3  Literature Review

3.1 Overview

This chapter surveys emerging research at the intersection of AI/LLMs and Model-Based Systems Engineering (MBSE). The literature review serves three purposes:

  1. Inform design decisions for the MCP server architecture
  2. Validate project relevance by identifying gaps in current tooling
  3. Position contributions within the academic landscape

3.2 AI + SysML v2 Research

3.2.1 Li et al. (2025) - LLM-Assisted Semantic Alignment for SysML v2

Attribute Value
Citation [1]
Venue IEEE ISSE 2025

Key Contribution: Proposes a prompt-driven approach for LLM-assisted semantic alignment of SysML v2 models across organizations.

TipSummary

7-stage iterative process with human-in-the-loop verification for cross-organizational model alignment. Key context management insights:

  • Staged decomposition: Process split into discrete stages (preparation, extraction, matching, verification, generation, consistency check, export) with explicit user confirmation gates
  • JSON intermediate representation: SysML v2 textual models converted to structured JSON before LLM processing
  • Confidence scoring: All outputs include confidence metadata for verification
  • Coverage checking: LLM explicitly confirms all elements processed to prevent silent omissions

On context windows: Paper explicitly acknowledges “attention degradation and token limits” as concerns. They balance prompt completeness vs. brevity by using detailed prompts for extraction stages, lighter prompts for validation.

MCP tool implications: Suggests tools like extract_model_elementssuggest_alignmentsverify_alignment with JSON intermediate format.

3.2.2 Hendricks & Cicirello (2025) - Text to Model via SysML

Attribute Value
Citation [2]
Venue arXiv preprint

Key Contribution: Five-step NLP pipeline converting natural language text to SysML diagrams (BDD), then to computational models via code generation.

TipSummary

5-step pipeline (preprocessing → knowledge graph → BDD → code → simulation) that deliberately minimizes LLM usage. Context management approach: avoidance.

  • LLMs used only for sentence-level attribute extraction—never sees full documents
  • Traditional NLP (TF-IDF, coreference resolution, OpenIE) handles document-level processing
  • Per-sentence prompts sidestep context window limitations entirely
  • Intermediate outputs at each step enable human inspection

Performance: Higher recall than GPT-4o zero-shot on key phrase extraction; end-to-end validation on simple pendulum where Copilot hallucinated parameter values.

The “harness matters” lens: Demonstrates that reasonable results are achievable by avoiding context management—limiting LLM to tiny, well-structured prompts.

3.2.3 Darm et al. (2025) - Inference-Time Intervention for Requirement Verification

Attribute Value
Citation [3]
Venue arXiv preprint

Key Contribution: Uses intervention techniques on specific LLM attention heads to verify requirements against Capella SysML models. Achieves perfect precision on requirement fulfillment checking.

TipSummary

Inference-time intervention modifies 1-3 attention heads to achieve 100% precision on requirement verification. Key context management insight: progressive context narrowing.

Graph-based model representation:

  • Capella models extracted to triple format: |Entity| |Relation| |Entity|
  • Semantic similarity filtering finds top-k relevant components
  • LLM re-ranking narrows to top-1
  • Breadth-first traversal extracts adjacent components from starting point
  • Chain-of-thought prompt with constrained output (“Final Answer: Yes/No”)

Key finding: LLMs exhibit “overconfidence” on requirement fulfillment—they default to “yes.” The intervention shifts toward conservative (higher precision) outputs.

MCP relevance: The subgraph extraction pattern (similarity → re-rank → BFS traversal) is directly applicable. The context management patterns transfer even when intervention techniques don’t.

3.2.4 Otten et al. (2026) - Generative AI in Systems Engineering: LLM Risk Assessment

Attribute Value
Citation [4]
Venue IEEE SysCon 2026

Key Contribution: Introduces LLM Risk Assessment Framework (LRF) for evaluating LLM use in systems engineering across autonomy and impact dimensions.

TipSummary

LLM Risk Assessment Framework (LRF): 2D matrix classifying LLM applications by autonomy (4 levels: Assisted → Fully Automated) and impact (Low/Medium/High).

Autonomy levels (inspired by SAE driving automation):

  • Level 0 - Assisted: Human in charge, LLM provides support
  • Level 1 - Guided: LLM suggests, human approves
  • Level 2 - Supervised: AI executes under monitoring
  • Level 3 - Fully Automated: AI acts independently

MCP tool classification implications:

  • Query/exploration tools → Level 0-1, Low impact → Minimal risk
  • Model modification tools → Level 1-2, Medium-High impact → Medium risk
  • Automated design generation → Level 2-3, High impact → High risk (requires safeguards)

3.2.5 Bouamra et al. (2025) - SysTemp: Template-Based SysML v2 Generation

Attribute Value
Citation [5]
Venue arXiv preprint

Key Contribution: Multi-agent system using template-based structuring to generate SysML v2 from natural language, addressing corpus scarcity and complex syntax.

TipSummary

Multi-agent template generator that decomposes SysML v2 generation into structured template selection and population. Context management via template-mediated structuring—the template constrains LLM output to valid patterns.

Architecture:

  • Template generator agent identifies appropriate SysML v2 structural patterns
  • Population agent fills templates from natural language specifications
  • Templates encode syntactic constraints, reducing hallucination risk

Key insight: Corpus scarcity is the central obstacle for SysML v2 generation. Templates compensate by providing structural scaffolding that LLMs cannot learn from limited examples.

MCP relevance: Template-based approach aligns with tool-mediated generation. An MCP server providing parse/validate feedback enables the same constraint-first pattern without hardcoded templates. Validates our grammar-first architecture: structural knowledge must come from tooling, not LLM training data.

3.2.6 Jin et al. (2025) - SysMBench: Benchmarking LLMs on System Model Generation

Attribute Value
Citation [6]
Venue arXiv preprint

Key Contribution: First benchmark for evaluating LLM-generated system models. 151 curated scenarios across 17 LLMs demonstrate that raw LLM capability is insufficient—best BLEU score is 4%.

TipSummary

151 human-curated scenarios spanning multiple domains and difficulty levels, evaluated with SysMEval (semantic-aware metric) and traditional metrics (BLEU, CodeBLEU). Strongest quantitative evidence that LLMs cannot reliably generate system models without external support.

Benchmark design:

  • Each scenario: NL requirements → model description language → visualized diagram
  • Evaluation: BLEU, CodeBLEU, and custom SysMEval-F1 metric
  • Three enhancement strategies tested: direct prompting, few-shot, chain-of-thought

Key findings:

  • Best BLEU: 4% (across all 17 LLMs tested)
  • Best SysMEval-F1: 62%
  • Enhancement strategies provide marginal improvement
  • Model description language syntax is a primary failure mode

MCP relevance: Provides the strongest quantitative argument for our thesis—harness design matters more than model capability. If the best LLMs achieve only 4% BLEU on system models, external tooling (parsing, validation, structural feedback) is not optional but essential. Validates the need for grammar-aware MCP tools that provide syntactic scaffolding.

3.3 AI + UML/General Modeling

3.3.1 Giannouris & Ananiadou (2025) - NOMAD: Multi-Agent UML Generation

Attribute Value
Citation [7]
Venue arXiv preprint

Key Contribution: Cognitively-inspired multi-agent framework decomposing UML class diagram generation into entity extraction, relationship classification, and diagram synthesis.

TipSummary

Cognitively-inspired multi-agent framework decomposing UML generation into specialized subtasks. Context management via pipeline partitioning—each agent sees only what it needs.

Agent architecture:

Agent Input Output
Concept Extractor NL requirements Classes + attributes
Relationship Comprehender Requirements + entities Typed relationships
Model Integrator Entities + relationships JSON intermediate
Code Articulator JSON PlantUML

Key insight: JSON intermediate representation acts as “context checkpoint”—formalizing output before next stage removes NL ambiguity.

Performance: F1 improves 0.66 → 0.70 (breadth), 0.74 → 0.84 (depth). Relationship modeling sees largest gains (0.52 → 0.92).

MCP relevance: Pipeline pattern with intermediate representations directly applicable. Schema constraints align with MCP tool design.

3.3.2 Ferrari et al. (2024) - Model Generation with LLMs: Requirements to UML

Attribute Value
Citation [8]
Venue arXiv preprint

Key Contribution: Evaluates ChatGPT generating UML sequence diagrams from 28 requirements documents. Identifies challenges with requirements smells.

TipSummary

First systematic study of GPT-3.5 generating UML sequence diagrams from 28 real-world requirements documents (87 variants). Qualitative analysis by 3 experts identified 23 categories of issues.

Performance (5-point scale, mean=3):

  • Standard adherence: 4.54 ✓
  • Terminological alignment: 4.49 ✓
  • Understandability: 4.37 ✓
  • Completeness: 3.63 ✓
  • Correctness: 3.22 (NOT significantly above mean—critical weakness)

Key failure modes:

  • Requirements smells (ambiguity/inconsistency) cause LLM to “hide” conflicts by abstracting
  • Session memory pollution causes hallucinations
  • Cross-reference handling fails when context unavailable

MCP implications: Single-shot generation inadequate; need iterative refinement. Session isolation essential.

3.4 Foundational Industry Work

3.4.1 Bader et al. (2024) - User-Centric MBSE Using Generative AI

Attribute Value
Citation [9]
Venue MODELSWARD 2024

Key Contribution: LLM-assisted model understanding patterns. Explicitly recommends SysML v2 as more LLM-friendly than XMI.

TipSummary

Fine-tuned GPT-3.5 on UML component diagrams (XMI format) to generate models from natural language. Identifies three critical obstacles and explicitly recommends SysML v2 as more LLM-friendly.

Critical findings:

  • Context window severely limiting: 16K tokens exceeded generating ~8 elements
  • XMI verbosity exacerbates context issues (~30% reduction via pre-processing still insufficient)
  • Moving context window prevents referencing past elements

SysML v2 validation: Paper explicitly recommends SysML v2’s textual notation as superior for LLM interaction—validates our project direction.

MCP implications: Context chunking critical even for simple models. Server should abstract/manage element IDs. Post-processing required—implement syntax checking.

3.4.2 Neema et al. (2025) - Evaluating Engineering AGI

Attribute Value
Citation [10]
Venue arXiv preprint

Key Contribution: Bloom’s taxonomy-based framework for evaluating engineering AGI capabilities, including CAD/SysML model evaluation criteria.

TipSummary

Bloom’s taxonomy-based framework for evaluating engineering AGI agents across cognitive levels from factual recall to meta-reasoning.

6 cognitive levels:

  1. Remember - Factual recall (equations, standards)
  2. Understand - Interpret design structure/function
  3. Apply - Predict performance, invoke simulation tools
  4. Analyze - Complete partial designs, detect errors
  5. Create - Synthesize full designs from requirements
  6. Reflect - Critique decisions, recognize limitations

MCP relevance:

  • Level 3+ expects agents to “invoke external tools such as solvers, simulators”—aligns with MCP tool-calling
  • Structured artifact I/O assumes automated validation—MCP could provide these tools
  • Metadata-driven test generation enables domain-specific benchmarks

3.4.3 Lopopolo (2026) - Harness Engineering: Agent-First Software Development

Attribute Value
Citation [11]
Venue OpenAI Engineering Blog

Key Contribution: Documents building a million-line production product with zero manually-written code, identifying environment design as the primary engineering activity in agent-first workflows.

TipSummary

Five-month experiment building an internal beta product entirely via Codex agents (~1,500 PRs, 3-7 engineers). Key findings directly relevant to MCP server design:

  • Progressive disclosure over monolithic context: “One big AGENTS.md” failed; replaced with structured docs/ directory where a short map (~100 lines) points to deeper sources of truth. Validates L0/L1/L2 tiered loading patterns
  • Repository knowledge as system of record: Context that lives outside the repository (chat threads, documents, tribal knowledge) is effectively invisible to agents. All architectural decisions, plans, and quality standards must be versioned and co-located with code
  • Mechanical enforcement: Custom linters and structural tests enforce architectural invariants (dependency directions, naming conventions, file size limits). Error messages inject remediation instructions into agent context
  • “Golden principles” and garbage collection: Recurring agent tasks scan for pattern drift, update quality grades, and open targeted refactoring PRs—continuous entropy management rather than periodic cleanup

On throughput: Average 3.5 PRs per engineer per day. Agent runs regularly exceed 6 hours on single tasks. Minimal blocking merge gates because corrections are cheap at high throughput.

MCP relevance: The article’s framing of “harness engineering” as a discipline—designing environments, feedback loops, and control systems for agents—validates this project’s central thesis (Section 5). Their progressive disclosure patterns map directly to our L0/L1/L2 context budgeting approach. Their observation that “boring technologies” are easier for agents to model supports our tree-sitter and MCP protocol choices.

3.5 Additional References

Paper Key Finding Domain
Jerry et al. (2026) [12] SysML v2 as semantic backbone for WCAG-compliant UI generation Healthcare
Erikstad (2024) [13] CrewAI multi-agent + MBSE for ship design optimization Marine
Rouabhia & Hadjadj (2025) [14] 9-LLM benchmark on UML method generation; 100% syntactic validity Benchmarking
Mao et al. (2025) [15] Data dependency inference improves code gen +11.66% Code generation
Trendowicz et al. (2026) [16] GPT-4o requirements quality assessment validated by experts Agile RE
Crabb & Jones (2024) [17] “Draft materials” workflow (LLM generates, engineer refines) effective Industry practice

3.6 Research Gaps & Our Contribution

Gap Identified Evidence Our Response
No MCP-based MBSE integration 0 MCP servers for SysML in 7,364+ public repos open-mcp-sysml MCP server
Limited open source SysML v2 + AI tooling Sensmetry sysml-2ls archived Oct 2025 tree-sitter-sysml grammar (MIT)
Raw LLM capability insufficient SysMBench: 4% BLEU across 17 LLMs Grammar-aware MCP tools provide structural feedback
Validation approaches still emerging Papers focus on specific interventions Hybrid architecture: local parse + API validation
No standardized AI-MBSE interface Each paper implements custom integration MCP protocol standardization

3.7 Synthesis

Analysis of the literature reveals consistent themes directly applicable to MCP server design.

3.7.1 Context Management Strategies

Strategy Papers Trade-off
Avoidance Hendricks Limit LLM to sentence-level; sidesteps problem but limits capability
Staged decomposition Li, NOMAD Pipeline with intermediate representations; adds latency but controls context
Template-mediated structuring SysTemp Templates constrain output to valid patterns; requires template library
Progressive narrowing Darm Similarity→re-rank→traversal; requires graph structure
Multi-agent partitioning NOMAD, Erikstad, SysTemp Each agent sees only what it needs; adds orchestration complexity
Progressive disclosure Lopopolo (OpenAI) Map-first entry point with structured deep dives; scales to 1M LOC

3.7.2 Design Patterns for MCP Tools

  1. JSON intermediate representation (Li, NOMAD): Convert SysML v2 to structured JSON before LLM processing
  2. Confidence scoring (Li): Tool responses should include confidence metadata
  3. Reachability-based pruning (Darm, Mao): Graph traversal to select relevant context
  4. Quality model injection (Trendowicz): Provide explicit evaluation criteria in prompt

3.7.3 Validated Claims

  • SysML v2 textual notation is LLM-friendly (Bader explicitly recommends it over XMI)
  • Raw LLM capability is insufficient for system model generation (SysMBench: best BLEU 4%, best SysMEval-F1 62% across 17 LLMs)
  • Corpus scarcity demands structural tooling (SysTemp: templates compensate for lack of training data; SysMBench: enhancement strategies provide only marginal improvement)
  • Draft-then-refine workflow works (Crabb: engineer maintains control)
  • Single-shot generation is insufficient (Ferrari: correctness not significantly above baseline)
  • Session isolation essential (Ferrari: memory pollution causes hallucinations)
  • Environment design outweighs direct coding (Lopopolo: 1M LOC product built with zero manually-written code; engineering effort shifts to harness design)

3.7.4 Implications for Tool Architecture

The literature supports our central thesis (Section 5): harness design matters more than model capability. SysMBench [6] provides the strongest quantitative evidence—when the best available LLMs achieve only 4% BLEU on system model generation, external tooling is not optional but essential. Key architectural implications:

  • Granular tools over monolithic operations: Enable staged workflows with verification gates
  • Server-side context selection: Use parser/grammar to extract relevant subgraphs before LLM query
  • Iterative refinement support: Design for multi-turn interactions, not single-shot generation
  • Validation hooks: Syntax checking essential; LLMs cannot reliably self-correct parsing errors