3 Literature Review
3.1 Overview
This chapter surveys emerging research at the intersection of AI/LLMs and Model-Based Systems Engineering (MBSE). The literature review serves three purposes:
- Inform design decisions for the MCP server architecture
- Validate project relevance by identifying gaps in current tooling
- Position contributions within the academic landscape
3.2 AI + SysML v2 Research
3.2.1 Li et al. (2025) - LLM-Assisted Semantic Alignment for SysML v2
| Attribute | Value |
|---|---|
| Citation | [1] |
| Venue | IEEE ISSE 2025 |
Key Contribution: Proposes a prompt-driven approach for LLM-assisted semantic alignment of SysML v2 models across organizations.
7-stage iterative process with human-in-the-loop verification for cross-organizational model alignment. Key context management insights:
- Staged decomposition: Process split into discrete stages (preparation, extraction, matching, verification, generation, consistency check, export) with explicit user confirmation gates
- JSON intermediate representation: SysML v2 textual models converted to structured JSON before LLM processing
- Confidence scoring: All outputs include confidence metadata for verification
- Coverage checking: LLM explicitly confirms all elements processed to prevent silent omissions
On context windows: Paper explicitly acknowledges “attention degradation and token limits” as concerns. They balance prompt completeness vs. brevity by using detailed prompts for extraction stages, lighter prompts for validation.
MCP tool implications: Suggests tools like extract_model_elements → suggest_alignments → verify_alignment with JSON intermediate format.
3.2.2 Hendricks & Cicirello (2025) - Text to Model via SysML
| Attribute | Value |
|---|---|
| Citation | [2] |
| Venue | arXiv preprint |
Key Contribution: Five-step NLP pipeline converting natural language text to SysML diagrams (BDD), then to computational models via code generation.
5-step pipeline (preprocessing → knowledge graph → BDD → code → simulation) that deliberately minimizes LLM usage. Context management approach: avoidance.
- LLMs used only for sentence-level attribute extraction—never sees full documents
- Traditional NLP (TF-IDF, coreference resolution, OpenIE) handles document-level processing
- Per-sentence prompts sidestep context window limitations entirely
- Intermediate outputs at each step enable human inspection
Performance: Higher recall than GPT-4o zero-shot on key phrase extraction; end-to-end validation on simple pendulum where Copilot hallucinated parameter values.
The “harness matters” lens: Demonstrates that reasonable results are achievable by avoiding context management—limiting LLM to tiny, well-structured prompts.
3.2.3 Darm et al. (2025) - Inference-Time Intervention for Requirement Verification
| Attribute | Value |
|---|---|
| Citation | [3] |
| Venue | arXiv preprint |
Key Contribution: Uses intervention techniques on specific LLM attention heads to verify requirements against Capella SysML models. Achieves perfect precision on requirement fulfillment checking.
Inference-time intervention modifies 1-3 attention heads to achieve 100% precision on requirement verification. Key context management insight: progressive context narrowing.
Graph-based model representation:
- Capella models extracted to triple format:
|Entity| |Relation| |Entity| - Semantic similarity filtering finds top-k relevant components
- LLM re-ranking narrows to top-1
- Breadth-first traversal extracts adjacent components from starting point
- Chain-of-thought prompt with constrained output (“Final Answer: Yes/No”)
Key finding: LLMs exhibit “overconfidence” on requirement fulfillment—they default to “yes.” The intervention shifts toward conservative (higher precision) outputs.
MCP relevance: The subgraph extraction pattern (similarity → re-rank → BFS traversal) is directly applicable. The context management patterns transfer even when intervention techniques don’t.
3.2.4 Otten et al. (2026) - Generative AI in Systems Engineering: LLM Risk Assessment
| Attribute | Value |
|---|---|
| Citation | [4] |
| Venue | IEEE SysCon 2026 |
Key Contribution: Introduces LLM Risk Assessment Framework (LRF) for evaluating LLM use in systems engineering across autonomy and impact dimensions.
LLM Risk Assessment Framework (LRF): 2D matrix classifying LLM applications by autonomy (4 levels: Assisted → Fully Automated) and impact (Low/Medium/High).
Autonomy levels (inspired by SAE driving automation):
- Level 0 - Assisted: Human in charge, LLM provides support
- Level 1 - Guided: LLM suggests, human approves
- Level 2 - Supervised: AI executes under monitoring
- Level 3 - Fully Automated: AI acts independently
MCP tool classification implications:
- Query/exploration tools → Level 0-1, Low impact → Minimal risk
- Model modification tools → Level 1-2, Medium-High impact → Medium risk
- Automated design generation → Level 2-3, High impact → High risk (requires safeguards)
3.2.5 Bouamra et al. (2025) - SysTemp: Template-Based SysML v2 Generation
| Attribute | Value |
|---|---|
| Citation | [5] |
| Venue | arXiv preprint |
Key Contribution: Multi-agent system using template-based structuring to generate SysML v2 from natural language, addressing corpus scarcity and complex syntax.
Multi-agent template generator that decomposes SysML v2 generation into structured template selection and population. Context management via template-mediated structuring—the template constrains LLM output to valid patterns.
Architecture:
- Template generator agent identifies appropriate SysML v2 structural patterns
- Population agent fills templates from natural language specifications
- Templates encode syntactic constraints, reducing hallucination risk
Key insight: Corpus scarcity is the central obstacle for SysML v2 generation. Templates compensate by providing structural scaffolding that LLMs cannot learn from limited examples.
MCP relevance: Template-based approach aligns with tool-mediated generation. An MCP server providing parse/validate feedback enables the same constraint-first pattern without hardcoded templates. Validates our grammar-first architecture: structural knowledge must come from tooling, not LLM training data.
3.2.6 Jin et al. (2025) - SysMBench: Benchmarking LLMs on System Model Generation
| Attribute | Value |
|---|---|
| Citation | [6] |
| Venue | arXiv preprint |
Key Contribution: First benchmark for evaluating LLM-generated system models. 151 curated scenarios across 17 LLMs demonstrate that raw LLM capability is insufficient—best BLEU score is 4%.
151 human-curated scenarios spanning multiple domains and difficulty levels, evaluated with SysMEval (semantic-aware metric) and traditional metrics (BLEU, CodeBLEU). Strongest quantitative evidence that LLMs cannot reliably generate system models without external support.
Benchmark design:
- Each scenario: NL requirements → model description language → visualized diagram
- Evaluation: BLEU, CodeBLEU, and custom SysMEval-F1 metric
- Three enhancement strategies tested: direct prompting, few-shot, chain-of-thought
Key findings:
- Best BLEU: 4% (across all 17 LLMs tested)
- Best SysMEval-F1: 62%
- Enhancement strategies provide marginal improvement
- Model description language syntax is a primary failure mode
MCP relevance: Provides the strongest quantitative argument for our thesis—harness design matters more than model capability. If the best LLMs achieve only 4% BLEU on system models, external tooling (parsing, validation, structural feedback) is not optional but essential. Validates the need for grammar-aware MCP tools that provide syntactic scaffolding.
3.3 AI + UML/General Modeling
3.3.1 Giannouris & Ananiadou (2025) - NOMAD: Multi-Agent UML Generation
| Attribute | Value |
|---|---|
| Citation | [7] |
| Venue | arXiv preprint |
Key Contribution: Cognitively-inspired multi-agent framework decomposing UML class diagram generation into entity extraction, relationship classification, and diagram synthesis.
Cognitively-inspired multi-agent framework decomposing UML generation into specialized subtasks. Context management via pipeline partitioning—each agent sees only what it needs.
Agent architecture:
| Agent | Input | Output |
|---|---|---|
| Concept Extractor | NL requirements | Classes + attributes |
| Relationship Comprehender | Requirements + entities | Typed relationships |
| Model Integrator | Entities + relationships | JSON intermediate |
| Code Articulator | JSON | PlantUML |
Key insight: JSON intermediate representation acts as “context checkpoint”—formalizing output before next stage removes NL ambiguity.
Performance: F1 improves 0.66 → 0.70 (breadth), 0.74 → 0.84 (depth). Relationship modeling sees largest gains (0.52 → 0.92).
MCP relevance: Pipeline pattern with intermediate representations directly applicable. Schema constraints align with MCP tool design.
3.3.2 Ferrari et al. (2024) - Model Generation with LLMs: Requirements to UML
| Attribute | Value |
|---|---|
| Citation | [8] |
| Venue | arXiv preprint |
Key Contribution: Evaluates ChatGPT generating UML sequence diagrams from 28 requirements documents. Identifies challenges with requirements smells.
First systematic study of GPT-3.5 generating UML sequence diagrams from 28 real-world requirements documents (87 variants). Qualitative analysis by 3 experts identified 23 categories of issues.
Performance (5-point scale, mean=3):
- Standard adherence: 4.54 ✓
- Terminological alignment: 4.49 ✓
- Understandability: 4.37 ✓
- Completeness: 3.63 ✓
- Correctness: 3.22 (NOT significantly above mean—critical weakness)
Key failure modes:
- Requirements smells (ambiguity/inconsistency) cause LLM to “hide” conflicts by abstracting
- Session memory pollution causes hallucinations
- Cross-reference handling fails when context unavailable
MCP implications: Single-shot generation inadequate; need iterative refinement. Session isolation essential.
3.4 Foundational Industry Work
3.4.1 Bader et al. (2024) - User-Centric MBSE Using Generative AI
| Attribute | Value |
|---|---|
| Citation | [9] |
| Venue | MODELSWARD 2024 |
Key Contribution: LLM-assisted model understanding patterns. Explicitly recommends SysML v2 as more LLM-friendly than XMI.
Fine-tuned GPT-3.5 on UML component diagrams (XMI format) to generate models from natural language. Identifies three critical obstacles and explicitly recommends SysML v2 as more LLM-friendly.
Critical findings:
- Context window severely limiting: 16K tokens exceeded generating ~8 elements
- XMI verbosity exacerbates context issues (~30% reduction via pre-processing still insufficient)
- Moving context window prevents referencing past elements
SysML v2 validation: Paper explicitly recommends SysML v2’s textual notation as superior for LLM interaction—validates our project direction.
MCP implications: Context chunking critical even for simple models. Server should abstract/manage element IDs. Post-processing required—implement syntax checking.
3.4.2 Neema et al. (2025) - Evaluating Engineering AGI
| Attribute | Value |
|---|---|
| Citation | [10] |
| Venue | arXiv preprint |
Key Contribution: Bloom’s taxonomy-based framework for evaluating engineering AGI capabilities, including CAD/SysML model evaluation criteria.
Bloom’s taxonomy-based framework for evaluating engineering AGI agents across cognitive levels from factual recall to meta-reasoning.
6 cognitive levels:
- Remember - Factual recall (equations, standards)
- Understand - Interpret design structure/function
- Apply - Predict performance, invoke simulation tools
- Analyze - Complete partial designs, detect errors
- Create - Synthesize full designs from requirements
- Reflect - Critique decisions, recognize limitations
MCP relevance:
- Level 3+ expects agents to “invoke external tools such as solvers, simulators”—aligns with MCP tool-calling
- Structured artifact I/O assumes automated validation—MCP could provide these tools
- Metadata-driven test generation enables domain-specific benchmarks
3.4.3 Lopopolo (2026) - Harness Engineering: Agent-First Software Development
| Attribute | Value |
|---|---|
| Citation | [11] |
| Venue | OpenAI Engineering Blog |
Key Contribution: Documents building a million-line production product with zero manually-written code, identifying environment design as the primary engineering activity in agent-first workflows.
Five-month experiment building an internal beta product entirely via Codex agents (~1,500 PRs, 3-7 engineers). Key findings directly relevant to MCP server design:
- Progressive disclosure over monolithic context: “One big AGENTS.md” failed; replaced with structured
docs/directory where a short map (~100 lines) points to deeper sources of truth. Validates L0/L1/L2 tiered loading patterns - Repository knowledge as system of record: Context that lives outside the repository (chat threads, documents, tribal knowledge) is effectively invisible to agents. All architectural decisions, plans, and quality standards must be versioned and co-located with code
- Mechanical enforcement: Custom linters and structural tests enforce architectural invariants (dependency directions, naming conventions, file size limits). Error messages inject remediation instructions into agent context
- “Golden principles” and garbage collection: Recurring agent tasks scan for pattern drift, update quality grades, and open targeted refactoring PRs—continuous entropy management rather than periodic cleanup
On throughput: Average 3.5 PRs per engineer per day. Agent runs regularly exceed 6 hours on single tasks. Minimal blocking merge gates because corrections are cheap at high throughput.
MCP relevance: The article’s framing of “harness engineering” as a discipline—designing environments, feedback loops, and control systems for agents—validates this project’s central thesis (Section 5). Their progressive disclosure patterns map directly to our L0/L1/L2 context budgeting approach. Their observation that “boring technologies” are easier for agents to model supports our tree-sitter and MCP protocol choices.
3.5 Additional References
| Paper | Key Finding | Domain |
|---|---|---|
| Jerry et al. (2026) [12] | SysML v2 as semantic backbone for WCAG-compliant UI generation | Healthcare |
| Erikstad (2024) [13] | CrewAI multi-agent + MBSE for ship design optimization | Marine |
| Rouabhia & Hadjadj (2025) [14] | 9-LLM benchmark on UML method generation; 100% syntactic validity | Benchmarking |
| Mao et al. (2025) [15] | Data dependency inference improves code gen +11.66% | Code generation |
| Trendowicz et al. (2026) [16] | GPT-4o requirements quality assessment validated by experts | Agile RE |
| Crabb & Jones (2024) [17] | “Draft materials” workflow (LLM generates, engineer refines) effective | Industry practice |
3.6 Research Gaps & Our Contribution
| Gap Identified | Evidence | Our Response |
|---|---|---|
| No MCP-based MBSE integration | 0 MCP servers for SysML in 7,364+ public repos | open-mcp-sysml MCP server |
| Limited open source SysML v2 + AI tooling | Sensmetry sysml-2ls archived Oct 2025 | tree-sitter-sysml grammar (MIT) |
| Raw LLM capability insufficient | SysMBench: 4% BLEU across 17 LLMs | Grammar-aware MCP tools provide structural feedback |
| Validation approaches still emerging | Papers focus on specific interventions | Hybrid architecture: local parse + API validation |
| No standardized AI-MBSE interface | Each paper implements custom integration | MCP protocol standardization |
3.7 Synthesis
Analysis of the literature reveals consistent themes directly applicable to MCP server design.
3.7.1 Context Management Strategies
| Strategy | Papers | Trade-off |
|---|---|---|
| Avoidance | Hendricks | Limit LLM to sentence-level; sidesteps problem but limits capability |
| Staged decomposition | Li, NOMAD | Pipeline with intermediate representations; adds latency but controls context |
| Template-mediated structuring | SysTemp | Templates constrain output to valid patterns; requires template library |
| Progressive narrowing | Darm | Similarity→re-rank→traversal; requires graph structure |
| Multi-agent partitioning | NOMAD, Erikstad, SysTemp | Each agent sees only what it needs; adds orchestration complexity |
| Progressive disclosure | Lopopolo (OpenAI) | Map-first entry point with structured deep dives; scales to 1M LOC |
3.7.2 Design Patterns for MCP Tools
- JSON intermediate representation (Li, NOMAD): Convert SysML v2 to structured JSON before LLM processing
- Confidence scoring (Li): Tool responses should include confidence metadata
- Reachability-based pruning (Darm, Mao): Graph traversal to select relevant context
- Quality model injection (Trendowicz): Provide explicit evaluation criteria in prompt
3.7.3 Validated Claims
- SysML v2 textual notation is LLM-friendly (Bader explicitly recommends it over XMI)
- Raw LLM capability is insufficient for system model generation (SysMBench: best BLEU 4%, best SysMEval-F1 62% across 17 LLMs)
- Corpus scarcity demands structural tooling (SysTemp: templates compensate for lack of training data; SysMBench: enhancement strategies provide only marginal improvement)
- Draft-then-refine workflow works (Crabb: engineer maintains control)
- Single-shot generation is insufficient (Ferrari: correctness not significantly above baseline)
- Session isolation essential (Ferrari: memory pollution causes hallucinations)
- Environment design outweighs direct coding (Lopopolo: 1M LOC product built with zero manually-written code; engineering effort shifts to harness design)
3.7.4 Implications for Tool Architecture
The literature supports our central thesis (Section 5): harness design matters more than model capability. SysMBench [6] provides the strongest quantitative evidence—when the best available LLMs achieve only 4% BLEU on system model generation, external tooling is not optional but essential. Key architectural implications:
- Granular tools over monolithic operations: Enable staged workflows with verification gates
- Server-side context selection: Use parser/grammar to extract relevant subgraphs before LLM query
- Iterative refinement support: Design for multi-turn interactions, not single-shot generation
- Validation hooks: Syntax checking essential; LLMs cannot reliably self-correct parsing errors