3 Literature Review

3.1 Overview

This chapter surveys emerging research at the intersection of AI/LLMs and Model-Based Systems Engineering (MBSE). The literature review serves three purposes:

Inform design decisions for the MCP server architecture
Validate project relevance by identifying gaps in current tooling
Position contributions within the academic landscape

3.2 AI + SysML v2 Research

3.2.1 Li et al. (2025) - LLM-Assisted Semantic Alignment for SysML v2

Attribute	Value
Citation	[1]
Venue	IEEE ISSE 2025

Key Contribution: Proposes a prompt-driven approach for LLM-assisted semantic alignment of SysML v2 models across organizations.

Summary

7-stage iterative process with human-in-the-loop verification for cross-organizational model alignment. Key context management insights:

Staged decomposition: Process split into discrete stages (preparation, extraction, matching, verification, generation, consistency check, export) with explicit user confirmation gates
JSON intermediate representation: SysML v2 textual models converted to structured JSON before LLM processing
Confidence scoring: All outputs include confidence metadata for verification
Coverage checking: LLM explicitly confirms all elements processed to prevent silent omissions

On context windows: Paper explicitly acknowledges “attention degradation and token limits” as concerns. They balance prompt completeness vs. brevity by using detailed prompts for extraction stages, lighter prompts for validation.

MCP tool implications: Suggests tools like extract_model_elements → suggest_alignments → verify_alignment with JSON intermediate format.

3.2.2 Hendricks & Cicirello (2025) - Text to Model via SysML

Attribute	Value
Citation	[2]
Venue	arXiv preprint

Key Contribution: Five-step NLP pipeline converting natural language text to SysML diagrams (BDD), then to computational models via code generation.

Summary

5-step pipeline (preprocessing → knowledge graph → BDD → code → simulation) that deliberately minimizes LLM usage. Context management approach: avoidance.

LLMs used only for sentence-level attribute extraction—never sees full documents
Traditional NLP (TF-IDF, coreference resolution, OpenIE) handles document-level processing
Per-sentence prompts sidestep context window limitations entirely
Intermediate outputs at each step enable human inspection

Performance: Higher recall than GPT-4o zero-shot on key phrase extraction; end-to-end validation on simple pendulum where Copilot hallucinated parameter values.

The “harness matters” lens: Demonstrates that reasonable results are achievable by avoiding context management—limiting LLM to tiny, well-structured prompts.

3.2.3 Darm et al. (2025) - Inference-Time Intervention for Requirement Verification

Attribute	Value
Citation	[3]
Venue	arXiv preprint

Key Contribution: Uses intervention techniques on specific LLM attention heads to verify requirements against Capella SysML models. Achieves perfect precision on requirement fulfillment checking.

Summary

Inference-time intervention modifies 1-3 attention heads to achieve 100% precision on requirement verification. Key context management insight: progressive context narrowing.

Graph-based model representation:

Capella models extracted to triple format: |Entity| |Relation| |Entity|
Semantic similarity filtering finds top-k relevant components
LLM re-ranking narrows to top-1
Breadth-first traversal extracts adjacent components from starting point
Chain-of-thought prompt with constrained output (“Final Answer: Yes/No”)

Key finding: LLMs exhibit “overconfidence” on requirement fulfillment—they default to “yes.” The intervention shifts toward conservative (higher precision) outputs.

MCP relevance: The subgraph extraction pattern (similarity → re-rank → BFS traversal) is directly applicable. The context management patterns transfer even when intervention techniques don’t.

3.2.4 Otten et al. (2026) - Generative AI in Systems Engineering: LLM Risk Assessment

Attribute	Value
Citation	[4]
Venue	IEEE SysCon 2026

Key Contribution: Introduces LLM Risk Assessment Framework (LRF) for evaluating LLM use in systems engineering across autonomy and impact dimensions.

Summary

LLM Risk Assessment Framework (LRF): 2D matrix classifying LLM applications by autonomy (4 levels: Assisted → Fully Automated) and impact (Low/Medium/High).

Autonomy levels (inspired by SAE driving automation):

Level 0 - Assisted: Human in charge, LLM provides support
Level 1 - Guided: LLM suggests, human approves
Level 2 - Supervised: AI executes under monitoring
Level 3 - Fully Automated: AI acts independently

MCP tool classification implications:

Query/exploration tools → Level 0-1, Low impact → Minimal risk
Model modification tools → Level 1-2, Medium-High impact → Medium risk
Automated design generation → Level 2-3, High impact → High risk (requires safeguards)

3.2.5 Bouamra et al. (2025) - SysTemp: Template-Based SysML v2 Generation

Attribute	Value
Citation	[5]
Venue	arXiv preprint

Key Contribution: Multi-agent system using template-based structuring to generate SysML v2 from natural language, addressing corpus scarcity and complex syntax.

Summary

Multi-agent template generator that decomposes SysML v2 generation into structured template selection and population. Context management via template-mediated structuring—the template constrains LLM output to valid patterns.

Architecture:

Template generator agent identifies appropriate SysML v2 structural patterns
Population agent fills templates from natural language specifications
Templates encode syntactic constraints, reducing hallucination risk

Key insight: Corpus scarcity is the central obstacle for SysML v2 generation. Templates compensate by providing structural scaffolding that LLMs cannot learn from limited examples.

MCP relevance: Template-based approach aligns with tool-mediated generation. An MCP server providing parse/validate feedback enables the same constraint-first pattern without hardcoded templates. Validates our grammar-first architecture: structural knowledge must come from tooling, not LLM training data.

3.2.6 Jin et al. (2025) - SysMBench: Benchmarking LLMs on System Model Generation

Attribute	Value
Citation	[6]
Venue	arXiv preprint

Key Contribution: First benchmark for evaluating LLM-generated system models. 151 curated scenarios across 17 LLMs demonstrate that raw LLM capability is insufficient—best BLEU score is 4%.

Summary

151 human-curated scenarios spanning multiple domains and difficulty levels, evaluated with SysMEval (semantic-aware metric) and traditional metrics (BLEU, CodeBLEU). Strongest quantitative evidence that LLMs cannot reliably generate system models without external support.

Benchmark design:

Each scenario: NL requirements → model description language → visualized diagram
Evaluation: BLEU, CodeBLEU, and custom SysMEval-F1 metric
Three enhancement strategies tested: direct prompting, few-shot, chain-of-thought

Key findings:

Best BLEU: 4% (across all 17 LLMs tested)
Best SysMEval-F1: 62%
Enhancement strategies provide marginal improvement
Model description language syntax is a primary failure mode

MCP relevance: Provides the strongest quantitative argument for our thesis—harness design matters more than model capability. If the best LLMs achieve only 4% BLEU on system models, external tooling (parsing, validation, structural feedback) is not optional but essential. Validates the need for grammar-aware MCP tools that provide syntactic scaffolding.

3.3 AI + UML/General Modeling

3.3.1 Giannouris & Ananiadou (2025) - NOMAD: Multi-Agent UML Generation

Attribute	Value
Citation	[7]
Venue	arXiv preprint

Key Contribution: Cognitively-inspired multi-agent framework decomposing UML class diagram generation into entity extraction, relationship classification, and diagram synthesis.

Summary

Cognitively-inspired multi-agent framework decomposing UML generation into specialized subtasks. Context management via pipeline partitioning—each agent sees only what it needs.

Agent architecture:

Agent	Input	Output
Concept Extractor	NL requirements	Classes + attributes
Relationship Comprehender	Requirements + entities	Typed relationships
Model Integrator	Entities + relationships	JSON intermediate
Code Articulator	JSON	PlantUML

Key insight: JSON intermediate representation acts as “context checkpoint”—formalizing output before next stage removes NL ambiguity.

Performance: F1 improves 0.66 → 0.70 (breadth), 0.74 → 0.84 (depth). Relationship modeling sees largest gains (0.52 → 0.92).

MCP relevance: Pipeline pattern with intermediate representations directly applicable. Schema constraints align with MCP tool design.

3.3.2 Ferrari et al. (2024) - Model Generation with LLMs: Requirements to UML

Attribute	Value
Citation	[8]
Venue	arXiv preprint

Key Contribution: Evaluates ChatGPT generating UML sequence diagrams from 28 requirements documents. Identifies challenges with requirements smells.

Summary

First systematic study of GPT-3.5 generating UML sequence diagrams from 28 real-world requirements documents (87 variants). Qualitative analysis by 3 experts identified 23 categories of issues.

Performance (5-point scale, mean=3):

Standard adherence: 4.54 ✓
Terminological alignment: 4.49 ✓
Understandability: 4.37 ✓
Completeness: 3.63 ✓
Correctness: 3.22 (NOT significantly above mean—critical weakness)

Key failure modes:

Requirements smells (ambiguity/inconsistency) cause LLM to “hide” conflicts by abstracting
Session memory pollution causes hallucinations
Cross-reference handling fails when context unavailable

MCP implications: Single-shot generation inadequate; need iterative refinement. Session isolation essential.

3.4 Foundational Industry Work

3.4.1 Bader et al. (2024) - User-Centric MBSE Using Generative AI

Attribute	Value
Citation	[9]
Venue	MODELSWARD 2024

Key Contribution: LLM-assisted model understanding patterns. Explicitly recommends SysML v2 as more LLM-friendly than XMI.

Summary

Fine-tuned GPT-3.5 on UML component diagrams (XMI format) to generate models from natural language. Identifies three critical obstacles and explicitly recommends SysML v2 as more LLM-friendly.

Critical findings:

Context window severely limiting: 16K tokens exceeded generating ~8 elements
XMI verbosity exacerbates context issues (~30% reduction via pre-processing still insufficient)
Moving context window prevents referencing past elements

SysML v2 validation: Paper explicitly recommends SysML v2’s textual notation as superior for LLM interaction—validates our project direction.

MCP implications: Context chunking critical even for simple models. Server should abstract/manage element IDs. Post-processing required—implement syntax checking.

3.4.2 Neema et al. (2025) - Evaluating Engineering AGI

Attribute	Value
Citation	[10]
Venue	arXiv preprint

Key Contribution: Bloom’s taxonomy-based framework for evaluating engineering AGI capabilities, including CAD/SysML model evaluation criteria.

Summary

Bloom’s taxonomy-based framework for evaluating engineering AGI agents across cognitive levels from factual recall to meta-reasoning.

6 cognitive levels:

Remember - Factual recall (equations, standards)
Understand - Interpret design structure/function
Apply - Predict performance, invoke simulation tools
Analyze - Complete partial designs, detect errors
Create - Synthesize full designs from requirements
Reflect - Critique decisions, recognize limitations

MCP relevance:

Level 3+ expects agents to “invoke external tools such as solvers, simulators”—aligns with MCP tool-calling
Structured artifact I/O assumes automated validation—MCP could provide these tools
Metadata-driven test generation enables domain-specific benchmarks

3.4.3 Lopopolo (2026) - Harness Engineering: Agent-First Software Development

Attribute	Value
Citation	[11]
Venue	OpenAI Engineering Blog

Key Contribution: Documents building a million-line production product with zero manually-written code, identifying environment design as the primary engineering activity in agent-first workflows.

Summary

Five-month experiment building an internal beta product entirely via Codex agents (~1,500 PRs, 3-7 engineers). Key findings directly relevant to MCP server design:

Progressive disclosure over monolithic context: “One big AGENTS.md” failed; replaced with structured docs/ directory where a short map (~100 lines) points to deeper sources of truth. Validates L0/L1/L2 tiered loading patterns
Repository knowledge as system of record: Context that lives outside the repository (chat threads, documents, tribal knowledge) is effectively invisible to agents. All architectural decisions, plans, and quality standards must be versioned and co-located with code
Mechanical enforcement: Custom linters and structural tests enforce architectural invariants (dependency directions, naming conventions, file size limits). Error messages inject remediation instructions into agent context
“Golden principles” and garbage collection: Recurring agent tasks scan for pattern drift, update quality grades, and open targeted refactoring PRs—continuous entropy management rather than periodic cleanup

On throughput: Average 3.5 PRs per engineer per day. Agent runs regularly exceed 6 hours on single tasks. Minimal blocking merge gates because corrections are cheap at high throughput.

MCP relevance: The article’s framing of “harness engineering” as a discipline—designing environments, feedback loops, and control systems for agents—validates this project’s central thesis (Section 5). Their progressive disclosure patterns map directly to our L0/L1/L2 context budgeting approach. Their observation that “boring technologies” are easier for agents to model supports our tree-sitter and MCP protocol choices.

3.5 Additional References

Paper	Key Finding	Domain
Jerry et al. (2026) [12]	SysML v2 as semantic backbone for WCAG-compliant UI generation	Healthcare
Erikstad (2024) [13]	CrewAI multi-agent + MBSE for ship design optimization	Marine
Rouabhia & Hadjadj (2025) [14]	9-LLM benchmark on UML method generation; 100% syntactic validity	Benchmarking
Mao et al. (2025) [15]	Data dependency inference improves code gen +11.66%	Code generation
Trendowicz et al. (2026) [16]	GPT-4o requirements quality assessment validated by experts	Agile RE
Crabb & Jones (2024) [17]	“Draft materials” workflow (LLM generates, engineer refines) effective	Industry practice

3.6 Research Gaps & Our Contribution

Gap Identified	Evidence	Our Response
No MCP-based MBSE integration	0 MCP servers for SysML in 7,364+ public repos	`open-mcp-sysml` MCP server
Limited open source SysML v2 + AI tooling	Sensmetry sysml-2ls archived Oct 2025	`tree-sitter-sysml` grammar (MIT)
Raw LLM capability insufficient	SysMBench: 4% BLEU across 17 LLMs	Grammar-aware MCP tools provide structural feedback
Validation approaches still emerging	Papers focus on specific interventions	Hybrid architecture: local parse + API validation
No standardized AI-MBSE interface	Each paper implements custom integration	MCP protocol standardization

3.7 Synthesis

Analysis of the literature reveals consistent themes directly applicable to MCP server design.

3.7.1 Context Management Strategies

Strategy	Papers	Trade-off
Avoidance	Hendricks	Limit LLM to sentence-level; sidesteps problem but limits capability
Staged decomposition	Li, NOMAD	Pipeline with intermediate representations; adds latency but controls context
Template-mediated structuring	SysTemp	Templates constrain output to valid patterns; requires template library
Progressive narrowing	Darm	Similarity→re-rank→traversal; requires graph structure
Multi-agent partitioning	NOMAD, Erikstad, SysTemp	Each agent sees only what it needs; adds orchestration complexity
Progressive disclosure	Lopopolo (OpenAI)	Map-first entry point with structured deep dives; scales to 1M LOC

3.7.2 Design Patterns for MCP Tools

JSON intermediate representation (Li, NOMAD): Convert SysML v2 to structured JSON before LLM processing
Confidence scoring (Li): Tool responses should include confidence metadata
Reachability-based pruning (Darm, Mao): Graph traversal to select relevant context
Quality model injection (Trendowicz): Provide explicit evaluation criteria in prompt

3.7.3 Validated Claims

SysML v2 textual notation is LLM-friendly (Bader explicitly recommends it over XMI)
Raw LLM capability is insufficient for system model generation (SysMBench: best BLEU 4%, best SysMEval-F1 62% across 17 LLMs)
Corpus scarcity demands structural tooling (SysTemp: templates compensate for lack of training data; SysMBench: enhancement strategies provide only marginal improvement)
Draft-then-refine workflow works (Crabb: engineer maintains control)
Single-shot generation is insufficient (Ferrari: correctness not significantly above baseline)
Session isolation essential (Ferrari: memory pollution causes hallucinations)
Environment design outweighs direct coding (Lopopolo: 1M LOC product built with zero manually-written code; engineering effort shifts to harness design)

3.7.4 Implications for Tool Architecture

The literature supports our central thesis (Section 5): harness design matters more than model capability. SysMBench [6] provides the strongest quantitative evidence—when the best available LLMs achieve only 4% BLEU on system model generation, external tooling is not optional but essential. Key architectural implications:

Granular tools over monolithic operations: Enable staged workflows with verification gates
Server-side context selection: Use parser/grammar to extract relevant subgraphs before LLM query
Iterative refinement support: Design for multi-turn interactions, not single-shot generation
Validation hooks: Syntax checking essential; LLMs cannot reliably self-correct parsing errors

[1]

Z. Li, S. Husung, and H. Wang, “LLM-Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2,” in 2025 IEEE international symposium on systems engineering (ISSE), 2025, pp. 1–8. doi: 10.1109/ISSE65546.2025.11369983.

[2]

M. A. Hendricks and A. Cicirello, “Text to Model via SysML: Automated Generation of Dynamical System Computational Models from Unstructured Natural Language Text via Enhanced System Modeling Language Diagrams,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2507.06803

[3]

P. Darm, J. Xie, and A. Riccardi, “Inference-Time Intervention in Large Language Models for Reliable Requirement Verification,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2503.14130

[4]

S. Otten et al., “Generative AI in Systems Engineering: A Framework for Risk Assessment of LLMs,” arXiv preprint, 2026, Available: https://arxiv.org/abs/2602.04358

[5]

Y. Bouamra, B. Yun, A. Poisson, and F. Armetta, “SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2.” 2025. Available: https://arxiv.org/abs/2506.21608

[6]

D. Jin, Z. Jin, L. Li, Z. Fang, J. Li, and X. Chen, “A System Model Generation Benchmark from Natural Language Requirements.” 2025. Available: https://arxiv.org/abs/2508.03215

[7]

P. Giannouris and S. Ananiadou, “NOMAD: Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2511.22409

[8]

A. Ferrari, S. Abualhaija, and C. Arora, “Model Generation with LLMs: From Requirements to UML Sequence Diagrams,” arXiv preprint, 2024, Available: https://arxiv.org/abs/2404.06371

[9]

E. Bader, D. Vereno, and C. Neureiter, “Facilitating User-Centric Model-Based Systems Engineering Using Generative AI,” in Proceedings of the 12th international conference on model-based software and systems engineering (MODELSWARD 2024), SCITEPRESS, 2024.

[10]

S. Neema et al., “On the Evaluation of Engineering Artificial General Intelligence,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2505.10653

[11]

R. Lopopolo, “Harness engineering: Leveraging Codex in an agent-first world.” Accessed: Feb. 17, 2026. [Online]. Available: https://openai.com/index/harness-engineering/

[12]

B. Jerry, L. Moreno, V. Francisco, and R. Hervas, “LLM-Driven Accessible Interface: A Model-Based Approach,” arXiv preprint, 2026, Available: https://arxiv.org/abs/2601.06616

[13]

S. O. Erikstad, “Multi-Agent LLMs and MBSE for Developing Design Optimization Models,” 2024, Available: https://www.researchgate.net/publication/380882908

[14]

D. Rouabhia and I. Hadjadj, “Behavioral Augmentation of UML Class Diagrams: LLMs for Method Generation,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2506.00788

[15]

W. Mao et al., “Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2508.03379

[16]

A. Trendowicz et al., “DeepQuali: LLMs for Assessing Quality of User Stories,” arXiv preprint, 2026, Available: https://arxiv.org/abs/2602.08887

[17]

E. S. Crabb and M. T. Jones, “Accelerating Model-Based Systems Engineering by Harnessing Generative AI,” in 2024 19th annual system of systems engineering conference (SoSE), IEEE, 2024.