Overview
This appendix defines concrete, reproducible benchmark tasks for evaluating MCP-based SysML v2 tooling. Each vignette measures either token efficiency or retrieval accuracy , using publicly available models.
Design Principles
Reproducibility : All models are publicly available on GitHub
Measurability : Each task has deterministic ground truth
Discriminative : “Dump everything” baseline fails; targeted retrieval succeeds
Executable : Implementable within 3-week timeline (by March 2026)
Source Models
GfSE/SysML-v2-Models
Eve Online Mining Frigate
18
~98K
~25K
GfSE/SysML-v2-Models
VehicleModel.sysml
1
~27K
~7K
Systems-Modeling/SysML-v2-Pilot
Training Examples
~40+
~200K
~50K
Token estimates use 4 chars/token heuristic for code.
Vignette 1: Requirement Constraint Extraction
Task : Extract all numeric constraint values from requirements in a multi-file model.
Specification
Question
“List all requirements with numeric thresholds and their constraint values”
Input Model
Eve Online Mining Frigate (18 files, ~98K chars)
Files Required
MiningFrigateRequirementsDef.sysml (4.8K chars)
Measurement
Token efficiency + Accuracy
Ground Truth
MFRQ01: miningRateLS >= 50.0 (m³/min)
MFRQ02: cargoCapacity >= 5000.0 (m³)
MFRQ03: shieldStrengthHS >= 200.0 (DPS), shieldStrengthLS >= 400.0 (DPS)
MFRQ04: droneCapacity >= 5
STRQ05: threatDetectionRange >= 20.0 (AU)
MFRQ06: warpSpeed >= 5.0 (AU/s), alignTime <= 3.0 (s)
MFRQ07: lockedTargets <= 3
MFRQ08: dockingTime <= 60.0 (s)
MFRQ09: fleetSize <= 10
MFRQ10: compressionFactor == 10.0
Evaluation
Accuracy
F1 score: correct constraints / (correct + missed + hallucinated)
Token Efficiency
Baseline tokens / MCP tokens
Baseline vs MCP
Baseline (all files)
~25,000
60-80% (distractor content)
MCP (targeted query)
~1,500
95%+ (requirement file only)
Vignette 2: Requirements-to-Verification Traceability
Task : For a specific requirement, identify which verification cases test it.
Specification
Question
“Which verification case tests requirement MFRQ03 (SurvivabilityRequirement)?”
Input Model
Eve Online Mining Frigate
Files Required
MiningFrigateVerificationCases.sysml, MiningFrigateRequirements.sysml
Measurement
Accuracy (binary: correct/incorrect)
Ground Truth
Requirement: MFRQ03 (SurvivabilityRequirement)
Verification Case: SurvivabilityTest
Verification Instance: survivabilityTest
Method: test
Actions: simulateHighSecAttack, simulateLowSecAttack, evaluateData
Verdict Binding: survivabilityRequirementLowSec
Evaluation
Accuracy
1 if correct case identified with method, 0 otherwise
Partial Credit
0.5 if case name correct but method/actions wrong
Baseline vs MCP
Baseline
~25,000
Must find verify survivabilityRequirementLowSec across files
MCP
~3,500
Query verification cases, follow verify reference
Vignette 3: Cross-File Interface Compatibility
Task : Determine if two ports are type-compatible for connection.
Specification
Question
“Can MiningFrigate.controlPort connect to Domain.PodPort?”
Input Model
Eve Online Mining Frigate
Files Required
MiningFrigate.sysml, Domain.sysml
Measurement
Accuracy + Reasoning correctness
Ground Truth
MiningFrigate.controlPort : ~Domain::PodPort (conjugate)
Domain.PodPort : (defined in Domain.sysml)
Compatibility: YES
Reason: controlPort is conjugate (~) of PodPort, enabling bidirectional flow
Evaluation
Binary Accuracy
1 if yes/no correct
Reasoning Score
1 if conjugate relationship explained, 0 otherwise
Baseline vs MCP
Baseline
~25,000
Must understand ~ conjugate syntax across files
MCP
~4,000
Query port definitions, resolve type references
Vignette 4: Element Inventory Count
Task : Count specific element types across a model.
Specification
Question
“How many part def definitions exist in the VehicleModel?”
Input Model
VehicleModel.sysml (598 lines, 27K chars)
Measurement
Exact match accuracy
Ground Truth
part def count: 32
Including: Vehicle, Engine, Cylinder, Transmission, Driveshaft,
AxleAssembly, Axle, FrontAxle, HalfAxle, Differential, Wheel,
Software, VehicleSoftware, VehicleController, FuelTank, Road,
VehicleRoadContext, SpatialTemporalReference, Engine4Cyl,
Engine6Cyl, TransmissionChoices, TransmissionAutomatic,
TransmissionManual, Sunroof, ...
Evaluation
Exact Match
1 if count == 32, 0 otherwise
Tolerance
±2 for partial credit (0.5)
Baseline vs MCP
Baseline
~7,000
Must parse and count all part def
MCP
~500
count_elements(type="part def") tool call
Vignette 5: Constraint Satisfaction Check
Task : Given attribute values, determine if a constraint is satisfied.
Specification
Question
“If miningRate = 45.0, does the Mining Frigate satisfy MFRQ01?”
Input Model
Eve Online Mining Frigate
Files Required
MiningFrigateRequirementsDef.sysml
Measurement
Accuracy + Reasoning
Ground Truth
Requirement MFRQ01 (OreExtractionEfficiencyRequirement):
require constraint { miningRateLS >= 50.0 }
Given: miningRate = 45.0
Evaluation: 45.0 >= 50.0 → FALSE
Answer: NO, requirement is NOT satisfied (45.0 < 50.0 required)
Evaluation
Binary Accuracy
1 if correct yes/no
Reasoning
1 if constraint value cited correctly
Vignette 6: State Machine Transition Query
Task : Identify what triggers a specific state transition.
Specification
Question
“What command triggers the Mining Frigate to transition from InGrid to OnWarp state?”
Input Model
Eve Online Mining Frigate
Files Required
MiningFrigate.sysml
Measurement
Accuracy
Ground Truth
State Machine: miningFrigatesStates
Transition: inGrid_to_onWarp
first InGrid
accept warpCommand : Domain::ShipCommand via miningFrigates.controlPort
do action executeWarpDrive : ExecuteWarpDrive
then OnWarp
Trigger: warpCommand (Domain::ShipCommand) via controlPort
Action: executeWarpDrive
Evaluation
Trigger Accuracy
1 if warpCommand identified
Port Accuracy
1 if controlPort identified
Action Accuracy
1 if executeWarpDrive identified
Composite
Average of three
Vignette 7: Stakeholder Concern Traceability
Task : Find which requirements address a specific stakeholder concern.
Specification
Question
“Which requirements frame the SecurityConcern?”
Input Model
Eve Online Mining Frigate
Files Required
MiningFrigateRequirementsDef.sysml, Concerns.sysml
Measurement
Precision/Recall
Ground Truth
Requirements framing SecurityConcern:
1. MFRQ03 (SurvivabilityRequirement) - frame concern SecurityConcern
2. STRQ05 (ThreatDetectionRequirement) - frame concern SecurityConcern
Evaluation
Precision
Correct / Total Returned
Recall
Correct / Ground Truth (2)
F1
2 × (P × R) / (P + R)
Vignette 8: Import Dependency Resolution
Task : Trace what a specific element depends on through imports.
Specification
Question
“What packages must be imported to use MiningFrigate::MiningFrigate?”
Input Model
Eve Online Mining Frigate
Files Required
Multiple (follow import chain)
Measurement
Completeness
Ground Truth
Direct imports in MiningFrigate.sysml:
- ScalarValues::*
- ISQ::*
- SI::*
- ParametersOfInterestMetadata::*
- OperationalUseCaseActions::*
- Domain::*
Transitive: Domain imports additional packages...
Evaluation
Direct Import Recall
Correct direct imports / 6
Token Efficiency
Baseline / MCP tokens to get answer
Summary: Benchmark Matrix
V1
Extraction
F1 + Efficiency
15:1
Medium
V2
Traceability
Accuracy
7:1
Medium
V3
Compatibility
Accuracy + Reason
6:1
Hard
V4
Inventory
Exact Match
14:1
Easy
V5
Constraint
Accuracy + Reason
10:1
Easy
V6
Behavior
Composite
8:1
Medium
V7
Traceability
F1
10:1
Medium
V8
Resolution
Completeness
5:1
Hard
Execution Protocol
Setup
Clone model repositories:
git clone https://github.com/GfSE/SysML-v2-Models.git
git clone https://github.com/Systems-Modeling/SysML-v2-Pilot-Implementation.git
Configure MCP server with model paths
Prepare baseline prompts (concatenated file contents)
Per-Vignette Execution
Baseline Run :
Concatenate all relevant files into prompt
Record input token count
Submit question, record response
Evaluate against ground truth
MCP Run :
Submit question with MCP tools available
Record tool calls and their token costs
Record total token usage
Evaluate against ground truth
Metrics Collection :
Accuracy score per evaluation criteria
Input tokens (baseline)
Input tokens (MCP total)
Output tokens (both conditions)
Latency (optional)
Statistical Validity
Run each vignette 3 times per condition
Report mean and standard deviation
Use same model temperature (0.0 for reproducibility)
Document model version and date
Future Extensions
Additional Vignettes (Post-March)
V9 : Multi-model consistency checking
V10 : Requirements completeness analysis
V11 : Allocation verification (requirements → design)
V12 : Change impact analysis
Expanded Model Set
OMG SysML v2 training examples (~50K tokens)
Larger industrial models (if available under open license)
Synthetic models with known properties