Appendix C — Benchmark Vignettes: MCP Evaluation Tasks

C.1 Overview

This appendix defines concrete, reproducible benchmark tasks for evaluating MCP-based SysML v2 tooling. Each vignette measures either token efficiency or retrieval accuracy, using publicly available models.

C.1.1 Design Principles

  1. Reproducibility: All models are publicly available on GitHub
  2. Measurability: Each task has deterministic ground truth
  3. Discriminative: “Dump everything” baseline fails; targeted retrieval succeeds
  4. Executable: Implementable within 3-week timeline (by March 2026)

C.1.2 Source Models

Repository Model Files Characters Est. Tokens
GfSE/SysML-v2-Models Eve Online Mining Frigate 18 ~98K ~25K
GfSE/SysML-v2-Models VehicleModel.sysml 1 ~27K ~7K
Systems-Modeling/SysML-v2-Pilot Training Examples ~40+ ~200K ~50K

Token estimates use 4 chars/token heuristic for code.


C.2 Vignette 1: Requirement Constraint Extraction

Task: Extract all numeric constraint values from requirements in a multi-file model.

C.2.1 Specification

Attribute Value
Question “List all requirements with numeric thresholds and their constraint values”
Input Model Eve Online Mining Frigate (18 files, ~98K chars)
Files Required MiningFrigateRequirementsDef.sysml (4.8K chars)
Measurement Token efficiency + Accuracy

C.2.2 Ground Truth

MFRQ01: miningRateLS >= 50.0 (m³/min)
MFRQ02: cargoCapacity >= 5000.0 (m³)
MFRQ03: shieldStrengthHS >= 200.0 (DPS), shieldStrengthLS >= 400.0 (DPS)
MFRQ04: droneCapacity >= 5
STRQ05: threatDetectionRange >= 20.0 (AU)
MFRQ06: warpSpeed >= 5.0 (AU/s), alignTime <= 3.0 (s)
MFRQ07: lockedTargets <= 3
MFRQ08: dockingTime <= 60.0 (s)
MFRQ09: fleetSize <= 10
MFRQ10: compressionFactor == 10.0

C.2.3 Evaluation

Metric Calculation
Accuracy F1 score: correct constraints / (correct + missed + hallucinated)
Token Efficiency Baseline tokens / MCP tokens

C.2.4 Baseline vs MCP

Condition Input Tokens Expected Accuracy
Baseline (all files) ~25,000 60-80% (distractor content)
MCP (targeted query) ~1,500 95%+ (requirement file only)

C.3 Vignette 2: Requirements-to-Verification Traceability

Task: For a specific requirement, identify which verification cases test it.

C.3.1 Specification

Attribute Value
Question “Which verification case tests requirement MFRQ03 (SurvivabilityRequirement)?”
Input Model Eve Online Mining Frigate
Files Required MiningFrigateVerificationCases.sysml, MiningFrigateRequirements.sysml
Measurement Accuracy (binary: correct/incorrect)

C.3.2 Ground Truth

Requirement: MFRQ03 (SurvivabilityRequirement)
Verification Case: SurvivabilityTest
Verification Instance: survivabilityTest
Method: test
Actions: simulateHighSecAttack, simulateLowSecAttack, evaluateData
Verdict Binding: survivabilityRequirementLowSec

C.3.3 Evaluation

Metric Calculation
Accuracy 1 if correct case identified with method, 0 otherwise
Partial Credit 0.5 if case name correct but method/actions wrong

C.3.4 Baseline vs MCP

Condition Input Tokens Challenge
Baseline ~25,000 Must find verify survivabilityRequirementLowSec across files
MCP ~3,500 Query verification cases, follow verify reference

C.4 Vignette 3: Cross-File Interface Compatibility

Task: Determine if two ports are type-compatible for connection.

C.4.1 Specification

Attribute Value
Question “Can MiningFrigate.controlPort connect to Domain.PodPort?”
Input Model Eve Online Mining Frigate
Files Required MiningFrigate.sysml, Domain.sysml
Measurement Accuracy + Reasoning correctness

C.4.2 Ground Truth

MiningFrigate.controlPort : ~Domain::PodPort (conjugate)
Domain.PodPort : (defined in Domain.sysml)

Compatibility: YES
Reason: controlPort is conjugate (~) of PodPort, enabling bidirectional flow

C.4.3 Evaluation

Metric Calculation
Binary Accuracy 1 if yes/no correct
Reasoning Score 1 if conjugate relationship explained, 0 otherwise

C.4.4 Baseline vs MCP

Condition Input Tokens Challenge
Baseline ~25,000 Must understand ~ conjugate syntax across files
MCP ~4,000 Query port definitions, resolve type references

C.5 Vignette 4: Element Inventory Count

Task: Count specific element types across a model.

C.5.1 Specification

Attribute Value
Question “How many part def definitions exist in the VehicleModel?”
Input Model VehicleModel.sysml (598 lines, 27K chars)
Measurement Exact match accuracy

C.5.2 Ground Truth

part def count: 32

Including: Vehicle, Engine, Cylinder, Transmission, Driveshaft, 
AxleAssembly, Axle, FrontAxle, HalfAxle, Differential, Wheel,
Software, VehicleSoftware, VehicleController, FuelTank, Road,
VehicleRoadContext, SpatialTemporalReference, Engine4Cyl, 
Engine6Cyl, TransmissionChoices, TransmissionAutomatic,
TransmissionManual, Sunroof, ...

C.5.3 Evaluation

Metric Calculation
Exact Match 1 if count == 32, 0 otherwise
Tolerance ±2 for partial credit (0.5)

C.5.4 Baseline vs MCP

Condition Input Tokens Challenge
Baseline ~7,000 Must parse and count all part def
MCP ~500 count_elements(type="part def") tool call

C.6 Vignette 5: Constraint Satisfaction Check

Task: Given attribute values, determine if a constraint is satisfied.

C.6.1 Specification

Attribute Value
Question “If miningRate = 45.0, does the Mining Frigate satisfy MFRQ01?”
Input Model Eve Online Mining Frigate
Files Required MiningFrigateRequirementsDef.sysml
Measurement Accuracy + Reasoning

C.6.2 Ground Truth

Requirement MFRQ01 (OreExtractionEfficiencyRequirement):
  require constraint { miningRateLS >= 50.0 }

Given: miningRate = 45.0
Evaluation: 45.0 >= 50.0 → FALSE

Answer: NO, requirement is NOT satisfied (45.0 < 50.0 required)

C.6.3 Evaluation

Metric Calculation
Binary Accuracy 1 if correct yes/no
Reasoning 1 if constraint value cited correctly

C.7 Vignette 6: State Machine Transition Query

Task: Identify what triggers a specific state transition.

C.7.1 Specification

Attribute Value
Question “What command triggers the Mining Frigate to transition from InGrid to OnWarp state?”
Input Model Eve Online Mining Frigate
Files Required MiningFrigate.sysml
Measurement Accuracy

C.7.2 Ground Truth

State Machine: miningFrigatesStates
Transition: inGrid_to_onWarp
  first InGrid
  accept warpCommand : Domain::ShipCommand via miningFrigates.controlPort
  do action executeWarpDrive : ExecuteWarpDrive
  then OnWarp

Trigger: warpCommand (Domain::ShipCommand) via controlPort
Action: executeWarpDrive

C.7.3 Evaluation

Metric Calculation
Trigger Accuracy 1 if warpCommand identified
Port Accuracy 1 if controlPort identified
Action Accuracy 1 if executeWarpDrive identified
Composite Average of three

C.8 Vignette 7: Stakeholder Concern Traceability

Task: Find which requirements address a specific stakeholder concern.

C.8.1 Specification

Attribute Value
Question “Which requirements frame the SecurityConcern?”
Input Model Eve Online Mining Frigate
Files Required MiningFrigateRequirementsDef.sysml, Concerns.sysml
Measurement Precision/Recall

C.8.2 Ground Truth

Requirements framing SecurityConcern:
1. MFRQ03 (SurvivabilityRequirement) - frame concern SecurityConcern
2. STRQ05 (ThreatDetectionRequirement) - frame concern SecurityConcern

C.8.3 Evaluation

Metric Calculation
Precision Correct / Total Returned
Recall Correct / Ground Truth (2)
F1 2 × (P × R) / (P + R)

C.9 Vignette 8: Import Dependency Resolution

Task: Trace what a specific element depends on through imports.

C.9.1 Specification

Attribute Value
Question “What packages must be imported to use MiningFrigate::MiningFrigate?”
Input Model Eve Online Mining Frigate
Files Required Multiple (follow import chain)
Measurement Completeness

C.9.2 Ground Truth

Direct imports in MiningFrigate.sysml:
- ScalarValues::*
- ISQ::*
- SI::*
- ParametersOfInterestMetadata::*
- OperationalUseCaseActions::*
- Domain::*

Transitive: Domain imports additional packages...

C.9.3 Evaluation

Metric Calculation
Direct Import Recall Correct direct imports / 6
Token Efficiency Baseline / MCP tokens to get answer

C.10 Summary: Benchmark Matrix

ID Task Type Primary Metric Token Ratio Target Difficulty
V1 Extraction F1 + Efficiency 15:1 Medium
V2 Traceability Accuracy 7:1 Medium
V3 Compatibility Accuracy + Reason 6:1 Hard
V4 Inventory Exact Match 14:1 Easy
V5 Constraint Accuracy + Reason 10:1 Easy
V6 Behavior Composite 8:1 Medium
V7 Traceability F1 10:1 Medium
V8 Resolution Completeness 5:1 Hard

C.11 Execution Protocol

C.11.1 Setup

  1. Clone model repositories:

    git clone https://github.com/GfSE/SysML-v2-Models.git
    git clone https://github.com/Systems-Modeling/SysML-v2-Pilot-Implementation.git
  2. Configure MCP server with model paths

  3. Prepare baseline prompts (concatenated file contents)

C.11.2 Per-Vignette Execution

  1. Baseline Run:
    • Concatenate all relevant files into prompt
    • Record input token count
    • Submit question, record response
    • Evaluate against ground truth
  2. MCP Run:
    • Submit question with MCP tools available
    • Record tool calls and their token costs
    • Record total token usage
    • Evaluate against ground truth
  3. Metrics Collection:
    • Accuracy score per evaluation criteria
    • Input tokens (baseline)
    • Input tokens (MCP total)
    • Output tokens (both conditions)
    • Latency (optional)

C.11.3 Statistical Validity

  • Run each vignette 3 times per condition
  • Report mean and standard deviation
  • Use same model temperature (0.0 for reproducibility)
  • Document model version and date

C.12 Future Extensions

C.12.1 Additional Vignettes (Post-March)

  • V9: Multi-model consistency checking
  • V10: Requirements completeness analysis
  • V11: Allocation verification (requirements → design)
  • V12: Change impact analysis

C.12.2 Expanded Model Set

  • OMG SysML v2 training examples (~50K tokens)
  • Larger industrial models (if available under open license)
  • Synthetic models with known properties