Appendix A — Appendix: GitLab Knowledge Graph Plugin Architecture Proposal

A.1 Executive Summary

This appendix documents exploratory research into extending GitLab Knowledge Graph (GKG) for domain-specific language support, using SysML v2 as a case study. The goal is to identify architectural patterns that could enable GKG to understand repository content beyond traditional programming languages.

NoteStatus: Aspirational Future Work

This proposal has not been coordinated with GKG maintainers and represents aspirational future work. The immediate project goal is to build a standalone SysML v2 MCP server; lessons learned from that implementation would inform any future GKG contribution.

Contributing to GKG would require:

  1. Demonstrating value through the standalone implementation
  2. Coordinating with GKG maintainers on priorities
  3. Aligning with GKG’s roadmap and architectural decisions

A.2 Background

A.2.1 What is GitLab Knowledge Graph?

GitLab Knowledge Graph (GKG) is an open-source project that creates structured, queryable representations of code repositories to power AI features and enhance developer productivity. Key characteristics:

Aspect Description
Language Rust
Parser gitlab-code-parser using tree-sitter + ast-grep
Storage KuzuDB (embedded graph database) + Parquet files
Interface CLI, HTTP server, MCP protocol
Scope Code structure: definitions, references, imports, call graphs

Repository: gitlab.com/gitlab-org/rust/knowledge-graph

A.2.2 Current Language Support

GKG supports programming languages via gitlab-code-parser:

Language Definitions Intra-file Refs Cross-file Refs
Ruby
Python Partial
TypeScript/JavaScript Partial
Kotlin
Java
Rust Partial

A.2.3 How Languages Are Added

Languages are added to GKG via compile-time extension of gitlab-code-parser:

  1. Add language to SupportedLanguage enum in parser.rs
  2. Create YAML rule files using ast-grep patterns
  3. Implement post-processing logic for structured extraction
  4. Update RuleManager to load rules for the new language

This model assumes:

  • A tree-sitter grammar exists for the language
  • The language follows programming language patterns (definitions, references, imports)
  • Language support is baked into the binary at compile time

A.3 The Domain-Specific Language Challenge

A.3.1 Programming Languages vs. Modeling Languages

GKG’s current architecture is optimized for programming languages. Domain-specific languages (DSLs) like SysML v2 present different characteristics:

Aspect Programming Languages SysML v2
Primary artifacts Functions, classes, modules Parts, requirements, actions
Relationships Calls, imports, inheritance Specialization, allocation, composition
Semantics Execution semantics Model semantics (KerML FOL)
Files .py, .rs, .ts .sysml
Tree-sitter grammar Widely available Does not exist

A.3.2 Why SysML v2 Doesn’t Fit the Current Model

  1. No tree-sitter grammar: GKG relies on tree-sitter for AST generation. No tree-sitter grammar exists for SysML v2.

  2. Different relationship types: GKG’s graph schema centers on code relationships (calls, imports). SysML v2 has different relationship types (allocation, satisfaction, derivation).

  3. Compile-time language list: Languages must be added at compile time. Organizations can’t add domain-specific support without forking GKG.

A.3.3 Other Affected Domains

SysML v2 is not unique. Other domain-specific content faces similar challenges:

Domain File Types Value to Repository Understanding
SysML v2 .sysml Systems architecture, requirements traceability
Terraform .tf Infrastructure dependencies, resource relationships
OpenAPI .yaml, .json API structure, endpoint relationships
Protobuf .proto Service definitions, message relationships
GraphQL .graphql Schema structure, type relationships
Kubernetes .yaml Deployment topology, resource dependencies

A.4 GKG’s Existing Extensibility Work

GKG maintainers have already considered extensibility. Key initiatives:

A.4.1 Graph Extractor Language (GEL)

Issue #227 introduced GEL, a DSL for custom extraction rules:

“Custom Extraction with Graph Extractor Language (GEL): We will introduce and document GEL, a custom DSL that allows developers to define their own rules for extracting framework-specific nodes and relationships from the AST.”

Example use case: Extracting Next.js API routes from TypeScript files.

Limitation: GEL operates on top of already-parsed ASTs. It cannot handle languages without tree-sitter grammars.

A.4.2 Contributions Pipeline

Issue #139 outlines a vision for extensibility:

“Engineer an Extensible Framework: The core of the strategy is to provide clear, powerful extension points that allow developers to add significant value without needing to modify the indexer’s core logic.”

Key pillars identified:

  1. Adding new languages via gitlab-code-parser patterns
  2. Custom extraction via GEL
  3. Comprehensive documentation

A.4.3 SCIP Integration

Issue #270 explores SCIP (Source Code Intelligence Protocol) for broader language support:

“We are currently developing custom code parsers in-house, which requires significant maintenance effort and limits our language coverage. We should investigate integrating with SCIP…”

This indicates interest in reducing parser maintenance burden.

A.5 Proposed Plugin Architecture

To support domain-specific languages like SysML v2, GKG could introduce a plugin architecture for context providers.

A.5.1 Design Goals

  1. Runtime extensibility: Add language support without recompiling GKG
  2. Isolation: Plugins cannot crash the core indexer
  3. Schema flexibility: Plugins can introduce custom node/relationship types
  4. Discoverability: Users can find and install plugins easily

A.5.2 Proposed Interface

/// A plugin that provides domain-specific context for a file type
pub trait ContextProvider: Send + Sync {
    /// Plugin metadata
    fn metadata(&self) -> PluginMetadata;
    
    /// File extensions this plugin handles
    fn supported_extensions(&self) -> &[&str];
    
    /// Parse a file and extract definitions
    fn extract_definitions(
        &self, 
        content: &str, 
        path: &str
    ) -> Result<Vec<Definition>, PluginError>;
    
    /// Extract relationships between definitions
    fn extract_relationships(
        &self,
        content: &str,
        path: &str,
        definitions: &[Definition],
    ) -> Result<Vec<Relationship>, PluginError>;
}

pub struct PluginMetadata {
    pub name: String,
    pub version: String,
    pub description: String,
    pub author: String,
}

pub struct Definition {
    pub id: String,
    pub name: String,
    pub definition_type: String,  // e.g., "PartDefinition", "Requirement"
    pub fqn: String,
    pub location: Location,
    pub properties: HashMap<String, Value>,
}

pub struct Relationship {
    pub source_id: String,
    pub target_id: String,
    pub relationship_type: String,  // e.g., "specializes", "allocates"
    pub properties: HashMap<String, Value>,
}

A.5.3 Plugin Discovery Mechanisms

Several approaches could enable runtime plugin loading:

Mechanism Pros Cons
WASM Sandboxed, portable Performance overhead, limited I/O
Dynamic libraries Native performance Platform-specific, security concerns
Subprocess Language-agnostic, isolated IPC overhead, process management
gRPC service Network-capable, language-agnostic Deployment complexity

Recommendation: Start with subprocess-based plugins (simple JSON protocol over stdin/stdout), evolve to WASM for sandboxing.

A.5.4 Schema Evolution

Plugins introducing custom node/relationship types need schema management:

# sysml-plugin/schema.yaml
nodes:
  - name: PartDefinition
    extends: Definition
    properties:
      - name: isAbstract
        type: boolean
  - name: RequirementDefinition
    extends: Definition
    properties:
      - name: text
        type: string

relationships:
  - name: specializes
    from: [PartDefinition, RequirementDefinition]
    to: [PartDefinition, RequirementDefinition]
  - name: satisfies
    from: [PartDefinition]
    to: [RequirementDefinition]
  - name: allocates
    from: [PartDefinition]
    to: [PartDefinition]

GKG’s schema manager would:

  1. Load plugin schemas at startup
  2. Create corresponding KuzuDB tables
  3. Validate plugin output against declared schema

A.6 SysML v2 as Reference Plugin

A.6.1 What SysML Context Would Provide

A SysML v2 plugin would enable GKG to understand:

Graph Node Description Query Example
PartDefinition System/component definitions “What parts does Vehicle contain?”
RequirementDefinition Requirements “What requirements trace to Engine?”
ActionDefinition Behaviors/functions “What actions does StartEngine perform?”
AllocationRelationship Function-to-structure “What functions are allocated to ECU?”
SatisfactionRelationship Requirement satisfaction “What requirements are satisfied by tests?”

A.6.2 Integration with Standalone MCP Server

The standalone SysML v2 MCP server (this project) would share parsing logic with a potential GKG plugin via the tree-sitter grammar:

┌─────────────────────────────────────────────────────────┐
│                  tree-sitter-sysml                       │
│         (grammar, Rust/C/WASM bindings)                  │
└─────────────────────────────────────────────────────────┘
                    │                       │
         ┌──────────┴──────────┐   ┌───────┴────────┐
         ▼                      ▼   ▼                ▼
┌─────────────────┐      ┌─────────────────────────────┐
│   MCP Server    │      │      GKG Plugin (Future)    │
│  (standalone)   │      │   (ContextProvider impl)    │
└─────────────────┘      └─────────────────────────────┘

This allows:

  1. Proving grammar correctness via MCP server
  2. Reusing grammar in GKG plugin without duplication
  3. Contributing grammar to GitLab vendor/grammars/ for syntax highlighting
  4. Independent evolution of MCP tools vs. GKG integration

A.6.3 Example Queries Enabled

With SysML v2 support, GKG could answer:

// Find all parts that satisfy safety requirements
MATCH (p:PartDefinition)-[:satisfies]->(r:RequirementDefinition)
WHERE r.name CONTAINS 'Safety'
RETURN p.name, r.name

// Trace function allocation to physical structure
MATCH (a:ActionDefinition)-[:allocatedTo]->(p:PartDefinition)
RETURN a.name AS function, p.name AS component

// Find unallocated requirements
MATCH (r:RequirementDefinition)
WHERE NOT (r)<-[:satisfies]-()
RETURN r.name AS untraced_requirement

A.7 Implementation Considerations

A.7.1 WASM vs Native vs Subprocess

Approach Performance Security Portability Complexity
WASM Medium High (sandboxed) High Medium
Native (.so/.dylib) High Low Low High
Subprocess Low-Medium Medium High Low

Recommendation for MVP: Subprocess with JSON-over-stdio. Simple to implement, language-agnostic, and isolates plugin crashes.

A.7.2 Performance Implications

Plugin-based parsing adds overhead:

Operation In-process Subprocess
Parse 1KB file ~1ms ~10-50ms
Parse 1MB file ~100ms ~200-500ms
Batch 1000 files ~1s ~10-30s

Mitigations:

  1. Batch file processing to amortize IPC cost
  2. Plugin-side caching of parsed ASTs
  3. Incremental re-indexing (only changed files)

A.7.3 Security Considerations

Plugins execute arbitrary code. Security measures:

  1. Sandboxing: Prefer WASM for production plugins
  2. Capabilities: Plugins declare required permissions (file read, network, etc.)
  3. Signed plugins: Require cryptographic signatures for trusted plugins
  4. Review process: Plugin registry with review before listing

A.8 Contribution Path

A.8.1 Alignment with GKG Roadmap

This proposal aligns with GKG maintainers’ expressed interests:

GKG Initiative Alignment
Issue #139 (Contributions Pipeline) Plugin architecture enables external contributions
Issue #227 (GEL) Plugins extend beyond GEL’s AST-based scope
Issue #270 (SCIP) Plugin approach is complementary to SCIP

A.8.2 Phased Approach

  1. Phase A: Standalone MCP Server (this project)
    • Prove SysML v2 parser correctness
    • Establish parsing patterns and AST design
    • Demonstrate value to MBSE practitioners
  2. Phase B: Tree-sitter Grammar (future)
    • Contribute tree-sitter grammar for SysML v2
    • Enables native GKG support without plugins
    • Requires significant specification analysis
  3. Phase C: Plugin Architecture Proposal (future)
    • Present findings to GKG maintainers
    • Collaborate on plugin interface design
    • Implement SysML v2 as reference plugin

A.8.3 Prerequisites

Before proposing to GKG team:

A.9 Conclusion

Extending GitLab Knowledge Graph for domain-specific languages like SysML v2 is technically feasible through a plugin architecture. The standalone SysML v2 MCP server serves as a proving ground for:

  1. Parser design and correctness
  2. Graph schema for MBSE concepts
  3. Value proposition for AI-assisted systems engineering

Lessons learned will inform future contributions to GKG, contingent on maintainer interest and project priorities.

A.10 References