Appendix A — Appendix: GitLab Knowledge Graph Plugin Architecture Proposal
A.1 Executive Summary
This appendix documents exploratory research into extending GitLab Knowledge Graph (GKG) for domain-specific language support, using SysML v2 as a case study. The goal is to identify architectural patterns that could enable GKG to understand repository content beyond traditional programming languages.
This proposal has not been coordinated with GKG maintainers and represents aspirational future work. The immediate project goal is to build a standalone SysML v2 MCP server; lessons learned from that implementation would inform any future GKG contribution.
Contributing to GKG would require:
- Demonstrating value through the standalone implementation
- Coordinating with GKG maintainers on priorities
- Aligning with GKG’s roadmap and architectural decisions
A.2 Background
A.2.1 What is GitLab Knowledge Graph?
GitLab Knowledge Graph (GKG) is an open-source project that creates structured, queryable representations of code repositories to power AI features and enhance developer productivity. Key characteristics:
| Aspect | Description |
|---|---|
| Language | Rust |
| Parser | gitlab-code-parser using tree-sitter + ast-grep |
| Storage | KuzuDB (embedded graph database) + Parquet files |
| Interface | CLI, HTTP server, MCP protocol |
| Scope | Code structure: definitions, references, imports, call graphs |
Repository: gitlab.com/gitlab-org/rust/knowledge-graph
A.2.2 Current Language Support
GKG supports programming languages via gitlab-code-parser:
| Language | Definitions | Intra-file Refs | Cross-file Refs |
|---|---|---|---|
| Ruby | ✓ | ✓ | ✓ |
| Python | ✓ | ✓ | Partial |
| TypeScript/JavaScript | ✓ | ✓ | Partial |
| Kotlin | ✓ | ✓ | ✓ |
| Java | ✓ | ✓ | ✓ |
| Rust | ✓ | ✓ | Partial |
A.2.3 How Languages Are Added
Languages are added to GKG via compile-time extension of gitlab-code-parser:
- Add language to
SupportedLanguageenum inparser.rs - Create YAML rule files using ast-grep patterns
- Implement post-processing logic for structured extraction
- Update
RuleManagerto load rules for the new language
This model assumes:
- A tree-sitter grammar exists for the language
- The language follows programming language patterns (definitions, references, imports)
- Language support is baked into the binary at compile time
A.3 The Domain-Specific Language Challenge
A.3.1 Programming Languages vs. Modeling Languages
GKG’s current architecture is optimized for programming languages. Domain-specific languages (DSLs) like SysML v2 present different characteristics:
| Aspect | Programming Languages | SysML v2 |
|---|---|---|
| Primary artifacts | Functions, classes, modules | Parts, requirements, actions |
| Relationships | Calls, imports, inheritance | Specialization, allocation, composition |
| Semantics | Execution semantics | Model semantics (KerML FOL) |
| Files | .py, .rs, .ts |
.sysml |
| Tree-sitter grammar | Widely available | Does not exist |
A.3.2 Why SysML v2 Doesn’t Fit the Current Model
No tree-sitter grammar: GKG relies on tree-sitter for AST generation. No tree-sitter grammar exists for SysML v2.
Different relationship types: GKG’s graph schema centers on code relationships (calls, imports). SysML v2 has different relationship types (allocation, satisfaction, derivation).
Compile-time language list: Languages must be added at compile time. Organizations can’t add domain-specific support without forking GKG.
A.3.3 Other Affected Domains
SysML v2 is not unique. Other domain-specific content faces similar challenges:
| Domain | File Types | Value to Repository Understanding |
|---|---|---|
| SysML v2 | .sysml |
Systems architecture, requirements traceability |
| Terraform | .tf |
Infrastructure dependencies, resource relationships |
| OpenAPI | .yaml, .json |
API structure, endpoint relationships |
| Protobuf | .proto |
Service definitions, message relationships |
| GraphQL | .graphql |
Schema structure, type relationships |
| Kubernetes | .yaml |
Deployment topology, resource dependencies |
A.4 GKG’s Existing Extensibility Work
GKG maintainers have already considered extensibility. Key initiatives:
A.4.1 Graph Extractor Language (GEL)
Issue #227 introduced GEL, a DSL for custom extraction rules:
“Custom Extraction with Graph Extractor Language (GEL): We will introduce and document GEL, a custom DSL that allows developers to define their own rules for extracting framework-specific nodes and relationships from the AST.”
Example use case: Extracting Next.js API routes from TypeScript files.
Limitation: GEL operates on top of already-parsed ASTs. It cannot handle languages without tree-sitter grammars.
A.4.2 Contributions Pipeline
Issue #139 outlines a vision for extensibility:
“Engineer an Extensible Framework: The core of the strategy is to provide clear, powerful extension points that allow developers to add significant value without needing to modify the indexer’s core logic.”
Key pillars identified:
- Adding new languages via
gitlab-code-parserpatterns - Custom extraction via GEL
- Comprehensive documentation
A.4.3 SCIP Integration
Issue #270 explores SCIP (Source Code Intelligence Protocol) for broader language support:
“We are currently developing custom code parsers in-house, which requires significant maintenance effort and limits our language coverage. We should investigate integrating with SCIP…”
This indicates interest in reducing parser maintenance burden.
A.5 Proposed Plugin Architecture
To support domain-specific languages like SysML v2, GKG could introduce a plugin architecture for context providers.
A.5.1 Design Goals
- Runtime extensibility: Add language support without recompiling GKG
- Isolation: Plugins cannot crash the core indexer
- Schema flexibility: Plugins can introduce custom node/relationship types
- Discoverability: Users can find and install plugins easily
A.5.2 Proposed Interface
/// A plugin that provides domain-specific context for a file type
pub trait ContextProvider: Send + Sync {
/// Plugin metadata
fn metadata(&self) -> PluginMetadata;
/// File extensions this plugin handles
fn supported_extensions(&self) -> &[&str];
/// Parse a file and extract definitions
fn extract_definitions(
&self,
content: &str,
path: &str
) -> Result<Vec<Definition>, PluginError>;
/// Extract relationships between definitions
fn extract_relationships(
&self,
content: &str,
path: &str,
definitions: &[Definition],
) -> Result<Vec<Relationship>, PluginError>;
}
pub struct PluginMetadata {
pub name: String,
pub version: String,
pub description: String,
pub author: String,
}
pub struct Definition {
pub id: String,
pub name: String,
pub definition_type: String, // e.g., "PartDefinition", "Requirement"
pub fqn: String,
pub location: Location,
pub properties: HashMap<String, Value>,
}
pub struct Relationship {
pub source_id: String,
pub target_id: String,
pub relationship_type: String, // e.g., "specializes", "allocates"
pub properties: HashMap<String, Value>,
}A.5.3 Plugin Discovery Mechanisms
Several approaches could enable runtime plugin loading:
| Mechanism | Pros | Cons |
|---|---|---|
| WASM | Sandboxed, portable | Performance overhead, limited I/O |
| Dynamic libraries | Native performance | Platform-specific, security concerns |
| Subprocess | Language-agnostic, isolated | IPC overhead, process management |
| gRPC service | Network-capable, language-agnostic | Deployment complexity |
Recommendation: Start with subprocess-based plugins (simple JSON protocol over stdin/stdout), evolve to WASM for sandboxing.
A.5.4 Schema Evolution
Plugins introducing custom node/relationship types need schema management:
# sysml-plugin/schema.yaml
nodes:
- name: PartDefinition
extends: Definition
properties:
- name: isAbstract
type: boolean
- name: RequirementDefinition
extends: Definition
properties:
- name: text
type: string
relationships:
- name: specializes
from: [PartDefinition, RequirementDefinition]
to: [PartDefinition, RequirementDefinition]
- name: satisfies
from: [PartDefinition]
to: [RequirementDefinition]
- name: allocates
from: [PartDefinition]
to: [PartDefinition]GKG’s schema manager would:
- Load plugin schemas at startup
- Create corresponding KuzuDB tables
- Validate plugin output against declared schema
A.6 SysML v2 as Reference Plugin
A.6.1 What SysML Context Would Provide
A SysML v2 plugin would enable GKG to understand:
| Graph Node | Description | Query Example |
|---|---|---|
PartDefinition |
System/component definitions | “What parts does Vehicle contain?” |
RequirementDefinition |
Requirements | “What requirements trace to Engine?” |
ActionDefinition |
Behaviors/functions | “What actions does StartEngine perform?” |
AllocationRelationship |
Function-to-structure | “What functions are allocated to ECU?” |
SatisfactionRelationship |
Requirement satisfaction | “What requirements are satisfied by tests?” |
A.6.2 Integration with Standalone MCP Server
The standalone SysML v2 MCP server (this project) would share parsing logic with a potential GKG plugin via the tree-sitter grammar:
┌─────────────────────────────────────────────────────────┐
│ tree-sitter-sysml │
│ (grammar, Rust/C/WASM bindings) │
└─────────────────────────────────────────────────────────┘
│ │
┌──────────┴──────────┐ ┌───────┴────────┐
▼ ▼ ▼ ▼
┌─────────────────┐ ┌─────────────────────────────┐
│ MCP Server │ │ GKG Plugin (Future) │
│ (standalone) │ │ (ContextProvider impl) │
└─────────────────┘ └─────────────────────────────┘
This allows:
- Proving grammar correctness via MCP server
- Reusing grammar in GKG plugin without duplication
- Contributing grammar to GitLab vendor/grammars/ for syntax highlighting
- Independent evolution of MCP tools vs. GKG integration
A.6.3 Example Queries Enabled
With SysML v2 support, GKG could answer:
// Find all parts that satisfy safety requirements
MATCH (p:PartDefinition)-[:satisfies]->(r:RequirementDefinition)
WHERE r.name CONTAINS 'Safety'
RETURN p.name, r.name
// Trace function allocation to physical structure
MATCH (a:ActionDefinition)-[:allocatedTo]->(p:PartDefinition)
RETURN a.name AS function, p.name AS component
// Find unallocated requirements
MATCH (r:RequirementDefinition)
WHERE NOT (r)<-[:satisfies]-()
RETURN r.name AS untraced_requirement
A.7 Implementation Considerations
A.7.1 WASM vs Native vs Subprocess
| Approach | Performance | Security | Portability | Complexity |
|---|---|---|---|---|
| WASM | Medium | High (sandboxed) | High | Medium |
| Native (.so/.dylib) | High | Low | Low | High |
| Subprocess | Low-Medium | Medium | High | Low |
Recommendation for MVP: Subprocess with JSON-over-stdio. Simple to implement, language-agnostic, and isolates plugin crashes.
A.7.2 Performance Implications
Plugin-based parsing adds overhead:
| Operation | In-process | Subprocess |
|---|---|---|
| Parse 1KB file | ~1ms | ~10-50ms |
| Parse 1MB file | ~100ms | ~200-500ms |
| Batch 1000 files | ~1s | ~10-30s |
Mitigations:
- Batch file processing to amortize IPC cost
- Plugin-side caching of parsed ASTs
- Incremental re-indexing (only changed files)
A.7.3 Security Considerations
Plugins execute arbitrary code. Security measures:
- Sandboxing: Prefer WASM for production plugins
- Capabilities: Plugins declare required permissions (file read, network, etc.)
- Signed plugins: Require cryptographic signatures for trusted plugins
- Review process: Plugin registry with review before listing
A.8 Contribution Path
A.8.1 Alignment with GKG Roadmap
This proposal aligns with GKG maintainers’ expressed interests:
| GKG Initiative | Alignment |
|---|---|
| Issue #139 (Contributions Pipeline) | Plugin architecture enables external contributions |
| Issue #227 (GEL) | Plugins extend beyond GEL’s AST-based scope |
| Issue #270 (SCIP) | Plugin approach is complementary to SCIP |
A.8.2 Phased Approach
- Phase A: Standalone MCP Server (this project)
- Prove SysML v2 parser correctness
- Establish parsing patterns and AST design
- Demonstrate value to MBSE practitioners
- Phase B: Tree-sitter Grammar (future)
- Contribute tree-sitter grammar for SysML v2
- Enables native GKG support without plugins
- Requires significant specification analysis
- Phase C: Plugin Architecture Proposal (future)
- Present findings to GKG maintainers
- Collaborate on plugin interface design
- Implement SysML v2 as reference plugin
A.8.3 Prerequisites
Before proposing to GKG team:
A.9 Conclusion
Extending GitLab Knowledge Graph for domain-specific languages like SysML v2 is technically feasible through a plugin architecture. The standalone SysML v2 MCP server serves as a proving ground for:
- Parser design and correctness
- Graph schema for MBSE concepts
- Value proposition for AI-assisted systems engineering
Lessons learned will inform future contributions to GKG, contingent on maintainer interest and project priorities.
A.10 References
- GitLab Knowledge Graph: https://gitlab.com/gitlab-org/rust/knowledge-graph
- gitlab-code-parser: https://gitlab.com/gitlab-org/rust/gitlab-code-parser
- GKG Issue #139 (Contributions Pipeline): https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/139
- GKG Issue #227 (GEL POC): https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/227
- GKG Issue #270 (SCIP Integration): https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/270
- ASIMOV Platform (reference): https://asimov.blog/introducing-asimov/
- sysml.rs (reference): https://github.com/artob/sysml.rs