Skip to content

feat: Implement GDS (Graph Data Science) Extension #351

@longbinlai

Description

@longbinlai

What do you want to be done?

Implement the GDS (Graph Data Science) extension for NeuG based on the design specification in PR #64 (feat: Add GDS Design Spec).

This is a major feature that brings Graph Data Science capabilities to NeuG, enabling users to run complex graph algorithms on projected subgraphs without copying data.

Background

PR #64 introduced the comprehensive GDS design specification (specs/004-gds/spec.md, 1120+ lines) which covers the full stack from product requirements to implementation details:

  • 8 core graph algorithms: PageRank, Connected Components, Louvain, Leiden, Label Propagation, Weakly Connected Components, Shortest Path, BFS
  • User-facing Cypher API: INSTALL/LOAD EXTENSION, project_graph, CALL algo
  • C++ implementation structures: ProjectedSubgraph, GDSGraph, GDSAlgo physical plan
  • Developer extension API and prioritized roadmap
  • Platform support: Multi-platform .so distribution (Linux/macOS, x86_64/ARM64)

Implementation Requirements

1. Extension Infrastructure

Goal: Enable NeuG to load and manage external extensions dynamically.

  • Extension Registry System
    • Implement extension metadata catalog (name, version, platform, entry point)
    • Track installed vs loaded extension state
  • INSTALL EXTENSION Support
    • Parse INSTALL EXTENSION 'gds' Cypher command
    • Download .neug_extension file from OSS or load from local filesystem
    • Verify platform compatibility (OS + architecture)
    • Store extension metadata in registry
  • LOAD EXTENSION Support
    • Parse LOAD EXTENSION 'gds' Cypher command
    • dlopen() the extension .so file
    • Call extension Init() function to register functions in catalog
    • Handle extension lifecycle (load/unload/reload)
  • Extension Build System
    • Add CMake configuration for building extensions as shared libraries
    • Define extension API headers (include/neug/extension/)
    • Support cross-platform builds (Linux/macOS, x86_64/ARM64)

Key Files to Create/Modify:

  • include/neug/extension/extension_api.h - Extension entry point and callback interfaces
  • include/neug/extension/extension_registry.h - Extension catalog management
  • src/extension/extension_registry.cpp - Implementation
  • src/extension/extension_loader.cpp - dlopen/dlclose wrapper
  • CMakeLists.txt updates for extension build targets

2. Graph Projection (ProjectedSubgraph)

Goal: Create zero-copy subgraph projections that store labels + predicates, not duplicated data.

  • ProjectedSubgraph Data Structure
    • struct ProjectedSubgraph containing:
      • std::unordered_map<std::string, VertexEntry> - node labels + filter expressions
      • std::unordered_map<std::string, EdgeEntry> - edge labels + filter expressions (triplet format: <src_label, edge_type, dst_label>)
    • struct VertexEntry - stores label filter (e.g., "Person: true" or "User.age > 30")
    • struct EdgeEntry - stores triplet edge label + relationship filter
  • Session-level Subgraph Context
    • Store ProjectedSubgraph in session context keyed by name
    • Support multiple named projections per session
    • Handle projection lifecycle (create/use/drop)
  • CALL project_graph Implementation
    • Parse CALL project_graph('g', {Person:'true'}, {KNOWS:'true'})
    • Validate node/edge specs against schema
    • Store projection in session context (no data copy)
    • Return projection metadata (vertex count, edge count estimates)

Key Files to Create/Modify:

  • include/neug/gds/projected_subgraph.h - ProjectedSubgraph definition
  • src/gds/projected_subgraph.cpp - Implementation
  • include/neug/main/session.h - Add subgraph context storage
  • src/compiler/ - Add CALL procedure parser support

3. GDSGraph Runtime View

Goal: Provide a runtime view over the full graph with predicate filtering applied at scan time.

  • GDSGraph Structure
    • struct GDSGraph wrapping:
      • Reference to base CSR storage
      • Reference to ProjectedSubgraph for label/predicate info
      • Runtime vertex/edge iterators that apply predicates
    • Zero-copy design: filter at runtime, don't materialize
  • Predicate Application
    • Compile node/edge filter expressions to Expression objects
    • Apply filters during graph traversal (scan + neighbor access)
    • Support label_id filtering for fast path
  • Homogeneous Graph Support
    • Handle single-label graphs (all nodes/edges)
    • Optimize for no-filter projections

Key Files to Create/Modify:

  • include/neug/gds/gds_graph.h - GDSGraph definition
  • src/gds/gds_graph.cpp - Implementation with CSR integration

4. Algorithm Implementation

Goal: Implement 8 graph algorithms with proper result streaming.

Phase 1 - Core Algorithms (MVP):

  • PageRank

    • Input: CALL page_rank('g', {damping: 0.85, tolerance: 1e-6, max_iter: 100})
    • Output: YIELD node, rank
    • Implementation: Iterative power method with damping factor
  • Connected Components

    • Input: CALL connected_components('g')
    • Output: YIELD node, component
    • Implementation: Union-Find or label propagation
  • Label Propagation

    • Input: CALL label_propagation('g', {max_iter: 10})
    • Output: YIELD node, community
    • Implementation: Synchronous/asynchronous label propagation
  • Leiden (AI/GraphRAG use case)

    • Input: CALL leiden('g', {resolution: 1.0, max_iter: 10})
    • Output: YIELD node, community, quality
    • Implementation: Multi-level modularity optimization with refinement

Phase 2 - Extended Algorithms:

  • Louvain

    • Input: CALL louvain('g', {resolution: 1.0})
    • Output: YIELD node, community, modularity
  • Weakly Connected Components (WCC)

    • Similar to CC but handles directed graphs
  • Shortest Path

    • Input: CALL shortest_path('g', {source: 123, target: 456, weight_property: 'distance'})
    • Output: YIELD node, distance, path
    • Implementation: Dijkstra with weight support
    • Note: weight_property: null should be documented as equivalent to BFS
  • BFS

    • Input: CALL bfs('g', {source: 123, max_depth: 5})
    • Output: YIELD node, depth
    • Implementation: Breadth-first search with depth limit
    • Note: Distinct from shortest_path due to max_depth parameter

Key Files to Create/Modify:

  • include/neug/gds/algorithms/ - Algorithm headers (one per algo)
  • src/gds/algorithms/ - Algorithm implementations
  • Each algorithm should follow the pattern:
    class PageRankAlgo : public GDSAlgoBase {
      Result Execute(const GDSGraph& graph, const AlgoParams& params) override;
    };

5. Physical Plan Integration (GDSAlgo)

Goal: Integrate GDS algorithms into the NeuG query execution pipeline.

  • GDSAlgo Physical Plan Operator
    • Create GDSAlgoPhysicalPlan extending base physical plan
    • Bind label_ids from ProjectedSubgraph to actual storage IDs
    • Compile Expression predicates from string to executable form
  • Execution Pipeline
    • Integrate with existing physical plan executor
    • Handle algorithm result streaming (don't materialize all results)
    • Support YIELD clause for column projection
  • Result Handling
    • Standardize algorithm output format (ResultTable)
    • Support pagination for large result sets
    • Handle early termination (e.g., convergence, max_iter)

Key Files to Create/Modify:

  • include/neug/execution/gds_algo_physical_plan.h
  • src/execution/gds_algo_physical_plan.cpp
  • src/compiler/ - Add logical-to-physical plan conversion for GDS calls

6. Cypher Parser Support

Goal: Parse GDS-specific Cypher syntax and integrate with query pipeline.

  • CALL Procedure Parsing
    • Extend ANTLR4 grammar for CALL procedure_name(params) YIELD columns
    • Support nested map parameters: {key1: value1, key2: {nested: value}}
    • Parse YIELD column list and aliases
  • Extension Function Registration
    • Register page_rank, connected_components, etc. in function catalog
    • Type-check parameters at bind time
    • Generate logical plan for GDS calls
  • gopt Integration
    • Add GDS-specific logical plan nodes
    • Convert to physical plan via existing gopt converter

Key Files to Modify:

  • src/compiler/parser/ - ANTLR4 grammar updates
  • src/compiler/binder/ - Function binding for GDS procedures
  • src/compiler/ - Logical plan generation

Technical Notes

Key Design Decisions (from spec):

  1. Zero-copy ProjectedSubgraph: Store predicates, not duplicated data. Apply filters at runtime during algorithm execution.
  2. Triplet Edge Labels: Edge projections use <src_label, edge_type, dst_label> format for precise filtering.
  3. Weight Semantics: shortest_path with weight_property: null is equivalent to BFS, but bfs procedure has additional max_depth parameter - document this distinction clearly.
  4. Platform Support: Multi-platform .so distribution requires platform detection at install time.

Architecture Overview (from spec sequence diagram):

User → NeuG: INSTALL EXTENSION 'gds'
NeuG → OSS: Download libjson.neug_extension
OSS → NeuG: .so file

User → NeuG: LOAD EXTENSION 'gds'
NeuG → ExtReg: dlopen() → call Init()
ExtReg → NeuG: Functions registered in catalog

User → NeuG: CALL project_graph('g', {Person:'true'}, {KNOWS:'true'})
NeuG → SubgraphCtx: Store ProjectedSubgraph (labels + predicates, no data copy)
SubgraphCtx → NeuG: OK

User → NeuG: CALL k_core('g', {min_k:3}) YIELD node, core_number
NeuG → SubgraphCtx: Lookup ProjectedSubgraph by name 'g'
SubgraphCtx → NeuG: VertexEntries + EdgeEntries
NeuG → NeuG: Compile to GDSAlgo physical plan (bind label_ids + Expression predicates)
NeuG → GDSAlgo: Execute with GDSGraph (scan full graph, apply predicates at runtime)
GDSAlgo → NeuG: (node, core_number) tuples
NeuG → User: Result set

Known Issues from Spec Review (PR #64 Greptile comments):

  • Algorithm count mismatch in "AI/GraphRAG 刚需算法" section (claims 3, lists 2) - needs fix in spec
  • Malformed platform support table in §2.2 - needs fix in spec
  • Ambiguous semantics between bfs and shortest_path with no weight - needs clarification in implementation

Acceptance Criteria

  • All Phase 1 algorithms functional with correct results on test graphs
  • Extension install/load lifecycle works end-to-end (download → dlopen → execute)
  • project_graph creates zero-copy projections (verify no data duplication in memory)
  • CALL algo.* syntax fully integrated with Cypher parser
  • GDSAlgo physical plan integrates with existing execution pipeline
  • Benchmarks match or exceed spec performance targets (reference LDBC SNB or similar)
  • Python bindings exposed for all public APIs (Database.call_procedure())
  • Unit tests for each algorithm with known-good results
  • Integration tests for full pipeline (install → load → project → execute)

Implementation Phases

Phase Scope Priority
Phase 0 Extension infrastructure + build system P0
Phase 1 ProjectedSubgraph + GDSGraph runtime view P0
Phase 2 Phase 1 algorithms (PageRank, CC, Label Prop, Leiden) P0
Phase 3 Physical plan integration + Cypher parser P0
Phase 4 Phase 2 algorithms (Louvain, WCC, Shortest Path, BFS) P1
Phase 5 Python bindings + benchmarks P1

Related

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions