Skip to content

Latest commit

 

History

History
84 lines (62 loc) · 3.74 KB

File metadata and controls

84 lines (62 loc) · 3.74 KB

Architecture

graphify is a Claude Code skill backed by a Python library. The skill orchestrates the library; the library can be used standalone.

Pipeline

detect()  →  extract()  →  build_graph()  →  cluster()  →  analyze()  →  report()  →  export()

Each stage is a single function in its own module. They communicate through plain Python dicts and NetworkX graphs - no shared state, no side effects outside graphify-out/.

Module responsibilities

Module Function Input → Output
detect.py collect_files(root) directory → [Path] filtered list
extract.py extract(path) file path → {nodes, edges} dict
build.py build_graph(extractions) list of extraction dicts → nx.Graph
cluster.py cluster(G) graph → graph with community attr on each node
analyze.py analyze(G) graph → analysis dict (god nodes, surprises, questions)
report.py render_report(G, analysis) graph + analysis → GRAPH_REPORT.md string
export.py export(G, out_dir, ...) graph → Obsidian vault, graph.json, graph.html, graph.svg
ingest.py ingest(url, ...) URL → file saved to corpus dir
cache.py check_semantic_cache / save_semantic_cache files → (cached, uncached) split
security.py validation helpers URL / path / label → validated or raises
validate.py validate_extraction(data) extraction dict → raises on schema errors
serve.py start_server(graph_path) graph file path → MCP stdio server
watch.py watch(root, flag_path) directory → writes flag file on change
benchmark.py run_benchmark(graph_path) graph file → corpus vs subgraph token comparison

Extraction output schema

Every extractor returns:

{
  "nodes": [
    {"id": "unique_string", "label": "human name", "source_file": "path", "source_location": "L42"}
  ],
  "edges": [
    {"source": "id_a", "target": "id_b", "relation": "calls|imports|uses|...", "confidence": "EXTRACTED|INFERRED|AMBIGUOUS"}
  ]
}

validate.py enforces this schema before build_graph() consumes it.

Confidence labels

Label Meaning
EXTRACTED Relationship is explicitly stated in the source (e.g., an import statement, a direct call)
INFERRED Relationship is a reasonable deduction (e.g., call-graph second pass, co-occurrence in context)
AMBIGUOUS Relationship is uncertain; flagged for human review in GRAPH_REPORT.md

Adding a new language extractor

  1. Add a extract_<lang>(path: Path) -> dict function in extract.py following the existing pattern (tree-sitter parse → walk nodes → collect nodes and edges → call-graph second pass for INFERRED calls edges).
  2. Register the file suffix in extract() dispatch and collect_files().
  3. Add the suffix to CODE_EXTENSIONS in detect.py and _WATCHED_EXTENSIONS in watch.py.
  4. Add the tree-sitter package to pyproject.toml dependencies.
  5. Add a fixture file to tests/fixtures/ and tests to tests/test_languages.py.

Security

All external input passes through graphify/security.py before use:

  • URLs → validate_url() (http/https only) + _NoFileRedirectHandler (blocks file:// redirects)
  • Fetched content → safe_fetch() / safe_fetch_text() (size cap, timeout)
  • Graph file paths → validate_graph_path() (must resolve inside graphify-out/)
  • Node labels → sanitize_label() (strips control chars, caps 256 chars, HTML-escapes)

See SECURITY.md for the full threat model.

Testing

One test file per module under tests/. Run with:

pytest tests/ -q

All tests are pure unit tests - no network calls, no file system side effects outside tmp_path.