Architecture

graphify is a Claude Code skill backed by a Python library. The skill orchestrates the library; the library can be used standalone.

Pipeline

detect()  →  extract()  →  build_graph()  →  cluster()  →  analyze()  →  report()  →  export()

Each stage is a single function in its own module. They communicate through plain Python dicts and NetworkX graphs - no shared state, no side effects outside graphify-out/.

Module responsibilities

Module	Function	Input → Output
`detect.py`	`collect_files(root)`	directory → `[Path]` filtered list
`extract.py`	`extract(path)`	file path → `{nodes, edges}` dict
`build.py`	`build_graph(extractions)`	list of extraction dicts → `nx.Graph`
`cluster.py`	`cluster(G)`	graph → graph with `community` attr on each node
`analyze.py`	`analyze(G)`	graph → analysis dict (god nodes, surprises, questions)
`report.py`	`render_report(G, analysis)`	graph + analysis → GRAPH_REPORT.md string
`export.py`	`export(G, out_dir, ...)`	graph → Obsidian vault, graph.json, graph.html, graph.svg
`ingest.py`	`ingest(url, ...)`	URL → file saved to corpus dir
`cache.py`	`check_semantic_cache / save_semantic_cache`	files → (cached, uncached) split
`security.py`	validation helpers	URL / path / label → validated or raises
`validate.py`	`validate_extraction(data)`	extraction dict → raises on schema errors
`serve.py`	`start_server(graph_path)`	graph file path → MCP stdio server
`watch.py`	`watch(root, flag_path)`	directory → writes flag file on change
`benchmark.py`	`run_benchmark(graph_path)`	graph file → corpus vs subgraph token comparison

Extraction output schema

Every extractor returns:

{
  "nodes": [
    {"id": "unique_string", "label": "human name", "source_file": "path", "source_location": "L42"}
  ],
  "edges": [
    {"source": "id_a", "target": "id_b", "relation": "calls|imports|uses|...", "confidence": "EXTRACTED|INFERRED|AMBIGUOUS"}
  ]
}

validate.py enforces this schema before build_graph() consumes it.

Confidence labels

Label	Meaning
`EXTRACTED`	Relationship is explicitly stated in the source (e.g., an import statement, a direct call)
`INFERRED`	Relationship is a reasonable deduction (e.g., call-graph second pass, co-occurrence in context)
`AMBIGUOUS`	Relationship is uncertain; flagged for human review in GRAPH_REPORT.md

Adding a new language extractor

Add a extract_<lang>(path: Path) -> dict function in extract.py following the existing pattern (tree-sitter parse → walk nodes → collect nodes and edges → call-graph second pass for INFERRED calls edges).
Register the file suffix in extract() dispatch and collect_files().
Add the suffix to CODE_EXTENSIONS in detect.py and _WATCHED_EXTENSIONS in watch.py.
Add the tree-sitter package to pyproject.toml dependencies.
Add a fixture file to tests/fixtures/ and tests to tests/test_languages.py.

Security

All external input passes through graphify/security.py before use:

URLs → validate_url() (http/https only) + _NoFileRedirectHandler (blocks file:// redirects)
Fetched content → safe_fetch() / safe_fetch_text() (size cap, timeout)
Graph file paths → validate_graph_path() (must resolve inside graphify-out/)
Node labels → sanitize_label() (strips control chars, caps 256 chars, HTML-escapes)

See SECURITY.md for the full threat model.

Testing

One test file per module under tests/. Run with:

pytest tests/ -q

All tests are pure unit tests - no network calls, no file system side effects outside tmp_path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Pipeline

Module responsibilities

Extraction output schema

Confidence labels

Adding a new language extractor

Security

Testing

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Pipeline

Module responsibilities

Extraction output schema

Confidence labels

Adding a new language extractor

Security

Testing