fix: cross-file call resolution — preserve N-way collision signal by jackshiung · Pull Request #397 · safishamsi/graphify

jackshiung · 2026-04-16T08:16:44Z

Summary

Fixes a silent data-loss bug in cross-file call resolution (introduced in #348 / 5c77d9c) where global_label_to_nid: dict[str, str] drops all but one candidate when multiple nodes share the same normalised label.

Problem

When multiple functions/methods across files share a name (e.g. .get() defined in 32 files of a Laravel/CodeIgniter monorepo):

global_label_to_nid: dict[str, str] = {}
for n in all_nodes:
    global_label_to_nid[normalised.lower()] = n["id"]   # N-1 candidates silently dropped

Every call-site of .get() gets routed to whichever nid happened to be iterated last, producing a pseudo-god-node. All ambiguity_degree=0 — the ambiguity signal is gone.

Observed on a real 5-subsystem monorepo (997 files):

.get(): 32 unique nodes, but only 1 received all 321 INFERRED incoming edges
Similar collapse for .all(), .delete(), .create(), .update()

Approach

Three layered commits (pickable individually):

refactor — bucket the lookup: dict[str, list[str]] preserving all candidates, with per-bucket dedup and consumption-site invariant assert.
fix — on multi-candidate resolution, emit one AMBIGUOUS edge per candidate (confidence=AMBIGUOUS, confidence_score=0.2, ambiguity_degree=N). Single-candidate paths remain INFERRED@0.8 unchanged.
feat — add a fan-out cap (max_ambiguity_fanout, default 20, override via kwarg or GRAPHIFY_MAX_AMBIGUITY_FANOUT). Above the cap, emit no edges and record to cross_file_call_stats (propagated into graph.json by build/export/watch).

The cap exists because generic verbs (.get() with 30+ candidates) produce fan-outs that are AST-undecidable — the edges would be semantically noise. Default of 20 is conservative; fleet users can raise it via env var without code change.

Evidence

Tests: 441 pass (4 existing + 4 new collision tests; all existing extract/confidence/pipeline/etc. green). No regressions.

Real monorepo validation (TBA-backend, 997 files):

Metric	Before (v0.4.16)	After (this PR)
Total edges	10,214	10,529 (+3.1%)
INFERRED calls	2,580	1,230 (collisions correctly reclassified)
AMBIGUOUS calls	0	1,426
truncated_high_degree	n/a	1,079 (examples: update, all, get, create, delete)
Communities	474	449 (-5.3%, no collapse)
Top god nodes	stable	stable

Without the cap, TBA-backend exploded to 37,166 edges (+264%) — this was the motivation for commit 3.

Notes

ambiguity_degree field is additive — consumers that don't read it are unaffected.
cross_file_call_stats is optional and present in graph.json only when cross-file resolution runs.
Default cap can be raised for fleet users (monorepo-heavy codebases) via GRAPHIFY_MAX_AMBIGUITY_FANOUT=40 graphify update . without patching.

Testing

pytest tests/test_cross_file_collision.py -v   # 8/8 new
pytest tests/                                   # 441/441

Happy to split into multiple PRs if preferred (each commit stands alone).

🤖 Authored with assistance from Claude Opus 4.6

The cross-file call resolution was using dict[str, str], causing dict-overwrite: when N nodes shared the same normalised label, only the last-iterated nid survived in the lookup table, silently dropping N-1 valid candidates. This commit changes the lookup table to dict[str, list[str]] with per-bucket uniqueness guaranteed via a seen-set, and adds list(dict.fromkeys(...)) dedup at consumption site as a belt-and-braces invariant. Resolution behaviour is kept equivalent for now — we still pick candidates[0] as the target — preparing the ground for subsequent commits to emit proper AMBIGUOUS edges. No behaviour change in this commit. All existing tests pass.

…_degree When cross-file call resolution finds multiple candidates for a callee (e.g. `.get()` defined in 32 files), emitting a single edge to an arbitrary winner is indistinguishable from dict-overwrite. This commit fans out the edge to all candidates, marking each as AMBIGUOUS (confidence_score=0.2) and recording ambiguity_degree = number of candidates on each edge. Single-candidate resolution remains INFERRED at 0.8 (unchanged). Self-reference is filtered (caller is excluded from its own candidate list). Real-world impact on a 5-subsystem monorepo (TBA-backend, 997 files): - INFERRED calls: 2580 → 1230 (collisions correctly reclassified) - AMBIGUOUS calls: 0 → N (exposes previously-hidden ambiguity) Note: heavy collisions (CRUD verbs like `.get()` with 30+ candidates) cause edge explosion. The next commit addresses this with a fan-out cap.

Labels with degree > 20 (typically generic verbs like `.get()`, `.all()`, `.delete()` in multi-subsystem monorepos) produce N-way AMBIGUOUS fanouts with no semantic value — the AST cannot disambiguate them regardless. On TBA-backend this caused edge count to jump 264% (10K → 37K). This commit adds a configurable cap: - extract() gains max_ambiguity_fanout kw-arg (default 20) - env var GRAPHIFY_MAX_AMBIGUITY_FANOUT overrides - When len(candidates) > cap → drop fan-out, record to stats - Stats surface in cross_file_call_stats with: - resolved_single / resolved_ambiguous / truncated_high_degree - truncated_examples (first 5 dropped labels) - max_ambiguity_fanout (effective value) build.py / export.py / watch.py propagate the stats into graph.json so downstream tools can see what was truncated. Real-world impact on TBA-backend: - Edges: 37166 (+264%) → 10529 (+3.1%) - truncated_high_degree: 1079 (examples: update, all, get, create, delete) - All other consumers (cluster, god_nodes, report) behave normally.

Adds 8 tests covering: - Single candidate → INFERRED 0.8 (unchanged behaviour) - N candidates → N AMBIGUOUS edges with ambiguity_degree=N - Self-reference filter correctness - Default cap (20) drops high-degree fanouts and records to stats - Cap override via kw-arg - Cap override via GRAPHIFY_MAX_AMBIGUITY_FANOUT env var - ambiguity_degree always matches actual fan-out count (invariant) - Unique targets within each call-site's fan-out Fixtures (tests/fixtures/collision/*.py) provide minimal Python programs that exercise each case.

jackshiung added 4 commits April 16, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cross-file call resolution — preserve N-way collision signal#397

fix: cross-file call resolution — preserve N-way collision signal#397
jackshiung wants to merge 4 commits intosafishamsi:v4from
jackshiung:chillvibe/main

jackshiung commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jackshiung commented Apr 16, 2026

Summary

Problem

Approach

Evidence

Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant