Skip to content

fix(extract): use full path instead of stem for within-file node IDs#329

Open
BTCB wants to merge 1 commit intosafishamsi:v4from
BTCB:fix/stem-id-collision
Open

fix(extract): use full path instead of stem for within-file node IDs#329
BTCB wants to merge 1 commit intosafishamsi:v4from
BTCB:fix/stem-id-collision

Conversation

@BTCB
Copy link
Copy Markdown

@BTCB BTCB commented Apr 14, 2026

Node IDs were generated as <path.stem>_<ident>, causing collisions whenever two files in the corpus shared a basename (lib.rs across Rust crates, init.py across Python packages, index.ts in TS monorepos) AND contained symbols with the same name. Merged nodes inherited edges from multiple files, producing fake high-betweenness bridges and duplicate entries in the God Nodes ranking.

Fix: replace stem = path.stem with stem = _make_id(str(path)) at all 10 declaration sites in extract.py. The resulting IDs have a path-unique prefix (same as the file-level file_nid), so within-file _make_id(stem, ident) calls never collide across files.

Cross-file references are unaffected: import handlers don't use stem, label-based call resolution uses labels not IDs, and pre_link_python's stem_to_entities lookups remain consistent because key+value change uniformly.

Verified on a 23-crate Rust workspace: 59 collision casualties -> 0, biggest weakly-connected component grew from 367 to 939 nodes, god node ranking stopped showing duplicate labels (test_config() had appeared at both rank 1 and rank 6 due to non-deterministic merge order).

Full test suite: 426 passed, 0 regressions. (7 failures in test_security.py are pre-existing and environment-specific — they reproduce identically on unpatched HEAD.)

Node IDs were generated as `<path.stem>_<ident>`, causing collisions
whenever two files in the corpus shared a basename (lib.rs across Rust
crates, __init__.py across Python packages, index.ts in TS monorepos)
AND contained symbols with the same name. Merged nodes inherited
edges from multiple files, producing fake high-betweenness bridges
and duplicate entries in the God Nodes ranking.

Fix: replace `stem = path.stem` with `stem = _make_id(str(path))` at
all 10 declaration sites in extract.py. The resulting IDs have a
path-unique prefix (same as the file-level `file_nid`), so within-file
_make_id(stem, ident) calls never collide across files.

Cross-file references are unaffected: import handlers don't use stem,
label-based call resolution uses labels not IDs, and pre_link_python's
stem_to_entities lookups remain consistent because key+value change
uniformly.

Verified on a 23-crate Rust workspace: 59 collision casualties -> 0,
biggest weakly-connected component grew from 367 to 939 nodes, god
node ranking stopped showing duplicate labels (test_config() had
appeared at both rank 1 and rank 6 due to non-deterministic merge
order).

Full test suite: 426 passed, 0 regressions. (7 failures in
test_security.py are pre-existing and environment-specific — they
reproduce identically on unpatched HEAD.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant