fix(rust): resolve use crate::... into real cross-file edges#330
Open
BTCB wants to merge 1 commit intosafishamsi:v4from
Open
fix(rust): resolve use crate::... into real cross-file edges#330BTCB wants to merge 1 commit intosafishamsi:v4from
use crate::... into real cross-file edges#330BTCB wants to merge 1 commit intosafishamsi:v4from
Conversation
The Rust extractor's `use_declaration` handler was emitting a single
edge per use statement to a stem-only node ID like `_make_id("types")`,
which never matched any real node and got garbage-collected by the
dangling-edge filter. Net result: **zero** cross-file edges for Rust
workspaces, so every `lib.rs` / `types.rs` / `model.rs` showed up as
an orphan weakly-connected component with degree=1.
On a 23-crate real-world Rust workspace, this meant:
- 422 weakly-connected "orphan" nodes (mostly type-definition modules)
- same-file vs cross-file edge ratio of 52:1 (healthy Rust projects
should be ~3-5:1)
- edge relation histogram: 8681 calls + 4970 contains + only 4 uses
- orphaned type modules disconnected from their in-crate consumers
despite sitting in the same crate — e.g. `crates/graduation/types.rs`
had NO PATH to `crates/graduation/manager.rs` even though manager.rs
contains `use crate::types::{ModeTransition, StrategyMode, StrategyRuntime};`
## Fix
Two-pass resolution, mirroring the Python cross-file resolver's structure:
1. **Per-file pass** (`extract_rust`): each `use_declaration` is parsed
and recorded on the result dict's new `_rust_uses` list. The parser
handles:
- `use crate::foo::Bar;` (single ident)
- `use crate::foo::{A, B as C};` (brace group + aliases)
- `use crate::foo::*;` (glob — skipped,
can't resolve statically)
- `pub use foo::{A, B};` in `lib.rs` / `mod.rs` (relative re-export
of sibling module)
- Filters out `self::`, `super::`, and external-crate paths.
- No edge is emitted in this pass — targets live in other files.
2. **Cross-file pass** (`_resolve_cross_file_rust_imports`): builds a
global `label → [(node_id, source_file)]` index across all parsed
Rust files, then for each recorded use, resolves each imported
identifier to a concrete target node. Prefers candidates whose
`source_file` matches the use's module prefix (so `use crate::types::Foo`
picks a `Foo` defined in a `types.rs`). Falls back to first non-self
candidate if no prefix match.
Produces two new edge relations:
- `uses` (confidence_score 0.95) for ordinary imports
- `reexports` (confidence_score 1.0) for `pub use` in `lib.rs` / `mod.rs`
Both are `INFERRED`, not `EXTRACTED`, because tree-sitter alone cannot
fully verify type identity the way rust-analyzer would — two different
crates may define identically-named types and the heuristic picks a
best-effort candidate.
## Cache schema bump
`cache.py` now mixes a `_CACHE_SCHEMA_TAG = b"v2"` constant into every
file hash. Pre-existing cache entries (which lack the new `_rust_uses`
field) silently stop matching and get re-extracted on next run. No
manual `rm -rf graphify-out/cache` needed. Bump this tag whenever the
extractor output schema changes again.
## Verified on arbitrage-bot (23 crates, 237 .rs files)
```
Before: uses=4, reexports=0
After: uses=506, reexports=276 (+782 real cross-file edges)
```
Specifically on the `graduation` crate (6 files, shown in commit-msg
narrative above): 0 cross-file edges → 15 uses + 17 reexports = 32
new edges. `StrategyRuntime`, `ModeTransition`, `GraduationScorecard`
now correctly path to `StrategyLifecycleManager` in 1-2 hops.
## Tests
Added 5 new unit tests in `test_multilang.py` covering a minimal
multi-file Rust fixture (`tests/fixtures/rust_crate/`):
- `test_rust_use_crate_resolves_braced_imports`
- `test_rust_use_crate_resolves_single_ident_import`
- `test_rust_pub_use_reexports_in_lib_rs`
- `test_rust_cross_file_edges_are_inferred`
- `test_rust_use_crate_never_produces_dangling_imports_from`
Full test suite: 417 passed, 0 regressions. (7 pre-existing failures
in `tests/test_security.py` are environment-specific — they reproduce
identically on unpatched `main` when local DNS resolves `example.com`
to a private IP range, unrelated to this PR.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Rust extractor's
use_declarationhandler was emitting a single edge per use statement to a stem-only node ID like_make_id("types"), which never matched any real node and got garbage-collected by the dangling-edge filter. Net result: zero cross-file edges for Rust workspaces, so everylib.rs/types.rs/model.rsshowed up as an orphan weakly-connected component with degree=1.On a 23-crate real-world Rust workspace, this meant:
crates/graduation/types.rshad NO PATH tocrates/graduation/manager.rseven though manager.rs containsuse crate::types::{ModeTransition, StrategyMode, StrategyRuntime};Fix
Two-pass resolution, mirroring the Python cross-file resolver's structure:
Per-file pass (
extract_rust): eachuse_declarationis parsed and recorded on the result dict's new_rust_useslist. The parser handles:use crate::foo::Bar;(single ident)use crate::foo::{A, B as C};(brace group + aliases)use crate::foo::*;(glob — skipped,can't resolve statically)
pub use foo::{A, B};inlib.rs/mod.rs(relative re-exportof sibling module)
self::,super::, and external-crate paths.Cross-file pass (
_resolve_cross_file_rust_imports): builds a globallabel → [(node_id, source_file)]index across all parsed Rust files, then for each recorded use, resolves each imported identifier to a concrete target node. Prefers candidates whosesource_filematches the use's module prefix (souse crate::types::Foopicks aFoodefined in atypes.rs). Falls back to first non-self candidate if no prefix match.Produces two new edge relations:
uses(confidence_score 0.95) for ordinary importsreexports(confidence_score 1.0) forpub useinlib.rs/mod.rsBoth are
INFERRED, notEXTRACTED, because tree-sitter alone cannot fully verify type identity the way rust-analyzer would — two different crates may define identically-named types and the heuristic picks a best-effort candidate.Cache schema bump
cache.pynow mixes a_CACHE_SCHEMA_TAG = b"v2"constant into every file hash. Pre-existing cache entries (which lack the new_rust_usesfield) silently stop matching and get re-extracted on next run. No manualrm -rf graphify-out/cacheneeded. Bump this tag whenever the extractor output schema changes again.Specifically on the
graduationcrate (6 files, shown in commit-msg narrative above): 0 cross-file edges → 15 uses + 17 reexports = 32 new edges.StrategyRuntime,ModeTransition,GraduationScorecardnow correctly path toStrategyLifecycleManagerin 1-2 hops.Tests
Added 5 new unit tests in
test_multilang.pycovering a minimal multi-file Rust fixture (tests/fixtures/rust_crate/):test_rust_use_crate_resolves_braced_importstest_rust_use_crate_resolves_single_ident_importtest_rust_pub_use_reexports_in_lib_rstest_rust_cross_file_edges_are_inferredtest_rust_use_crate_never_produces_dangling_imports_fromFull test suite: 417 passed, 0 regressions. (7 pre-existing failures in
tests/test_security.pyare environment-specific — they reproduce identically on unpatchedmainwhen local DNS resolvesexample.comto a private IP range, unrelated to this PR.)