Skip to content

fix(rust): resolve use crate::... into real cross-file edges#330

Open
BTCB wants to merge 1 commit intosafishamsi:v4from
BTCB:fix/rust-cross-file-use-edges
Open

fix(rust): resolve use crate::... into real cross-file edges#330
BTCB wants to merge 1 commit intosafishamsi:v4from
BTCB:fix/rust-cross-file-use-edges

Conversation

@BTCB
Copy link
Copy Markdown

@BTCB BTCB commented Apr 14, 2026

The Rust extractor's use_declaration handler was emitting a single edge per use statement to a stem-only node ID like _make_id("types"), which never matched any real node and got garbage-collected by the dangling-edge filter. Net result: zero cross-file edges for Rust workspaces, so every lib.rs / types.rs / model.rs showed up as an orphan weakly-connected component with degree=1.

On a 23-crate real-world Rust workspace, this meant:

  • 422 weakly-connected "orphan" nodes (mostly type-definition modules)
  • same-file vs cross-file edge ratio of 52:1 (healthy Rust projects should be ~3-5:1)
  • edge relation histogram: 8681 calls + 4970 contains + only 4 uses
  • orphaned type modules disconnected from their in-crate consumers despite sitting in the same crate — e.g. crates/graduation/types.rs had NO PATH to crates/graduation/manager.rs even though manager.rs contains use crate::types::{ModeTransition, StrategyMode, StrategyRuntime};

Fix

Two-pass resolution, mirroring the Python cross-file resolver's structure:

  1. Per-file pass (extract_rust): each use_declaration is parsed and recorded on the result dict's new _rust_uses list. The parser handles:

    • use crate::foo::Bar; (single ident)
    • use crate::foo::{A, B as C}; (brace group + aliases)
    • use crate::foo::*; (glob — skipped,
      can't resolve statically)
    • pub use foo::{A, B}; in lib.rs / mod.rs (relative re-export
      of sibling module)
    • Filters out self::, super::, and external-crate paths.
    • No edge is emitted in this pass — targets live in other files.
  2. Cross-file pass (_resolve_cross_file_rust_imports): builds a global label → [(node_id, source_file)] index across all parsed Rust files, then for each recorded use, resolves each imported identifier to a concrete target node. Prefers candidates whose source_file matches the use's module prefix (so use crate::types::Foo picks a Foo defined in a types.rs). Falls back to first non-self candidate if no prefix match.

Produces two new edge relations:

  • uses (confidence_score 0.95) for ordinary imports
  • reexports (confidence_score 1.0) for pub use in lib.rs / mod.rs

Both are INFERRED, not EXTRACTED, because tree-sitter alone cannot fully verify type identity the way rust-analyzer would — two different crates may define identically-named types and the heuristic picks a best-effort candidate.

Cache schema bump

cache.py now mixes a _CACHE_SCHEMA_TAG = b"v2" constant into every file hash. Pre-existing cache entries (which lack the new _rust_uses field) silently stop matching and get re-extracted on next run. No manual rm -rf graphify-out/cache needed. Bump this tag whenever the extractor output schema changes again.

Specifically on the graduation crate (6 files, shown in commit-msg narrative above): 0 cross-file edges → 15 uses + 17 reexports = 32 new edges. StrategyRuntime, ModeTransition, GraduationScorecard now correctly path to StrategyLifecycleManager in 1-2 hops.

Tests

Added 5 new unit tests in test_multilang.py covering a minimal multi-file Rust fixture (tests/fixtures/rust_crate/):

  • test_rust_use_crate_resolves_braced_imports
  • test_rust_use_crate_resolves_single_ident_import
  • test_rust_pub_use_reexports_in_lib_rs
  • test_rust_cross_file_edges_are_inferred
  • test_rust_use_crate_never_produces_dangling_imports_from

Full test suite: 417 passed, 0 regressions. (7 pre-existing failures in tests/test_security.py are environment-specific — they reproduce identically on unpatched main when local DNS resolves example.com to a private IP range, unrelated to this PR.)

The Rust extractor's `use_declaration` handler was emitting a single
edge per use statement to a stem-only node ID like `_make_id("types")`,
which never matched any real node and got garbage-collected by the
dangling-edge filter. Net result: **zero** cross-file edges for Rust
workspaces, so every `lib.rs` / `types.rs` / `model.rs` showed up as
an orphan weakly-connected component with degree=1.

On a 23-crate real-world Rust workspace, this meant:

- 422 weakly-connected "orphan" nodes (mostly type-definition modules)
- same-file vs cross-file edge ratio of 52:1 (healthy Rust projects
  should be ~3-5:1)
- edge relation histogram: 8681 calls + 4970 contains + only 4 uses
- orphaned type modules disconnected from their in-crate consumers
  despite sitting in the same crate — e.g. `crates/graduation/types.rs`
  had NO PATH to `crates/graduation/manager.rs` even though manager.rs
  contains `use crate::types::{ModeTransition, StrategyMode, StrategyRuntime};`

## Fix

Two-pass resolution, mirroring the Python cross-file resolver's structure:

1. **Per-file pass** (`extract_rust`): each `use_declaration` is parsed
   and recorded on the result dict's new `_rust_uses` list. The parser
   handles:
   - `use crate::foo::Bar;`                              (single ident)
   - `use crate::foo::{A, B as C};`                      (brace group + aliases)
   - `use crate::foo::*;`                                (glob — skipped,
                                                          can't resolve statically)
   - `pub use foo::{A, B};` in `lib.rs` / `mod.rs`       (relative re-export
                                                          of sibling module)
   - Filters out `self::`, `super::`, and external-crate paths.
   - No edge is emitted in this pass — targets live in other files.

2. **Cross-file pass** (`_resolve_cross_file_rust_imports`): builds a
   global `label → [(node_id, source_file)]` index across all parsed
   Rust files, then for each recorded use, resolves each imported
   identifier to a concrete target node. Prefers candidates whose
   `source_file` matches the use's module prefix (so `use crate::types::Foo`
   picks a `Foo` defined in a `types.rs`). Falls back to first non-self
   candidate if no prefix match.

Produces two new edge relations:
- `uses`       (confidence_score 0.95) for ordinary imports
- `reexports`  (confidence_score 1.0)  for `pub use` in `lib.rs` / `mod.rs`

Both are `INFERRED`, not `EXTRACTED`, because tree-sitter alone cannot
fully verify type identity the way rust-analyzer would — two different
crates may define identically-named types and the heuristic picks a
best-effort candidate.

## Cache schema bump

`cache.py` now mixes a `_CACHE_SCHEMA_TAG = b"v2"` constant into every
file hash. Pre-existing cache entries (which lack the new `_rust_uses`
field) silently stop matching and get re-extracted on next run. No
manual `rm -rf graphify-out/cache` needed. Bump this tag whenever the
extractor output schema changes again.

## Verified on arbitrage-bot (23 crates, 237 .rs files)

```
Before: uses=4, reexports=0
After:  uses=506, reexports=276  (+782 real cross-file edges)
```

Specifically on the `graduation` crate (6 files, shown in commit-msg
narrative above): 0 cross-file edges → 15 uses + 17 reexports = 32
new edges. `StrategyRuntime`, `ModeTransition`, `GraduationScorecard`
now correctly path to `StrategyLifecycleManager` in 1-2 hops.

## Tests

Added 5 new unit tests in `test_multilang.py` covering a minimal
multi-file Rust fixture (`tests/fixtures/rust_crate/`):
- `test_rust_use_crate_resolves_braced_imports`
- `test_rust_use_crate_resolves_single_ident_import`
- `test_rust_pub_use_reexports_in_lib_rs`
- `test_rust_cross_file_edges_are_inferred`
- `test_rust_use_crate_never_produces_dangling_imports_from`

Full test suite: 417 passed, 0 regressions. (7 pre-existing failures
in `tests/test_security.py` are environment-specific — they reproduce
identically on unpatched `main` when local DNS resolves `example.com`
to a private IP range, unrelated to this PR.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant