Conversation
- Add 4-stage Rust parser using tree-sitter-rust - repository_scanner.py: enumerate .rs files - function_extractor.py: extract functions/methods from AST - call_graph_builder.py: build bidirectional call graphs - unit_generator.py: generate dataset.json - test_pipeline.py: orchestrator with all 4 processing levels - Register Rust in parser_adapter.py (detection + dispatch) - Add 'rust' to CLI language whitelist (cli.py, parse.go) - Add tree-sitter-rust to dependencies (requirements.txt, pyproject.toml) - Update README with Rust in supported languages - Add adding-a-parser.md guide with CLI whitelist and venv dependency docs
NahumKorda
left a comment
There was a problem hiding this comment.
Generated by Clause Code:
Summary
Well-structured PR that adds a Rust parser following the established 4-stage pipeline. The adding-a-parser.md guide is a valuable addition. Overall quality is good — it will work for typical Rust codebases — but there are
several correctness issues to address.
Major Issues
M1. qualified_name uses :: separator, inconsistent with all other parsers
function_extractor.py builds IDs like src/lib.rs:Config::new. The guide documents . as the separator (Config.method). The :: also makes splitting on : ambiguous — src/lib.rs:Config::new yields ["src/lib.rs", "Config", "",
"new"] instead of ["src/lib.rs", "Config::new"]. Use . to match convention.
M2. Trait impl names include for keyword, creating IDs with spaces
impl Display for Config produces impl_name = "Display for Config", leading to src/lib.rs:Display for Config::fmt. Spaces in function IDs will break downstream processing. Normalize to Config (the implementing type).
M3. rstrip('::') bug in _extract_imports
base = parts.split('{')[0].rstrip('::')
str.rstrip('::') strips any character in {':', ':'}, not the substring ::. So std::collections:: strips all trailing c, o, l, e, i, t, n, s, : characters. Use .removesuffix('::') instead.
M4. _is_async uses fragile string matching instead of AST
return code.strip().startswith('async ') or 'async fn' in code[:50]
A comment containing async in the first 50 chars causes false positives. Check tree-sitter children instead: any(child.type == 'async' for child in node.children).
Minor Issues
m1. RUST_BUILTINS has duplicates (contains, assert, assert_eq, debug, take, flatten). Harmless but suggests copy-paste oversight.
m2. _has_test_attribute walks parent siblings, but in tree-sitter-rust attributes may be direct children of the function node inside impl/mod blocks. Test detection may not work reliably in all contexts. Same issue for
_has_route_attribute and _has_main_attribute.
m3. _resolve_simple_call checks the same impl block first for bare function calls. In Rust, foo() inside an impl block does NOT resolve to Self::foo() — only self.foo() does. This creates false call graph edges.
m4. Closures (|args| body) are not extracted. Code inside closures (common in iterator chains, async) won't appear in the call graph, potentially missing vulnerability-relevant call chains.
m5. _extract_imports ignores glob imports (use foo::*) and doesn't handle nested group uses (use std::{collections::{HashMap, BTreeMap}, io::Read}).
m6. Scanner excludes examples/ by default — other parsers don't exclude equivalent directories, and example code can contain vulnerability patterns.
m7. _resolve_method_call unique-name matching is overly aggressive — if there's exactly one method named process in the entire codebase, any .process() call resolves to it regardless of type.
Nits
n1. adding-a-parser.md shows Config.new (dot) in Rust examples, but the parser produces Config::new (double colon). Should be consistent.
n2. _is_public doesn't distinguish pub vs pub(crate) vs pub(super) — these have different attack surface implications.
n3. build is classified as constructor, but in Rust's builder pattern, build() is the finalizer, not the constructor.
n4. associated_function and entry_point unit types are returned by the classifier but not documented in adding-a-parser.md.
n5. get_dependencies/get_callers are duplicated identically in both CallGraphBuilder and UnitGenerator (consistent with other parsers, but noted).
Positives
- Excellent structural consistency with existing parsers
- All integration points updated (parser_adapter, cli.py, parse.go, pyproject.toml, requirements.txt, README)
- Comprehensive RUST_BUILTINS filter covering macros, iterators, Option/Result, async, logging
- Correctly handles impl blocks, trait implementations, self/Self:: calls, async, route attributes (actix/axum/rocket)
- adding-a-parser.md is genuinely useful and will lower the barrier for future contributions
- Stack-based traversal (no recursion), robust error handling with regex fallback
Add 4-stage Rust parser using tree-sitter-rust
Register Rust in parser_adapter.py (detection + dispatch)
Add 'rust' to CLI language whitelist (cli.py, parse.go)
Add tree-sitter-rust to dependencies (requirements.txt, pyproject.toml)
Update README with Rust in supported languages
Add adding-a-parser.md guide with CLI whitelist and venv dependency docs