fix(kernel): emit 1-based UTF-16 code-unit columns from tree-sitter nodes#914
Conversation
…odes - Introduced `LineColumnIndex` in `crates/common/src/utils/position_utils.rs`, a pre-built per-source index that converts tree-sitter's 0-based UTF-8 byte columns to 1-based UTF-16 code-unit columns, with an ASCII fast path and full unit-test coverage (BMP, supplementary-plane emoji, CJK, combining marks, CRLF, empty lines). - Documented `Position.col` as a 1-based UTF-16 code-unit column in `crates/common/src/model/position.rs` to lock in the contract. - Changed `map_node` and `get_query_nodes` (`tree_sitter.rs`) to accept and use `&LineColumnIndex`; updated `process_tree_sitter_tree_request` (`tree_sitter_tree.rs`) accordingly. - Replaced `TreeSitterNode::from_ts_node` with `from_ts_node_with_index`, threading `LineColumnIndex` through the full ddsa call chain: `JsRuntime::execute_rule_internal` → `QueryMatchBridge::set_data` → `TsNodeBridge::insert`/`insert_capture`/`build_v8_node`. - Updated the two taint-flow region builders (`LocatedNode::new_cst` in `flow/graph.rs`, `position_eq` helper in `flow/java.rs`) and the deno ops (`op_ts_node_named_children`, `op_ts_node_parent`, `op_digraph_adjacency_list_to_dot`) to use the same helper. - Moved the `ctx_bridge` borrow in `op_ts_node_named_children` into the `else` branch where it is actually needed. - Added multibyte regression tests across all layers: `LineColumnIndex` unit matrix, `map_node`/`get_query_nodes` emoji+CJK tests, `TsNodeBridge` bridge test, `LocatedNode` regression test, `stella_compat::getCode` JS execution test, server integration test on base64-encoded multibyte Python, and SARIF/CSV serialization assertions. IDE-6037
|
🎯 Code Coverage (details) 🔗 Commit SHA: 1f052b9 | Docs | Datadog PR Page | Give us feedback! |
- Deleted the `find_node` function from `tree_sitter.rs` as it was not utilized in the current implementation. This cleanup helps streamline the code and improve maintainability.
There was a problem hiding this comment.
Pull request overview
This PR standardizes tree-sitter-derived column reporting on 1-based UTF-16 code-unit columns, aligning kernel/server/bridge outputs with LSP, VS Code, and SARIF expectations for non-ASCII source.
Changes:
- Adds
LineColumnIndexconversion helper and documentsPosition.colsemantics. - Threads UTF-16 column conversion through tree-sitter mapping, DDSA V8 node bridging, server AST responses, and taint-flow graph regions.
- Adds focused regression tests for multibyte columns plus SARIF/CSV pass-through behavior.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
crates/common/src/utils/position_utils.rs |
Adds byte-column to UTF-16 column conversion helper and tests. |
crates/common/src/model/position.rs |
Documents column semantics for positions. |
crates/static-analysis-kernel/src/analysis/tree_sitter.rs |
Uses LineColumnIndex in tree-sitter node/query mapping and adds multibyte tests. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/runtime.rs |
Builds an index for DDSA query match bridging and adds an end-to-end multibyte test. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/ops.rs |
Applies UTF-16 columns when dynamically inserting child/parent/graph nodes. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/ts_node.rs |
Replaces raw byte-column constructor with indexed UTF-16 constructor and tests it. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/ts_node.rs |
Threads LineColumnIndex through TS node insertion and V8 object creation. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/query_match.rs |
Passes the index through query match capture insertion. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/test_utils.rs |
Exposes test source text for indexed conversions. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/java.rs |
Updates Java flow tests to compare UTF-16 columns. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph.rs |
Updates CST located-node column construction and adds regression coverage. |
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph_test_utils.rs |
Supplies full source text to graph test utilities. |
crates/static-analysis-server/src/tree_sitter_tree.rs |
Uses UTF-16 column conversion for server AST responses and tests multibyte output. |
crates/cli/src/sarif/sarif_utils.rs |
Adds SARIF column pass-through regression coverage. |
crates/cli/src/csv.rs |
Adds CSV column pass-through regression coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Added `LineColumnIndex::compute_line_starts` (extracted from `new`) and `LineColumnIndex::from_parts` to allow constructing an index from a pre-computed `Vec<usize>` without re-scanning the source. - Added a `line_starts: Vec<usize>` field to `RootContext`, populated in `set_text` so the scan runs once per file rather than once per op call. - Added `RootContext::line_column_index()` which builds a `LineColumnIndex` from the cached offsets in O(1) (a Vec clone, not a source scan). - Updated `op_ts_node_named_children` and `op_ts_node_parent` to use the cached index, eliminating O(source_len × visited_nodes) behavior on deep tree walks or large files. Co-authored-by: Cursor <cursoragent@cursor.com>
The old signature built a new LineColumnIndex (O(source_len) scan) on every node, making op_digraph_adjacency_list_to_dot O(source_len × CST_vertices). Change new_cst to accept &LineColumnIndex<'_> so callers own the index lifetime and can build it once before any loop: - ops.rs (op_digraph_adjacency_list_to_dot): build once before the closure, re-using the RootContext-cached index when available. - graph_test_utils.rs (cst_dot_digraph): build once before the iterator. - graph.rs (located_node_multibyte_col_is_utf16 test): pass &idx directly. Co-authored-by: Cursor <cursoragent@cursor.com>
…n calculation The old comment had a tangent line ending in "10?" that contradicted the actual expected value of 11. Replace with a single, step-by-step UTF-16 prefix sum that maps directly to the assert_eq! below it. Co-authored-by: Cursor <cursoragent@cursor.com>
…y_nodes_cjk The comment had a first derivation stating end starts at byte 7, then a second "let's recompute" block arriving at byte 9. Remove the wrong first derivation and keep only the corrected step-by-step calculation. Co-authored-by: Cursor <cursoragent@cursor.com>
The previous header said the source was '🚀protectedName = 1;' (emoji
directly preceding the identifier), but the actual source places the emoji
in a string literal: '"\u{1F680}"; protectedName = 1;'. Rewrite the header
to match the real source layout and update the byte/UTF-16 derivation.
Co-authored-by: Cursor <cursoragent@cursor.com>
The previous docstring led with UTF-16 as the universal meaning, then footnoted the secrets-scanner exception. Rewrite so both contracts are equal-weight bullet points: kernel (UTF-16 code units) and secrets scanner (Unicode grapheme clusters). Consumers reading just the field-level doc now see the full picture without having to read further. Co-authored-by: Cursor <cursoragent@cursor.com>
The unwrap_or_else(|| LineColumnIndex::new("")) fallback would silently
produce col=1 for every node if tree text was never set, making bugs
invisible. Replace with expect() in both op_ts_node_named_children and
op_ts_node_parent, consistent with the already-present panic in
op_digraph_adjacency_list_to_dot for the same invariant.
Co-authored-by: Cursor <cursoragent@cursor.com>
…→UTF-16 transition This PR changes the kernel from emitting 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns (matching LSP / VS Code / SARIF v2.1). While the PR is in review, the regression workflow compares two SARIF runs as JSON strings — one built from `main` (byte cols) and one from this branch (UTF-16 cols). Every violation that lives on a line with a non-ASCII character drifts by N columns (N = number of multibyte chars before the position) and shows up as a "removed + added" pair. To unblock the CI we drop `startColumn` / `endColumn` from both the comparison key (in `parseFile`) and the summary location display, and compare by (file, ruleId, message, startLine, endLine) only. This is a ONE-SHOT loosening for the encoding transition. A stacked follow-up PR will restore the column fields once this PR lands on `main` (at which point both runs are on UTF-16 columns again and column-level regression detection becomes meaningful). Verified locally on numpy's `_core/tests/test_strings.py`: - before fix: 3 "removed" + 3 "added" false positives (lines with λ μ ·) - after fix: 0 + 0 (clean) Co-authored-by: Cursor <cursoragent@cursor.com>
Heads-up for reviewers — temporary loosening of
|
| startColumn | endColumn | |
|---|---|---|
| main (bytes) | 14 | 32 |
| branch (UTF-16) | 14 | 31 |
The string-equality check sees this as "violation X removed, new violation Y added" — even though it's the same violation, only the column number changed. No rule is firing in a new place, no rule has stopped firing.
Reproduced locally on _core/tests/test_strings.py:
- Without the fix: 3 "removed" + 3 "added" false positives (lines with
λ,μ,·) - With the fix: 0 + 0 (clean)
What 329df0c does
parseFilestripsstartColumn/endColumnfrom the in-memory region — the comparison key becomes(file, ruleId, message, startLine, endLine).- The summary table location string drops columns too, since they no longer participate in the comparison.
- A long inline comment explains why this is temporary and what the follow-up will do.
Why not just fix the regression script properly here?
We do — in #915, stacked on this branch. Once this PR merges to main, both pre and post runs are on UTF-16, so column comparison becomes meaningful again and #915 restores the original behaviour. Doing it in-place here would either require leaving main as the "loosened" baseline forever, or shipping two competing changes that touch the same lines in the same PR.
What we are NOT giving up
Column-level regression detection — for one PR cycle. The check still catches:
- Rules that stop firing
- Rules that fire in new files/lines
- Different error messages on the same line
That's the bulk of what this workflow protects against. Column shifts are rare in practice and the gap is fully closed by #915.
Sorry for the extra commit on this PR — happy to discuss alternatives if you'd prefer to land the kernel change with column-comparison fully disabled long-term, or with a different escape-hatch mechanism.
Replace hand-rolled LineColumnIndex implementation with a thin newtype wrapper around line_index::LineIndex. Public API and all call sites are unchanged. Simplifies RootContext cache to own the index directly, eliminating per-call Vec<usize> clones. Addresses reviewer feedback on #914.
…-16 transition - Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. - Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes. - Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative. - Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected. IDE-6037
This stack of pull requests is managed by Graphite. Learn more about stacking. |
jasonforal
left a comment
There was a problem hiding this comment.
Ok, I started to leave some comments but then noticed an overall pattern, so some of my comments will overlap with this summary, but I left them in just to demonstrate what I meant. But they aren't exhaustive so please take a look over the full diff to find other examples:
Redundant Tests
Now that we've switched to the line-index crate, the only unit test we need for each fixed call site that previous just did + 1 is that we're calling byte_col_to_utf16_col. So just one source string is needed to prove that. We trust the byte_col_to_utf16_col unit tests cover all these other cases.
In general, I prefer tests to be single-responsibility (no redundant code paths tested) and highly readable
Unnecessary doc comments
Seems like a lot of LLM comments leaked into the doc comments (like how idx is passed to provide O(1) lookup). I think this is just noise. If it was a code comment, I'd question it but let it go, but since these are doc comments, they are just clutter.
I think the reason to pass in the index is self-explanatory and doesn't need any documentation.
Otherwise, this is a great fix!
| /// When the kernel produces UTF-16 columns (e.g. col 3 for a node after "🚀"), the CSV row | ||
| /// contains 3, not the raw byte col (5). | ||
| #[test] | ||
| fn test_export_csv_multibyte_col() { |
There was a problem hiding this comment.
Do you think this test is necessary? It seems to cover the same code path as test_export_csv
- Removed eight unit tests in position_utils.rs that duplicated coverage already provided by the line-index crate, collapsing them into a single wiring test that proves the wrapper delegates to LineIndex::to_wide(WideEncoding::Utf16, ..) and adds 1. - Dropped per-call-site multibyte tests covered by the new helper: test_export_csv_multibyte_col, sarif_region_carries_utf16_col, ts_node_line_col_multibyte, stella_compat_getcode_multibyte, test_map_node_multibyte_emoji, test_get_query_nodes_cjk, ts_node_bridge_multibyte_utf16_col, located_node_multibyte_col_is_utf16, and test_process_tree_sitter_tree_request_multibyte. - Trimmed LLM-flavored doc comments on insert_capture, RootContext (lci field, set_text, line_column_index), LocatedNode::new_cst, TreeSitterNode::from_ts_node_with_index, QueryMatchBridge::set_data, and map_node. - Removed noisy O(1)/pre-computed inline comments from three ops.rs call sites that explained self-evident caching behaviour. - Changed LineColumnIndex::byte_col_to_utf16_col to return Option<u32> instead of silently falling back to byte_col + 1, making the fallibility explicit at the type level. - Updated the five production call sites (tree_sitter.rs, js/ts_node.rs, js/flow/graph.rs) to use unwrap_or(byte_col as u32 + 1) so the ASCII fallback is visible at every caller, and updated test sites in js/flow/java.rs to use unwrap(). IDE-6037
…-16 transition - Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. - Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes. - Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative. - Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected. IDE-6037

Summary
This PR fixes a long-standing semantic bug in how the static analyzer reports column numbers. Tree-sitter's
Point.columnis a 0-based UTF-8 byte offset, but our internalPosition.colfield was simply doingbyte_col + 1and calling it a column number. This produces correct results for ASCII-only source files, but silently emits wrong columns for any source containing multibyte characters (emoji, CJK ideographs, accented letters, combining marks, etc.).LSP, VS Code, and SARIF v2.1 all define columns as 1-based UTF-16 code units. This PR aligns
Position.colwith that standard.Errors
Playground:
VS Code:
What changed
Core conversion helper (
crates/common)Introduced
LineColumnIndexincrates/common/src/utils/position_utils.rs, a thin newtype wrapper around theline-indexcrate (the line/column indexing engine extracted from rust-analyzer):LineIndex::new.byte_col_to_utf16_col(row, byte_col) -> u32which delegates toLineIndex::to_wide(WideEncoding::Utf16, ..)and applies the+ 1adjustment for 1-based columns.line-indexsplits on\nonly, which mirrors tree-sitter's line model exactly. Tree-sitter does not treat bare\ras a line terminator, and for\r\nfiles the\ris part of the column count on the same line.RootContextowns anOption<LineColumnIndex>directly, populated once inset_text; ops borrow&LineColumnIndexper call (no re-scanning, no per-callVec<usize>clones).Note
Earlier revisions of this PR shipped a hand-rolled implementation with an ASCII
is_ascii()fast path and a per-scalarchar::len_utf16()slow path. Following reviewer feedback, commit 8947d3bb replaced it with theline-indexwrapper to lean on a battle-tested upstream implementation. All existing unit tests (ASCII, BMP non-ASCII, supplementary-plane emoji, CJK, combining marks, CRLF, EOL/empty lines) pass unchanged against the new backing.Contract documentation (
crates/common)Position.colis now documented as a 1-based UTF-16 code-unit column incrates/common/src/model/position.rs, locking in the semantics for all future contributors.Kernel — tree-sitter layer (
crates/static-analysis-kernel)map_nodeandget_query_nodesintree_sitter.rsnow accept a&LineColumnIndexparameter and use it instead of the raw+ 1offset.LineColumnIndexis built once perget_query_nodescall (one scan of the source), then reused for every node in that query result.Kernel — ddsa JavaScript bridge
The
LineColumnIndexis threaded through the full ddsa call chain that feeds tree-sitter nodes into V8:TreeSitterNode::from_ts_node(the old ASCII-only constructor) has been removed. There is now a single constructor,from_ts_node_with_index, which always takes a&LineColumnIndex. This eliminates the footgun of accidentally using the byte-based constructor on a multibyte source.op_ts_node_named_childrenandop_ts_node_parentbuild aLineColumnIndexfrom the source text stored inRootContextbefore inserting newly discovered nodes into the bridge. Thectx_bridgeborrow inop_ts_node_named_childrenwas moved into theelsebranch (when there are actually children to process) to keep the borrow scope tight and the intent clear.Kernel — taint-flow region builders
LocatedNode::new_cstinflow/graph.rs(used by the taint-flow graph engine) previously didnode.start_position().column + 1. It now accepts the full source string, builds aLineColumnIndex, and emits the same UTF-16 column asfrom_ts_node_with_index— ensuring the taint-flow graph deduplication key stays consistent with the rest of the system.The
position_eqtest helper inflow/java.rswas updated to compare againstLineColumnIndex-derived columns rather than raw byte offsets.Server (
crates/static-analysis-server)process_tree_sitter_tree_requestnow builds aLineColumnIndexfrom the decoded source before callingmap_node, so IDE AST responses carry UTF-16 columns.SARIF and CSV output (
crates/cli)No changes were required to the serialization logic — SARIF and CSV both read
Position.coldirectly, so fixing the kernel's output automatically fixes their output. Focused tests were added to assert that the values pass through unchanged.Why UTF-16?
Blast radius
Position.colcontains (now UTF-16)LineColumnIndexhelperget_position_in_string)The change is intentionally a semantic breaking change for columns on non-ASCII input. Any downstream consumer (e.g. suppression rules keyed on exact column numbers for non-ASCII code) will need to update their column values. For ASCII code — which is the overwhelming majority of suppression use cases — nothing changes.
What we considered and rejected
Storing
LineColumnIndexin the bridge instead of passing it as a parameterRejected because the source text lives in
RootContextwhich is reset per file, and the bridge is long-lived across files. Storing the index in the bridge would require either cloning the source string (expensive) or carefully managing a lifetime that would become dangling. Passing&LineColumnIndexas a parameter keeps lifetimes simple and makes the dependency explicit.Keeping
TreeSitterNode::from_ts_node(the old byte-based constructor) for ASCII testsRejected because having two constructors with different semantics is a footgun — a future test author could reach for the shorter name on a multibyte source and get silently wrong columns. The ASCII tests were trivially updated to use
from_ts_node_with_indexwith aLineColumnIndex::newcall; for pure ASCII sources the results are identical.Using
str::lines()for line splitting inLineColumnIndexRejected because
str::lines()also splits on bare\r(old Mac OS 9 line endings), which tree-sitter does not recognise as a line terminator. Diverging here would produce wrong row/column pairs on\r\nfiles. The\n-only scan is intentional and documented at the call site.Using
usizefor column numbers everywherebyte_col_to_utf16_colreturnsu32to matchTreeSitterNode's field types andPosition.col. TheLocatedNode::Cst.colfield usesusize(a pre-existing inconsistency); the cast at the boundary is harmless but noted as a mild code smell if a future cleanup is desired.Tests added
LineColumnIndexunit matrix — ASCII fast path, BMP non-ASCII (é), supplementary-plane emoji (🚀), CJK (日本), combining marks, CRLF line endings, EOL boundaries, empty lines.map_nodemultibyte — emoji prefix shiftsnumidentifier to UTF-16 col 11 instead of raw byte col 13.get_query_nodesCJK — 日 (3 UTF-8 bytes, 1 UTF-16 unit) shiftsendidentifier to UTF-16 col 8.TsNodeBridgemultibyte — v8_startColfield is 7 (UTF-16) forabcafter"🚀";.TreeSitterNode::from_ts_node_with_indexmultibyte — emoji and CJK prefix tests.LocatedNode::new_cstregression — asserts UTF-16 col differs from raw byte col + 1 on multibyte input and agrees on ASCII input.stella_compat::getCodemultibyte — end-to-end JS execution test;violation.base_region.start_colis 7 (UTF-16), not 9 (byte).x = "🚀"; num = 5parsed as Python; asserts a node withstart.col == 11exists and no node reportsstart.col == 13.startColumn/endColumnpass through unchanged.Notes for the reviewers
See #914 (comment)