Skip to content

fix(kernel): emit 1-based UTF-16 code-unit columns from tree-sitter nodes#914

Merged
robertohuertasm-datadog merged 12 commits into
mainfrom
rob/fix/tree-sitter-byte-column-bug-IDE-6037
May 20, 2026
Merged

fix(kernel): emit 1-based UTF-16 code-unit columns from tree-sitter nodes#914
robertohuertasm-datadog merged 12 commits into
mainfrom
rob/fix/tree-sitter-byte-column-bug-IDE-6037

Conversation

@robertohuertasm-datadog
Copy link
Copy Markdown
Contributor

@robertohuertasm-datadog robertohuertasm-datadog commented May 17, 2026

Summary

This PR fixes a long-standing semantic bug in how the static analyzer reports column numbers. Tree-sitter's Point.column is a 0-based UTF-8 byte offset, but our internal Position.col field was simply doing byte_col + 1 and calling it a column number. This produces correct results for ASCII-only source files, but silently emits wrong columns for any source containing multibyte characters (emoji, CJK ideographs, accented letters, combining marks, etc.).

LSP, VS Code, and SARIF v2.1 all define columns as 1-based UTF-16 code units. This PR aligns Position.col with that standard.

Errors

Playground:

image

VS Code:

image

What changed

Core conversion helper (crates/common)

Introduced LineColumnIndex in crates/common/src/utils/position_utils.rs, a thin newtype wrapper around the line-index crate (the line/column indexing engine extracted from rust-analyzer):

  • Built once per source string via LineIndex::new.
  • Exposes byte_col_to_utf16_col(row, byte_col) -> u32 which delegates to LineIndex::to_wide(WideEncoding::Utf16, ..) and applies the + 1 adjustment for 1-based columns.
  • line-index splits on \n only, which mirrors tree-sitter's line model exactly. Tree-sitter does not treat bare \r as a line terminator, and for \r\n files the \r is part of the column count on the same line.
  • RootContext owns an Option<LineColumnIndex> directly, populated once in set_text; ops borrow &LineColumnIndex per call (no re-scanning, no per-call Vec<usize> clones).

Note

Earlier revisions of this PR shipped a hand-rolled implementation with an ASCII is_ascii() fast path and a per-scalar char::len_utf16() slow path. Following reviewer feedback, commit 8947d3bb replaced it with the line-index wrapper to lean on a battle-tested upstream implementation. All existing unit tests (ASCII, BMP non-ASCII, supplementary-plane emoji, CJK, combining marks, CRLF, EOL/empty lines) pass unchanged against the new backing.

Contract documentation (crates/common)

Position.col is now documented as a 1-based UTF-16 code-unit column in crates/common/src/model/position.rs, locking in the semantics for all future contributors.

Kernel — tree-sitter layer (crates/static-analysis-kernel)

  • map_node and get_query_nodes in tree_sitter.rs now accept a &LineColumnIndex parameter and use it instead of the raw + 1 offset.
  • LineColumnIndex is built once per get_query_nodes call (one scan of the source), then reused for every node in that query result.

Kernel — ddsa JavaScript bridge

The LineColumnIndex is threaded through the full ddsa call chain that feeds tree-sitter nodes into V8:

JsRuntime::execute_rule_internal
  └─ QueryMatchBridge::set_data
       └─ TsNodeBridge::insert_capture / insert / build_v8_node
            └─ TreeSitterNode::from_ts_node_with_index   ← single source of truth
  • TreeSitterNode::from_ts_node (the old ASCII-only constructor) has been removed. There is now a single constructor, from_ts_node_with_index, which always takes a &LineColumnIndex. This eliminates the footgun of accidentally using the byte-based constructor on a multibyte source.
  • The deno ops op_ts_node_named_children and op_ts_node_parent build a LineColumnIndex from the source text stored in RootContext before inserting newly discovered nodes into the bridge. The ctx_bridge borrow in op_ts_node_named_children was moved into the else branch (when there are actually children to process) to keep the borrow scope tight and the intent clear.

Kernel — taint-flow region builders

LocatedNode::new_cst in flow/graph.rs (used by the taint-flow graph engine) previously did node.start_position().column + 1. It now accepts the full source string, builds a LineColumnIndex, and emits the same UTF-16 column as from_ts_node_with_index — ensuring the taint-flow graph deduplication key stays consistent with the rest of the system.

The position_eq test helper in flow/java.rs was updated to compare against LineColumnIndex-derived columns rather than raw byte offsets.

Server (crates/static-analysis-server)

process_tree_sitter_tree_request now builds a LineColumnIndex from the decoded source before calling map_node, so IDE AST responses carry UTF-16 columns.

SARIF and CSV output (crates/cli)

No changes were required to the serialization logic — SARIF and CSV both read Position.col directly, so fixing the kernel's output automatically fixes their output. Focused tests were added to assert that the values pass through unchanged.


Why UTF-16?

  • SARIF v2.1 specifies column numbers as 1-based UTF-16 code units (the format's default encoding is UTF-16).
  • LSP / VS Code use 0-based UTF-16 code units; our 1-based variant is the natural complement.
  • CJK characters (e.g. 日, 本) are 3 UTF-8 bytes but 1 UTF-16 unit — they are common in production code and the byte-based columns were visibly wrong.
  • Emoji (e.g. 🚀) are 4 UTF-8 bytes but 2 UTF-16 units (surrogate pair) — even more extreme divergence.
  • For pure ASCII (the vast majority of lines), byte col == UTF-16 col, so there is zero behavioral change on ASCII-only files.

Blast radius

Area Impact
ASCII-only source files No change — byte col == UTF-16 col for ASCII
Source files with CJK, accented characters Column numbers decrease (fewer UTF-16 units than UTF-8 bytes)
Source files with emoji (surrogate pairs) Column numbers decrease further (4 bytes → 2 UTF-16 units)
SARIF output Now conforms to SARIF v2.1 UTF-16 column semantics
CSV output Faithfully serializes whatever Position.col contains (now UTF-16)
IDE / server AST responses Column numbers updated to UTF-16
Taint-flow graph deduplication Consistent — all paths use the same LineColumnIndex helper
Secrets scanner (get_position_in_string) Not affected — uses a separate grapheme-based column contract that does not flow through the kernel output

The change is intentionally a semantic breaking change for columns on non-ASCII input. Any downstream consumer (e.g. suppression rules keyed on exact column numbers for non-ASCII code) will need to update their column values. For ASCII code — which is the overwhelming majority of suppression use cases — nothing changes.


What we considered and rejected

Storing LineColumnIndex in the bridge instead of passing it as a parameter
Rejected because the source text lives in RootContext which is reset per file, and the bridge is long-lived across files. Storing the index in the bridge would require either cloning the source string (expensive) or carefully managing a lifetime that would become dangling. Passing &LineColumnIndex as a parameter keeps lifetimes simple and makes the dependency explicit.

Keeping TreeSitterNode::from_ts_node (the old byte-based constructor) for ASCII tests
Rejected because having two constructors with different semantics is a footgun — a future test author could reach for the shorter name on a multibyte source and get silently wrong columns. The ASCII tests were trivially updated to use from_ts_node_with_index with a LineColumnIndex::new call; for pure ASCII sources the results are identical.

Using str::lines() for line splitting in LineColumnIndex
Rejected because str::lines() also splits on bare \r (old Mac OS 9 line endings), which tree-sitter does not recognise as a line terminator. Diverging here would produce wrong row/column pairs on \r\n files. The \n-only scan is intentional and documented at the call site.

Using usize for column numbers everywhere
byte_col_to_utf16_col returns u32 to match TreeSitterNode's field types and Position.col. The LocatedNode::Cst.col field uses usize (a pre-existing inconsistency); the cast at the boundary is harmless but noted as a mild code smell if a future cleanup is desired.


Tests added

  • LineColumnIndex unit matrix — ASCII fast path, BMP non-ASCII (é), supplementary-plane emoji (🚀), CJK (日本), combining marks, CRLF line endings, EOL boundaries, empty lines.
  • map_node multibyte — emoji prefix shifts num identifier to UTF-16 col 11 instead of raw byte col 13.
  • get_query_nodes CJK — 日 (3 UTF-8 bytes, 1 UTF-16 unit) shifts end identifier to UTF-16 col 8.
  • TsNodeBridge multibyte — v8 _startCol field is 7 (UTF-16) for abc after "🚀"; .
  • TreeSitterNode::from_ts_node_with_index multibyte — emoji and CJK prefix tests.
  • LocatedNode::new_cst regression — asserts UTF-16 col differs from raw byte col + 1 on multibyte input and agrees on ASCII input.
  • stella_compat::getCode multibyte — end-to-end JS execution test; violation.base_region.start_col is 7 (UTF-16), not 9 (byte).
  • Server integration test — base64-encoded x = "🚀"; num = 5 parsed as Python; asserts a node with start.col == 11 exists and no node reports start.col == 13.
  • SARIF serializationstartColumn/endColumn pass through unchanged.
  • CSV serialization — UTF-16 col 3 appears in the CSV row, not raw byte col.

Notes for the reviewers

See #914 (comment)

…odes

- Introduced `LineColumnIndex` in `crates/common/src/utils/position_utils.rs`, a pre-built per-source index that converts tree-sitter's 0-based UTF-8 byte columns to 1-based UTF-16 code-unit columns, with an ASCII fast path and full unit-test coverage (BMP, supplementary-plane emoji, CJK, combining marks, CRLF, empty lines).
- Documented `Position.col` as a 1-based UTF-16 code-unit column in `crates/common/src/model/position.rs` to lock in the contract.
- Changed `map_node` and `get_query_nodes` (`tree_sitter.rs`) to accept and use `&LineColumnIndex`; updated `process_tree_sitter_tree_request` (`tree_sitter_tree.rs`) accordingly.
- Replaced `TreeSitterNode::from_ts_node` with `from_ts_node_with_index`, threading `LineColumnIndex` through the full ddsa call chain: `JsRuntime::execute_rule_internal` → `QueryMatchBridge::set_data` → `TsNodeBridge::insert`/`insert_capture`/`build_v8_node`.
- Updated the two taint-flow region builders (`LocatedNode::new_cst` in `flow/graph.rs`, `position_eq` helper in `flow/java.rs`) and the deno ops (`op_ts_node_named_children`, `op_ts_node_parent`, `op_digraph_adjacency_list_to_dot`) to use the same helper.
- Moved the `ctx_bridge` borrow in `op_ts_node_named_children` into the `else` branch where it is actually needed.
- Added multibyte regression tests across all layers: `LineColumnIndex` unit matrix, `map_node`/`get_query_nodes` emoji+CJK tests, `TsNodeBridge` bridge test, `LocatedNode` regression test, `stella_compat::getCode` JS execution test, server integration test on base64-encoded multibyte Python, and SARIF/CSV serialization assertions.

IDE-6037
@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented May 17, 2026

🎯 Code Coverage (details)
Patch Coverage: 99.39%
Overall Coverage: 85.45% (+0.06%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 1f052b9 | Docs | Datadog PR Page | Give us feedback!

- Deleted the `find_node` function from `tree_sitter.rs` as it was not utilized in the current implementation. This cleanup helps streamline the code and improve maintainability.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes tree-sitter-derived column reporting on 1-based UTF-16 code-unit columns, aligning kernel/server/bridge outputs with LSP, VS Code, and SARIF expectations for non-ASCII source.

Changes:

  • Adds LineColumnIndex conversion helper and documents Position.col semantics.
  • Threads UTF-16 column conversion through tree-sitter mapping, DDSA V8 node bridging, server AST responses, and taint-flow graph regions.
  • Adds focused regression tests for multibyte columns plus SARIF/CSV pass-through behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/common/src/utils/position_utils.rs Adds byte-column to UTF-16 column conversion helper and tests.
crates/common/src/model/position.rs Documents column semantics for positions.
crates/static-analysis-kernel/src/analysis/tree_sitter.rs Uses LineColumnIndex in tree-sitter node/query mapping and adds multibyte tests.
crates/static-analysis-kernel/src/analysis/ddsa_lib/runtime.rs Builds an index for DDSA query match bridging and adds an end-to-end multibyte test.
crates/static-analysis-kernel/src/analysis/ddsa_lib/ops.rs Applies UTF-16 columns when dynamically inserting child/parent/graph nodes.
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/ts_node.rs Replaces raw byte-column constructor with indexed UTF-16 constructor and tests it.
crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/ts_node.rs Threads LineColumnIndex through TS node insertion and V8 object creation.
crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/query_match.rs Passes the index through query match capture insertion.
crates/static-analysis-kernel/src/analysis/ddsa_lib/test_utils.rs Exposes test source text for indexed conversions.
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/java.rs Updates Java flow tests to compare UTF-16 columns.
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph.rs Updates CST located-node column construction and adds regression coverage.
crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph_test_utils.rs Supplies full source text to graph test utilities.
crates/static-analysis-server/src/tree_sitter_tree.rs Uses UTF-16 column conversion for server AST responses and tests multibyte output.
crates/cli/src/sarif/sarif_utils.rs Adds SARIF column pass-through regression coverage.
crates/cli/src/csv.rs Adds CSV column pass-through regression coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/ops.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/ops.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph.rs Outdated
Comment thread crates/common/src/utils/position_utils.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/tree_sitter.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/runtime.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/tree_sitter.rs Outdated
Comment thread crates/common/src/model/position.rs Outdated
robertohuertasm-datadog and others added 8 commits May 17, 2026 12:24
- Added `LineColumnIndex::compute_line_starts` (extracted from `new`) and
  `LineColumnIndex::from_parts` to allow constructing an index from a
  pre-computed `Vec<usize>` without re-scanning the source.
- Added a `line_starts: Vec<usize>` field to `RootContext`, populated in
  `set_text` so the scan runs once per file rather than once per op call.
- Added `RootContext::line_column_index()` which builds a `LineColumnIndex`
  from the cached offsets in O(1) (a Vec clone, not a source scan).
- Updated `op_ts_node_named_children` and `op_ts_node_parent` to use the
  cached index, eliminating O(source_len × visited_nodes) behavior on deep
  tree walks or large files.

Co-authored-by: Cursor <cursoragent@cursor.com>
The old signature built a new LineColumnIndex (O(source_len) scan) on every
node, making op_digraph_adjacency_list_to_dot O(source_len × CST_vertices).

Change new_cst to accept &LineColumnIndex<'_> so callers own the index
lifetime and can build it once before any loop:
- ops.rs (op_digraph_adjacency_list_to_dot): build once before the closure,
  re-using the RootContext-cached index when available.
- graph_test_utils.rs (cst_dot_digraph): build once before the iterator.
- graph.rs (located_node_multibyte_col_is_utf16 test): pass &idx directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
…n calculation

The old comment had a tangent line ending in "10?" that contradicted the
actual expected value of 11. Replace with a single, step-by-step UTF-16
prefix sum that maps directly to the assert_eq! below it.

Co-authored-by: Cursor <cursoragent@cursor.com>
…y_nodes_cjk

The comment had a first derivation stating end starts at byte 7, then a
second "let's recompute" block arriving at byte 9. Remove the wrong first
derivation and keep only the corrected step-by-step calculation.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous header said the source was '🚀protectedName = 1;' (emoji
directly preceding the identifier), but the actual source places the emoji
in a string literal: '"\u{1F680}"; protectedName = 1;'. Rewrite the header
to match the real source layout and update the byte/UTF-16 derivation.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous docstring led with UTF-16 as the universal meaning, then
footnoted the secrets-scanner exception. Rewrite so both contracts are
equal-weight bullet points: kernel (UTF-16 code units) and secrets scanner
(Unicode grapheme clusters). Consumers reading just the field-level doc now
see the full picture without having to read further.

Co-authored-by: Cursor <cursoragent@cursor.com>
The unwrap_or_else(|| LineColumnIndex::new("")) fallback would silently
produce col=1 for every node if tree text was never set, making bugs
invisible. Replace with expect() in both op_ts_node_named_children and
op_ts_node_parent, consistent with the already-present panic in
op_digraph_adjacency_list_to_dot for the same invariant.

Co-authored-by: Cursor <cursoragent@cursor.com>
…→UTF-16 transition

This PR changes the kernel from emitting 1-based UTF-8 byte columns to
1-based UTF-16 code-unit columns (matching LSP / VS Code / SARIF v2.1).

While the PR is in review, the regression workflow compares two SARIF
runs as JSON strings — one built from `main` (byte cols) and one from
this branch (UTF-16 cols). Every violation that lives on a line with a
non-ASCII character drifts by N columns (N = number of multibyte chars
before the position) and shows up as a "removed + added" pair.

To unblock the CI we drop `startColumn` / `endColumn` from both the
comparison key (in `parseFile`) and the summary location display, and
compare by (file, ruleId, message, startLine, endLine) only.

This is a ONE-SHOT loosening for the encoding transition. A stacked
follow-up PR will restore the column fields once this PR lands on
`main` (at which point both runs are on UTF-16 columns again and
column-level regression detection becomes meaningful).

Verified locally on numpy's `_core/tests/test_strings.py`:
- before fix: 3 "removed" + 3 "added" false positives (lines with λ μ ·)
- after  fix: 0 + 0 (clean)

Co-authored-by: Cursor <cursoragent@cursor.com>
@robertohuertasm-datadog
Copy link
Copy Markdown
Contributor Author

robertohuertasm-datadog commented May 17, 2026

Heads-up for reviewers — temporary loosening of check-regressions.js in 329df0c

Update: the follow-up that restores the column-aware comparison is now open as #915, stacked on top of this branch. It will close the loop the moment this PR lands on main.

TL;DR: this PR also includes a one-shot, intentional weakening of the regression workflow (startColumn / endColumn removed from the comparison key). #915 reverts that part immediately after this lands on main.

Why was the regression CI failing?

The workflow runs the analyzer twice — once on main, once on the PR branch — and compares the two SARIF outputs as JSON strings (set1 vs set2 in check-regressions.js). This PR changes the kernel from emitting 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. While the PR is in review:

  • main build → byte columns
  • branch build → UTF-16 columns

Every violation that lives on a line containing a non-ASCII character drifts by N columns (N = number of multibyte chars before the violation position). For example, on numpy/_core/tests/test_strings.py line 1284 ("λ" * 5 + "μ" * 2):

startColumn endColumn
main (bytes) 14 32
branch (UTF-16) 14 31

The string-equality check sees this as "violation X removed, new violation Y added" — even though it's the same violation, only the column number changed. No rule is firing in a new place, no rule has stopped firing.

Reproduced locally on _core/tests/test_strings.py:

  • Without the fix: 3 "removed" + 3 "added" false positives (lines with λ, μ, ·)
  • With the fix: 0 + 0 (clean)

What 329df0c does

  1. parseFile strips startColumn / endColumn from the in-memory region — the comparison key becomes (file, ruleId, message, startLine, endLine).
  2. The summary table location string drops columns too, since they no longer participate in the comparison.
  3. A long inline comment explains why this is temporary and what the follow-up will do.

Why not just fix the regression script properly here?

We do — in #915, stacked on this branch. Once this PR merges to main, both pre and post runs are on UTF-16, so column comparison becomes meaningful again and #915 restores the original behaviour. Doing it in-place here would either require leaving main as the "loosened" baseline forever, or shipping two competing changes that touch the same lines in the same PR.

What we are NOT giving up

Column-level regression detection — for one PR cycle. The check still catches:

  • Rules that stop firing
  • Rules that fire in new files/lines
  • Different error messages on the same line

That's the bulk of what this workflow protects against. Column shifts are rare in practice and the gap is fully closed by #915.

Sorry for the extra commit on this PR — happy to discuss alternatives if you'd prefer to land the kernel change with column-comparison fully disabled long-term, or with a different escape-hatch mechanism.

Replace hand-rolled LineColumnIndex implementation with a thin newtype
wrapper around line_index::LineIndex. Public API and all call sites are
unchanged. Simplifies RootContext cache to own the index directly,
eliminating per-call Vec<usize> clones.

Addresses reviewer feedback on #914.
robertohuertasm-datadog added a commit that referenced this pull request May 19, 2026
…-16 transition

- Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns.
- Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes.
- Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative.
- Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected.

IDE-6037
Copy link
Copy Markdown
Contributor Author

Comment thread crates/common/src/utils/position_utils.rs Outdated
Copy link
Copy Markdown
Collaborator

@jasonforal jasonforal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I started to leave some comments but then noticed an overall pattern, so some of my comments will overlap with this summary, but I left them in just to demonstrate what I meant. But they aren't exhaustive so please take a look over the full diff to find other examples:

Redundant Tests
Now that we've switched to the line-index crate, the only unit test we need for each fixed call site that previous just did + 1 is that we're calling byte_col_to_utf16_col. So just one source string is needed to prove that. We trust the byte_col_to_utf16_col unit tests cover all these other cases.

In general, I prefer tests to be single-responsibility (no redundant code paths tested) and highly readable

Unnecessary doc comments
Seems like a lot of LLM comments leaked into the doc comments (like how idx is passed to provide O(1) lookup). I think this is just noise. If it was a code comment, I'd question it but let it go, but since these are doc comments, they are just clutter.

I think the reason to pass in the index is self-explanatory and doesn't need any documentation.


Otherwise, this is a great fix!

Comment thread crates/cli/src/csv.rs Outdated
/// When the kernel produces UTF-16 columns (e.g. col 3 for a node after "🚀"), the CSV row
/// contains 3, not the raw byte col (5).
#[test]
fn test_export_csv_multibyte_col() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this test is necessary? It seems to cover the same code path as test_export_csv

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed in 1f052b9.

Comment thread crates/cli/src/sarif/sarif_utils.rs Outdated
Comment thread crates/common/src/utils/position_utils.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/ts_node.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/ts_node.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/context/root.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/js/ts_node.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/js/ts_node.rs Outdated
Comment thread crates/static-analysis-kernel/src/analysis/ddsa_lib/runtime.rs Outdated
- Removed eight unit tests in position_utils.rs that duplicated coverage already provided by the line-index crate, collapsing them into a single wiring test that proves the wrapper delegates to LineIndex::to_wide(WideEncoding::Utf16, ..) and adds 1.
- Dropped per-call-site multibyte tests covered by the new helper: test_export_csv_multibyte_col, sarif_region_carries_utf16_col, ts_node_line_col_multibyte, stella_compat_getcode_multibyte, test_map_node_multibyte_emoji, test_get_query_nodes_cjk, ts_node_bridge_multibyte_utf16_col, located_node_multibyte_col_is_utf16, and test_process_tree_sitter_tree_request_multibyte.
- Trimmed LLM-flavored doc comments on insert_capture, RootContext (lci field, set_text, line_column_index), LocatedNode::new_cst, TreeSitterNode::from_ts_node_with_index, QueryMatchBridge::set_data, and map_node.
- Removed noisy O(1)/pre-computed inline comments from three ops.rs call sites that explained self-evident caching behaviour.
- Changed LineColumnIndex::byte_col_to_utf16_col to return Option<u32> instead of silently falling back to byte_col + 1, making the fallibility explicit at the type level.
- Updated the five production call sites (tree_sitter.rs, js/ts_node.rs, js/flow/graph.rs) to use unwrap_or(byte_col as u32 + 1) so the ASCII fallback is visible at every caller, and updated test sites in js/flow/java.rs to use unwrap().

IDE-6037
Comment thread crates/common/src/model/position.rs
@robertohuertasm-datadog robertohuertasm-datadog merged commit 39f4b84 into main May 20, 2026
108 of 126 checks passed
@robertohuertasm-datadog robertohuertasm-datadog deleted the rob/fix/tree-sitter-byte-column-bug-IDE-6037 branch May 20, 2026 17:24
robertohuertasm-datadog added a commit that referenced this pull request May 20, 2026
…-16 transition

- Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns.
- Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes.
- Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative.
- Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected.

IDE-6037
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants