fix(kernel): emit 1-based UTF-16 code-unit columns from tree-sitter nodes by robertohuertasm-datadog · Pull Request #914 · DataDog/datadog-static-analyzer

robertohuertasm-datadog · 2026-05-17T09:43:40Z

Summary

This PR fixes a long-standing semantic bug in how the static analyzer reports column numbers. Tree-sitter's Point.column is a 0-based UTF-8 byte offset, but our internal Position.col field was simply doing byte_col + 1 and calling it a column number. This produces correct results for ASCII-only source files, but silently emits wrong columns for any source containing multibyte characters (emoji, CJK ideographs, accented letters, combining marks, etc.).

LSP, VS Code, and SARIF v2.1 all define columns as 1-based UTF-16 code units. This PR aligns Position.col with that standard.

Errors

Playground:

VS Code:

What changed

Core conversion helper (`crates/common`)

Introduced LineColumnIndex in crates/common/src/utils/position_utils.rs, a thin newtype wrapper around the line-index crate (the line/column indexing engine extracted from rust-analyzer):

Built once per source string via LineIndex::new.
Exposes byte_col_to_utf16_col(row, byte_col) -> u32 which delegates to LineIndex::to_wide(WideEncoding::Utf16, ..) and applies the + 1 adjustment for 1-based columns.
line-index splits on \n only, which mirrors tree-sitter's line model exactly. Tree-sitter does not treat bare \r as a line terminator, and for \r\n files the \r is part of the column count on the same line.
RootContext owns an Option<LineColumnIndex> directly, populated once in set_text; ops borrow &LineColumnIndex per call (no re-scanning, no per-call Vec<usize> clones).

Note

Earlier revisions of this PR shipped a hand-rolled implementation with an ASCII is_ascii() fast path and a per-scalar char::len_utf16() slow path. Following reviewer feedback, commit 8947d3bb replaced it with the line-index wrapper to lean on a battle-tested upstream implementation. All existing unit tests (ASCII, BMP non-ASCII, supplementary-plane emoji, CJK, combining marks, CRLF, EOL/empty lines) pass unchanged against the new backing.

Contract documentation (`crates/common`)

Position.col is now documented as a 1-based UTF-16 code-unit column in crates/common/src/model/position.rs, locking in the semantics for all future contributors.

Kernel — tree-sitter layer (`crates/static-analysis-kernel`)

map_node and get_query_nodes in tree_sitter.rs now accept a &LineColumnIndex parameter and use it instead of the raw + 1 offset.
LineColumnIndex is built once per get_query_nodes call (one scan of the source), then reused for every node in that query result.

Kernel — ddsa JavaScript bridge

The LineColumnIndex is threaded through the full ddsa call chain that feeds tree-sitter nodes into V8:

JsRuntime::execute_rule_internal
  └─ QueryMatchBridge::set_data
       └─ TsNodeBridge::insert_capture / insert / build_v8_node
            └─ TreeSitterNode::from_ts_node_with_index   ← single source of truth

TreeSitterNode::from_ts_node (the old ASCII-only constructor) has been removed. There is now a single constructor, from_ts_node_with_index, which always takes a &LineColumnIndex. This eliminates the footgun of accidentally using the byte-based constructor on a multibyte source.
The deno ops op_ts_node_named_children and op_ts_node_parent build a LineColumnIndex from the source text stored in RootContext before inserting newly discovered nodes into the bridge. The ctx_bridge borrow in op_ts_node_named_children was moved into the else branch (when there are actually children to process) to keep the borrow scope tight and the intent clear.

Kernel — taint-flow region builders

LocatedNode::new_cst in flow/graph.rs (used by the taint-flow graph engine) previously did node.start_position().column + 1. It now accepts the full source string, builds a LineColumnIndex, and emits the same UTF-16 column as from_ts_node_with_index — ensuring the taint-flow graph deduplication key stays consistent with the rest of the system.

The position_eq test helper in flow/java.rs was updated to compare against LineColumnIndex-derived columns rather than raw byte offsets.

Server (`crates/static-analysis-server`)

process_tree_sitter_tree_request now builds a LineColumnIndex from the decoded source before calling map_node, so IDE AST responses carry UTF-16 columns.

SARIF and CSV output (`crates/cli`)

No changes were required to the serialization logic — SARIF and CSV both read Position.col directly, so fixing the kernel's output automatically fixes their output. Focused tests were added to assert that the values pass through unchanged.

Why UTF-16?

SARIF v2.1 specifies column numbers as 1-based UTF-16 code units (the format's default encoding is UTF-16).
LSP / VS Code use 0-based UTF-16 code units; our 1-based variant is the natural complement.
CJK characters (e.g. 日, 本) are 3 UTF-8 bytes but 1 UTF-16 unit — they are common in production code and the byte-based columns were visibly wrong.
Emoji (e.g. 🚀) are 4 UTF-8 bytes but 2 UTF-16 units (surrogate pair) — even more extreme divergence.
For pure ASCII (the vast majority of lines), byte col == UTF-16 col, so there is zero behavioral change on ASCII-only files.

Blast radius

Area	Impact
ASCII-only source files	No change — byte col == UTF-16 col for ASCII
Source files with CJK, accented characters	Column numbers decrease (fewer UTF-16 units than UTF-8 bytes)
Source files with emoji (surrogate pairs)	Column numbers decrease further (4 bytes → 2 UTF-16 units)
SARIF output	Now conforms to SARIF v2.1 UTF-16 column semantics
CSV output	Faithfully serializes whatever `Position.col` contains (now UTF-16)
IDE / server AST responses	Column numbers updated to UTF-16
Taint-flow graph deduplication	Consistent — all paths use the same `LineColumnIndex` helper
Secrets scanner (`get_position_in_string`)	Not affected — uses a separate grapheme-based column contract that does not flow through the kernel output

The change is intentionally a semantic breaking change for columns on non-ASCII input. Any downstream consumer (e.g. suppression rules keyed on exact column numbers for non-ASCII code) will need to update their column values. For ASCII code — which is the overwhelming majority of suppression use cases — nothing changes.

What we considered and rejected

Storing LineColumnIndex in the bridge instead of passing it as a parameter
Rejected because the source text lives in RootContext which is reset per file, and the bridge is long-lived across files. Storing the index in the bridge would require either cloning the source string (expensive) or carefully managing a lifetime that would become dangling. Passing &LineColumnIndex as a parameter keeps lifetimes simple and makes the dependency explicit.

Keeping TreeSitterNode::from_ts_node (the old byte-based constructor) for ASCII tests
Rejected because having two constructors with different semantics is a footgun — a future test author could reach for the shorter name on a multibyte source and get silently wrong columns. The ASCII tests were trivially updated to use from_ts_node_with_index with a LineColumnIndex::new call; for pure ASCII sources the results are identical.

Using str::lines() for line splitting in LineColumnIndex
Rejected because str::lines() also splits on bare \r (old Mac OS 9 line endings), which tree-sitter does not recognise as a line terminator. Diverging here would produce wrong row/column pairs on \r\n files. The \n-only scan is intentional and documented at the call site.

Using usize for column numbers everywhere
byte_col_to_utf16_col returns u32 to match TreeSitterNode's field types and Position.col. The LocatedNode::Cst.col field uses usize (a pre-existing inconsistency); the cast at the boundary is harmless but noted as a mild code smell if a future cleanup is desired.

Tests added

LineColumnIndex unit matrix — ASCII fast path, BMP non-ASCII (é), supplementary-plane emoji (🚀), CJK (日本), combining marks, CRLF line endings, EOL boundaries, empty lines.
map_node multibyte — emoji prefix shifts num identifier to UTF-16 col 11 instead of raw byte col 13.
get_query_nodes CJK — 日 (3 UTF-8 bytes, 1 UTF-16 unit) shifts end identifier to UTF-16 col 8.
TsNodeBridge multibyte — v8 _startCol field is 7 (UTF-16) for abc after "🚀"; .
TreeSitterNode::from_ts_node_with_index multibyte — emoji and CJK prefix tests.
LocatedNode::new_cst regression — asserts UTF-16 col differs from raw byte col + 1 on multibyte input and agrees on ASCII input.
stella_compat::getCode multibyte — end-to-end JS execution test; violation.base_region.start_col is 7 (UTF-16), not 9 (byte).
Server integration test — base64-encoded x = "🚀"; num = 5 parsed as Python; asserts a node with start.col == 11 exists and no node reports start.col == 13.
SARIF serialization — startColumn/endColumn pass through unchanged.
CSV serialization — UTF-16 col 3 appears in the CSV row, not raw byte col.

Notes for the reviewers

See #914 (comment)

…odes - Introduced `LineColumnIndex` in `crates/common/src/utils/position_utils.rs`, a pre-built per-source index that converts tree-sitter's 0-based UTF-8 byte columns to 1-based UTF-16 code-unit columns, with an ASCII fast path and full unit-test coverage (BMP, supplementary-plane emoji, CJK, combining marks, CRLF, empty lines). - Documented `Position.col` as a 1-based UTF-16 code-unit column in `crates/common/src/model/position.rs` to lock in the contract. - Changed `map_node` and `get_query_nodes` (`tree_sitter.rs`) to accept and use `&LineColumnIndex`; updated `process_tree_sitter_tree_request` (`tree_sitter_tree.rs`) accordingly. - Replaced `TreeSitterNode::from_ts_node` with `from_ts_node_with_index`, threading `LineColumnIndex` through the full ddsa call chain: `JsRuntime::execute_rule_internal` → `QueryMatchBridge::set_data` → `TsNodeBridge::insert`/`insert_capture`/`build_v8_node`. - Updated the two taint-flow region builders (`LocatedNode::new_cst` in `flow/graph.rs`, `position_eq` helper in `flow/java.rs`) and the deno ops (`op_ts_node_named_children`, `op_ts_node_parent`, `op_digraph_adjacency_list_to_dot`) to use the same helper. - Moved the `ctx_bridge` borrow in `op_ts_node_named_children` into the `else` branch where it is actually needed. - Added multibyte regression tests across all layers: `LineColumnIndex` unit matrix, `map_node`/`get_query_nodes` emoji+CJK tests, `TsNodeBridge` bridge test, `LocatedNode` regression test, `stella_compat::getCode` JS execution test, server integration test on base64-encoded multibyte Python, and SARIF/CSV serialization assertions. IDE-6037

datadog-prod-us1-6 · 2026-05-17T09:48:06Z

🎯 Code Coverage (details)
• Patch Coverage: 99.39%
• Overall Coverage: 85.45% (+0.06%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 1f052b9 | Docs | Datadog PR Page | Give us feedback!}

- Deleted the `find_node` function from `tree_sitter.rs` as it was not utilized in the current implementation. This cleanup helps streamline the code and improve maintainability.

Copilot

Pull request overview

This PR standardizes tree-sitter-derived column reporting on 1-based UTF-16 code-unit columns, aligning kernel/server/bridge outputs with LSP, VS Code, and SARIF expectations for non-ASCII source.

Changes:

Adds LineColumnIndex conversion helper and documents Position.col semantics.
Threads UTF-16 column conversion through tree-sitter mapping, DDSA V8 node bridging, server AST responses, and taint-flow graph regions.
Adds focused regression tests for multibyte columns plus SARIF/CSV pass-through behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`crates/common/src/utils/position_utils.rs`	Adds byte-column to UTF-16 column conversion helper and tests.
`crates/common/src/model/position.rs`	Documents column semantics for positions.
`crates/static-analysis-kernel/src/analysis/tree_sitter.rs`	Uses `LineColumnIndex` in tree-sitter node/query mapping and adds multibyte tests.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/runtime.rs`	Builds an index for DDSA query match bridging and adds an end-to-end multibyte test.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/ops.rs`	Applies UTF-16 columns when dynamically inserting child/parent/graph nodes.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/js/ts_node.rs`	Replaces raw byte-column constructor with indexed UTF-16 constructor and tests it.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/ts_node.rs`	Threads `LineColumnIndex` through TS node insertion and V8 object creation.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/bridge/query_match.rs`	Passes the index through query match capture insertion.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/test_utils.rs`	Exposes test source text for indexed conversions.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/java.rs`	Updates Java flow tests to compare UTF-16 columns.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph.rs`	Updates CST located-node column construction and adds regression coverage.
`crates/static-analysis-kernel/src/analysis/ddsa_lib/js/flow/graph_test_utils.rs`	Supplies full source text to graph test utilities.
`crates/static-analysis-server/src/tree_sitter_tree.rs`	Uses UTF-16 column conversion for server AST responses and tests multibyte output.
`crates/cli/src/sarif/sarif_utils.rs`	Adds SARIF column pass-through regression coverage.
`crates/cli/src/csv.rs`	Adds CSV column pass-through regression coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Added `LineColumnIndex::compute_line_starts` (extracted from `new`) and `LineColumnIndex::from_parts` to allow constructing an index from a pre-computed `Vec<usize>` without re-scanning the source. - Added a `line_starts: Vec<usize>` field to `RootContext`, populated in `set_text` so the scan runs once per file rather than once per op call. - Added `RootContext::line_column_index()` which builds a `LineColumnIndex` from the cached offsets in O(1) (a Vec clone, not a source scan). - Updated `op_ts_node_named_children` and `op_ts_node_parent` to use the cached index, eliminating O(source_len × visited_nodes) behavior on deep tree walks or large files. Co-authored-by: Cursor <cursoragent@cursor.com>

The old signature built a new LineColumnIndex (O(source_len) scan) on every node, making op_digraph_adjacency_list_to_dot O(source_len × CST_vertices). Change new_cst to accept &LineColumnIndex<'_> so callers own the index lifetime and can build it once before any loop: - ops.rs (op_digraph_adjacency_list_to_dot): build once before the closure, re-using the RootContext-cached index when available. - graph_test_utils.rs (cst_dot_digraph): build once before the iterator. - graph.rs (located_node_multibyte_col_is_utf16 test): pass &idx directly. Co-authored-by: Cursor <cursoragent@cursor.com>

…n calculation The old comment had a tangent line ending in "10?" that contradicted the actual expected value of 11. Replace with a single, step-by-step UTF-16 prefix sum that maps directly to the assert_eq! below it. Co-authored-by: Cursor <cursoragent@cursor.com>

…y_nodes_cjk The comment had a first derivation stating end starts at byte 7, then a second "let's recompute" block arriving at byte 9. Remove the wrong first derivation and keep only the corrected step-by-step calculation. Co-authored-by: Cursor <cursoragent@cursor.com>

The previous header said the source was '🚀protectedName = 1;' (emoji directly preceding the identifier), but the actual source places the emoji in a string literal: '"\u{1F680}"; protectedName = 1;'. Rewrite the header to match the real source layout and update the byte/UTF-16 derivation. Co-authored-by: Cursor <cursoragent@cursor.com>

The previous docstring led with UTF-16 as the universal meaning, then footnoted the secrets-scanner exception. Rewrite so both contracts are equal-weight bullet points: kernel (UTF-16 code units) and secrets scanner (Unicode grapheme clusters). Consumers reading just the field-level doc now see the full picture without having to read further. Co-authored-by: Cursor <cursoragent@cursor.com>

The unwrap_or_else(|| LineColumnIndex::new("")) fallback would silently produce col=1 for every node if tree text was never set, making bugs invisible. Replace with expect() in both op_ts_node_named_children and op_ts_node_parent, consistent with the already-present panic in op_digraph_adjacency_list_to_dot for the same invariant. Co-authored-by: Cursor <cursoragent@cursor.com>

…→UTF-16 transition This PR changes the kernel from emitting 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns (matching LSP / VS Code / SARIF v2.1). While the PR is in review, the regression workflow compares two SARIF runs as JSON strings — one built from `main` (byte cols) and one from this branch (UTF-16 cols). Every violation that lives on a line with a non-ASCII character drifts by N columns (N = number of multibyte chars before the position) and shows up as a "removed + added" pair. To unblock the CI we drop `startColumn` / `endColumn` from both the comparison key (in `parseFile`) and the summary location display, and compare by (file, ruleId, message, startLine, endLine) only. This is a ONE-SHOT loosening for the encoding transition. A stacked follow-up PR will restore the column fields once this PR lands on `main` (at which point both runs are on UTF-16 columns again and column-level regression detection becomes meaningful). Verified locally on numpy's `_core/tests/test_strings.py`: - before fix: 3 "removed" + 3 "added" false positives (lines with λ μ ·) - after fix: 0 + 0 (clean) Co-authored-by: Cursor <cursoragent@cursor.com>

robertohuertasm-datadog · 2026-05-17T15:19:07Z

Heads-up for reviewers — temporary loosening of `check-regressions.js` in `329df0c`

Update: the follow-up that restores the column-aware comparison is now open as #915, stacked on top of this branch. It will close the loop the moment this PR lands on main.

TL;DR: this PR also includes a one-shot, intentional weakening of the regression workflow (startColumn / endColumn removed from the comparison key). #915 reverts that part immediately after this lands on main.

Why was the regression CI failing?

The workflow runs the analyzer twice — once on main, once on the PR branch — and compares the two SARIF outputs as JSON strings (set1 vs set2 in check-regressions.js). This PR changes the kernel from emitting 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. While the PR is in review:

main build → byte columns
branch build → UTF-16 columns

Every violation that lives on a line containing a non-ASCII character drifts by N columns (N = number of multibyte chars before the violation position). For example, on numpy/_core/tests/test_strings.py line 1284 ("λ" * 5 + "μ" * 2):

	startColumn	endColumn
main (bytes)	14	32
branch (UTF-16)	14	31

The string-equality check sees this as "violation X removed, new violation Y added" — even though it's the same violation, only the column number changed. No rule is firing in a new place, no rule has stopped firing.

Reproduced locally on _core/tests/test_strings.py:

Without the fix: 3 "removed" + 3 "added" false positives (lines with λ, μ, ·)
With the fix: 0 + 0 (clean)

What 329df0c does

parseFile strips startColumn / endColumn from the in-memory region — the comparison key becomes (file, ruleId, message, startLine, endLine).
The summary table location string drops columns too, since they no longer participate in the comparison.
A long inline comment explains why this is temporary and what the follow-up will do.

Why not just fix the regression script properly here?

We do — in #915, stacked on this branch. Once this PR merges to main, both pre and post runs are on UTF-16, so column comparison becomes meaningful again and #915 restores the original behaviour. Doing it in-place here would either require leaving main as the "loosened" baseline forever, or shipping two competing changes that touch the same lines in the same PR.

What we are NOT giving up

Column-level regression detection — for one PR cycle. The check still catches:

Rules that stop firing
Rules that fire in new files/lines
Different error messages on the same line

That's the bulk of what this workflow protects against. Column shifts are rare in practice and the gap is fully closed by #915.

Sorry for the extra commit on this PR — happy to discuss alternatives if you'd prefer to land the kernel change with column-comparison fully disabled long-term, or with a different escape-hatch mechanism.

Replace hand-rolled LineColumnIndex implementation with a thin newtype wrapper around line_index::LineIndex. Public API and all call sites are unchanged. Simplifies RootContext cache to own the index directly, eliminating per-call Vec<usize> clones. Addresses reviewer feedback on #914.

…-16 transition - Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. - Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes. - Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative. - Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected. IDE-6037

robertohuertasm-datadog · 2026-05-19T14:34:08Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

jasonforal

Ok, I started to leave some comments but then noticed an overall pattern, so some of my comments will overlap with this summary, but I left them in just to demonstrate what I meant. But they aren't exhaustive so please take a look over the full diff to find other examples:

Redundant Tests
Now that we've switched to the line-index crate, the only unit test we need for each fixed call site that previous just did + 1 is that we're calling byte_col_to_utf16_col. So just one source string is needed to prove that. We trust the byte_col_to_utf16_col unit tests cover all these other cases.

In general, I prefer tests to be single-responsibility (no redundant code paths tested) and highly readable

Unnecessary doc comments
Seems like a lot of LLM comments leaked into the doc comments (like how idx is passed to provide O(1) lookup). I think this is just noise. If it was a code comment, I'd question it but let it go, but since these are doc comments, they are just clutter.

I think the reason to pass in the index is self-explanatory and doesn't need any documentation.

Otherwise, this is a great fix!

jasonforal · 2026-05-20T10:50:42Z

+    /// When the kernel produces UTF-16 columns (e.g. col 3 for a node after "🚀"), the CSV row
+    /// contains 3, not the raw byte col (5).
+    #[test]
+    fn test_export_csv_multibyte_col() {


Do you think this test is necessary? It seems to cover the same code path as test_export_csv

Agreed, removed in 1f052b9.

- Removed eight unit tests in position_utils.rs that duplicated coverage already provided by the line-index crate, collapsing them into a single wiring test that proves the wrapper delegates to LineIndex::to_wide(WideEncoding::Utf16, ..) and adds 1. - Dropped per-call-site multibyte tests covered by the new helper: test_export_csv_multibyte_col, sarif_region_carries_utf16_col, ts_node_line_col_multibyte, stella_compat_getcode_multibyte, test_map_node_multibyte_emoji, test_get_query_nodes_cjk, ts_node_bridge_multibyte_utf16_col, located_node_multibyte_col_is_utf16, and test_process_tree_sitter_tree_request_multibyte. - Trimmed LLM-flavored doc comments on insert_capture, RootContext (lci field, set_text, line_column_index), LocatedNode::new_cst, TreeSitterNode::from_ts_node_with_index, QueryMatchBridge::set_data, and map_node. - Removed noisy O(1)/pre-computed inline comments from three ops.rs call sites that explained self-evident caching behaviour. - Changed LineColumnIndex::byte_col_to_utf16_col to return Option<u32> instead of silently falling back to byte_col + 1, making the fallibility explicit at the type level. - Updated the five production call sites (tree_sitter.rs, js/ts_node.rs, js/flow/graph.rs) to use unwrap_or(byte_col as u32 + 1) so the ASCII fallback is visible at every caller, and updated test sites in js/flow/java.rs to use unwrap(). IDE-6037

…-16 transition - Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns. - Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes. - Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative. - Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected. IDE-6037

refactor(tree_sitter): remove unused find_node function

05755f2

- Deleted the `find_node` function from `tree_sitter.rs` as it was not utilized in the current implementation. This cleanup helps streamline the code and improve maintainability.

robertohuertasm-datadog requested a review from Copilot May 17, 2026 09:56

Copilot started reviewing on behalf of robertohuertasm-datadog May 17, 2026 09:56 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

robertohuertasm-datadog and others added 8 commits May 17, 2026 12:24

robertohuertasm-datadog marked this pull request as ready for review May 17, 2026 15:44

robertohuertasm-datadog requested a review from a team as a code owner May 17, 2026 15:44

robertohuertasm-datadog mentioned this pull request May 17, 2026

ci(regressions): restore column-aware SARIF comparison after byte→UTF-16 transition #915

Merged

jasonforal reviewed May 19, 2026

View reviewed changes

Comment thread crates/common/src/utils/position_utils.rs Outdated

jasonforal reviewed May 20, 2026

View reviewed changes

robertohuertasm-datadog requested a review from jasonforal May 20, 2026 14:10

robertohuertasm-datadog commented May 20, 2026

View reviewed changes

Comment thread crates/common/src/model/position.rs

jasonforal approved these changes May 20, 2026

View reviewed changes

robertohuertasm-datadog merged commit 39f4b84 into main May 20, 2026
108 of 126 checks passed

robertohuertasm-datadog deleted the rob/fix/tree-sitter-byte-column-bug-IDE-6037 branch May 20, 2026 17:24

Conversation

robertohuertasm-datadog commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Errors

What changed

Core conversion helper (crates/common)

Contract documentation (crates/common)

Kernel — tree-sitter layer (crates/static-analysis-kernel)

Kernel — ddsa JavaScript bridge

Kernel — taint-flow region builders

Server (crates/static-analysis-server)

SARIF and CSV output (crates/cli)

Why UTF-16?

Blast radius

What we considered and rejected

Tests added

Notes for the reviewers

Uh oh!

datadog-prod-us1-6 Bot commented May 17, 2026 • edited by datadog-prod-us1-4 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertohuertasm-datadog commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Heads-up for reviewers — temporary loosening of check-regressions.js in 329df0c

Uh oh!

robertohuertasm-datadog commented May 19, 2026

Uh oh!

Uh oh!

jasonforal left a comment

Choose a reason for hiding this comment

Uh oh!

jasonforal May 20, 2026

Choose a reason for hiding this comment

Uh oh!

robertohuertasm-datadog May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robertohuertasm-datadog commented May 17, 2026 •

edited

Loading

Core conversion helper (`crates/common`)

Contract documentation (`crates/common`)

Kernel — tree-sitter layer (`crates/static-analysis-kernel`)

Server (`crates/static-analysis-server`)

SARIF and CSV output (`crates/cli`)

datadog-prod-us1-6 Bot commented May 17, 2026 •

edited by datadog-prod-us1-4 Bot

Loading

robertohuertasm-datadog commented May 17, 2026 •

edited

Loading

Heads-up for reviewers — temporary loosening of `check-regressions.js` in `329df0c`