Skip to content

ci(regressions): restore column-aware SARIF comparison after byte→UTF-16 transition#915

Merged
robertohuertasm-datadog merged 1 commit into
mainfrom
rob/fix/tree-sitter-byte-column-bug-02-IDE-6037
May 21, 2026
Merged

ci(regressions): restore column-aware SARIF comparison after byte→UTF-16 transition#915
robertohuertasm-datadog merged 1 commit into
mainfrom
rob/fix/tree-sitter-byte-column-bug-02-IDE-6037

Conversation

@robertohuertasm-datadog
Copy link
Copy Markdown
Contributor

@robertohuertasm-datadog robertohuertasm-datadog commented May 17, 2026

Context

PR #914 changed the static-analysis kernel from emitting 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns (matching LSP, VS Code, and the SARIF v2.1 default encoding).

While that PR was open, the regression workflow (.github/scripts/check-regressions.js) compared two SARIF runs as JSON strings:

  • before = analyzer built from main → byte columns
  • after = analyzer built from the PR branch → UTF-16 columns

For every violation on a line containing a non-ASCII character, the column numbers drift by N (where N is the number of multibyte characters before the violation position). The string-equality comparison saw those as "violation X removed" + "violation Y added" pairs, even though the rule, the file, and the line were identical.

Example from numpy/_core/tests/test_strings.py line 1284:

("λ" * 5 + "μ" * 2, "μ"),
Build startColumn endColumn
main (byte cols) 14 32
#914 (UTF-16 cols) 14 31

→ same violation, two different rows in the diff.

To unblock #914's CI, commit 329df0cc temporarily dropped startColumn / endColumn from both the comparison key and the summary location string, comparing by (file, ruleId, message, startLine, endLine) instead. That commit was explicitly marked as a one-shot loosening and called out a stacked follow-up PR to restore the original behaviour. This is that follow-up.

What this PR does

  • Restores the comparison key to the full physicalLocation (including startColumn and endColumn).
  • Restores the summary table location string to its original startLine:startColumn-endLine:endColumn format.
  • Effectively reverts the JS-only change introduced in 329df0cc. No other files touched.

Why this is safe to land

Once #914 is merged into main, both sides of the regression check run with UTF-16 columns:

…so column-aware diffing becomes meaningful again and the false "removed + added" pairs disappear. Verified locally on _core/tests/test_strings.py:

Comparison logic "Removed" diffs "Added" diffs
Column-aware (this PR) on main-vs-main baselines 0 0

Merge order

This PR depends on #914 landing first. It is opened against the rob/fix/tree-sitter-byte-column-bug-IDE-6037 branch as a stacked PR — GitHub will automatically retarget it to main once #914 is merged.

What we get back

Column-level regression detection, on top of the file / rule / line / message protection we kept throughout. The workflow can again catch column-shift bugs introduced by future tree-sitter or grammar updates, off-by-one regressions in rule code, etc.

Closes IDE-6037 follow-up.

@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented May 17, 2026

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 85.45% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0c391ff | Docs | Datadog PR Page | Give us feedback!

@robertohuertasm-datadog robertohuertasm-datadog force-pushed the rob/fix/tree-sitter-byte-column-bug-02-IDE-6037 branch from 1a1b490 to 142502a Compare May 19, 2026 14:33
Copy link
Copy Markdown
Contributor Author

robertohuertasm-datadog commented May 19, 2026

Base automatically changed from rob/fix/tree-sitter-byte-column-bug-IDE-6037 to main May 20, 2026 17:24
…-16 transition

- Reverted the temporary `startColumn` / `endColumn` strip introduced in #914 (commit 329df0c), which was added to unblock CI while the kernel was migrated from 1-based UTF-8 byte columns to 1-based UTF-16 code-unit columns.
- Restored the comparison key to the full `physicalLocation` (including `startColumn` and `endColumn`) so the regression workflow once again detects column-level drift on top of file / rule / line changes.
- Restored the summary table location string to the original `startLine:startColumn-endLine:endColumn` format to keep the GitHub Actions summary informative.
- Safe to land once #914 is on `main`: both pre- and post-runs produce UTF-16 columns, so column-aware diffing is meaningful again and no false "removed + added" pairs are expected.

IDE-6037
@robertohuertasm-datadog robertohuertasm-datadog force-pushed the rob/fix/tree-sitter-byte-column-bug-02-IDE-6037 branch from 142502a to 0c391ff Compare May 20, 2026 17:38
@robertohuertasm-datadog robertohuertasm-datadog merged commit 9e7e2a8 into main May 21, 2026
91 checks passed
@robertohuertasm-datadog robertohuertasm-datadog deleted the rob/fix/tree-sitter-byte-column-bug-02-IDE-6037 branch May 21, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants