Commit 42cbb41
committed
feat(parser): shebang-based language detection for extension-less scripts (#237)
Add a shebang fallback to `CodeParser.detect_language()` so that
extension-less Unix scripts (`bin/myapp`, `.git/hooks/pre-commit`,
`scripts/deploy`, `.husky/pre-push`, `installer`, ...) are routed to the
correct tree-sitter grammar based on their first line.
Root cause of #237
------------------
`detect_language()` was a single-line lookup against `EXTENSION_TO_LANGUAGE`
keyed on `path.suffix.lower()`. Any file with no extension returned
`None`, which filters it out of both `incremental_update()` and
`full_build()` before parsing. Real-world repos rely heavily on
extension-less scripts for entrypoints, git hooks, CI installers, and
shell tooling — all currently invisible to `callers_of`,
`get_impact_radius`, `detect_changes`, and architecture mapping.
Fix
---
1. New module-level `SHEBANG_INTERPRETER_TO_LANGUAGE` table mapping common
interpreter basenames to languages that are *already* registered:
- bash / sh / zsh / ksh / dash / ash -> "bash"
- python / python2 / python3 / pypy / pypy3 -> "python"
- node / nodejs -> "javascript"
- ruby, perl, lua, Rscript, php
This file strictly *routes* extension-less files to existing languages;
it does NOT introduce new grammars.
2. New `_SHEBANG_PROBE_BYTES = 256` constant — maximum bytes read from the
head of a file when probing. Enough for any reasonable shebang line
while keeping worst-case I/O tiny.
3. New `CodeParser._detect_language_from_shebang(path)` static method.
Opens the file, reads up to 256 bytes, verifies `#!` prefix, splits on
the first newline AND first NUL byte (defensive against binary), and
decodes UTF-8 strictly so malformed content returns None instead of
raising. Handles:
- direct form #!/bin/bash
- env indirection #!/usr/bin/env bash
- env -S flag (Linux) #!/usr/bin/env -S node --experimental-vm-modules
- trailing flags #!/bin/bash -e
- interpreter basename extraction from any absolute path
- CRLF line endings (`.split(b"\n", 1)`)
4. `detect_language(path)` now tries the extension lookup first, and if it
returns None AND `path.suffix == ""`, falls back to the shebang probe.
Files with a *known* extension are NEVER re-read — extension-based
detection remains authoritative.
Non-regressions guaranteed by the design
----------------------------------------
- `.py` files still parse as Python even if the first line is a misleading
`#!/bin/bash` (`test_detect_shebang_does_not_override_extension`)
- Extension-less README / LICENSE files return None with a 256-byte read
that finds no shebang.
- Binary files whose first bytes are not `#!` return None without raising.
- Unknown interpreters (e.g. `#!/usr/bin/env ocaml`) return None — same
semantics as an unmapped extension.
Tests added (tests/test_parser.py::TestCodeParser — 16 tests)
-------------------------------------------------------------
- test_detect_shebang_bin_bash
- test_detect_shebang_bin_sh_routed_to_bash
- test_detect_shebang_env_bash
- test_detect_shebang_env_python3
- test_detect_shebang_direct_python
- test_detect_shebang_node
- test_detect_shebang_env_dash_s_flag
- test_detect_shebang_ruby
- test_detect_shebang_perl
- test_detect_shebang_with_trailing_flags
- test_detect_shebang_missing_returns_none
- test_detect_shebang_empty_file_returns_none
- test_detect_shebang_binary_content_returns_none
- test_detect_shebang_unknown_interpreter_returns_none
- test_detect_shebang_does_not_override_extension
- test_parse_shebang_script_produces_function_nodes (end-to-end parse_file
check: extension-less bash script is detected AND parsed into File +
Function nodes, all tagged language="bash")
Test results
------------
Stage 1 (new targeted shebang tests): 16/16 passed.
Stage 2 (tests/test_parser.py full): 83/83 passed.
Stage 3 (tests/test_multilang.py adjacent): 151/151 passed.
Stage 4 (full suite): 748 passed (up from 733),
8 pre-existing Windows failures in test_incremental (3) + test_main
async coroutine detection (1) + test_notebook Databricks (4) —
verified identical on unchanged main.
Stage 5 (ruff check):
- code_review_graph/parser.py: clean
- tests/test_parser.py: 1 pre-existing F841 on line 1038
(test_map_dispatch_qualified_reference, unrelated to this PR —
reproducible on unchanged main at line 901).
Zero regressions. Purely additive fallback that only fires for files
with no extension.1 parent 80d22bf commit 42cbb41
2 files changed
+254
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
122 | 157 | | |
123 | 158 | | |
124 | 159 | | |
| |||
383 | 418 | | |
384 | 419 | | |
385 | 420 | | |
386 | | - | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
387 | 503 | | |
388 | 504 | | |
389 | 505 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
24 | 161 | | |
25 | 162 | | |
26 | 163 | | |
| |||
0 commit comments