Skip to content

Commit 42cbb41

Browse files
committed
feat(parser): shebang-based language detection for extension-less scripts (#237)
Add a shebang fallback to `CodeParser.detect_language()` so that extension-less Unix scripts (`bin/myapp`, `.git/hooks/pre-commit`, `scripts/deploy`, `.husky/pre-push`, `installer`, ...) are routed to the correct tree-sitter grammar based on their first line. Root cause of #237 ------------------ `detect_language()` was a single-line lookup against `EXTENSION_TO_LANGUAGE` keyed on `path.suffix.lower()`. Any file with no extension returned `None`, which filters it out of both `incremental_update()` and `full_build()` before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible to `callers_of`, `get_impact_radius`, `detect_changes`, and architecture mapping. Fix --- 1. New module-level `SHEBANG_INTERPRETER_TO_LANGUAGE` table mapping common interpreter basenames to languages that are *already* registered: - bash / sh / zsh / ksh / dash / ash -> "bash" - python / python2 / python3 / pypy / pypy3 -> "python" - node / nodejs -> "javascript" - ruby, perl, lua, Rscript, php This file strictly *routes* extension-less files to existing languages; it does NOT introduce new grammars. 2. New `_SHEBANG_PROBE_BYTES = 256` constant — maximum bytes read from the head of a file when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny. 3. New `CodeParser._detect_language_from_shebang(path)` static method. Opens the file, reads up to 256 bytes, verifies `#!` prefix, splits on the first newline AND first NUL byte (defensive against binary), and decodes UTF-8 strictly so malformed content returns None instead of raising. Handles: - direct form #!/bin/bash - env indirection #!/usr/bin/env bash - env -S flag (Linux) #!/usr/bin/env -S node --experimental-vm-modules - trailing flags #!/bin/bash -e - interpreter basename extraction from any absolute path - CRLF line endings (`.split(b"\n", 1)`) 4. `detect_language(path)` now tries the extension lookup first, and if it returns None AND `path.suffix == ""`, falls back to the shebang probe. Files with a *known* extension are NEVER re-read — extension-based detection remains authoritative. Non-regressions guaranteed by the design ---------------------------------------- - `.py` files still parse as Python even if the first line is a misleading `#!/bin/bash` (`test_detect_shebang_does_not_override_extension`) - Extension-less README / LICENSE files return None with a 256-byte read that finds no shebang. - Binary files whose first bytes are not `#!` return None without raising. - Unknown interpreters (e.g. `#!/usr/bin/env ocaml`) return None — same semantics as an unmapped extension. Tests added (tests/test_parser.py::TestCodeParser — 16 tests) ------------------------------------------------------------- - test_detect_shebang_bin_bash - test_detect_shebang_bin_sh_routed_to_bash - test_detect_shebang_env_bash - test_detect_shebang_env_python3 - test_detect_shebang_direct_python - test_detect_shebang_node - test_detect_shebang_env_dash_s_flag - test_detect_shebang_ruby - test_detect_shebang_perl - test_detect_shebang_with_trailing_flags - test_detect_shebang_missing_returns_none - test_detect_shebang_empty_file_returns_none - test_detect_shebang_binary_content_returns_none - test_detect_shebang_unknown_interpreter_returns_none - test_detect_shebang_does_not_override_extension - test_parse_shebang_script_produces_function_nodes (end-to-end parse_file check: extension-less bash script is detected AND parsed into File + Function nodes, all tagged language="bash") Test results ------------ Stage 1 (new targeted shebang tests): 16/16 passed. Stage 2 (tests/test_parser.py full): 83/83 passed. Stage 3 (tests/test_multilang.py adjacent): 151/151 passed. Stage 4 (full suite): 748 passed (up from 733), 8 pre-existing Windows failures in test_incremental (3) + test_main async coroutine detection (1) + test_notebook Databricks (4) — verified identical on unchanged main. Stage 5 (ruff check): - code_review_graph/parser.py: clean - tests/test_parser.py: 1 pre-existing F841 on line 1038 (test_map_dispatch_qualified_reference, unrelated to this PR — reproducible on unchanged main at line 901). Zero regressions. Purely additive fallback that only fires for files with no extension.
1 parent 80d22bf commit 42cbb41

File tree

2 files changed

+254
-1
lines changed

2 files changed

+254
-1
lines changed

code_review_graph/parser.py

Lines changed: 117 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,41 @@ class EdgeInfo:
119119
".jl": "julia",
120120
}
121121

122+
# Shebang interpreter → language mapping for extension-less Unix scripts.
123+
# Each key is the **basename** of the interpreter path as it appears after
124+
# ``#!`` (or after ``#!/usr/bin/env``). Only languages already registered
125+
# above are listed — this file strictly routes extension-less scripts, it
126+
# does NOT introduce new languages on its own. See issue #237.
127+
SHEBANG_INTERPRETER_TO_LANGUAGE: dict[str, str] = {
128+
# POSIX / bash-compatible shells — all routed through tree-sitter-bash
129+
"bash": "bash",
130+
"sh": "bash",
131+
"zsh": "bash",
132+
"ksh": "bash",
133+
"dash": "bash",
134+
"ash": "bash",
135+
# Python (every common variant)
136+
"python": "python",
137+
"python2": "python",
138+
"python3": "python",
139+
"pypy": "python",
140+
"pypy3": "python",
141+
# JavaScript via Node
142+
"node": "javascript",
143+
"nodejs": "javascript",
144+
# Ruby / Perl / Lua / R / PHP
145+
"ruby": "ruby",
146+
"perl": "perl",
147+
"lua": "lua",
148+
"Rscript": "r",
149+
"php": "php",
150+
}
151+
152+
# Maximum bytes to read from the head of a file when probing for a shebang.
153+
# 256 is enough for any reasonable shebang line (``#!/usr/bin/env python3 -u\n``
154+
# is ~30 chars) while keeping the worst-case read tiny even on fat binaries.
155+
_SHEBANG_PROBE_BYTES = 256
156+
122157
# Tree-sitter node type mappings per language
123158
# Maps (language) -> dict of semantic role -> list of TS node types
124159
_CLASS_TYPES: dict[str, list[str]] = {
@@ -383,7 +418,88 @@ def _get_parser(self, language: str): # type: ignore[arg-type]
383418
return self._parsers[language]
384419

385420
def detect_language(self, path: Path) -> Optional[str]:
386-
return EXTENSION_TO_LANGUAGE.get(path.suffix.lower())
421+
"""Map a file path to its language name.
422+
423+
Extension-based lookup is tried first. For extension-less files
424+
(typical for Unix scripts like ``bin/myapp`` or ``.git/hooks/pre-commit``)
425+
we fall back to reading the first line for a shebang. Files that
426+
already have a known extension are never re-read — shebang probing
427+
only runs when the extension lookup returns ``None`` **and** the path
428+
has no suffix at all. See issue #237.
429+
"""
430+
suffix = path.suffix.lower()
431+
lang = EXTENSION_TO_LANGUAGE.get(suffix)
432+
if lang is not None:
433+
return lang
434+
# Only probe shebang for files without any extension — "README", "LICENSE",
435+
# and other extension-less text files also fall here, but the probe is a
436+
# cheap 256-byte read that returns None when no shebang is found.
437+
if suffix == "":
438+
return self._detect_language_from_shebang(path)
439+
return None
440+
441+
@staticmethod
442+
def _detect_language_from_shebang(path: Path) -> Optional[str]:
443+
"""Inspect the first line of ``path`` for a shebang interpreter.
444+
445+
Returns the mapped language name or ``None`` if the file has no
446+
shebang, is unreadable, or names an interpreter we don't map.
447+
448+
Accepted shapes::
449+
450+
#!/bin/bash
451+
#!/usr/bin/env python3
452+
#!/usr/bin/env -S node --experimental-vm-modules
453+
#!/usr/bin/bash -e
454+
455+
Only the basename of the interpreter is consulted. Trailing flags
456+
after the interpreter are ignored. Windows-style ``\r\n`` line
457+
endings are handled. Binary files read as garbage bytes simply
458+
fail the ``#!`` prefix check and return ``None``.
459+
"""
460+
try:
461+
with path.open("rb") as fh:
462+
head = fh.read(_SHEBANG_PROBE_BYTES)
463+
except (OSError, PermissionError):
464+
return None
465+
if not head.startswith(b"#!"):
466+
return None
467+
468+
# Take just the first line, stripped of leading "#!" and any
469+
# surrounding whitespace. Split on NUL to defend against accidental
470+
# binary content following a ``#!`` prefix.
471+
first_line = head.split(b"\n", 1)[0].split(b"\0", 1)[0]
472+
try:
473+
line = first_line[2:].decode("utf-8", errors="strict").strip()
474+
except UnicodeDecodeError:
475+
return None
476+
if not line:
477+
return None
478+
479+
tokens = line.split()
480+
if not tokens:
481+
return None
482+
483+
first = tokens[0]
484+
# `/usr/bin/env` indirection: the interpreter is the next token.
485+
# `/usr/bin/env -S node --flag` is also valid — skip any leading
486+
# ``-`` options after env.
487+
if first.endswith("/env") or first == "env":
488+
interpreter_token: Optional[str] = None
489+
for tok in tokens[1:]:
490+
if tok.startswith("-"):
491+
# ``-S`` takes no argument in most envs; skip and continue.
492+
continue
493+
interpreter_token = tok
494+
break
495+
if interpreter_token is None:
496+
return None
497+
interpreter = interpreter_token.rsplit("/", 1)[-1]
498+
else:
499+
# Direct form: ``#!/bin/bash`` or ``#!/usr/local/bin/python3``.
500+
interpreter = first.rsplit("/", 1)[-1]
501+
502+
return SHEBANG_INTERPRETER_TO_LANGUAGE.get(interpreter)
387503

388504
def parse_file(self, path: Path) -> tuple[list[NodeInfo], list[EdgeInfo]]:
389505
"""Parse a single file and return extracted nodes and edges."""

tests/test_parser.py

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,143 @@ def test_detect_language_typescript(self):
2121
def test_detect_language_unknown(self):
2222
assert self.parser.detect_language(Path("foo.txt")) is None
2323

24+
# --- Shebang detection for extension-less Unix scripts (#237) ---
25+
26+
def _write_shebang_file(self, tmp_path: Path, name: str, content: str) -> Path:
27+
"""Helper: write an extension-less file with ``content`` and return its path."""
28+
p = tmp_path / name
29+
p.write_text(content, encoding="utf-8")
30+
return p
31+
32+
def test_detect_shebang_bin_bash(self, tmp_path):
33+
p = self._write_shebang_file(
34+
tmp_path, "deploy", "#!/bin/bash\nfoo() { echo hi; }\n",
35+
)
36+
assert self.parser.detect_language(p) == "bash"
37+
38+
def test_detect_shebang_bin_sh_routed_to_bash(self, tmp_path):
39+
"""/bin/sh scripts are parsed through the bash grammar."""
40+
p = self._write_shebang_file(
41+
tmp_path, "install-hook", "#!/bin/sh\necho hello\n",
42+
)
43+
assert self.parser.detect_language(p) == "bash"
44+
45+
def test_detect_shebang_env_bash(self, tmp_path):
46+
p = self._write_shebang_file(
47+
tmp_path, "runner", "#!/usr/bin/env bash\nfoo() { echo hi; }\n",
48+
)
49+
assert self.parser.detect_language(p) == "bash"
50+
51+
def test_detect_shebang_env_python3(self, tmp_path):
52+
p = self._write_shebang_file(
53+
tmp_path, "myapp",
54+
"#!/usr/bin/env python3\ndef main():\n pass\n",
55+
)
56+
assert self.parser.detect_language(p) == "python"
57+
58+
def test_detect_shebang_direct_python(self, tmp_path):
59+
p = self._write_shebang_file(
60+
tmp_path, "tool", "#!/usr/bin/python3\nprint('hi')\n",
61+
)
62+
assert self.parser.detect_language(p) == "python"
63+
64+
def test_detect_shebang_node(self, tmp_path):
65+
p = self._write_shebang_file(
66+
tmp_path, "cli", "#!/usr/bin/env node\nconsole.log(1);\n",
67+
)
68+
assert self.parser.detect_language(p) == "javascript"
69+
70+
def test_detect_shebang_env_dash_s_flag(self, tmp_path):
71+
"""``#!/usr/bin/env -S node --flag`` (Linux -S) resolves to the interpreter."""
72+
p = self._write_shebang_file(
73+
tmp_path, "esm-tool",
74+
"#!/usr/bin/env -S node --experimental-vm-modules\n"
75+
"console.log('esm');\n",
76+
)
77+
assert self.parser.detect_language(p) == "javascript"
78+
79+
def test_detect_shebang_ruby(self, tmp_path):
80+
p = self._write_shebang_file(
81+
tmp_path, "rake-task", "#!/usr/bin/env ruby\nputs 1\n",
82+
)
83+
assert self.parser.detect_language(p) == "ruby"
84+
85+
def test_detect_shebang_perl(self, tmp_path):
86+
p = self._write_shebang_file(
87+
tmp_path, "cgi-script", "#!/usr/bin/env perl\nprint 1;\n",
88+
)
89+
assert self.parser.detect_language(p) == "perl"
90+
91+
def test_detect_shebang_with_trailing_flags(self, tmp_path):
92+
"""``#!/bin/bash -e`` still maps to bash (flags ignored)."""
93+
p = self._write_shebang_file(
94+
tmp_path, "strict", "#!/bin/bash -e\nfoo() { echo hi; }\n",
95+
)
96+
assert self.parser.detect_language(p) == "bash"
97+
98+
def test_detect_shebang_missing_returns_none(self, tmp_path):
99+
"""Extension-less text files without a shebang return None, not bash."""
100+
p = self._write_shebang_file(
101+
tmp_path, "README", "# just a readme, no shebang\nsome content\n",
102+
)
103+
assert self.parser.detect_language(p) is None
104+
105+
def test_detect_shebang_empty_file_returns_none(self, tmp_path):
106+
p = tmp_path / "EMPTY"
107+
p.write_bytes(b"")
108+
assert self.parser.detect_language(p) is None
109+
110+
def test_detect_shebang_binary_content_returns_none(self, tmp_path):
111+
"""A garbage-byte first line that happens not to start with ``#!``
112+
must not raise and must return None."""
113+
p = tmp_path / "binary-blob"
114+
p.write_bytes(b"\x00\x01\x02\x03 garbage bytes not a shebang\n")
115+
assert self.parser.detect_language(p) is None
116+
117+
def test_detect_shebang_unknown_interpreter_returns_none(self, tmp_path):
118+
"""A valid shebang to an interpreter we don't route is treated as
119+
'unknown language' — same as an unmapped extension."""
120+
p = self._write_shebang_file(
121+
tmp_path, "ocaml-script", "#!/usr/bin/env ocaml\nlet x = 1\n",
122+
)
123+
assert self.parser.detect_language(p) is None
124+
125+
def test_detect_shebang_does_not_override_extension(self, tmp_path):
126+
"""A file with a known extension must still use extension-based
127+
detection, even if its first line is a misleading shebang."""
128+
p = tmp_path / "script.py"
129+
p.write_text("#!/bin/bash\nprint('hi')\n", encoding="utf-8")
130+
# .py wins over the bash shebang — non-intuitive-looking content
131+
# in a .py file must not fool the detector.
132+
assert self.parser.detect_language(p) == "python"
133+
134+
def test_parse_shebang_script_produces_function_nodes(self, tmp_path):
135+
"""End-to-end regression: an extension-less bash script is not only
136+
detected but also fully parsed into structural nodes via parse_file.
137+
"""
138+
script = (
139+
"#!/usr/bin/env bash\n"
140+
"greet() {\n"
141+
' echo "hi $1"\n'
142+
"}\n"
143+
"main() {\n"
144+
" greet world\n"
145+
"}\n"
146+
"main\n"
147+
)
148+
p = self._write_shebang_file(tmp_path, "deploy", script)
149+
150+
nodes, edges = self.parser.parse_file(p)
151+
152+
# We at least got the File node plus both functions.
153+
assert len(nodes) >= 3
154+
funcs = [n for n in nodes if n.kind == "Function"]
155+
func_names = {f.name for f in funcs}
156+
assert "greet" in func_names
157+
assert "main" in func_names
158+
for n in nodes:
159+
assert n.language == "bash"
160+
24161
def test_parse_python_file(self):
25162
nodes, edges = self.parser.parse_file(FIXTURES / "sample_python.py")
26163

0 commit comments

Comments
 (0)