evo2 SAE eval: label producers + probing harness (on #1629)#1636
evo2 SAE eval: label producers + probing harness (on #1629)#1636polinabinder1 wants to merge 9 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThree new scripts implement a complete Evo2 sparse autoencoder interpretability suite. ChangesEvo2 SAE Interpretability Probing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py (1)
37-52: ⚡ Quick winAdd Google-style docstrings instead of suppressing D103 in
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.pyandbionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py.The shared root cause is the same in both files: new public helpers and CLI entry points are being introduced with
# noqa: D103(or no docstring) instead of repo-required docstrings. Please document the FASTA/sampling helpers inevo2_buffer.pyand the public CLI commands inprobe.pywith Google-style docstrings rather than suppressing the lint rule. As per coding guidelines,**/*.py: Use Google-style docstrings (pydocstyle convention) in Python code and Use Ruff for Python linting and formatting with Google-style docstrings.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py` around lines 37 - 52, Replace the "# noqa: D103" suppressions by adding Google-style docstrings for the public helpers and CLI entry points: add a short summary, Args, Returns, and Examples where appropriate to the read_fasta and sample_sequences functions in evo2_buffer.py (document parameters like path, fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and likewise add Google-style docstrings to the public CLI functions/commands in probe.py (document command purpose, parameters/flags and exit behavior); ensure docstrings follow pydocstyle/Google conventions so Ruff/linting passes.Source: Coding guidelines
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py (1)
51-85: ⚡ Quick winReplace the
D10xsuppressions with actual Google-style docstrings.This new file opts out of pydocstyle for
SAEWrap,L26Hook, andmain()instead of documenting the new CLI surface and hook behavior. Adding concise Google-style docstrings here is a small cleanup that keeps the script aligned with the repository standard.As per coding guidelines,
**/*.py: "Ensure all Python files follow Google-style docstrings (pydocstyle convention)" and "Use Ruff for Python linting and formatting with Google-style docstrings".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py` around lines 51 - 85, Add Google-style docstrings for SAEWrap, L26Hook, and main() instead of suppressing pydocstyle (remove the noqa D10x markers). For SAEWrap, document the class purpose, the constructor parameter sae (type and expected interface: encode, decoder, pre_bias, optional normalize_input) and describe forward(x) return values (recon, codes) and normalization behavior. For L26Hook, document the class purpose, the attributes mode/override/captured, valid mode values ("off", "capture", "replace"), and __call__ behavior (how it captures or replaces the first tensor in output and dtype handling). For main(), add a concise docstring describing the CLI surface and what the script does when executed (setup, hooking behavior, and overall flow). Ensure docstrings follow Google style for one-line summary, optional Args/Returns sections where useful.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 107-115: The code can proceed with an empty `batches` list and
produce misleading perfect scores; add a fail-fast guard after building
`batches` (after the loop that calls `sample_sequences`, `engine.tokenize`, and
appends to `batches`) that checks `if not batches:` and raises a clear exception
(e.g., ValueError) explaining no evaluable sequences were produced (mention
`--n-seqs`, filtered FASTA, or token length <= 4) so the evaluator (which
computes `ce_original`, `ce_sae`, `ce_zero` and `loss_recovered`) never runs on
empty data.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py`:
- Around line 343-351: The --auroc-device argument in _add_model_args is
hardcoded to "cuda:1", causing failures on single-GPU hosts; change its default
to match the primary device (e.g., "cuda:0") or make it None and resolve to the
same device as --device at runtime; update the same hardcoded default at the
other occurrence (lines referenced in the review) so all model-backed buffers
use the primary device by default (reference: _add_model_args and the
--auroc-device argument).
- Around line 19-27: The CLI help lists a "loss-recovered" subcommand but main()
never registers it; update main() to add a subparser for "loss-recovered" (using
the existing ArgumentParser.add_subparsers usage), set its help text to match
the help block, and wire it to the actual handler (e.g.,
set_defaults(func=probe_loss_recovered.main) or call the probe_loss_recovered
entrypoint) so invoking "probe.py loss-recovered" runs the intended logic; also
audit the similar missing registrations noted around the 354-405 region and
register any other commands mentioned in the top help that lack corresponding
subparsers.
- Around line 86-90: The _load function currently silently filters unknown
labels; change it to validate requested labels and fail fast: after loading buf
via ActivationBuffer.load(a.acts) and parsing labels from a.labels or
buf.label_names, compute missing = [t for t in labels if t not in buf.name_idx]
and if missing is non-empty raise a ValueError (or SystemExit) with a clear
message listing the unknown label names and available label names; otherwise
return buf and the requested labels in the original order (not filtered).
- Around line 114-120: The current guard skips only all-zero test labels but not
all-one test labels, causing best_single_train_test and auroc_vec to receive
single-class targets; update the conditional that checks ytr and yte so it also
treats an all-ones yte as invalid (e.g., check yte.sum() in (0, len(yte)) or
equivalent) and set out[n] to (nan,nan) in that case; modify the logic around
ytr, yte, the existing if-block, and the block that calls fit_logreg,
best_single_train_test and auroc_vec so both all-negative and all-positive test
folds are skipped.
- Around line 159-165: The codon→aa probe is leaking test info because Xz is
standardized with full-dataset mean/std; compute mean/std from the training
split and apply that normalization to both train and test instead. Specifically,
after calling split_indices(...) (tr, te), compute mu and sigma from X[tr] (use
X[tr].mean(0) and X[tr].std(0)+1e-6), then set Xz = (X - mu)/sigma before
calling decode_eval and fit_softmax; also ensure trn is chosen as the
intersection of tr with the non-hidx indices (instead of using the full-dataset
non-hidx mask) so fit_softmax(Xz[trn], ...) only uses training examples.
---
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`:
- Around line 37-52: Replace the "# noqa: D103" suppressions by adding
Google-style docstrings for the public helpers and CLI entry points: add a short
summary, Args, Returns, and Examples where appropriate to the read_fasta and
sample_sequences functions in evo2_buffer.py (document parameters like path,
fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and
likewise add Google-style docstrings to the public CLI functions/commands in
probe.py (document command purpose, parameters/flags and exit behavior); ensure
docstrings follow pydocstyle/Google conventions so Ruff/linting passes.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 51-85: Add Google-style docstrings for SAEWrap, L26Hook, and
main() instead of suppressing pydocstyle (remove the noqa D10x markers). For
SAEWrap, document the class purpose, the constructor parameter sae (type and
expected interface: encode, decoder, pre_bias, optional normalize_input) and
describe forward(x) return values (recon, codes) and normalization behavior. For
L26Hook, document the class purpose, the attributes mode/override/captured,
valid mode values ("off", "capture", "replace"), and __call__ behavior (how it
captures or replaces the first tensor in output and dtype handling). For main(),
add a concise docstring describing the CLI surface and what the script does when
executed (setup, hooking behavior, and overall flow). Ensure docstrings follow
Google style for one-line summary, optional Args/Returns sections where useful.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5bc44345-d335-4a1a-b937-a1042492fd94
📒 Files selected for processing (3)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py
f794c31 to
3557958
Compare
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
70a34e8 to
2394d71
Compare
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
2394d71 to
7f960de
Compare
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
3b93f3d to
1ebd296
Compare
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
26dd036 to
73c261f
Compare
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
… correctness, tests - declare pyrodigal (predict_cds/predict_codons import it lazily; only biopython was declared, so the gene-calling path would ImportError in a clean env) - euk_windows: reject a multi-record FASTA (parse_gff drops seqid + we concatenate, so a multi-chrom GFF would silently mislabel against a blob) and document the single-chromosome assumption - skip windows overlapping a *second* gene's span (central-gene approx mislabels a neighbor's exons as intergenic) and de-dup near-duplicate adjacent-exon windows so they don't eat the token budget - extract the CDS reverse-strand frame math into a unit-tested helper (_frame_and_start) - tests: euk_windows had none — add parse_gff/1-based->0-based labeling/single-chrom guard; cover the previously-untested labelers (gc/homopolymer/dinuc/kozak/splice) + the +/- strand frame anchor #1630 CPU tests: 16 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…s-recovered engine call Adds end-to-end CPU tests that cross the eval-layer seams the per-layer unit tests miss: buffer save/load -> probe.main() (auroc/annotate/linear), and annot_tracks.label_windows -> domain_f1. Guards the dense-twin round trip (the SAE-vs-dense `linear` comparison only renders if dense survives save->load) and verifies the stale-base harness still runs against Also fixes probe_loss_recovered.py to call the real engine API (Evo2SAE._ensure_engine().model; the previous _ensure_gen_model() does not exist) and adds tests/conftest.py to centralize the scripts/ sys.path insertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
1ebd296 to
c48938b
Compare
…; drop CI lane Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the #1629 primitives, now the package __init__): scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py -> src/evo2_sae/eval/probing/*.py Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh + pytest). Reparented onto the moved #1629. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…; drop CI lane Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the #1629 primitives, now the package __init__): scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py -> src/evo2_sae/eval/probing/*.py Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh + pytest). Reparented onto the moved #1629. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
c48938b to
1974169
Compare
Summary
Evo2 SAE eval: label producers + probing harness — turn DNA into an
ActivationBuffer(the one model-touching step) and score it through the probing CLI (AUROC / annotate / linear / domain-F1 / loss-recovered). Lives in theevo2_sae.eval.probingpackage, alongside the #1629 primitives.Stacked on #1629 (→ #1622). #1630 supplies the eval labels.
Contents —
evo2_sae.eval.probingevo2_buffer.py— DNA →ActivationBuffer(the only model-touching code: Evo2 → layer-L residual →SAE.encode+ per-token labels)labelers.py— per-token biological labelers (genetic code / CDS frame; prokaryotic gene calling viapyrodigal)annot_tracks.py— BED/GFF interval-track loader → per-token masks (RefSeq / Rfam / JASPAR / ENCODE)euk_windows.py— eukaryotic gene-structure windowsprobe.py— the probing CLI (extract/auroc/annotate/linear/euk-f1/domain-eval)probe_loss_recovered.py— SAE fidelity (loss recovered); reuses the sharedsae.eval.loss_recoveredImports are package-relative; the primitives come from
evo2_sae.eval.probing.loss_recoveredstays in the sharedsaelib (used by esm2/codonfm too).How to run
No dedicated CI lane (deferred — see #1622; CI should fold into the repo-wide recipe lane later).
Tests
labelers/annot_tracks/euk_windows) + the probe-CLI integration (buffer save/load roundtrip incl. the dense twin, AUROC/annotate/linear over a planted feature,domain_f1over interval tracks). 35 passed.@pytest.mark.skipif(not torch.cuda.is_available())— runs on a GPU box, skips otherwise.