evo2 SAE eval: label producers + probing harness (on #1629) by polinabinder1 · Pull Request #1636 · NVIDIA-BioNeMo/bionemo-recipes

polinabinder1 · 2026-06-12T03:45:18Z

Summary

Evo2 SAE eval: label producers + probing harness — turn DNA into an ActivationBuffer (the one model-touching step) and score it through the probing CLI (AUROC / annotate / linear / domain-F1 / loss-recovered). Lives in the evo2_sae.eval.probing package, alongside the #1629 primitives.

Stacked on #1629 (→ #1622). #1630 supplies the eval labels.

Contents — `evo2_sae.eval.probing`

evo2_buffer.py — DNA → ActivationBuffer (the only model-touching code: Evo2 → layer-L residual → SAE.encode + per-token labels)
labelers.py — per-token biological labelers (genetic code / CDS frame; prokaryotic gene calling via pyrodigal)
annot_tracks.py — BED/GFF interval-track loader → per-token masks (RefSeq / Rfam / JASPAR / ENCODE)
euk_windows.py — eukaryotic gene-structure windows
probe.py — the probing CLI (extract / auroc / annotate / linear / euk-f1 / domain-eval)
probe_loss_recovered.py — SAE fidelity (loss recovered); reuses the shared sae.eval.loss_recovered

Imports are package-relative; the primitives come from evo2_sae.eval.probing. loss_recovered stays in the shared sae lib (used by esm2/codonfm too).

How to run

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh        # or: PYTHONPATH=src:../../sae/src
pytest tests/test_probe_integration.py tests/test_labelers.py tests/test_annot_tracks.py tests/test_euk_windows.py
# CLI: python -m evo2_sae.eval.probing.probe extract|auroc|annotate|linear|domain-eval ...

No dedicated CI lane (deferred — see #1622; CI should fold into the repo-wide recipe lane later).

Tests

CPU (no model): label producers (labelers / annot_tracks / euk_windows) + the probe-CLI integration (buffer save/load roundtrip incl. the dense twin, AUROC/annotate/linear over a planted feature, domain_f1 over interval tracks). 35 passed.
GPU: the real-engine buffer/loss-recovered path is gated by @pytest.mark.skipif(not torch.cuda.is_available()) — runs on a GPU box, skips otherwise.

copy-pr-bot · 2026-06-12T03:45:22Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-12T03:45:25Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 242e5252-9c73-471f-9d5d-955e250ac801

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Three new scripts implement a complete Evo2 sparse autoencoder interpretability suite. evo2_buffer.py reads FASTA files, samples DNA sequences from prokaryotic/eukaryotic pools, and builds ActivationBuffer objects containing sparse codes and per-token labels. probe.py provides a unified CLI for buffer-based metric probing (AUROC, linear classifiers, codon-AA analysis, feature annotation) and model-backed genomic domain evaluation. probe_loss_recovered.py computes loss-recovery metrics by wrapping the SAE and installing layer hooks to capture or replace residual activations.

Changes

Evo2 SAE Interpretability Probing

Layer / File(s)	Summary
Activation buffer construction from sequences `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`	FASTA parser, sequence sampler with kingdom-based oversampling and token-budget controls, and buffer builder that preallocates tensors, tokenizes sequences with kingdom tags, runs locked Evo2 forward passes, encodes sparse codes, computes per-token labeler outputs, and returns NumPy-backed `ActivationBuffer`.
Metric-based probing from precomputed buffers `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py` (lines 1–352, 408–409)	Buffer loading and label resolution utilities, AUROC command to compute per-label metrics and print tables, `_eval_matrix` helper to fit logistic regression and compute train/test scores, `linear` command to compare SAE vs. dense representations, `codon-aa` classifier to compute family-disjoint amino-acid recall, `annotate` command to generate per-feature concept metadata as Parquet, and CLI dispatcher with shared argument handling for all subcommands.
Model-backed streaming and genomic domain evaluation `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py` (lines 206–318, 320–341)	Core `_encode_windows` helper that tiles windows through batches, performs locked Evo2 forward passes, fills label/instance masks, `extract` command to build and save buffers using Evo2, `euk-f1` command to compute exon/intron/CDS domain F1, and `domain-eval` command to evaluate arbitrary genomic tracks against SAE features with per-concept AUROC and optimal thresholds.
Loss recovery and reconstruction fidelity evaluation `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`	`SAEWrap` class to produce reconstructions in residual space with optional denormalization, `L26Hook` forward-hook for capturing or replacing layer outputs, and main evaluation routine that samples sequences, registers the layer hook, computes next-token cross-entropy deltas under locked engine execution, and reports loss-recovered metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

jstjohn
trvachov

Poem

🐰 A rabbit hops through DNA so bright,
Sparse codes hidden from the light,
Probing domains, aurocs gleam,
Building buffers in the stream,
Loss recovered—fidelity's dream! 🧬✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'evo2 SAE eval: label producers + probing harness (on `#1629`)' clearly and concisely summarizes the main change: adding label producers and a probing harness for Evo2 SAE evaluation, building on prior work (`#1629`).
Description check	✅ Passed	PR description is detailed and well-structured, covering summary, contents, usage instructions, testing approach, and how to run the code, with clear references to dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pbinder/evo2-eval-probe

_{Comment @coderabbitai help to get the list of available commands.}

polinabinder1 · 2026-06-12T04:31:44Z

@coderabbitai review

coderabbitai · 2026-06-12T04:31:51Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

polinabinder1 · 2026-06-12T04:36:44Z

@coderabbitai review

coderabbitai · 2026-06-12T04:36:48Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py (1)
37-52: ⚡ Quick win

Add Google-style docstrings instead of suppressing D103 in bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py and bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py.

The shared root cause is the same in both files: new public helpers and CLI entry points are being introduced with # noqa: D103 (or no docstring) instead of repo-required docstrings. Please document the FASTA/sampling helpers in evo2_buffer.py and the public CLI commands in probe.py with Google-style docstrings rather than suppressing the lint rule. As per coding guidelines, **/*.py: Use Google-style docstrings (pydocstyle convention) in Python code and Use Ruff for Python linting and formatting with Google-style docstrings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`
around lines 37 - 52, Replace the "# noqa: D103" suppressions by adding
Google-style docstrings for the public helpers and CLI entry points: add a short
summary, Args, Returns, and Examples where appropriate to the read_fasta and
sample_sequences functions in evo2_buffer.py (document parameters like path,
fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and
likewise add Google-style docstrings to the public CLI functions/commands in
probe.py (document command purpose, parameters/flags and exit behavior); ensure
docstrings follow pydocstyle/Google conventions so Ruff/linting passes.
Source: Coding guidelines
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py (1)
51-85: ⚡ Quick win

Replace the D10x suppressions with actual Google-style docstrings.

This new file opts out of pydocstyle for SAEWrap, L26Hook, and main() instead of documenting the new CLI surface and hook behavior. Adding concise Google-style docstrings here is a small cleanup that keeps the script aligned with the repository standard.

As per coding guidelines, **/*.py: "Ensure all Python files follow Google-style docstrings (pydocstyle convention)" and "Use Ruff for Python linting and formatting with Google-style docstrings".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`
around lines 51 - 85, Add Google-style docstrings for SAEWrap, L26Hook, and
main() instead of suppressing pydocstyle (remove the noqa D10x markers). For
SAEWrap, document the class purpose, the constructor parameter sae (type and
expected interface: encode, decoder, pre_bias, optional normalize_input) and
describe forward(x) return values (recon, codes) and normalization behavior. For
L26Hook, document the class purpose, the attributes mode/override/captured,
valid mode values ("off", "capture", "replace"), and __call__ behavior (how it
captures or replaces the first tensor in output and dtype handling). For main(),
add a concise docstring describing the CLI surface and what the script does when
executed (setup, hooking behavior, and overall flow). Ensure docstrings follow
Google style for one-line summary, optional Args/Returns sections where useful.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 107-115: The code can proceed with an empty `batches` list and
produce misleading perfect scores; add a fail-fast guard after building
`batches` (after the loop that calls `sample_sequences`, `engine.tokenize`, and
appends to `batches`) that checks `if not batches:` and raises a clear exception
(e.g., ValueError) explaining no evaluable sequences were produced (mention
`--n-seqs`, filtered FASTA, or token length <= 4) so the evaluator (which
computes `ce_original`, `ce_sae`, `ce_zero` and `loss_recovered`) never runs on
empty data.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py`:
- Around line 343-351: The --auroc-device argument in _add_model_args is
hardcoded to "cuda:1", causing failures on single-GPU hosts; change its default
to match the primary device (e.g., "cuda:0") or make it None and resolve to the
same device as --device at runtime; update the same hardcoded default at the
other occurrence (lines referenced in the review) so all model-backed buffers
use the primary device by default (reference: _add_model_args and the
--auroc-device argument).
- Around line 19-27: The CLI help lists a "loss-recovered" subcommand but main()
never registers it; update main() to add a subparser for "loss-recovered" (using
the existing ArgumentParser.add_subparsers usage), set its help text to match
the help block, and wire it to the actual handler (e.g.,
set_defaults(func=probe_loss_recovered.main) or call the probe_loss_recovered
entrypoint) so invoking "probe.py loss-recovered" runs the intended logic; also
audit the similar missing registrations noted around the 354-405 region and
register any other commands mentioned in the top help that lack corresponding
subparsers.
- Around line 86-90: The _load function currently silently filters unknown
labels; change it to validate requested labels and fail fast: after loading buf
via ActivationBuffer.load(a.acts) and parsing labels from a.labels or
buf.label_names, compute missing = [t for t in labels if t not in buf.name_idx]
and if missing is non-empty raise a ValueError (or SystemExit) with a clear
message listing the unknown label names and available label names; otherwise
return buf and the requested labels in the original order (not filtered).
- Around line 114-120: The current guard skips only all-zero test labels but not
all-one test labels, causing best_single_train_test and auroc_vec to receive
single-class targets; update the conditional that checks ytr and yte so it also
treats an all-ones yte as invalid (e.g., check yte.sum() in (0, len(yte)) or
equivalent) and set out[n] to (nan,nan) in that case; modify the logic around
ytr, yte, the existing if-block, and the block that calls fit_logreg,
best_single_train_test and auroc_vec so both all-negative and all-positive test
folds are skipped.
- Around line 159-165: The codon→aa probe is leaking test info because Xz is
standardized with full-dataset mean/std; compute mean/std from the training
split and apply that normalization to both train and test instead. Specifically,
after calling split_indices(...) (tr, te), compute mu and sigma from X[tr] (use
X[tr].mean(0) and X[tr].std(0)+1e-6), then set Xz = (X - mu)/sigma before
calling decode_eval and fit_softmax; also ensure trn is chosen as the
intersection of tr with the non-hidx indices (instead of using the full-dataset
non-hidx mask) so fit_softmax(Xz[trn], ...) only uses training examples.

---

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`:
- Around line 37-52: Replace the "# noqa: D103" suppressions by adding
Google-style docstrings for the public helpers and CLI entry points: add a short
summary, Args, Returns, and Examples where appropriate to the read_fasta and
sample_sequences functions in evo2_buffer.py (document parameters like path,
fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and
likewise add Google-style docstrings to the public CLI functions/commands in
probe.py (document command purpose, parameters/flags and exit behavior); ensure
docstrings follow pydocstyle/Google conventions so Ruff/linting passes.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 51-85: Add Google-style docstrings for SAEWrap, L26Hook, and
main() instead of suppressing pydocstyle (remove the noqa D10x markers). For
SAEWrap, document the class purpose, the constructor parameter sae (type and
expected interface: encode, decoder, pre_bias, optional normalize_input) and
describe forward(x) return values (recon, codes) and normalization behavior. For
L26Hook, document the class purpose, the attributes mode/override/captured,
valid mode values ("off", "capture", "replace"), and __call__ behavior (how it
captures or replaces the first tensor in output and dtype handling). For main(),
add a concise docstring describing the CLI surface and what the script does when
executed (setup, hooking behavior, and overall flow). Ensure docstrings follow
Google style for one-line summary, optional Args/Returns sections where useful.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bc44345-d335-4a1a-b937-a1042492fd94

📥 Commits

Reviewing files that changed from the base of the PR and between fe6bb14 and d892bb8.

📒 Files selected for processing (3)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py

Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

copy-pr-bot · 2026-06-23T18:54:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

… correctness, tests - declare pyrodigal (predict_cds/predict_codons import it lazily; only biopython was declared, so the gene-calling path would ImportError in a clean env) - euk_windows: reject a multi-record FASTA (parse_gff drops seqid + we concatenate, so a multi-chrom GFF would silently mislabel against a blob) and document the single-chromosome assumption - skip windows overlapping a *second* gene's span (central-gene approx mislabels a neighbor's exons as intergenic) and de-dup near-duplicate adjacent-exon windows so they don't eat the token budget - extract the CDS reverse-strand frame math into a unit-tested helper (_frame_and_start) - tests: euk_windows had none — add parse_gff/1-based->0-based labeling/single-chrom guard; cover the previously-untested labelers (gc/homopolymer/dinuc/kozak/splice) + the +/- strand frame anchor #1630 CPU tests: 16 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…s-recovered engine call Adds end-to-end CPU tests that cross the eval-layer seams the per-layer unit tests miss: buffer save/load -> probe.main() (auroc/annotate/linear), and annot_tracks.label_windows -> domain_f1. Guards the dense-twin round trip (the SAE-vs-dense `linear` comparison only renders if dense survives save->load) and verifies the stale-base harness still runs against Also fixes probe_loss_recovered.py to call the real engine API (Evo2SAE._ensure_engine().model; the previous _ensure_gen_model() does not exist) and adds tests/conftest.py to centralize the scripts/ sys.path insertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…; drop CI lane Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the #1629 primitives, now the package __init__): scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py -> src/evo2_sae/eval/probing/*.py Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh + pytest). Reparented onto the moved #1629. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

This was referenced Jun 12, 2026

evo2-sae: probing primitives (eval metrics + ActivationBuffer) #1629

Open

evo2 SAE eval (1/2): label producers — imported by #1636 #1630

Closed

polinabinder1 changed the title ~~evo2 SAE eval: probing harness + CLI~~ evo2 SAE eval (2/2): probing harness + CLI (on #1630) Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

polinabinder1 marked this pull request as ready for review June 12, 2026 05:32

polinabinder1 requested review from jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners June 12, 2026 05:32

polinabinder1 requested review from jstjohn and pstjohn and removed request for jstjohn, pstjohn and trvachov June 12, 2026 05:32

polinabinder1 force-pushed the pbinder/evo2-eval-harness branch from f794c31 to 3557958 Compare June 23, 2026 06:18

polinabinder1 force-pushed the pbinder/evo2-eval-probe branch from 70a34e8 to 2394d71 Compare June 23, 2026 06:19

polinabinder1 force-pushed the pbinder/evo2-eval-probe branch from 2394d71 to 7f960de Compare June 23, 2026 18:54

polinabinder1 changed the base branch from pbinder/evo2-eval-harness to pbinder/sae-interp-primitives June 23, 2026 18:54

polinabinder1 changed the title ~~evo2 SAE eval (2/2): probing harness + CLI (on #1630)~~ evo2 SAE eval: label producers + probing harness (on #1629) Jun 23, 2026

polinabinder1 mentioned this pull request Jun 23, 2026

evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622

Open

polinabinder1 force-pushed the pbinder/evo2-eval-probe branch from 3b93f3d to 1ebd296 Compare June 23, 2026 19:58

polinabinder1 force-pushed the pbinder/sae-interp-primitives branch 3 times, most recently from 26dd036 to 73c261f Compare June 24, 2026 04:24

polinabinder1 and others added 8 commits June 24, 2026 04:25

polinabinder1 force-pushed the pbinder/evo2-eval-probe branch from 1ebd296 to c48938b Compare June 24, 2026 04:42

polinabinder1 force-pushed the pbinder/evo2-eval-probe branch from c48938b to 1974169 Compare June 24, 2026 04:49

Conversation

polinabinder1 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contents — evo2_sae.eval.probing

How to run

Tests

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polinabinder1 commented Jun 12, 2026 •

edited

Loading

Contents — `evo2_sae.eval.probing`

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading