evo2-sae: probing primitives (eval metrics + ActivationBuffer)#1629
evo2-sae: probing primitives (eval metrics + ActivationBuffer)#1629polinabinder1 wants to merge 5 commits into
Conversation
📝 WalkthroughWalkthroughThis PR adds a comprehensive SAE feature-probing evaluation module ( ChangesSAE Probing Evaluation Suite
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (2)
54-65: 💤 Low value
allow_pickle=Trueposes a deserialization risk if loading untrusted files.This is acceptable for internal artifacts but worth documenting. If these buffers might come from external sources, consider validating provenance or using a safer serialization format.
`@classmethod` def load(cls, path: str) -> "ActivationBuffer": - """Load an ActivationBuffer from an .npz written by save().""" + """Load an ActivationBuffer from an .npz written by save(). + + Warning: + Uses allow_pickle=True; only load files from trusted sources. + """ z = np.load(path, allow_pickle=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py` around lines 54 - 65, The load method in ActivationBuffer uses np.load(..., allow_pickle=True) which is unsafe for untrusted files; change load to avoid allow_pickle=True by default (use allow_pickle=False) or add an explicit parameter (e.g., allow_pickle: bool = False) and fail with a clear error if pickled objects are required, and update the ActivationBuffer.load docstring to document the deserialization risk and the need to validate provenance when loading external files; ensure references to ActivationBuffer.load and the local variable z are used to implement and surface the safer behavior.
243-245: 💤 Low valueConsider adding a comment explaining the
+2sizing for the remap tensor.The
+2accounts for 0-indexing and ensures negative indexing (-1) wraps to a valid buffer position. While correct, this is subtle:- remap = torch.full((int(inst_ids.max().item()) + 2,), -1, device=dev, dtype=torch.long) + # +2: one for 0-indexing, one so that -1 wraps to a valid (unused) slot + remap = torch.full((int(inst_ids.max().item()) + 2,), -1, device=dev, dtype=torch.long)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py` around lines 243 - 245, Add an inline comment above the remap creation explaining why the size is int(inst_ids.max().item()) + 2: we need +1 for 0-based indexing of the maximum id and an extra slot so that using -1 as a sentinel (when indexing remap with potentially -1 inst_ids) will wrap to a valid buffer position instead of raising an out-of-bounds error; reference the remap tensor and the subsequent remap[uniq.long()] / remap[inst_ids.long()] usage (and the torch.full default -1) so readers understand the sentinel handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py`:
- Around line 54-65: The load method in ActivationBuffer uses np.load(...,
allow_pickle=True) which is unsafe for untrusted files; change load to avoid
allow_pickle=True by default (use allow_pickle=False) or add an explicit
parameter (e.g., allow_pickle: bool = False) and fail with a clear error if
pickled objects are required, and update the ActivationBuffer.load docstring to
document the deserialization risk and the need to validate provenance when
loading external files; ensure references to ActivationBuffer.load and the local
variable z are used to implement and surface the safer behavior.
- Around line 243-245: Add an inline comment above the remap creation explaining
why the size is int(inst_ids.max().item()) + 2: we need +1 for 0-based indexing
of the maximum id and an extra slot so that using -1 as a sentinel (when
indexing remap with potentially -1 inst_ids) will wrap to a valid buffer
position instead of raising an out-of-bounds error; reference the remap tensor
and the subsequent remap[uniq.long()] / remap[inst_ids.long()] usage (and the
torch.full default -1) so readers understand the sentinel handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 23ddf87a-6a45-46a2-8264-db968ee016e5
📒 Files selected for processing (3)
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/__init__.pybionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.pybionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py
|
Addressed the two nitpicks in |
Re-lands #1629 (sae.eval.probing: AUROC / domain-F1 / linear probes + ActivationBuffer) onto the post-#1633 top-level layout, and adds a dedicated CPU workflow (ubuntu-latest, no model/GPU) that runs the model-agnostic probing tests. Separate from the evo2 GPU lane; the tensor-parallel sae tests (torchrun/multi-GPU) are out of scope here. Validated: tests/test_probing.py -> 6 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
57837ec to
13a0690
Compare
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1629 (sae.eval.probing: AUROC / domain-F1 / linear probes + ActivationBuffer) onto the post-#1633 top-level layout, and adds a dedicated CPU workflow (ubuntu-latest, no model/GPU) that runs the model-agnostic probing tests. Separate from the evo2 GPU lane; the tensor-parallel sae tests (torchrun/multi-GPU) are out of scope here. Validated: tests/test_probing.py -> 6 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
auroc_all / auroc_vec / best_single / macro_auroc ranked via argsort().argsort(), giving tied values arbitrary distinct ranks. SAE codes are sparse (heavy zero-mass), so that biased the AUROC on the real data distribution — and the oracle test only covered randn (no ties). Switch to average (Mann-Whitney) ranks via a vectorized searchsorted helper (keeps the all-features-at-once speed that motivates hand-rolling), make the oracle tie-aware, and add sparse-tie + constant-feature tests. Also documents why these metrics are hand-rolled. tests/test_probing.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…er None paths - a never/always-firing concept -> AUROC 0.5 (the valid-mask branch; realistic for rare concepts) - auroc_vec directly (was only tested transitively via best_single) on tied scores - ActivationBuffer with no dense twin / no instances (the Optional -> None save/load paths) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
standardize z-scores SAE codes for the linear/codon probes, where ~20% of latents are dead (constant 0). Add a direct test that the 1e-6 std floor keeps those columns finite (no NaN into the logreg fit) and that mean/std use the train rows only (no test-set leakage). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ed sae lib); drop CI lane These probing primitives (eval metrics + ActivationBuffer) are evo2-specific, so move them from the shared sae library into the evo2_sae recipe package: * sae/src/sae/eval/probing.py -> recipes/evo2/src/evo2_sae/eval/probing.py * new recipes/evo2/src/evo2_sae/eval/__init__.py (re-exports the probing API) * sae/src/sae/eval/__init__.py reverted (no longer exports probing — stays shared for esm2/codonfm) * sae/tests/test_probing.py -> recipes/evo2/tests/test_probing.py (import evo2_sae.eval.probing) Remove .github/workflows/unit-tests-sae.yaml (defer CI; run tests via the recipe's .ci_build.sh + pytest). Re-parented onto #1622 so the evo2_sae package is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
26dd036 to
73c261f
Compare
73c261f to
26dd036
Compare
26dd036 to
73c261f
Compare
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers (scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/ motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py). Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI (scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked; the GPU extract->score smoke is a follow-up (no unit tests in this PR yet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…; drop CI lane Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the #1629 primitives, now the package __init__): scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py -> src/evo2_sae/eval/probing/*.py Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh + pytest). Reparented onto the moved #1629. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…; drop CI lane Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the #1629 primitives, now the package __init__): scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py -> src/evo2_sae/eval/probing/*.py Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh + pytest). Reparented onto the moved #1629. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Summary
SAE probing primitives (eval metrics +
ActivationBuffer) for Evo2 — scoring metrics + per-feature annotation, all pure functions of codes + labels. Lives in the evo2 recipe atevo2_sae.eval.probing— moved out of the sharedsaelibrary because it's evo2-specific (the sharedsae.evalkeepsloss_recovered/sparsity/dead_latentsfor esm2/codonfm).Stacked on #1622 (uses the
evo2_saepackage). Base of #1636 (probe harness); #1630 supplies the eval labels.Contents —
evo2_sae.eval.probingActivationBuffer(codes + optional dense twin + per-token labels + instance ids)auroc_all,auroc_vec,best_single_train_testfit_logreg/fit_softmax/macro_auroc/decode_evaldomain_f1(precision-per-nt, recall-per-instance)annotate_features(per-feature best concept by AUROC → the annotation table)How to use
Tests
No dedicated CI lane (deferred — see #1622). Run via the recipe:
12 passed (CPU, no model): AUROC vs a pairwise-definition oracle,
domain_f1vs a hand-computed reference,best_singlewinner's-curse flip,decode_evalseparability,annotate_featuresbest-concept, buffer roundtrip, tie-correct (average) ranks, degenerate-label / tie / sparse edge cases, andstandardize's zero-variance floor.Why hand-rolled (not sklearn / torchmetrics) — checked, not a win
GPU-vectorized over the whole ~32k-feature dictionary in one pass; the library options are CPU and per-
(scores, label), so a 32k-feature dictionary becomes a 32k-iteration CPU loop. Function by function:auroc_all— no library computes a vectorized[features × labels]AUROC matrix on GPU. Kept.domain_f1,best_single_train_test,annotate_features— bespoke (instance-F1, winner's-curse, per-feature assignment); no library equivalent.fit_logreg/fit_softmax/decode_eval— the only sklearn-replaceable code, but they fit on the[N≈50k, F≈32k]SAE-code matrix, exactly where CodonFM hit thesklearn.LogisticRegressionscaling wall and had to subsample to ≤5k features. Swapping reintroduces that coverage loss + a runtime dep. Net regression.ActivationBuffer/split_indices/standardize—np.savez+ tiny helpers; nothing to gain.Conclusion: the module stays
torch+numpy-only. Each metric is a standard formula (Mann–Whitney rank-AUROC, Adam BCE/softmax, instance-F1) vectorized for full-dictionary GPU scale, and each is validated against an independent reference in the tests.