Skip to content

evo2-sae: probing primitives (eval metrics + ActivationBuffer)#1629

Open
polinabinder1 wants to merge 5 commits into
pbinder/evo2-sae-servefrom
pbinder/sae-interp-primitives
Open

evo2-sae: probing primitives (eval metrics + ActivationBuffer)#1629
polinabinder1 wants to merge 5 commits into
pbinder/evo2-sae-servefrom
pbinder/sae-interp-primitives

Conversation

@polinabinder1

@polinabinder1 polinabinder1 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

SAE probing primitives (eval metrics + ActivationBuffer) for Evo2 — scoring metrics + per-feature annotation, all pure functions of codes + labels. Lives in the evo2 recipe at evo2_sae.eval.probing — moved out of the shared sae library because it's evo2-specific (the shared sae.eval keeps loss_recovered / sparsity / dead_latents for esm2/codonfm).

Stacked on #1622 (uses the evo2_sae package). Base of #1636 (probe harness); #1630 supplies the eval labels.

Contents — evo2_sae.eval.probing

  • ActivationBuffer (codes + optional dense twin + per-token labels + instance ids)
  • AUROC: auroc_all, auroc_vec, best_single_train_test
  • decoders: fit_logreg / fit_softmax / macro_auroc / decode_eval
  • domain_f1 (precision-per-nt, recall-per-instance)
  • annotate_features (per-feature best concept by AUROC → the annotation table)

How to use

from evo2_sae.eval.probing import auroc_all, annotate_features
au  = auroc_all(codes, labels)                                   # [F, L]
ann = annotate_features(codes, labels, names, min_auroc=0.85)    # [{feature_id, label, auroc}]

Tests

No dedicated CI lane (deferred — see #1622). Run via the recipe:

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh        # or: PYTHONPATH=src:../../sae/src
pytest tests/test_probing.py

12 passed (CPU, no model): AUROC vs a pairwise-definition oracle, domain_f1 vs a hand-computed reference, best_single winner's-curse flip, decode_eval separability, annotate_features best-concept, buffer roundtrip, tie-correct (average) ranks, degenerate-label / tie / sparse edge cases, and standardize's zero-variance floor.

Why hand-rolled (not sklearn / torchmetrics) — checked, not a win

GPU-vectorized over the whole ~32k-feature dictionary in one pass; the library options are CPU and per-(scores, label), so a 32k-feature dictionary becomes a 32k-iteration CPU loop. Function by function:

  • auroc_all — no library computes a vectorized [features × labels] AUROC matrix on GPU. Kept.
  • domain_f1, best_single_train_test, annotate_features — bespoke (instance-F1, winner's-curse, per-feature assignment); no library equivalent.
  • fit_logreg / fit_softmax / decode_eval — the only sklearn-replaceable code, but they fit on the [N≈50k, F≈32k] SAE-code matrix, exactly where CodonFM hit the sklearn.LogisticRegression scaling wall and had to subsample to ≤5k features. Swapping reintroduces that coverage loss + a runtime dep. Net regression.
  • ActivationBuffer / split_indices / standardizenp.savez + tiny helpers; nothing to gain.

Conclusion: the module stays torch + numpy-only. Each metric is a standard formula (Mann–Whitney rank-AUROC, Adam BCE/softmax, instance-F1) vectorized for full-dictionary GPU scale, and each is validated against an independent reference in the tests.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a comprehensive SAE feature-probing evaluation module (probing.py) to enable model-agnostic interpretation of learned features through metrics, classifiers, and annotation tools, along with an ActivationBuffer artifact for persistence and a full test suite validating correctness across all components.

Changes

SAE Probing Evaluation Suite

Layer / File(s) Summary
ActivationBuffer data structure and persistence
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 1–65), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 123–142)
Dataclass storing SAE feature codes, per-token boolean labels and names, optional dense residuals, and concept-to-instance id mappings; .save() serializes to typed .npz with per-concept instance arrays; .load() reconstructs the dataclass; .name_idx property maps label names to column indices.
Dataset utilities and standardization
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 73–84)
split_indices performs deterministic train/test splitting via seeded torch.randperm; standardize computes mean and std on training rows with epsilon-clamped std normalization.
AUROC computation and best-feature selection
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 86–145), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 37–71)
auroc_all computes full [feature, label] AUROC matrix via chunked rank-statistics; auroc_vec handles single-vector AUROC with degenerate-case handling; best_single_train_test selects best feature on training set and reports test AUROC without winner's-curse bias; test oracle _auroc_ref validates against brute-force reference.
Feature concept annotation via AUROC thresholding
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 147–174), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 110–121)
annotate_features derives per-feature best-label annotations by selecting max AUROC across labels and filtering by configurable AUROC threshold; excludes low-information features.
Linear classifier training and macro-AUROC evaluation
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 176–226), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 89–108)
fit_logreg trains binary logistic regression; fit_softmax trains multinomial softmax; both use Adam with BCE-with-logits and cross-entropy respectively; macro_auroc computes macro one-vs-rest AUROC; decode_eval orchestrates training and dual metric reporting for test accuracy and macro AUROC.
Domain-adjusted F1 with instance-aware thresholding
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (lines 228–270), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 73–87)
domain_f1 computes threshold-swept per-feature F1 by normalizing activations per-feature, remapping instance ids, aggregating per-instance firing via index_reduce_, combining precision from concept masks with recall from instance aggregation, and selecting best F1 threshold per feature in chunked passes.
Module public API and test setup
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/__init__.py (lines 25–71), bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py (lines 1–35)
Imports and re-exports all probing.py utilities in __all__ for public access; test module imports and validates all components.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A warren of metrics, now bundled with care,
AUROC and F1 floating through air,
Buffers that save what the features unfold,
Linear probes seeking wisdom untold,
Domain-aware thresholds, adaptive and keen—
The richest of probing suites ever been seen! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately captures the main change: introducing probing primitives (eval metrics and ActivationBuffer) for SAE evaluation, which aligns with the core additions across three files.
Description check ✅ Passed The PR description provides a clear summary, usage examples, comprehensive explanation of included components, and justification for implementation choices. However, the template requires explicit marking of change type and CI configuration labels.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pbinder/sae-interp-primitives

Comment @coderabbitai help to get the list of available commands.

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@polinabinder1 polinabinder1 changed the title sae: shared interpretability primitives (probing + steering) sae: shared probing primitives (eval metrics + ActivationBuffer) Jun 11, 2026
@polinabinder1

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py (2)

54-65: 💤 Low value

allow_pickle=True poses a deserialization risk if loading untrusted files.

This is acceptable for internal artifacts but worth documenting. If these buffers might come from external sources, consider validating provenance or using a safer serialization format.

     `@classmethod`
     def load(cls, path: str) -> "ActivationBuffer":
-        """Load an ActivationBuffer from an .npz written by save()."""
+        """Load an ActivationBuffer from an .npz written by save().
+
+        Warning:
+            Uses allow_pickle=True; only load files from trusted sources.
+        """
         z = np.load(path, allow_pickle=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py`
around lines 54 - 65, The load method in ActivationBuffer uses np.load(...,
allow_pickle=True) which is unsafe for untrusted files; change load to avoid
allow_pickle=True by default (use allow_pickle=False) or add an explicit
parameter (e.g., allow_pickle: bool = False) and fail with a clear error if
pickled objects are required, and update the ActivationBuffer.load docstring to
document the deserialization risk and the need to validate provenance when
loading external files; ensure references to ActivationBuffer.load and the local
variable z are used to implement and surface the safer behavior.

243-245: 💤 Low value

Consider adding a comment explaining the +2 sizing for the remap tensor.

The +2 accounts for 0-indexing and ensures negative indexing (-1) wraps to a valid buffer position. While correct, this is subtle:

-    remap = torch.full((int(inst_ids.max().item()) + 2,), -1, device=dev, dtype=torch.long)
+    # +2: one for 0-indexing, one so that -1 wraps to a valid (unused) slot
+    remap = torch.full((int(inst_ids.max().item()) + 2,), -1, device=dev, dtype=torch.long)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py`
around lines 243 - 245, Add an inline comment above the remap creation
explaining why the size is int(inst_ids.max().item()) + 2: we need +1 for
0-based indexing of the maximum id and an extra slot so that using -1 as a
sentinel (when indexing remap with potentially -1 inst_ids) will wrap to a valid
buffer position instead of raising an out-of-bounds error; reference the remap
tensor and the subsequent remap[uniq.long()] / remap[inst_ids.long()] usage (and
the torch.full default -1) so readers understand the sentinel handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py`:
- Around line 54-65: The load method in ActivationBuffer uses np.load(...,
allow_pickle=True) which is unsafe for untrusted files; change load to avoid
allow_pickle=True by default (use allow_pickle=False) or add an explicit
parameter (e.g., allow_pickle: bool = False) and fail with a clear error if
pickled objects are required, and update the ActivationBuffer.load docstring to
document the deserialization risk and the need to validate provenance when
loading external files; ensure references to ActivationBuffer.load and the local
variable z are used to implement and surface the safer behavior.
- Around line 243-245: Add an inline comment above the remap creation explaining
why the size is int(inst_ids.max().item()) + 2: we need +1 for 0-based indexing
of the maximum id and an extra slot so that using -1 as a sentinel (when
indexing remap with potentially -1 inst_ids) will wrap to a valid buffer
position instead of raising an out-of-bounds error; reference the remap tensor
and the subsequent remap[uniq.long()] / remap[inst_ids.long()] usage (and the
torch.full default -1) so readers understand the sentinel handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 23ddf87a-6a45-46a2-8264-db968ee016e5

📥 Commits

Reviewing files that changed from the base of the PR and between e407165 and 79df727.

📒 Files selected for processing (3)
  • bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/__init__.py
  • bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/eval/probing.py
  • bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_probing.py

@polinabinder1

Copy link
Copy Markdown
Collaborator Author

Addressed the two nitpicks in 57837ec7: documented the allow_pickle=True trust caveat on ActivationBuffer.load, and added a comment explaining the +2 remap-tensor sizing (index-by-max-id + sentinel headroom). Tests still green (6 passed).

@polinabinder1 polinabinder1 marked this pull request as ready for review June 12, 2026 05:32
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1629 (sae.eval.probing: AUROC / domain-F1 / linear probes + ActivationBuffer) onto
the post-#1633 top-level layout, and adds a dedicated CPU workflow (ubuntu-latest, no model/GPU)
that runs the model-agnostic probing tests. Separate from the evo2 GPU lane; the tensor-parallel
sae tests (torchrun/multi-GPU) are out of scope here.

Validated: tests/test_probing.py -> 6 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/sae-interp-primitives branch from 57837ec to 13a0690 Compare June 23, 2026 06:06
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers
(scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/
motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py).

Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI
(scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an
ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked;
the GPU extract->score smoke is a follow-up (no unit tests in this PR yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI
(scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an
ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked;
the GPU extract->score smoke is a follow-up (no unit tests in this PR yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers
(scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/
motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py).

Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI
(scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an
ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked;
the GPU extract->score smoke is a follow-up (no unit tests in this PR yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 and others added 3 commits June 24, 2026 04:13
Re-lands #1629 (sae.eval.probing: AUROC / domain-F1 / linear probes + ActivationBuffer) onto
the post-#1633 top-level layout, and adds a dedicated CPU workflow (ubuntu-latest, no model/GPU)
that runs the model-agnostic probing tests. Separate from the evo2 GPU lane; the tensor-parallel
sae tests (torchrun/multi-GPU) are out of scope here.

Validated: tests/test_probing.py -> 6 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
auroc_all / auroc_vec / best_single / macro_auroc ranked via argsort().argsort(), giving tied
values arbitrary distinct ranks. SAE codes are sparse (heavy zero-mass), so that biased the AUROC
on the real data distribution — and the oracle test only covered randn (no ties). Switch to
average (Mann-Whitney) ranks via a vectorized searchsorted helper (keeps the all-features-at-once
speed that motivates hand-rolling), make the oracle tie-aware, and add sparse-tie +
constant-feature tests. Also documents why these metrics are hand-rolled.

tests/test_probing.py -> 8 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
…er None paths

- a never/always-firing concept -> AUROC 0.5 (the valid-mask branch; realistic for rare concepts)
- auroc_vec directly (was only tested transitively via best_single) on tied scores
- ActivationBuffer with no dense twin / no instances (the Optional -> None save/load paths)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
root and others added 2 commits June 24, 2026 04:13
standardize z-scores SAE codes for the linear/codon probes, where ~20% of latents are dead
(constant 0). Add a direct test that the 1e-6 std floor keeps those columns finite (no NaN into
the logreg fit) and that mean/std use the train rows only (no test-set leakage).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ed sae lib); drop CI lane

These probing primitives (eval metrics + ActivationBuffer) are evo2-specific, so move them from the
shared sae library into the evo2_sae recipe package:
  * sae/src/sae/eval/probing.py            -> recipes/evo2/src/evo2_sae/eval/probing.py
  * new recipes/evo2/src/evo2_sae/eval/__init__.py (re-exports the probing API)
  * sae/src/sae/eval/__init__.py reverted (no longer exports probing — stays shared for esm2/codonfm)
  * sae/tests/test_probing.py              -> recipes/evo2/tests/test_probing.py (import evo2_sae.eval.probing)
Remove .github/workflows/unit-tests-sae.yaml (defer CI; run tests via the recipe's .ci_build.sh + pytest).
Re-parented onto #1622 so the evo2_sae package is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/sae-interp-primitives branch from 26dd036 to 73c261f Compare June 24, 2026 04:16
@polinabinder1 polinabinder1 changed the base branch from main to pbinder/evo2-sae-serve June 24, 2026 04:16
@polinabinder1 polinabinder1 force-pushed the pbinder/sae-interp-primitives branch from 73c261f to 26dd036 Compare June 24, 2026 04:19
@polinabinder1 polinabinder1 changed the base branch from pbinder/evo2-sae-serve to main June 24, 2026 04:19
@polinabinder1 polinabinder1 force-pushed the pbinder/sae-interp-primitives branch from 26dd036 to 73c261f Compare June 24, 2026 04:24
@polinabinder1 polinabinder1 changed the base branch from main to pbinder/evo2-sae-serve June 24, 2026 04:24
@polinabinder1 polinabinder1 changed the title sae: shared probing primitives (eval metrics + ActivationBuffer) evo2-sae: probing primitives (eval metrics + ActivationBuffer) Jun 24, 2026
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
Re-lands #1630 on the post-#1633 layout, on top of the rebased #1629: the DNA label producers
(scripts/{labelers,annot_tracks,euk_windows}.py) that emit per-token concept labels (genes/exons/
motifs) to fill #1629's ActivationBuffer, + biopython dep (genetic code in labelers.py).

Validated: tests/{test_labelers,test_annot_tracks}.py -> 8 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
Re-lands #1636 on the post-#1633 layout, on top of rebased #1630: the harness/CLI
(scripts/{evo2_buffer,probe,probe_loss_recovered}.py) that runs the model to build an
ActivationBuffer (#1629) from #1630's labels and emits the probing metrics. Syntax-checked;
the GPU extract->score smoke is a follow-up (no unit tests in this PR yet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
…; drop CI lane

Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the
#1629 primitives, now the package __init__):
  scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py
    -> src/evo2_sae/eval/probing/*.py
Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the
primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests
at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh
+ pytest). Reparented onto the moved #1629.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
…; drop CI lane

Relocate #1636's probe harness from scripts/ into the evo2_sae.eval.probing package (alongside the
#1629 primitives, now the package __init__):
  scripts/{labelers,evo2_buffer,annot_tracks,euk_windows,probe,probe_loss_recovered}.py
    -> src/evo2_sae/eval/probing/*.py
Fix imports to package-relative (from . import labelers; from .evo2_buffer import ...) and pull the
primitives from evo2_sae.eval.probing; loss_recovered stays in the shared sae lib. Re-point the tests
at the package (drop the sys.path-into-scripts/ hack). Remove the CPU CI lane (defer; run via .ci_build.sh
+ pytest). Reparented onto the moved #1629.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant