Skip to content

evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile#1622

Open
polinabinder1 wants to merge 13 commits into
mainfrom
pbinder/evo2-sae-serve
Open

evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile#1622
polinabinder1 wants to merge 13 commits into
mainfrom
pbinder/evo2-sae-serve

Conversation

@polinabinder1

@polinabinder1 polinabinder1 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

The importable Evo2SAE inference engine + feature steering — the base of the serve stack — with tests and a runnable (layer-cached) Docker image. A single Evo2 inference engine is loaded once and serves both paths: encode reads the residual stream off a layer-L forward hook; generate drives the same model's decode with decode-only feature steering. No web/CLI here: the server + CLI (#1637), dashboard (#1623), and steering eval (#1635) build on it.

Rebased onto the post-#1633 top-level layout (interpretability/sparse_autoencoders/).

Architecture: one model, both paths

Earlier iterations loaded two copies of Evo2 — a truncated post_process=False model for encode/highlight and the full inference engine for generate (~1.8× the weights). This collapses to a single engine (infer.setup_inference_engine, run eager with cuda_graph_impl="none" so the steering hook applies):

  • load() builds the one engine and takes self.model = unwrap_model(comp.model) + comp.tokenizer from it.
  • encode/highlight (_forward_hidden) runs a normal full-sequence forward and reads layer L off a forward hook — the engine model is post_process=True (it produces logits for generation), so output_embeddings can't be used; the hook captures the same [S, B, H] module output the steering clamp_hook reads, so encode and steer see identical activations by construction.
  • generate steers on self.model.decoder.layers[L] — the same module encode reads.

Validated end-to-end on the 1B-8k-bf16 (21/21 tests, incl. a highlight↔steer interleaving test proving no state bleed between the shared model's encode forward and decode path). 7B fidelity is the remaining gate.

Contents

Engine + steering

  • src/evo2_sae/core.pyEvo2SAE: load → encode / encode_batch / feature_tracks / generate (decode-only clamp via sae.steering) + input-sanitization guards (_sanitize_steering: feature-id range, clamp-magnitude cap, non-finite/top_k/temperature coercion). encode_batch is length-bucketed (work sorted by token length to minimize padding waste on mixed-length inputs; results un-sorted back to input order).
  • Load-time SAE/model fit checkload() verifies the SAE's input_dim equals the model's hidden size (_model_hidden_size via config, or a 1-token forward) and raises a clear error on a mismatch ("wrong SAE/model pairing"), instead of a cryptic matmul failure on the first encode. Known gap: a wrong layer number with the same hidden size can't be caught here (the SAE checkpoint records no training layer) — it silently yields out-of-distribution features; /health surfaces the configured layer, and stamping the layer into the checkpoint at train time is a follow-up.
  • sae/src/sae/steering.py — model-agnostic delta-clamp hook + steer().

Build / run / CI

  • .ci_build.sh (env | install | all) + .ci_test_env.sh — build the env by delegating to evo2_megatron's own build (no fork of the pinned megatron stack), then install sae + this recipe into that venv. The phase arg lets the Dockerfile cache the two steps separately.
  • Dockerfile — thin, non-forking, layer-cached: the ~30-min mbridge megatron build is its own layer (depends only on recipes/evo2_megatron), and the SAE source + editable installs are a separate trailing layer — so editing engine/SAE code rebuilds only the cheap install layer, not megatron. (+ a per-Dockerfile .dockerignore.)
  • tests/conftest.py — 1B-8k-bf16 fixture (bionemo_loadrun_nemo2_to_mbridge) + a synthesized tiny SAE, GPU-memory-gated; honors EVO2_CKPT_DIR / SAE_CKPT_PATH for manual / 7B runs. The GPU tests are gated by @pytest.mark.skipif(not torch.cuda.is_available()), so they run on a GPU box and skip otherwise.

Dependency on bionemo.evo2

The engine reuses bionemo.evo2's model code (the mbridge recipes/evo2_megatron recipe), which isn't pip-installable. .ci_build.sh (and the Dockerfile) build it via evo2_megatron's own script; it's intentionally not in pyproject.toml, matching the codonfm/esm2 recipes (base model is environment-provided).

How to run

# Build once from the repo root, then run with a GPU:
docker build -f interpretability/sparse_autoencoders/recipes/evo2/Dockerfile -t evo2-sae .
docker run --gpus all -it evo2-sae bash -lc "source .ci_test_env.sh && pytest tests/"

On build time / making it easier. The engine needs bionemo.evo2 (the mbridge
evo2_megatron recipe), which isn't pip-installable — so the first docker build
compiles the full megatron stack (megatron-bridge, causal-conv1d, …) and takes ~30 min.
After that, the build is layer-cached: editing engine/SAE code re-runs only the two
editable pip installs (seconds), not the megatron compile. Other shortcuts:

  • Build once, reuse: push the built image to a registry; coworkers docker run it and never rebuild.
  • Skip the compile: the Dockerfile's ARG BASE_IMAGE can point at a prebuilt evo2_megatron / bionemo image once one exists — the build then collapses to just the two pip installs.
  • No container at all (dev): inside an existing megatron env, pip install -e sae/ && pip install -e recipes/evo2/ (what local validation does).
from evo2_sae import Evo2SAE
eng   = Evo2SAE(evo2_ckpt_dir, sae_ckpt_path, layer=19).load()    # 1B layer 19 (7B: 26)
codes = eng.encode("ATGGCC...")                                    # [S, n_features], sparse (TopK)
out   = eng.generate(prompt="ATGGCC...", features=[{"feature_id": 123, "strength": 200}])

Tests

There's no dedicated CI lane right now (deferred — it should later fold into the repo-wide recipe lane, which already runs .ci_build.sh + pytest). Run them manually:

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh   # build + activate the megatron venv
pytest tests/
  • CPU (no model): test_core.py (engine plumbing — top_features, _load_sae, generate guards, the SAE/model dim check, encode_batch length-bucketing order) + test_steering.py sanitize guards + sae/tests/test_steering.py (exact clamp math). Quick CPU-only run without the venv: PYTHONPATH=src:../../sae/src pytest tests/test_core.py.
  • GPU: test_steering.py — bf16 encode, generation in-distribution, steering changes the continuation (+ compare_baseline), batched/empty-sequence encode, max-clamp stays finite, and highlight↔steer interleaving (encode bit-identical across a steered generate; baseline unaffected by history). Gated by @pytest.mark.skipif(not torch.cuda.is_available()) — runs on a GPU box (megatron venv); set EVO2_CKPT_DIR/SAE_CKPT_PATH for a specific model, else the fixtures build the 1B-8k-bf16 + a synthesized SAE.

Base of

#1637 (server) → #1623 (dashboard), and #1635 (steering eval).


Note: recipes/evo2/ is co-owned with the eval stack (#1636)

This PR owns the recipe's Dockerfile / .ci_build.sh / src/evo2_sae + tests/conftest.py; the eval stack (#1636) adds scripts/ (labelers, probe harness) and its biopython/pyrodigal deps to the same recipes/evo2/. The eval branch is pre-reconciled against this PR (verified clean with git merge-tree): it keeps [tool.setuptools] packages = [] (so this PR's where = ["src"] wins at merge), carries a byte-identical pytest-markers block, and has no conftest.py. No change needed here — just merge order awareness, and pip install -e recipes/evo2 (in .ci_build.sh) will install the eval deps automatically once both land.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e588a942-69f4-41f6-83f6-7516464f3c2e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces sparse autoencoder (SAE) feature steering capabilities for the Evo2 foundation model, along with a complete inference recipe. It adds a reusable clamp_hook steering primitive that injects only the delta between clamped and original SAE reconstructions, applies it to a new Evo2SAE inference engine supporting encoding and steered generation, and refactors FASTA parsing into a shared utility.

Changes

Evo2 SAE Steering and Inference Recipe

Layer / File(s) Summary
Build configuration and runtime dependencies
bionemo-recipes/.../evo2/pyproject.toml
Added fastapi>=0.110, uvicorn>=0.29, pandas>=1.5 to project dependencies and enabled setuptools package discovery under src/.
SAE feature steering primitives
bionemo-recipes/.../sae/src/sae/steering.py
New clamp_hook forward hook re-encodes activations through the SAE, clamps specified feature codes, and injects the delta between clamped and original decoded outputs. steer context manager registers/removes the hook reliably.
SAE steering unit tests
bionemo-recipes/.../sae/tests/test_steering.py
Validates delta-clamp correctness (no-op leaves unchanged, real clamp matches analytic delta), tuple output isolation (only hidden state modified), and decode-only mode (skips prefill).
Evo2SAE package API and lazy loading
bionemo-recipes/.../evo2/src/evo2_sae/__init__.py
Public API via __all__ constraint on Evo2SAE, clean_dna, DEFAULT_ORGANISM_TAGS with module-level __getattr__ for lazy core module loading.
Evo2SAE core inference engine
bionemo-recipes/.../evo2/src/evo2_sae/core.py
Main Evo2SAE class loads truncated Evo2 model + SAE checkpoint, supports tokenization, single/batch encoding to SAE codes, feature extraction, top-k feature selection, and generation with optional decode-time SAE steering via delta injection.
Shared FASTA parsing utility
bionemo-recipes/.../evo2/src/evo2_sae/fasta.py
Streaming read_fasta() reader transparently supports plain and gzip-compressed FASTA, yields (seq_id, sequence) tuples, and auto-generates sequential IDs for headerless records.
FASTA integration in chunk script
bionemo-recipes/.../evo2/scripts/chunk_fasta.py
Updated to use shared read_fasta() from evo2_sae.fasta instead of local parse_fasta().
Evo2 SAE recipe integration tests
bionemo-recipes/.../evo2/tests/test_steering.py
CPU test validates clamp-hook arithmetic. GPU tests verify encode produces finite positive codes, unsteered generation produces valid DNA (ACGTN), and steering changes continuation deterministically.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA-BioNeMo/bionemo-framework#1621: Both PRs modify the Evo2 recipe's FASTA chunking pipeline by refactoring parse_fasta logic in scripts/chunk_fasta.py; this PR removes the local parser in favor of a shared evo2_sae.fasta.read_fasta utility.

Suggested labels

ciflow:all

Suggested reviewers

  • jstjohn
  • pstjohn
  • jwilber
  • trvachov

🐰 A SAE hook so clever, it clamps with a delta,
Evo2 now steers genes with precision so bright—
Encoding and clamping, a feature to frame,
Generation refined: DNA shaped just right! 🧬

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile' accurately and specifically summarizes the main components added in this PR.
Description check ✅ Passed PR description provides comprehensive technical overview with architecture, contents, dependencies, usage examples, and tests, but lacks explicit mapping to template sections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pbinder/evo2-sae-serve

Comment @coderabbitai help to get the list of available commands.

@jwilber

jwilber commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Have you tried this? any examples you can share/screenshots?

polinabinder1 added a commit that referenced this pull request Jun 10, 2026
…ashboard.py

- Remove the committed sample parquets; the dashboard now reads atlas data the user provides
  (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step.
- Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist +
  feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors
  the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays
  a pure front-end (runtime dep on the #1622 server only).
- Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26).
- tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError;
  wrong schema -> ValueError. 3 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1

polinabinder1 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

@jwilber This only deals with the steering backend. The visualization is in PR 1623.

polinabinder1 added a commit that referenced this pull request Jun 10, 2026
Users pick from a preset library or paste sequences; the backend embeds them live
(Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring
by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset
sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from 91d1e30 to de81106 Compare June 11, 2026 00:28
polinabinder1 added a commit that referenced this pull request Jun 11, 2026
React/Vite dashboard for the evo2 SAE — three tabs (Feature atlas, Generative steering,
Sequence inspector) plus a feature-detail drill-down. Front-end only: the atlas tab reads
static parquet (works with no backend); the inspector + steering tabs call the live engine
(`launch_inference.sh serve`, #1622) through the Vite /api -> :8001 proxy. Runtime dependency
on the server only — no code dependency, so it merges independently of #1622.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 11, 2026
…ashboard.py

- Remove the committed sample parquets; the dashboard now reads atlas data the user provides
  (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step.
- Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist +
  feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors
  the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays
  a pure front-end (runtime dep on the #1622 server only).
- Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26).
- tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError;
  wrong schema -> ValueError. 3 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 11, 2026
Users pick from a preset library or paste sequences; the backend embeds them live
(Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring
by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset
sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (8)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py (1)

37-50: ⚖️ Poor tradeoff

Consider more portable default paths.

Similar to the shell script, the default checkpoint and annotation paths are hardcoded to /data/interp/evo2/... which won't exist for other users. While these can be overridden via CLI arguments or environment variables (making this less critical than the shell script issue), consider removing these hardcoded defaults or documenting the required setup clearly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`
around lines 37 - 50, Default file paths for CLI args (--sae-ckpt-path,
--feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to
/data/interp/evo2/...; remove or replace these with portable defaults by making
the argparse defaults None (or point to a user/home-relative path) and rely on
environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or
explicit CLI input, and update the code that consumes these values (where these
args are referenced) to validate and raise a clear error if no path is provided;
target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and
the EVO2_CKPT_DIR default.
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py (7)

51-55: ⚡ Quick win

Add field docstrings to FeatureClamp.

📝 Example enhancement
 class FeatureClamp(BaseModel):
     """A single SAE-feature steering clamp (feature id + target strength)."""
 
-    feature_id: int
-    strength: float = 1.0
+    feature_id: int
+    """SAE feature ID to clamp during generation."""
+    strength: float = 1.0
+    """Target activation strength for the feature."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 51 - 55, Add Google-style (pydocstyle) docstrings describing each
field on the FeatureClamp Pydantic model: update the class docstring for
FeatureClamp (subclassing BaseModel) to include an Args section documenting
feature_id (int) and strength (float) with concise descriptions and
units/semantics (e.g., feature index and target steering strength, default 1.0).
Keep the top-line summary intact and ensure the Args block follows Google style
so linters accept it.

Source: Coding guidelines


58-68: ⚡ Quick win

Add field docstrings to GenerateRequest.

📝 Example enhancement
 class GenerateRequest(BaseModel):
     """Request body for /generate (autoregressive generation + optional SAE-feature clamps)."""
 
-    prompt: str = ""
-    organism: str = "None (raw DNA)"
-    tag: Optional[str] = None
-    features: list[FeatureClamp] = []
-    n_tokens: int = 120
-    temperature: float = 1.0
-    top_k: int = 0
-    compare_baseline: bool = False
+    prompt: str = ""
+    """Initial DNA sequence to condition generation."""
+    organism: str = "None (raw DNA)"
+    """Organism identifier for phylogenetic tagging."""
+    tag: Optional[str] = None
+    """Custom phylogenetic tag (overrides organism lookup)."""
+    features: list[FeatureClamp] = []
+    """SAE feature clamps for steering generation."""
+    n_tokens: int = 120
+    """Number of tokens to generate."""
+    temperature: float = 1.0
+    """Sampling temperature (higher = more random)."""
+    top_k: int = 0
+    """Top-k sampling parameter (0 = disabled)."""
+    compare_baseline: bool = False
+    """Whether to generate an unsteered baseline for comparison."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 58 - 68, Add Google-style (pydocstyle) docstrings for the
GenerateRequest datamodel: add a class docstring describing the purpose of
GenerateRequest and include an Args section that documents each attribute
(prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature,
top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input
sequence string; organism: organism context or "None (raw DNA)"; tag: optional
user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of
tokens to generate; temperature: sampling temperature; top_k: top-k sampling
value; compare_baseline: whether to compare to baseline). Ensure the formatting
follows Google-style pydocstyle conventions and place the docstring immediately
under the class GenerateRequest declaration.

Source: Coding guidelines


39-48: ⚡ Quick win

Add field docstrings to AnnotateRequest.

The class is missing Google-style field docstrings. Each field should document its purpose, especially fields like mode that have specific allowed values ("topk" | "pick").

📝 Example enhancement
 class AnnotateRequest(BaseModel):
     """Request body for /annotate (top-k feature scan or an explicit feature pick)."""
 
-    sequence: str
-    organism: str = "None (raw DNA)"
-    tag: Optional[str] = None
-    mode: str = "topk"  # "topk" | "pick"
-    k: int = 8
-    feature_ids: Optional[list[int]] = None
-    feature_id: Optional[int] = None
+    sequence: str
+    """DNA sequence to annotate."""
+    organism: str = "None (raw DNA)"
+    """Organism identifier for phylogenetic tagging."""
+    tag: Optional[str] = None
+    """Custom phylogenetic tag (overrides organism lookup)."""
+    mode: str = "topk"
+    """Feature selection mode: 'topk' (top-k scan) or 'pick' (explicit features)."""
+    k: int = 8
+    """Number of top features to return when mode='topk'."""
+    feature_ids: Optional[list[int]] = None
+    """Explicit feature IDs when mode='pick'."""
+    feature_id: Optional[int] = None
+    """Single feature ID when mode='pick' (alternative to feature_ids)."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 39 - 48, The AnnotateRequest Pydantic model lacks Google-style
field docstrings; update the class docstring for AnnotateRequest to include a
Google-style "Attributes:" section that documents each field (sequence,
organism, tag, mode, k, feature_ids, feature_id), describing purpose,
types/constraints and allowed values for mode ("topk" | "pick") and any
relationships (e.g., feature_ids vs feature_id) so readers and linters can
validate the field meanings. Ensure the docstring follows pydocstyle/Google
conventions and mentions defaults where relevant.

Source: Coding guidelines


99-107: ⚡ Quick win

Add return type hint to features endpoint.

     `@app.get`("/features")
-    def features():
+    def features() -> list[dict]:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 99 - 107, The endpoint function features lacks a return type hint;
update its signature to include a typed return such as def features() ->
List[Dict[str, Any]]: and add the necessary imports (from typing import List,
Dict, Any) at the top of the module, or alternatively define and use a pydantic
model and set response_model on `@app.get`; modify the function signature and
imports so Pyright type checking passes while keeping the existing logic in
features().

Source: Coding guidelines


109-154: ⚡ Quick win

Add return type hint to annotate endpoint.

     `@app.post`("/annotate")
-    def annotate(req: AnnotateRequest):
+    def annotate(req: AnnotateRequest) -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 109 - 154, The annotate endpoint lacks a return type hint which
fails Pyright checks; update the annotate function signature (def annotate(req:
AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or
a proper TypedDict/AnnotateResponse if available), and add the corresponding
typing import (e.g., from typing import Dict, Any) at the top of the module so
Pyright accepts the annotated return for the function annotate and its returned
JSON structure.

Source: Coding guidelines


86-97: ⚡ Quick win

Add return type hint to health endpoint.

     `@app.get`("/health")
-    def health():
+    def health() -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 86 - 97, The health endpoint lacks a return type hint; update the
health function signature (def health) to declare a typed return such as ->
Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from
typing import Any, Dict) so Pyright can validate the returned mapping built from
engine (engine.ready, engine.layer, engine.n_features, engine.labels,
engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned
structure unchanged and ensure the type hint covers the mixed value types.

Source: Coding guidelines


156-172: ⚡ Quick win

Add return type hint to generate endpoint.

     `@app.post`("/generate")
-    def generate(req: GenerateRequest):
+    def generate(req: GenerateRequest) -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 156 - 172, Add an explicit return type annotation to the FastAPI
endpoint function generate (def generate(req: GenerateRequest) -> Any) and
import Any from typing; update the signature so Pyright knows the endpoint's
return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body
and exception handling (engine.generate call and HTTPException raises)
unchanged.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh`:
- Around line 17-21: The script currently embeds development-only absolute
defaults for VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS which
will break elsewhere; remove those hardcoded paths and instead either (a) set
VENV to a relative default like RECIPE_DIR/.venv and leave
EVO2_CKPT_DIR/SAE_CKPT_PATH/FEATURE_ANNOTATIONS unset, or (b) require these env
vars be provided and add an explicit validation block that checks VENV,
EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS (while allowing
EMBEDDING_LAYER to keep a sane numeric default), and if any are missing print a
clear error naming the missing variable(s) and exit non‑zero; update the code
references to VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, FEATURE_ANNOTATIONS, and
EMBEDDING_LAYER accordingly.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 57-67: Add a Google-style docstring to the _engine function
describing its purpose, parameters, and return value: explain that _engine
constructs and returns an Evo2SAE instance, document each parameter passed to
Evo2SAE (evo2_ckpt_dir, sae_ckpt_path, layer, device, max_seq_len,
feature_annotations) with types and brief descriptions, and state the return
type (Evo2SAE). Place the docstring immediately below the def _engine(args):
line using the standard Google style (Args:, Returns:) so tools and linters can
pick it up.
- Around line 34-55: The function _add_common is missing a Google-style
docstring; add a concise Google-style docstring immediately below the def
_add_common(p: argparse.ArgumentParser) -> None: line describing the function’s
purpose (registers shared CLI arguments), the parameter p (an
argparse.ArgumentParser), and any side effects/returns (modifies the parser in
place, returns None). Use the Google docstring sections: Args and Returns, and
keep wording aligned with surrounding code style.
- Around line 70-87: Add a Google-style docstring to _read_fasta describing
parameters (path), return values (ids, seqs), behavior (supports gzipped files)
and exceptions; and fix the header-parsing edge case by replacing the brittle
line that does line[1:].split()[0] with logic that strips the leading ">" and
whitespace, uses .split() safely (e.g., parts = line[1:].strip().split(); name =
parts[0] if parts else f"seq_{len(ids)}") so headers like "> " don't raise
IndexError and still produce a generated id when no token is present.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.py`:
- Around line 168-189: The docstring and logic in the method that reads
self.feature_annotations (variables: labels, peaks, path, path.suffix) claim to
support parquet/tsv/csv/json but only handle parquet; update the code in the
function (the block starting with path = Path(self.feature_annotations)) to
detect .csv/.tsv (use csv or pandas.read_csv), .json (json.load or
pandas.read_json), and parse the same columns ("feature_id", "label" or
"annotation", "max_activation") into labels and peaks just like the parquet
branch, and for any other suffix emit an explicit logger.warning stating the
format is unsupported and return empty labels/peaks; ensure you reuse the same
keys/behavior (casting ids to int, labels to str, peaks to float) as done in the
pq branch so the rest of the code remains compatible.
- Around line 352-366: The code indexes SAE tensors using incoming feature IDs
(see fids, features and usages of self.sae.encoder.weight /
self.sae.decoder.weight) without validation; add explicit bounds and type checks
before any tensor indexing inside the block that builds specs (validate each fid
is an integer >=0 and < self.sae.encoder.weight.size(0) and similarly valid for
decoder indexing), and if invalid raise a ValueError with a clear message so the
/generate handler returns 400; perform these checks at the start of the with
self._lock block (before accessing self.sae.* tensors) or filter/convert
f["feature_id"] to int safely and validate before using it in specs construction
(references: fids, features, self.sae.encoder.weight, self.sae.decoder.weight,
self.layer).

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 124-131: The code currently treats any non-"pick" mode as "topk";
update the conditional around req.mode in server.py to explicitly handle "pick"
and "topk" only and raise an HTTPException(400, "invalid mode; allowed values:
'pick', 'topk'") for any other value. Concretely, change the if/else to if
req.mode == "pick": ... elif req.mode == "topk": compute k and call
engine.top_features(...); else: raise the 400 error so typos or unsupported
modes are rejected (refer to req.mode, engine.top_features, chosen).
- Line 84: The CORS middleware is currently set to allow all origins via
app.add_middleware(CORSMiddleware, allow_origins=["*"]) which is too permissive
for production; update the server startup to read an environment variable (e.g.,
CORS_ALLOWED_ORIGINS or CORS_ALLOWED_ORIGIN) and use that to populate
allow_origins (parse a comma-separated list into a list), defaulting to a safe
value like an empty list or localhost for dev, and ensure
allow_methods/allow_headers remain appropriate; locate the use of
app.add_middleware and replace the hardcoded ["*"] with the parsed config so
deployments can restrict origins without code changes.
- Around line 23-34: Add a second blank line after the import block (the line
ending with "from .core import Evo2SAE, clean_dna") so there are two blank lines
before the next top-level statement (e.g., the logger = logging.getLogger(...)
or any subsequent definitions); this aligns with the isort rule and ensures the
import section (including Evo2SAE and clean_dna) is separated from module-level
code.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py`:
- Around line 99-105: Update the test_endpoints_503_until_ready to also assert
that the /generate endpoint returns 503 when the engine is not ready: in the
existing test that creates FakeEngine (eng.ready = False), using
TestClient(build_app(eng)) add a POST request to "/generate" with a
representative JSON payload (similar shape to other tests, e.g. prompt/sequence
fields) and assert c.post("/generate", json=...).status_code == 503 so /generate
is covered like /features and /annotate.

---

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 37-50: Default file paths for CLI args (--sae-ckpt-path,
--feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to
/data/interp/evo2/...; remove or replace these with portable defaults by making
the argparse defaults None (or point to a user/home-relative path) and rely on
environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or
explicit CLI input, and update the code that consumes these values (where these
args are referenced) to validate and raise a clear error if no path is provided;
target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and
the EVO2_CKPT_DIR default.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 51-55: Add Google-style (pydocstyle) docstrings describing each
field on the FeatureClamp Pydantic model: update the class docstring for
FeatureClamp (subclassing BaseModel) to include an Args section documenting
feature_id (int) and strength (float) with concise descriptions and
units/semantics (e.g., feature index and target steering strength, default 1.0).
Keep the top-line summary intact and ensure the Args block follows Google style
so linters accept it.
- Around line 58-68: Add Google-style (pydocstyle) docstrings for the
GenerateRequest datamodel: add a class docstring describing the purpose of
GenerateRequest and include an Args section that documents each attribute
(prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature,
top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input
sequence string; organism: organism context or "None (raw DNA)"; tag: optional
user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of
tokens to generate; temperature: sampling temperature; top_k: top-k sampling
value; compare_baseline: whether to compare to baseline). Ensure the formatting
follows Google-style pydocstyle conventions and place the docstring immediately
under the class GenerateRequest declaration.
- Around line 39-48: The AnnotateRequest Pydantic model lacks Google-style field
docstrings; update the class docstring for AnnotateRequest to include a
Google-style "Attributes:" section that documents each field (sequence,
organism, tag, mode, k, feature_ids, feature_id), describing purpose,
types/constraints and allowed values for mode ("topk" | "pick") and any
relationships (e.g., feature_ids vs feature_id) so readers and linters can
validate the field meanings. Ensure the docstring follows pydocstyle/Google
conventions and mentions defaults where relevant.
- Around line 99-107: The endpoint function features lacks a return type hint;
update its signature to include a typed return such as def features() ->
List[Dict[str, Any]]: and add the necessary imports (from typing import List,
Dict, Any) at the top of the module, or alternatively define and use a pydantic
model and set response_model on `@app.get`; modify the function signature and
imports so Pyright type checking passes while keeping the existing logic in
features().
- Around line 109-154: The annotate endpoint lacks a return type hint which
fails Pyright checks; update the annotate function signature (def annotate(req:
AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or
a proper TypedDict/AnnotateResponse if available), and add the corresponding
typing import (e.g., from typing import Dict, Any) at the top of the module so
Pyright accepts the annotated return for the function annotate and its returned
JSON structure.
- Around line 86-97: The health endpoint lacks a return type hint; update the
health function signature (def health) to declare a typed return such as ->
Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from
typing import Any, Dict) so Pyright can validate the returned mapping built from
engine (engine.ready, engine.layer, engine.n_features, engine.labels,
engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned
structure unchanged and ensure the type hint covers the mixed value types.
- Around line 156-172: Add an explicit return type annotation to the FastAPI
endpoint function generate (def generate(req: GenerateRequest) -> Any) and
import Any from typing; update the signature so Pyright knows the endpoint's
return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body
and exception handling (engine.generate call and HTTPException raises)
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3e499152-13aa-449c-b7ad-6b67a8279836

📥 Commits

Reviewing files that changed from the base of the PR and between e407165 and de81106.

📒 Files selected for processing (8)
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.toml
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/__init__.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steering.py

polinabinder1 added a commit that referenced this pull request Jun 11, 2026
…_sae serve`

Shrink the inference PR to the engine + server + their tests. The encode/batch/generate
command-line tools (cli.py) and launch_inference.sh move to the stacked CLI PR (#1632); the
server stays launchable here via `python -m evo2_sae serve` (__main__.py, env-configured).
fasta.py stays (shared by the extraction-side chunk_fasta.py and, via the base, the CLI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from 5f4fce4 to 4a0de59 Compare June 11, 2026 20:18
polinabinder1 added a commit that referenced this pull request Jun 11, 2026
…1622)

Steering's only consumers (the live engine's clamp hook + the steer.py harness) both
live in the evo2 serve recipe (#1622), and the harness imports Evo2SAE from it. So the
steering primitive + harness move to a dedicated PR stacked on #1622, where the core
clamp-hook dedup can happen in-place. This base is now the probing library only.

Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 marked this pull request as draft June 11, 2026 21:18
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
…d onto migrated #1622

Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness
(scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics.

Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Loading an SAE whose input_dim != the model's hidden size (wrong SAE/model pairing) used to
succeed at load and then fail with a cryptic matmul shape error on the first encode. load() now
checks it up front and raises a clear message ("SAE input_dim=X does not match the Evo2 hidden
size=H at layer L — wrong SAE/model pairing (check --sae-ckpt-path / --layer)").

- _model_hidden_size(): read it from the model config (cheap) or a 1-token forward (ground
  truth); None if neither works -> check skipped, never blocks an otherwise-fine load.
- _check_dim(): pure, unit-tested on CPU (test_check_dim_rejects_sae_model_mismatch).

NOT detectable, documented in code: a *wrong layer number* with the same hidden size — encode
still matches dims and silently yields out-of-distribution features. The SAE checkpoint records
no training layer; /health surfaces the configured layer. Follow-up: stamp the layer into the SAE
checkpoint at train time and assert it here.

Validated in the evo2_megatron venv: CPU test_core 7 passed, GPU test_steering 12 passed on the 1B
(real load() exercises the new check; 1920==1920, no false positive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…,
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's
model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they
run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which
would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject
so the GPU tests are excluded without an unknown-marker warning.

Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 and others added 2 commits June 23, 2026 18:26
Sort encode_batch work by token length so each micro-batch holds similar-length
sequences (less wasted padding on mixed-length inputs). Results are written back by
original index, so the returned order still matches the input order. Add a CPU test
that stubs the model and asserts input-order output despite the internal length-sort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Drop the separate truncated encode model (load_model_to_layer(full=False)) and serve both
paths from the one inference engine built by setup_inference_engine:

  * load() builds the engine and takes self.model = unwrap_model(comp.model) + comp.tokenizer.
  * encode/highlight (_forward_hidden) runs a normal full-sequence forward on that model and
    reads layer L off a forward hook — the engine model is post_process=True so output_embeddings
    can't be used; the hook captures the same [S,B,H] module output the steering clamp_hook reads,
    so encode and steer see identical activations.
  * generate() steers on self.model.decoder.layers[L] (the same module encode reads).

Removes the ~1.8x model duplication (one set of weights instead of truncated + full). The
num-microbatches double-init teardown is now just defensive (only one model inits it).

Test: add GPU test_highlight_steer_interleaving_no_bleed — encode is bit-identical across a
steered generate, and a baseline generate is unaffected by prior encode/steer history (proves
no state bleed between the shared model's highlight forward and decode path).

Validated end-to-end on the 1B-8k-bf16 (21/21 tests pass, incl. the interleaving + steering
GPU tests). 7B fidelity still unconfirmed (no 7B checkpoint available) — HOLD push for that gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…se SAE forward

- Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine
  internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use
  it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer
  work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract.
- Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered).
- Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the
  only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer
  exists.
- SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead
  of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the
  loss-recovered recon can't drift from the SAE's actual (de)normalization.
- Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests
  exercise forward_codes and the harness imports cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…ck (#1622)

recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files
so the two stacks merge without conflict, regardless of order:
- pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's
  `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with
  no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the
  add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add.
- Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the
  per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…validation test

- test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer
  (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels
  shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled.
- test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end
  against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA /
  the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which
  arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's
model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they
run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which
would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject
so the GPU tests are excluded without an unknown-marker warning.

Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…se SAE forward

- Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine
  internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use
  it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer
  work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract.
- Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered).
- Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the
  only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer
  exists.
- SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead
  of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the
  loss-recovered recon can't drift from the SAE's actual (de)normalization.
- Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests
  exercise forward_codes and the harness imports cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…ck (#1622)

recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files
so the two stacks merge without conflict, regardless of order:
- pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's
  `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with
  no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the
  add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add.
- Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the
  per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 23, 2026
…validation test

- test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer
  (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels
  shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled.
- test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end
  against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA /
  the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which
  arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
…d onto migrated #1622

Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness
(scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics.

Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
…d onto migrated #1622

Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect:
dose-response (effect vs strength) + selectivity (target vs control features), persisted to a
structured steering_results.json.

Review fixes:
  * metric: replace positional Hamming with normalized edit (Levenshtein) distance. Greedy
    decode is autoregressive, so one early flipped token shifts every downstream base and pins
    Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence
    (shared-prefix length) is the complementary monotone signal. Tested with the shift case.
  * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two
    requests above it produce an identical clamp (a fake plateau). run_steering now warns,
    steers at the effective value, and records max_clamp_strength + capped_strengths.
  * fix dangling doc reference (probe.py -> extract.py, which exists).
  * refactor steer.py into injectable pick_target()/run_steering() and add CPU test_steer.py
    (fake engine, local not in conftest) covering target picking, dose monotonicity,
    selectivity, and cap reporting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
…d onto migrated #1622

Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect:
dose-response (effect vs strength) + selectivity (target vs control features), persisted to a
structured steering_results.json.

Metric / robustness:
  * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is
    autoregressive, so one early flipped token shifts every downstream base and pins Hamming at
    ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix
    length) is the complementary monotone signal. Tested with the shift case.
  * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two
    requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at
    the effective value, and records max_clamp_strength + capped_strengths.

Consolidation:
  * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so
    they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path
    inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py).
  * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk.
  * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in
    conftest, to avoid colliding with the sibling server PR's engine fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 and others added 2 commits June 24, 2026 03:35
Replace the 'slow' marker with an explicit @pytest.mark.skipif(not torch.cuda.is_available()) on the
GPU/integration tests in test_steering.py (shared 'requires_gpu' decorator). They run when a GPU is
present (CI's L4 + megatron env) and skip with a clear 'requires a GPU' reason otherwise — the
conftest fixtures still further skip on too-little GPU memory or an unfetchable/unimportable
checkpoint. Remove the now-unused 'slow' marker registration from pyproject.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Remove .github/workflows/unit-tests-interpretability-recipes.yaml per review (recoverable from
history / 9bedf2b). Keep .ci_build.sh + .ci_test_env.sh + the tests — those are the build/run
machinery (used by the Dockerfile and manual runs), not the CI lane. How to build + run the tests
is documented in the PR description; CI should later fold into the repo-wide recipe lane rather
than a bespoke workflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
…ed sae lib); drop CI lane

These probing primitives (eval metrics + ActivationBuffer) are evo2-specific, so move them from the
shared sae library into the evo2_sae recipe package:
  * sae/src/sae/eval/probing.py            -> recipes/evo2/src/evo2_sae/eval/probing.py
  * new recipes/evo2/src/evo2_sae/eval/__init__.py (re-exports the probing API)
  * sae/src/sae/eval/__init__.py reverted (no longer exports probing — stays shared for esm2/codonfm)
  * sae/tests/test_probing.py              -> recipes/evo2/tests/test_probing.py (import evo2_sae.eval.probing)
Remove .github/workflows/unit-tests-sae.yaml (defer CI; run tests via the recipe's .ci_build.sh + pytest).
Re-parented onto #1622 so the evo2_sae package is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's
model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they
run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which
would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject
so the GPU tests are excluded without an unknown-marker warning.

Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 24, 2026
…se SAE forward

- Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine
  internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use
  it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer
  work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract.
- Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered).
- Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the
  only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer
  exists.
- SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead
  of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the
  loss-recovered recon can't drift from the SAE's actual (de)normalization.
- Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests
  exercise forward_codes and the harness imports cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 24, 2026
…ck (#1622)

recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files
so the two stacks merge without conflict, regardless of order:
- pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's
  `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with
  no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the
  add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add.
- Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the
  per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 pushed a commit that referenced this pull request Jun 24, 2026
…validation test

- test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer
  (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels
  shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled.
- test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end
  against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA /
  the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which
  arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
…d onto migrated #1622

Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect:
dose-response (effect vs strength) + selectivity (target vs control features), persisted to a
structured steering_results.json.

Metric / robustness:
  * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is
    autoregressive, so one early flipped token shifts every downstream base and pins Hamming at
    ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix
    length) is the complementary monotone signal. Tested with the shift case.
  * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two
    requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at
    the effective value, and records max_clamp_strength + capped_strengths.

Consolidation:
  * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so
    they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path
    inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py).
  * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk.
  * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in
    conftest, to avoid colliding with the sibling server PR's engine fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
Relocate the steering dose-response / selectivity metrics from evo2_sae.steer_analysis into the
evo2_sae.eval package (src/evo2_sae/steer_analysis.py -> src/evo2_sae/eval/steering.py), alongside
the eval/probing harness. Update the importers (scripts/steer.py CLI + tests/test_steer_analysis.py)
to evo2_sae.eval.steering. The CI lane is dropped via the rebase onto the updated #1622. Pure-CPU
tests, no GPU/model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants