evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile by polinabinder1 · Pull Request #1622 · NVIDIA-BioNeMo/bionemo-recipes

polinabinder1 · 2026-06-10T16:46:05Z

Summary

The importable Evo2SAE inference engine + feature steering — the base of the serve stack — with tests and a runnable (layer-cached) Docker image. A single Evo2 inference engine is loaded once and serves both paths: encode reads the residual stream off a layer-L forward hook; generate drives the same model's decode with decode-only feature steering. No web/CLI here: the server + CLI (#1637), dashboard (#1623), and steering eval (#1635) build on it.

Rebased onto the post-#1633 top-level layout (interpretability/sparse_autoencoders/).

Architecture: one model, both paths

Earlier iterations loaded two copies of Evo2 — a truncated post_process=False model for encode/highlight and the full inference engine for generate (~1.8× the weights). This collapses to a single engine (infer.setup_inference_engine, run eager with cuda_graph_impl="none" so the steering hook applies):

load() builds the one engine and takes self.model = unwrap_model(comp.model) + comp.tokenizer from it.
encode/highlight (_forward_hidden) runs a normal full-sequence forward and reads layer L off a forward hook — the engine model is post_process=True (it produces logits for generation), so output_embeddings can't be used; the hook captures the same [S, B, H] module output the steering clamp_hook reads, so encode and steer see identical activations by construction.
generate steers on self.model.decoder.layers[L] — the same module encode reads.

Validated end-to-end on the 1B-8k-bf16 (21/21 tests, incl. a highlight↔steer interleaving test proving no state bleed between the shared model's encode forward and decode path). 7B fidelity is the remaining gate.

src/evo2_sae/core.py — Evo2SAE: load → encode / encode_batch / feature_tracks / generate (decode-only clamp via sae.steering) + input-sanitization guards (_sanitize_steering: feature-id range, clamp-magnitude cap, non-finite/top_k/temperature coercion). encode_batch is length-bucketed (work sorted by token length to minimize padding waste on mixed-length inputs; results un-sorted back to input order).
Load-time SAE/model fit check — load() verifies the SAE's input_dim equals the model's hidden size (_model_hidden_size via config, or a 1-token forward) and raises a clear error on a mismatch ("wrong SAE/model pairing"), instead of a cryptic matmul failure on the first encode. Known gap: a wrong layer number with the same hidden size can't be caught here (the SAE checkpoint records no training layer) — it silently yields out-of-distribution features; /health surfaces the configured layer, and stamping the layer into the checkpoint at train time is a follow-up.
sae/src/sae/steering.py — model-agnostic delta-clamp hook + steer().

Build / run / CI

.ci_build.sh (env | install | all) + .ci_test_env.sh — build the env by delegating to evo2_megatron's own build (no fork of the pinned megatron stack), then install sae + this recipe into that venv. The phase arg lets the Dockerfile cache the two steps separately.
Dockerfile — thin, non-forking, layer-cached: the ~30-min mbridge megatron build is its own layer (depends only on recipes/evo2_megatron), and the SAE source + editable installs are a separate trailing layer — so editing engine/SAE code rebuilds only the cheap install layer, not megatron. (+ a per-Dockerfile .dockerignore.)
tests/conftest.py — 1B-8k-bf16 fixture (bionemo_load → run_nemo2_to_mbridge) + a synthesized tiny SAE, GPU-memory-gated; honors EVO2_CKPT_DIR / SAE_CKPT_PATH for manual / 7B runs. The GPU tests are gated by @pytest.mark.skipif(not torch.cuda.is_available()), so they run on a GPU box and skip otherwise.

Dependency on `bionemo.evo2`

The engine reuses bionemo.evo2's model code (the mbridge recipes/evo2_megatron recipe), which isn't pip-installable. .ci_build.sh (and the Dockerfile) build it via evo2_megatron's own script; it's intentionally not in pyproject.toml, matching the codonfm/esm2 recipes (base model is environment-provided).

How to run

# Build once from the repo root, then run with a GPU:
docker build -f interpretability/sparse_autoencoders/recipes/evo2/Dockerfile -t evo2-sae .
docker run --gpus all -it evo2-sae bash -lc "source .ci_test_env.sh && pytest tests/"

On build time / making it easier. The engine needs bionemo.evo2 (the mbridge
evo2_megatron recipe), which isn't pip-installable — so the first docker build
compiles the full megatron stack (megatron-bridge, causal-conv1d, …) and takes ~30 min.
After that, the build is layer-cached: editing engine/SAE code re-runs only the two
editable pip installs (seconds), not the megatron compile. Other shortcuts:

Build once, reuse: push the built image to a registry; coworkers docker run it and never rebuild.

Skip the compile: the Dockerfile's ARG BASE_IMAGE can point at a prebuilt evo2_megatron / bionemo image once one exists — the build then collapses to just the two pip installs.

No container at all (dev): inside an existing megatron env, pip install -e sae/ && pip install -e recipes/evo2/ (what local validation does).

from evo2_sae import Evo2SAE
eng   = Evo2SAE(evo2_ckpt_dir, sae_ckpt_path, layer=19).load()    # 1B layer 19 (7B: 26)
codes = eng.encode("ATGGCC...")                                    # [S, n_features], sparse (TopK)
out   = eng.generate(prompt="ATGGCC...", features=[{"feature_id": 123, "strength": 200}])

Tests

There's no dedicated CI lane right now (deferred — it should later fold into the repo-wide recipe lane, which already runs .ci_build.sh + pytest). Run them manually:

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh   # build + activate the megatron venv
pytest tests/

CPU (no model): test_core.py (engine plumbing — top_features, _load_sae, generate guards, the SAE/model dim check, encode_batch length-bucketing order) + test_steering.py sanitize guards + sae/tests/test_steering.py (exact clamp math). Quick CPU-only run without the venv: PYTHONPATH=src:../../sae/src pytest tests/test_core.py.
GPU: test_steering.py — bf16 encode, generation in-distribution, steering changes the continuation (+ compare_baseline), batched/empty-sequence encode, max-clamp stays finite, and highlight↔steer interleaving (encode bit-identical across a steered generate; baseline unaffected by history). Gated by @pytest.mark.skipif(not torch.cuda.is_available()) — runs on a GPU box (megatron venv); set EVO2_CKPT_DIR/SAE_CKPT_PATH for a specific model, else the fixtures build the 1B-8k-bf16 + a synthesized SAE.

Base of

#1637 (server) → #1623 (dashboard), and #1635 (steering eval).

Note: `recipes/evo2/` is co-owned with the eval stack (#1636)

This PR owns the recipe's Dockerfile / .ci_build.sh / src/evo2_sae + tests/conftest.py; the eval stack (#1636) adds scripts/ (labelers, probe harness) and its biopython/pyrodigal deps to the same recipes/evo2/. The eval branch is pre-reconciled against this PR (verified clean with git merge-tree): it keeps [tool.setuptools] packages = [] (so this PR's where = ["src"] wins at merge), carries a byte-identical pytest-markers block, and has no conftest.py. No change needed here — just merge order awareness, and pip install -e recipes/evo2 (in .ci_build.sh) will install the eval deps automatically once both land.

coderabbitai · 2026-06-10T16:46:14Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e588a942-69f4-41f6-83f6-7516464f3c2e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces sparse autoencoder (SAE) feature steering capabilities for the Evo2 foundation model, along with a complete inference recipe. It adds a reusable clamp_hook steering primitive that injects only the delta between clamped and original SAE reconstructions, applies it to a new Evo2SAE inference engine supporting encoding and steered generation, and refactors FASTA parsing into a shared utility.

Changes

Evo2 SAE Steering and Inference Recipe

Layer / File(s)	Summary
Build configuration and runtime dependencies `bionemo-recipes/.../evo2/pyproject.toml`	Added `fastapi>=0.110`, `uvicorn>=0.29`, `pandas>=1.5` to project dependencies and enabled setuptools package discovery under `src/`.
SAE feature steering primitives `bionemo-recipes/.../sae/src/sae/steering.py`	New `clamp_hook` forward hook re-encodes activations through the SAE, clamps specified feature codes, and injects the delta between clamped and original decoded outputs. `steer` context manager registers/removes the hook reliably.
SAE steering unit tests `bionemo-recipes/.../sae/tests/test_steering.py`	Validates delta-clamp correctness (no-op leaves unchanged, real clamp matches analytic delta), tuple output isolation (only hidden state modified), and decode-only mode (skips prefill).
Evo2SAE package API and lazy loading `bionemo-recipes/.../evo2/src/evo2_sae/__init__.py`	Public API via `__all__` constraint on `Evo2SAE`, `clean_dna`, `DEFAULT_ORGANISM_TAGS` with module-level `__getattr__` for lazy core module loading.
Evo2SAE core inference engine `bionemo-recipes/.../evo2/src/evo2_sae/core.py`	Main `Evo2SAE` class loads truncated Evo2 model + SAE checkpoint, supports tokenization, single/batch encoding to SAE codes, feature extraction, top-k feature selection, and generation with optional decode-time SAE steering via delta injection.
Shared FASTA parsing utility `bionemo-recipes/.../evo2/src/evo2_sae/fasta.py`	Streaming `read_fasta()` reader transparently supports plain and gzip-compressed FASTA, yields `(seq_id, sequence)` tuples, and auto-generates sequential IDs for headerless records.
FASTA integration in chunk script `bionemo-recipes/.../evo2/scripts/chunk_fasta.py`	Updated to use shared `read_fasta()` from `evo2_sae.fasta` instead of local `parse_fasta()`.
Evo2 SAE recipe integration tests `bionemo-recipes/.../evo2/tests/test_steering.py`	CPU test validates clamp-hook arithmetic. GPU tests verify encode produces finite positive codes, unsteered generation produces valid DNA (ACGTN), and steering changes continuation deterministically.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA-BioNeMo/bionemo-framework#1621: Both PRs modify the Evo2 recipe's FASTA chunking pipeline by refactoring parse_fasta logic in scripts/chunk_fasta.py; this PR removes the local parser in favor of a shared evo2_sae.fasta.read_fasta utility.

Suggested labels

ciflow:all

Suggested reviewers

jstjohn
pstjohn
jwilber
trvachov

🐰 A SAE hook so clever, it clamps with a delta,
Evo2 now steers genes with precision so bright—
Encoding and clamping, a feature to frame,
Generation refined: DNA shaped just right! 🧬

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile' accurately and specifically summarizes the main components added in this PR.
Description check	✅ Passed	PR description provides comprehensive technical overview with architecture, contents, dependencies, usage examples, and tests, but lacks explicit mapping to template sections.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pbinder/evo2-sae-serve

_{Comment @coderabbitai help to get the list of available commands.}

jwilber · 2026-06-10T18:05:58Z

Have you tried this? any examples you can share/screenshots?

…ashboard.py - Remove the committed sample parquets; the dashboard now reads atlas data the user provides (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step. - Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist + feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays a pure front-end (runtime dep on the #1622 server only). - Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26). - tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError; wrong schema -> ValueError. 3 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

polinabinder1 · 2026-06-10T23:33:51Z

@jwilber This only deals with the steering backend. The visualization is in PR 1623.

Users pick from a preset library or paste sequences; the backend embeds them live (Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

React/Vite dashboard for the evo2 SAE — three tabs (Feature atlas, Generative steering, Sequence inspector) plus a feature-detail drill-down. Front-end only: the atlas tab reads static parquet (works with no backend); the inspector + steering tabs call the live engine (`launch_inference.sh serve`, #1622) through the Vite /api -> :8001 proxy. Runtime dependency on the server only — no code dependency, so it merges independently of #1622. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ashboard.py - Remove the committed sample parquets; the dashboard now reads atlas data the user provides (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step. - Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist + feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays a pure front-end (runtime dep on the #1622 server only). - Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26). - tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError; wrong schema -> ValueError. 3 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Users pick from a preset library or paste sequences; the backend embeds them live (Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

polinabinder1 · 2026-06-11T18:13:27Z

@coderabbitai review

coderabbitai · 2026-06-11T18:13:34Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (8)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py (1)

37-50: ⚖️ Poor tradeoff

Consider more portable default paths.

Similar to the shell script, the default checkpoint and annotation paths are hardcoded to /data/interp/evo2/... which won't exist for other users. While these can be overridden via CLI arguments or environment variables (making this less critical than the shell script issue), consider removing these hardcoded defaults or documenting the required setup clearly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`
around lines 37 - 50, Default file paths for CLI args (--sae-ckpt-path,
--feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to
/data/interp/evo2/...; remove or replace these with portable defaults by making
the argparse defaults None (or point to a user/home-relative path) and rely on
environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or
explicit CLI input, and update the code that consumes these values (where these
args are referenced) to validate and raise a clear error if no path is provided;
target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and
the EVO2_CKPT_DIR default.

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py (7)

51-55: ⚡ Quick win

Add field docstrings to FeatureClamp.

📝 Example enhancement

 class FeatureClamp(BaseModel):
     """A single SAE-feature steering clamp (feature id + target strength)."""
 
-    feature_id: int
-    strength: float = 1.0
+    feature_id: int
+    """SAE feature ID to clamp during generation."""
+    strength: float = 1.0
+    """Target activation strength for the feature."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 51 - 55, Add Google-style (pydocstyle) docstrings describing each
field on the FeatureClamp Pydantic model: update the class docstring for
FeatureClamp (subclassing BaseModel) to include an Args section documenting
feature_id (int) and strength (float) with concise descriptions and
units/semantics (e.g., feature index and target steering strength, default 1.0).
Keep the top-line summary intact and ensure the Args block follows Google style
so linters accept it.

Source: Coding guidelines

58-68: ⚡ Quick win

Add field docstrings to GenerateRequest.

📝 Example enhancement

 class GenerateRequest(BaseModel):
     """Request body for /generate (autoregressive generation + optional SAE-feature clamps)."""
 
-    prompt: str = ""
-    organism: str = "None (raw DNA)"
-    tag: Optional[str] = None
-    features: list[FeatureClamp] = []
-    n_tokens: int = 120
-    temperature: float = 1.0
-    top_k: int = 0
-    compare_baseline: bool = False
+    prompt: str = ""
+    """Initial DNA sequence to condition generation."""
+    organism: str = "None (raw DNA)"
+    """Organism identifier for phylogenetic tagging."""
+    tag: Optional[str] = None
+    """Custom phylogenetic tag (overrides organism lookup)."""
+    features: list[FeatureClamp] = []
+    """SAE feature clamps for steering generation."""
+    n_tokens: int = 120
+    """Number of tokens to generate."""
+    temperature: float = 1.0
+    """Sampling temperature (higher = more random)."""
+    top_k: int = 0
+    """Top-k sampling parameter (0 = disabled)."""
+    compare_baseline: bool = False
+    """Whether to generate an unsteered baseline for comparison."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 58 - 68, Add Google-style (pydocstyle) docstrings for the
GenerateRequest datamodel: add a class docstring describing the purpose of
GenerateRequest and include an Args section that documents each attribute
(prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature,
top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input
sequence string; organism: organism context or "None (raw DNA)"; tag: optional
user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of
tokens to generate; temperature: sampling temperature; top_k: top-k sampling
value; compare_baseline: whether to compare to baseline). Ensure the formatting
follows Google-style pydocstyle conventions and place the docstring immediately
under the class GenerateRequest declaration.

Source: Coding guidelines

39-48: ⚡ Quick win

Add field docstrings to AnnotateRequest.

The class is missing Google-style field docstrings. Each field should document its purpose, especially fields like mode that have specific allowed values ("topk" | "pick").

📝 Example enhancement

 class AnnotateRequest(BaseModel):
     """Request body for /annotate (top-k feature scan or an explicit feature pick)."""
 
-    sequence: str
-    organism: str = "None (raw DNA)"
-    tag: Optional[str] = None
-    mode: str = "topk"  # "topk" | "pick"
-    k: int = 8
-    feature_ids: Optional[list[int]] = None
-    feature_id: Optional[int] = None
+    sequence: str
+    """DNA sequence to annotate."""
+    organism: str = "None (raw DNA)"
+    """Organism identifier for phylogenetic tagging."""
+    tag: Optional[str] = None
+    """Custom phylogenetic tag (overrides organism lookup)."""
+    mode: str = "topk"
+    """Feature selection mode: 'topk' (top-k scan) or 'pick' (explicit features)."""
+    k: int = 8
+    """Number of top features to return when mode='topk'."""
+    feature_ids: Optional[list[int]] = None
+    """Explicit feature IDs when mode='pick'."""
+    feature_id: Optional[int] = None
+    """Single feature ID when mode='pick' (alternative to feature_ids)."""

As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 39 - 48, The AnnotateRequest Pydantic model lacks Google-style
field docstrings; update the class docstring for AnnotateRequest to include a
Google-style "Attributes:" section that documents each field (sequence,
organism, tag, mode, k, feature_ids, feature_id), describing purpose,
types/constraints and allowed values for mode ("topk" | "pick") and any
relationships (e.g., feature_ids vs feature_id) so readers and linters can
validate the field meanings. Ensure the docstring follows pydocstyle/Google
conventions and mentions defaults where relevant.

Source: Coding guidelines

99-107: ⚡ Quick win

Add return type hint to features endpoint.

     `@app.get`("/features")
-    def features():
+    def features() -> list[dict]:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 99 - 107, The endpoint function features lacks a return type hint;
update its signature to include a typed return such as def features() ->
List[Dict[str, Any]]: and add the necessary imports (from typing import List,
Dict, Any) at the top of the module, or alternatively define and use a pydantic
model and set response_model on `@app.get`; modify the function signature and
imports so Pyright type checking passes while keeping the existing logic in
features().

Source: Coding guidelines

109-154: ⚡ Quick win

Add return type hint to annotate endpoint.

     `@app.post`("/annotate")
-    def annotate(req: AnnotateRequest):
+    def annotate(req: AnnotateRequest) -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 109 - 154, The annotate endpoint lacks a return type hint which
fails Pyright checks; update the annotate function signature (def annotate(req:
AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or
a proper TypedDict/AnnotateResponse if available), and add the corresponding
typing import (e.g., from typing import Dict, Any) at the top of the module so
Pyright accepts the annotated return for the function annotate and its returned
JSON structure.

Source: Coding guidelines

86-97: ⚡ Quick win

Add return type hint to health endpoint.

     `@app.get`("/health")
-    def health():
+    def health() -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 86 - 97, The health endpoint lacks a return type hint; update the
health function signature (def health) to declare a typed return such as ->
Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from
typing import Any, Dict) so Pyright can validate the returned mapping built from
engine (engine.ready, engine.layer, engine.n_features, engine.labels,
engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned
structure unchanged and ensure the type hint covers the mixed value types.

Source: Coding guidelines

156-172: ⚡ Quick win

Add return type hint to generate endpoint.

     `@app.post`("/generate")
-    def generate(req: GenerateRequest):
+    def generate(req: GenerateRequest) -> dict:

As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`
around lines 156 - 172, Add an explicit return type annotation to the FastAPI
endpoint function generate (def generate(req: GenerateRequest) -> Any) and
import Any from typing; update the signature so Pyright knows the endpoint's
return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body
and exception handling (engine.generate call and HTTPException raises)
unchanged.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh`:
- Around line 17-21: The script currently embeds development-only absolute
defaults for VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS which
will break elsewhere; remove those hardcoded paths and instead either (a) set
VENV to a relative default like RECIPE_DIR/.venv and leave
EVO2_CKPT_DIR/SAE_CKPT_PATH/FEATURE_ANNOTATIONS unset, or (b) require these env
vars be provided and add an explicit validation block that checks VENV,
EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS (while allowing
EMBEDDING_LAYER to keep a sane numeric default), and if any are missing print a
clear error naming the missing variable(s) and exit non‑zero; update the code
references to VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, FEATURE_ANNOTATIONS, and
EMBEDDING_LAYER accordingly.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 57-67: Add a Google-style docstring to the _engine function
describing its purpose, parameters, and return value: explain that _engine
constructs and returns an Evo2SAE instance, document each parameter passed to
Evo2SAE (evo2_ckpt_dir, sae_ckpt_path, layer, device, max_seq_len,
feature_annotations) with types and brief descriptions, and state the return
type (Evo2SAE). Place the docstring immediately below the def _engine(args):
line using the standard Google style (Args:, Returns:) so tools and linters can
pick it up.
- Around line 34-55: The function _add_common is missing a Google-style
docstring; add a concise Google-style docstring immediately below the def
_add_common(p: argparse.ArgumentParser) -> None: line describing the function’s
purpose (registers shared CLI arguments), the parameter p (an
argparse.ArgumentParser), and any side effects/returns (modifies the parser in
place, returns None). Use the Google docstring sections: Args and Returns, and
keep wording aligned with surrounding code style.
- Around line 70-87: Add a Google-style docstring to _read_fasta describing
parameters (path), return values (ids, seqs), behavior (supports gzipped files)
and exceptions; and fix the header-parsing edge case by replacing the brittle
line that does line[1:].split()[0] with logic that strips the leading ">" and
whitespace, uses .split() safely (e.g., parts = line[1:].strip().split(); name =
parts[0] if parts else f"seq_{len(ids)}") so headers like "> " don't raise
IndexError and still produce a generated id when no token is present.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.py`:
- Around line 168-189: The docstring and logic in the method that reads
self.feature_annotations (variables: labels, peaks, path, path.suffix) claim to
support parquet/tsv/csv/json but only handle parquet; update the code in the
function (the block starting with path = Path(self.feature_annotations)) to
detect .csv/.tsv (use csv or pandas.read_csv), .json (json.load or
pandas.read_json), and parse the same columns ("feature_id", "label" or
"annotation", "max_activation") into labels and peaks just like the parquet
branch, and for any other suffix emit an explicit logger.warning stating the
format is unsupported and return empty labels/peaks; ensure you reuse the same
keys/behavior (casting ids to int, labels to str, peaks to float) as done in the
pq branch so the rest of the code remains compatible.
- Around line 352-366: The code indexes SAE tensors using incoming feature IDs
(see fids, features and usages of self.sae.encoder.weight /
self.sae.decoder.weight) without validation; add explicit bounds and type checks
before any tensor indexing inside the block that builds specs (validate each fid
is an integer >=0 and < self.sae.encoder.weight.size(0) and similarly valid for
decoder indexing), and if invalid raise a ValueError with a clear message so the
/generate handler returns 400; perform these checks at the start of the with
self._lock block (before accessing self.sae.* tensors) or filter/convert
f["feature_id"] to int safely and validate before using it in specs construction
(references: fids, features, self.sae.encoder.weight, self.sae.decoder.weight,
self.layer).

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 124-131: The code currently treats any non-"pick" mode as "topk";
update the conditional around req.mode in server.py to explicitly handle "pick"
and "topk" only and raise an HTTPException(400, "invalid mode; allowed values:
'pick', 'topk'") for any other value. Concretely, change the if/else to if
req.mode == "pick": ... elif req.mode == "topk": compute k and call
engine.top_features(...); else: raise the 400 error so typos or unsupported
modes are rejected (refer to req.mode, engine.top_features, chosen).
- Line 84: The CORS middleware is currently set to allow all origins via
app.add_middleware(CORSMiddleware, allow_origins=["*"]) which is too permissive
for production; update the server startup to read an environment variable (e.g.,
CORS_ALLOWED_ORIGINS or CORS_ALLOWED_ORIGIN) and use that to populate
allow_origins (parse a comma-separated list into a list), defaulting to a safe
value like an empty list or localhost for dev, and ensure
allow_methods/allow_headers remain appropriate; locate the use of
app.add_middleware and replace the hardcoded ["*"] with the parsed config so
deployments can restrict origins without code changes.
- Around line 23-34: Add a second blank line after the import block (the line
ending with "from .core import Evo2SAE, clean_dna") so there are two blank lines
before the next top-level statement (e.g., the logger = logging.getLogger(...)
or any subsequent definitions); this aligns with the isort rule and ensures the
import section (including Evo2SAE and clean_dna) is separated from module-level
code.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py`:
- Around line 99-105: Update the test_endpoints_503_until_ready to also assert
that the /generate endpoint returns 503 when the engine is not ready: in the
existing test that creates FakeEngine (eng.ready = False), using
TestClient(build_app(eng)) add a POST request to "/generate" with a
representative JSON payload (similar shape to other tests, e.g. prompt/sequence
fields) and assert c.post("/generate", json=...).status_code == 503 so /generate
is covered like /features and /annotate.

---

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 37-50: Default file paths for CLI args (--sae-ckpt-path,
--feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to
/data/interp/evo2/...; remove or replace these with portable defaults by making
the argparse defaults None (or point to a user/home-relative path) and rely on
environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or
explicit CLI input, and update the code that consumes these values (where these
args are referenced) to validate and raise a clear error if no path is provided;
target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and
the EVO2_CKPT_DIR default.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 51-55: Add Google-style (pydocstyle) docstrings describing each
field on the FeatureClamp Pydantic model: update the class docstring for
FeatureClamp (subclassing BaseModel) to include an Args section documenting
feature_id (int) and strength (float) with concise descriptions and
units/semantics (e.g., feature index and target steering strength, default 1.0).
Keep the top-line summary intact and ensure the Args block follows Google style
so linters accept it.
- Around line 58-68: Add Google-style (pydocstyle) docstrings for the
GenerateRequest datamodel: add a class docstring describing the purpose of
GenerateRequest and include an Args section that documents each attribute
(prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature,
top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input
sequence string; organism: organism context or "None (raw DNA)"; tag: optional
user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of
tokens to generate; temperature: sampling temperature; top_k: top-k sampling
value; compare_baseline: whether to compare to baseline). Ensure the formatting
follows Google-style pydocstyle conventions and place the docstring immediately
under the class GenerateRequest declaration.
- Around line 39-48: The AnnotateRequest Pydantic model lacks Google-style field
docstrings; update the class docstring for AnnotateRequest to include a
Google-style "Attributes:" section that documents each field (sequence,
organism, tag, mode, k, feature_ids, feature_id), describing purpose,
types/constraints and allowed values for mode ("topk" | "pick") and any
relationships (e.g., feature_ids vs feature_id) so readers and linters can
validate the field meanings. Ensure the docstring follows pydocstyle/Google
conventions and mentions defaults where relevant.
- Around line 99-107: The endpoint function features lacks a return type hint;
update its signature to include a typed return such as def features() ->
List[Dict[str, Any]]: and add the necessary imports (from typing import List,
Dict, Any) at the top of the module, or alternatively define and use a pydantic
model and set response_model on `@app.get`; modify the function signature and
imports so Pyright type checking passes while keeping the existing logic in
features().
- Around line 109-154: The annotate endpoint lacks a return type hint which
fails Pyright checks; update the annotate function signature (def annotate(req:
AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or
a proper TypedDict/AnnotateResponse if available), and add the corresponding
typing import (e.g., from typing import Dict, Any) at the top of the module so
Pyright accepts the annotated return for the function annotate and its returned
JSON structure.
- Around line 86-97: The health endpoint lacks a return type hint; update the
health function signature (def health) to declare a typed return such as ->
Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from
typing import Any, Dict) so Pyright can validate the returned mapping built from
engine (engine.ready, engine.layer, engine.n_features, engine.labels,
engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned
structure unchanged and ensure the type hint covers the mixed value types.
- Around line 156-172: Add an explicit return type annotation to the FastAPI
endpoint function generate (def generate(req: GenerateRequest) -> Any) and
import Any from typing; update the signature so Pyright knows the endpoint's
return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body
and exception handling (engine.generate call and HTTPException raises)
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3e499152-13aa-449c-b7ad-6b67a8279836

📥 Commits

Reviewing files that changed from the base of the PR and between e407165 and de81106.

📒 Files selected for processing (8)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.toml
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/__init__.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steering.py

…_sae serve` Shrink the inference PR to the engine + server + their tests. The encode/batch/generate command-line tools (cli.py) and launch_inference.sh move to the stacked CLI PR (#1632); the server stays launchable here via `python -m evo2_sae serve` (__main__.py, env-configured). fasta.py stays (shared by the extraction-side chunk_fasta.py and, via the base, the CLI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…1622) Steering's only consumers (the live engine's clamp hook + the steer.py harness) both live in the evo2 serve recipe (#1622), and the harness imports Evo2SAE from it. So the steering primitive + harness move to a dedicated PR stacked on #1622, where the core clamp-hook dedup can happen in-place. This base is now the probing library only. Signed-off-by: Polina Binder <pbinder@nvidia.com>

copy-pr-bot · 2026-06-11T22:00:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Loading an SAE whose input_dim != the model's hidden size (wrong SAE/model pairing) used to succeed at load and then fail with a cryptic matmul shape error on the first encode. load() now checks it up front and raises a clear message ("SAE input_dim=X does not match the Evo2 hidden size=H at layer L — wrong SAE/model pairing (check --sae-ckpt-path / --layer)"). - _model_hidden_size(): read it from the model config (cheap) or a 1-token forward (ground truth); None if neither works -> check skipped, never blocks an otherwise-fine load. - _check_dim(): pure, unit-tested on CPU (test_check_dim_rejects_sae_model_mismatch). NOT detectable, documented in code: a *wrong layer number* with the same hidden size — encode still matches dims and silently yields out-of-distribution features. The SAE checkpoint records no training layer; /health surfaces the configured layer. Follow-up: stamp the layer into the SAE checkpoint at train time and assert it here. Validated in the evo2_megatron venv: CPU test_core 7 passed, GPU test_steering 12 passed on the 1B (real load() exercises the new check; 1920==1920, no false positive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Sort encode_batch work by token length so each micro-batch holds similar-length sequences (less wasted padding on mixed-length inputs). Results are written back by original index, so the returned order still matches the input order. Add a CPU test that stubs the model and asserts input-order output despite the internal length-sort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Drop the separate truncated encode model (load_model_to_layer(full=False)) and serve both paths from the one inference engine built by setup_inference_engine: * load() builds the engine and takes self.model = unwrap_model(comp.model) + comp.tokenizer. * encode/highlight (_forward_hidden) runs a normal full-sequence forward on that model and reads layer L off a forward hook — the engine model is post_process=True so output_embeddings can't be used; the hook captures the same [S,B,H] module output the steering clamp_hook reads, so encode and steer see identical activations. * generate() steers on self.model.decoder.layers[L] (the same module encode reads). Removes the ~1.8x model duplication (one set of weights instead of truncated + full). The num-microbatches double-init teardown is now just defensive (only one model inits it). Test: add GPU test_highlight_steer_interleaving_no_bleed — encode is bit-identical across a steered generate, and a baseline generate is unaffected by prior encode/steer history (proves no state bleed between the shared model's highlight forward and decode path). Validated end-to-end on the 1B-8k-bf16 (21/21 tests pass, incl. the interleaving + steering GPU tests). 7B fidelity still unconfirmed (no 7B checkpoint available) — HOLD push for that gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Review fixes: * metric: replace positional Hamming with normalized edit (Levenshtein) distance. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering now warns, steers at the effective value, and records max_clamp_strength + capped_strengths. * fix dangling doc reference (probe.py -> extract.py, which exists). * refactor steer.py into injectable pick_target()/run_steering() and add CPU test_steer.py (fake engine, local not in conftest) covering target picking, dose monotonicity, selectivity, and cap reporting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Metric / robustness: * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at the effective value, and records max_clamp_strength + capped_strengths. Consolidation: * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py). * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk. * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in conftest, to avoid colliding with the sibling server PR's engine fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Replace the 'slow' marker with an explicit @pytest.mark.skipif(not torch.cuda.is_available()) on the GPU/integration tests in test_steering.py (shared 'requires_gpu' decorator). They run when a GPU is present (CI's L4 + megatron env) and skip with a clear 'requires a GPU' reason otherwise — the conftest fixtures still further skip on too-little GPU memory or an unfetchable/unimportable checkpoint. Remove the now-unused 'slow' marker registration from pyproject. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Remove .github/workflows/unit-tests-interpretability-recipes.yaml per review (recoverable from history / 9bedf2b). Keep .ci_build.sh + .ci_test_env.sh + the tests — those are the build/run machinery (used by the Dockerfile and manual runs), not the CI lane. How to build + run the tests is documented in the PR description; CI should later fold into the repo-wide recipe lane rather than a bespoke workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ed sae lib); drop CI lane These probing primitives (eval metrics + ActivationBuffer) are evo2-specific, so move them from the shared sae library into the evo2_sae recipe package: * sae/src/sae/eval/probing.py -> recipes/evo2/src/evo2_sae/eval/probing.py * new recipes/evo2/src/evo2_sae/eval/__init__.py (re-exports the probing API) * sae/src/sae/eval/__init__.py reverted (no longer exports probing — stays shared for esm2/codonfm) * sae/tests/test_probing.py -> recipes/evo2/tests/test_probing.py (import evo2_sae.eval.probing) Remove .github/workflows/unit-tests-sae.yaml (defer CI; run tests via the recipe's .ci_build.sh + pytest). Re-parented onto #1622 so the evo2_sae package is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Metric / robustness: * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at the effective value, and records max_clamp_strength + capped_strengths. Consolidation: * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py). * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk. * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in conftest, to avoid colliding with the sibling server PR's engine fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Relocate the steering dose-response / selectivity metrics from evo2_sae.steer_analysis into the evo2_sae.eval package (src/evo2_sae/steer_analysis.py -> src/evo2_sae/eval/steering.py), alongside the eval/probing harness. Update the importers (scripts/steer.py CLI + tests/test_steer_analysis.py) to evo2_sae.eval.steering. The CI lane is dropped via the rebase onto the updated #1622. Pure-CPU tests, no GPU/model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

polinabinder1 requested review from jstjohn, jwilber, pstjohn and trvachov as code owners June 10, 2026 16:46

polinabinder1 mentioned this pull request Jun 10, 2026

evo2: live Evo2+SAE inference engine + steering server + CLI #1603

Closed

This was referenced Jun 10, 2026

evo2 SAE recipe: feature-explorer dashboard (viz) #1623

Open

evo2 SAE feature-explorer dashboard #1604

Closed

polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from 91d1e30 to de81106 Compare June 11, 2026 00:28

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

polinabinder1 requested a review from savitha-eng as a code owner June 11, 2026 18:55

This was referenced Jun 11, 2026

evo2 SAE eval (1/2): label producers — imported by #1636 #1630

Closed

evo2 SAE steering: dose-response / selectivity harness #1631

Closed

evo2 infer: generate CLI mode (steer from the command line) #1632

Closed

polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from 5f4fce4 to 4a0de59 Compare June 11, 2026 20:18

This was referenced Jun 11, 2026

evo2 SAE steering: one clamp (sae.steering) for engine, CLI + harness #1634

Closed

evo2-sae: probing primitives (eval metrics + ActivationBuffer) #1629

Open

polinabinder1 marked this pull request as draft June 11, 2026 21:18

polinabinder1 and others added 2 commits June 23, 2026 18:26

polinabinder1 and others added 2 commits June 24, 2026 03:35

Conversation

polinabinder1 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture: one model, both paths

Contents

Dependency on bionemo.evo2

How to run

Tests

Base of

Note: recipes/evo2/ is co-owned with the eval stack (#1636)

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

jwilber commented Jun 10, 2026

Uh oh!

polinabinder1 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polinabinder1 commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

polinabinder1 commented Jun 10, 2026 •

edited

Loading

Dependency on `bionemo.evo2`

Note: `recipes/evo2/` is co-owned with the eval stack (#1636)

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

polinabinder1 commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading