evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile#1622
evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile#1622polinabinder1 wants to merge 13 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces sparse autoencoder (SAE) feature steering capabilities for the Evo2 foundation model, along with a complete inference recipe. It adds a reusable ChangesEvo2 SAE Steering and Inference Recipe
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Have you tried this? any examples you can share/screenshots? |
…ashboard.py - Remove the committed sample parquets; the dashboard now reads atlas data the user provides (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step. - Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist + feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays a pure front-end (runtime dep on the #1622 server only). - Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26). - tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError; wrong schema -> ValueError. 3 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
@jwilber This only deals with the steering backend. The visualization is in PR 1623. |
Users pick from a preset library or paste sequences; the backend embeds them live (Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
91d1e30 to
de81106
Compare
React/Vite dashboard for the evo2 SAE — three tabs (Feature atlas, Generative steering, Sequence inspector) plus a feature-detail drill-down. Front-end only: the atlas tab reads static parquet (works with no backend); the inspector + steering tabs call the live engine (`launch_inference.sh serve`, #1622) through the Vite /api -> :8001 proxy. Runtime dependency on the server only — no code dependency, so it merges independently of #1622. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ashboard.py - Remove the committed sample parquets; the dashboard now reads atlas data the user provides (gitignored public/*.parquet). It does NOT generate — generation is a separate offline step. - Add scripts/launch_dashboard.py: validate the 3 atlas parquets in --data-dir (exist + feature_id schema, fail fast) -> stage into feature_explorer/public/ -> start Vite. Mirrors the codonfm/esm2 launch_dashboard convention; engine-free (stdlib + pyarrow), so this PR stays a pure front-end (runtime dep on the #1622 server only). - Fix stale refs (evo2_sae_infer -> evo2_sae, steering_server.py -> server.py, layer 19 -> 26). - tests/test_launch_dashboard.py (CPU): staging copies the parquets; missing file -> FileNotFoundError; wrong schema -> ValueError. 3 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Users pick from a preset library or paste sequences; the backend embeds them live (Evo2 -> layer-L -> SAE, mean/max-pooled per sequence) and the client UMAPs them, recoloring by feature. SequenceUMAPView.jsx (umap-js, already a dep) + the 'sequmap' tab + a small preset sequence_library.json. Needs the /gene_embed endpoint on the server (added in #1622). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (8)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py (1)
37-50: ⚖️ Poor tradeoffConsider more portable default paths.
Similar to the shell script, the default checkpoint and annotation paths are hardcoded to
/data/interp/evo2/...which won't exist for other users. While these can be overridden via CLI arguments or environment variables (making this less critical than the shell script issue), consider removing these hardcoded defaults or documenting the required setup clearly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py` around lines 37 - 50, Default file paths for CLI args (--sae-ckpt-path, --feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to /data/interp/evo2/...; remove or replace these with portable defaults by making the argparse defaults None (or point to a user/home-relative path) and rely on environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or explicit CLI input, and update the code that consumes these values (where these args are referenced) to validate and raise a clear error if no path is provided; target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and the EVO2_CKPT_DIR default.bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py (7)
51-55: ⚡ Quick winAdd field docstrings to FeatureClamp.
📝 Example enhancement
class FeatureClamp(BaseModel): """A single SAE-feature steering clamp (feature id + target strength).""" - feature_id: int - strength: float = 1.0 + feature_id: int + """SAE feature ID to clamp during generation.""" + strength: float = 1.0 + """Target activation strength for the feature."""As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 51 - 55, Add Google-style (pydocstyle) docstrings describing each field on the FeatureClamp Pydantic model: update the class docstring for FeatureClamp (subclassing BaseModel) to include an Args section documenting feature_id (int) and strength (float) with concise descriptions and units/semantics (e.g., feature index and target steering strength, default 1.0). Keep the top-line summary intact and ensure the Args block follows Google style so linters accept it.Source: Coding guidelines
58-68: ⚡ Quick winAdd field docstrings to GenerateRequest.
📝 Example enhancement
class GenerateRequest(BaseModel): """Request body for /generate (autoregressive generation + optional SAE-feature clamps).""" - prompt: str = "" - organism: str = "None (raw DNA)" - tag: Optional[str] = None - features: list[FeatureClamp] = [] - n_tokens: int = 120 - temperature: float = 1.0 - top_k: int = 0 - compare_baseline: bool = False + prompt: str = "" + """Initial DNA sequence to condition generation.""" + organism: str = "None (raw DNA)" + """Organism identifier for phylogenetic tagging.""" + tag: Optional[str] = None + """Custom phylogenetic tag (overrides organism lookup).""" + features: list[FeatureClamp] = [] + """SAE feature clamps for steering generation.""" + n_tokens: int = 120 + """Number of tokens to generate.""" + temperature: float = 1.0 + """Sampling temperature (higher = more random).""" + top_k: int = 0 + """Top-k sampling parameter (0 = disabled).""" + compare_baseline: bool = False + """Whether to generate an unsteered baseline for comparison."""As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 58 - 68, Add Google-style (pydocstyle) docstrings for the GenerateRequest datamodel: add a class docstring describing the purpose of GenerateRequest and include an Args section that documents each attribute (prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature, top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input sequence string; organism: organism context or "None (raw DNA)"; tag: optional user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of tokens to generate; temperature: sampling temperature; top_k: top-k sampling value; compare_baseline: whether to compare to baseline). Ensure the formatting follows Google-style pydocstyle conventions and place the docstring immediately under the class GenerateRequest declaration.Source: Coding guidelines
39-48: ⚡ Quick winAdd field docstrings to AnnotateRequest.
The class is missing Google-style field docstrings. Each field should document its purpose, especially fields like
modethat have specific allowed values ("topk" | "pick").📝 Example enhancement
class AnnotateRequest(BaseModel): """Request body for /annotate (top-k feature scan or an explicit feature pick).""" - sequence: str - organism: str = "None (raw DNA)" - tag: Optional[str] = None - mode: str = "topk" # "topk" | "pick" - k: int = 8 - feature_ids: Optional[list[int]] = None - feature_id: Optional[int] = None + sequence: str + """DNA sequence to annotate.""" + organism: str = "None (raw DNA)" + """Organism identifier for phylogenetic tagging.""" + tag: Optional[str] = None + """Custom phylogenetic tag (overrides organism lookup).""" + mode: str = "topk" + """Feature selection mode: 'topk' (top-k scan) or 'pick' (explicit features).""" + k: int = 8 + """Number of top features to return when mode='topk'.""" + feature_ids: Optional[list[int]] = None + """Explicit feature IDs when mode='pick'.""" + feature_id: Optional[int] = None + """Single feature ID when mode='pick' (alternative to feature_ids)."""As per coding guidelines, use Google-style docstrings (pydocstyle convention) in Python code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 39 - 48, The AnnotateRequest Pydantic model lacks Google-style field docstrings; update the class docstring for AnnotateRequest to include a Google-style "Attributes:" section that documents each field (sequence, organism, tag, mode, k, feature_ids, feature_id), describing purpose, types/constraints and allowed values for mode ("topk" | "pick") and any relationships (e.g., feature_ids vs feature_id) so readers and linters can validate the field meanings. Ensure the docstring follows pydocstyle/Google conventions and mentions defaults where relevant.Source: Coding guidelines
99-107: ⚡ Quick winAdd return type hint to features endpoint.
`@app.get`("/features") - def features(): + def features() -> list[dict]:As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 99 - 107, The endpoint function features lacks a return type hint; update its signature to include a typed return such as def features() -> List[Dict[str, Any]]: and add the necessary imports (from typing import List, Dict, Any) at the top of the module, or alternatively define and use a pydantic model and set response_model on `@app.get`; modify the function signature and imports so Pyright type checking passes while keeping the existing logic in features().Source: Coding guidelines
109-154: ⚡ Quick winAdd return type hint to annotate endpoint.
`@app.post`("/annotate") - def annotate(req: AnnotateRequest): + def annotate(req: AnnotateRequest) -> dict:As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 109 - 154, The annotate endpoint lacks a return type hint which fails Pyright checks; update the annotate function signature (def annotate(req: AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or a proper TypedDict/AnnotateResponse if available), and add the corresponding typing import (e.g., from typing import Dict, Any) at the top of the module so Pyright accepts the annotated return for the function annotate and its returned JSON structure.Source: Coding guidelines
86-97: ⚡ Quick winAdd return type hint to health endpoint.
`@app.get`("/health") - def health(): + def health() -> dict:As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 86 - 97, The health endpoint lacks a return type hint; update the health function signature (def health) to declare a typed return such as -> Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from typing import Any, Dict) so Pyright can validate the returned mapping built from engine (engine.ready, engine.layer, engine.n_features, engine.labels, engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned structure unchanged and ensure the type hint covers the mixed value types.Source: Coding guidelines
156-172: ⚡ Quick winAdd return type hint to generate endpoint.
`@app.post`("/generate") - def generate(req: GenerateRequest): + def generate(req: GenerateRequest) -> dict:As per coding guidelines, use Pyright for type checking in Python files following pyproject.toml configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` around lines 156 - 172, Add an explicit return type annotation to the FastAPI endpoint function generate (def generate(req: GenerateRequest) -> Any) and import Any from typing; update the signature so Pyright knows the endpoint's return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body and exception handling (engine.generate call and HTTPException raises) unchanged.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh`:
- Around line 17-21: The script currently embeds development-only absolute
defaults for VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS which
will break elsewhere; remove those hardcoded paths and instead either (a) set
VENV to a relative default like RECIPE_DIR/.venv and leave
EVO2_CKPT_DIR/SAE_CKPT_PATH/FEATURE_ANNOTATIONS unset, or (b) require these env
vars be provided and add an explicit validation block that checks VENV,
EVO2_CKPT_DIR, SAE_CKPT_PATH, and FEATURE_ANNOTATIONS (while allowing
EMBEDDING_LAYER to keep a sane numeric default), and if any are missing print a
clear error naming the missing variable(s) and exit non‑zero; update the code
references to VENV, EVO2_CKPT_DIR, SAE_CKPT_PATH, FEATURE_ANNOTATIONS, and
EMBEDDING_LAYER accordingly.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 57-67: Add a Google-style docstring to the _engine function
describing its purpose, parameters, and return value: explain that _engine
constructs and returns an Evo2SAE instance, document each parameter passed to
Evo2SAE (evo2_ckpt_dir, sae_ckpt_path, layer, device, max_seq_len,
feature_annotations) with types and brief descriptions, and state the return
type (Evo2SAE). Place the docstring immediately below the def _engine(args):
line using the standard Google style (Args:, Returns:) so tools and linters can
pick it up.
- Around line 34-55: The function _add_common is missing a Google-style
docstring; add a concise Google-style docstring immediately below the def
_add_common(p: argparse.ArgumentParser) -> None: line describing the function’s
purpose (registers shared CLI arguments), the parameter p (an
argparse.ArgumentParser), and any side effects/returns (modifies the parser in
place, returns None). Use the Google docstring sections: Args and Returns, and
keep wording aligned with surrounding code style.
- Around line 70-87: Add a Google-style docstring to _read_fasta describing
parameters (path), return values (ids, seqs), behavior (supports gzipped files)
and exceptions; and fix the header-parsing edge case by replacing the brittle
line that does line[1:].split()[0] with logic that strips the leading ">" and
whitespace, uses .split() safely (e.g., parts = line[1:].strip().split(); name =
parts[0] if parts else f"seq_{len(ids)}") so headers like "> " don't raise
IndexError and still produce a generated id when no token is present.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.py`:
- Around line 168-189: The docstring and logic in the method that reads
self.feature_annotations (variables: labels, peaks, path, path.suffix) claim to
support parquet/tsv/csv/json but only handle parquet; update the code in the
function (the block starting with path = Path(self.feature_annotations)) to
detect .csv/.tsv (use csv or pandas.read_csv), .json (json.load or
pandas.read_json), and parse the same columns ("feature_id", "label" or
"annotation", "max_activation") into labels and peaks just like the parquet
branch, and for any other suffix emit an explicit logger.warning stating the
format is unsupported and return empty labels/peaks; ensure you reuse the same
keys/behavior (casting ids to int, labels to str, peaks to float) as done in the
pq branch so the rest of the code remains compatible.
- Around line 352-366: The code indexes SAE tensors using incoming feature IDs
(see fids, features and usages of self.sae.encoder.weight /
self.sae.decoder.weight) without validation; add explicit bounds and type checks
before any tensor indexing inside the block that builds specs (validate each fid
is an integer >=0 and < self.sae.encoder.weight.size(0) and similarly valid for
decoder indexing), and if invalid raise a ValueError with a clear message so the
/generate handler returns 400; perform these checks at the start of the with
self._lock block (before accessing self.sae.* tensors) or filter/convert
f["feature_id"] to int safely and validate before using it in specs construction
(references: fids, features, self.sae.encoder.weight, self.sae.decoder.weight,
self.layer).
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 124-131: The code currently treats any non-"pick" mode as "topk";
update the conditional around req.mode in server.py to explicitly handle "pick"
and "topk" only and raise an HTTPException(400, "invalid mode; allowed values:
'pick', 'topk'") for any other value. Concretely, change the if/else to if
req.mode == "pick": ... elif req.mode == "topk": compute k and call
engine.top_features(...); else: raise the 400 error so typos or unsupported
modes are rejected (refer to req.mode, engine.top_features, chosen).
- Line 84: The CORS middleware is currently set to allow all origins via
app.add_middleware(CORSMiddleware, allow_origins=["*"]) which is too permissive
for production; update the server startup to read an environment variable (e.g.,
CORS_ALLOWED_ORIGINS or CORS_ALLOWED_ORIGIN) and use that to populate
allow_origins (parse a comma-separated list into a list), defaulting to a safe
value like an empty list or localhost for dev, and ensure
allow_methods/allow_headers remain appropriate; locate the use of
app.add_middleware and replace the hardcoded ["*"] with the parsed config so
deployments can restrict origins without code changes.
- Around line 23-34: Add a second blank line after the import block (the line
ending with "from .core import Evo2SAE, clean_dna") so there are two blank lines
before the next top-level statement (e.g., the logger = logging.getLogger(...)
or any subsequent definitions); this aligns with the isort rule and ensures the
import section (including Evo2SAE and clean_dna) is separated from module-level
code.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py`:
- Around line 99-105: Update the test_endpoints_503_until_ready to also assert
that the /generate endpoint returns 503 when the engine is not ready: in the
existing test that creates FakeEngine (eng.ready = False), using
TestClient(build_app(eng)) add a POST request to "/generate" with a
representative JSON payload (similar shape to other tests, e.g. prompt/sequence
fields) and assert c.post("/generate", json=...).status_code == 503 so /generate
is covered like /features and /annotate.
---
Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 37-50: Default file paths for CLI args (--sae-ckpt-path,
--feature-annotations and EVO2_CKPT_DIR env fallback) are hardcoded to
/data/interp/evo2/...; remove or replace these with portable defaults by making
the argparse defaults None (or point to a user/home-relative path) and rely on
environment variables (SAE_CKPT_PATH, FEATURE_ANNOTATIONS, EVO2_CKPT_DIR) or
explicit CLI input, and update the code that consumes these values (where these
args are referenced) to validate and raise a clear error if no path is provided;
target the add_argument calls for "--sae-ckpt-path", "--feature-annotations" and
the EVO2_CKPT_DIR default.
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py`:
- Around line 51-55: Add Google-style (pydocstyle) docstrings describing each
field on the FeatureClamp Pydantic model: update the class docstring for
FeatureClamp (subclassing BaseModel) to include an Args section documenting
feature_id (int) and strength (float) with concise descriptions and
units/semantics (e.g., feature index and target steering strength, default 1.0).
Keep the top-line summary intact and ensure the Args block follows Google style
so linters accept it.
- Around line 58-68: Add Google-style (pydocstyle) docstrings for the
GenerateRequest datamodel: add a class docstring describing the purpose of
GenerateRequest and include an Args section that documents each attribute
(prompt, organism, tag, features: list[FeatureClamp], n_tokens, temperature,
top_k, compare_baseline) with types and brief descriptions (e.g., prompt: input
sequence string; organism: organism context or "None (raw DNA)"; tag: optional
user tag; features: SAE FeatureClamp list used for clamping; n_tokens: number of
tokens to generate; temperature: sampling temperature; top_k: top-k sampling
value; compare_baseline: whether to compare to baseline). Ensure the formatting
follows Google-style pydocstyle conventions and place the docstring immediately
under the class GenerateRequest declaration.
- Around line 39-48: The AnnotateRequest Pydantic model lacks Google-style field
docstrings; update the class docstring for AnnotateRequest to include a
Google-style "Attributes:" section that documents each field (sequence,
organism, tag, mode, k, feature_ids, feature_id), describing purpose,
types/constraints and allowed values for mode ("topk" | "pick") and any
relationships (e.g., feature_ids vs feature_id) so readers and linters can
validate the field meanings. Ensure the docstring follows pydocstyle/Google
conventions and mentions defaults where relevant.
- Around line 99-107: The endpoint function features lacks a return type hint;
update its signature to include a typed return such as def features() ->
List[Dict[str, Any]]: and add the necessary imports (from typing import List,
Dict, Any) at the top of the module, or alternatively define and use a pydantic
model and set response_model on `@app.get`; modify the function signature and
imports so Pyright type checking passes while keeping the existing logic in
features().
- Around line 109-154: The annotate endpoint lacks a return type hint which
fails Pyright checks; update the annotate function signature (def annotate(req:
AnnotateRequest)) to include an explicit return type like -> Dict[str, Any] (or
a proper TypedDict/AnnotateResponse if available), and add the corresponding
typing import (e.g., from typing import Dict, Any) at the top of the module so
Pyright accepts the annotated return for the function annotate and its returned
JSON structure.
- Around line 86-97: The health endpoint lacks a return type hint; update the
health function signature (def health) to declare a typed return such as ->
Dict[str, Any] or -> dict[str, Any] and add the corresponding import (from
typing import Any, Dict) so Pyright can validate the returned mapping built from
engine (engine.ready, engine.layer, engine.n_features, engine.labels,
engine.sae_ckpt_path, engine.organism_tags, engine.device); keep the returned
structure unchanged and ensure the type hint covers the mixed value types.
- Around line 156-172: Add an explicit return type annotation to the FastAPI
endpoint function generate (def generate(req: GenerateRequest) -> Any) and
import Any from typing; update the signature so Pyright knows the endpoint's
return type (e.g., def generate(req: GenerateRequest) -> Any:), leaving the body
and exception handling (engine.generate call and HTTPException raises)
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3e499152-13aa-449c-b7ad-6b67a8279836
📒 Files selected for processing (8)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/pyproject.tomlbionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.shbionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/__init__.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/core.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_steering.py
…_sae serve` Shrink the inference PR to the engine + server + their tests. The encode/batch/generate command-line tools (cli.py) and launch_inference.sh move to the stacked CLI PR (#1632); the server stays launchable here via `python -m evo2_sae serve` (__main__.py, env-configured). fasta.py stays (shared by the extraction-side chunk_fasta.py and, via the base, the CLI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
5f4fce4 to
4a0de59
Compare
…1622) Steering's only consumers (the live engine's clamp hook + the steer.py harness) both live in the evo2 serve recipe (#1622), and the harness imports Evo2SAE from it. So the steering primitive + harness move to a dedicated PR stacked on #1622, where the core clamp-hook dedup can happen in-place. This base is now the probing library only. Signed-off-by: Polina Binder <pbinder@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Loading an SAE whose input_dim != the model's hidden size (wrong SAE/model pairing) used to
succeed at load and then fail with a cryptic matmul shape error on the first encode. load() now
checks it up front and raises a clear message ("SAE input_dim=X does not match the Evo2 hidden
size=H at layer L — wrong SAE/model pairing (check --sae-ckpt-path / --layer)").
- _model_hidden_size(): read it from the model config (cheap) or a 1-token forward (ground
truth); None if neither works -> check skipped, never blocks an otherwise-fine load.
- _check_dim(): pure, unit-tested on CPU (test_check_dim_rejects_sae_model_mismatch).
NOT detectable, documented in code: a *wrong layer number* with the same hidden size — encode
still matches dims and silently yields out-of-distribution features. The SAE checkpoint records
no training layer; /health surfaces the configured layer. Follow-up: stamp the layer into the SAE
checkpoint at train time and assert it here.
Validated in the evo2_megatron venv: CPU test_core 7 passed, GPU test_steering 12 passed on the 1B
(real load() exercises the new check; 1920==1920, no false positive).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Sort encode_batch work by token length so each micro-batch holds similar-length sequences (less wasted padding on mixed-length inputs). Results are written back by original index, so the returned order still matches the input order. Add a CPU test that stubs the model and asserts input-order output despite the internal length-sort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Drop the separate truncated encode model (load_model_to_layer(full=False)) and serve both
paths from the one inference engine built by setup_inference_engine:
* load() builds the engine and takes self.model = unwrap_model(comp.model) + comp.tokenizer.
* encode/highlight (_forward_hidden) runs a normal full-sequence forward on that model and
reads layer L off a forward hook — the engine model is post_process=True so output_embeddings
can't be used; the hook captures the same [S,B,H] module output the steering clamp_hook reads,
so encode and steer see identical activations.
* generate() steers on self.model.decoder.layers[L] (the same module encode reads).
Removes the ~1.8x model duplication (one set of weights instead of truncated + full). The
num-microbatches double-init teardown is now just defensive (only one model inits it).
Test: add GPU test_highlight_steer_interleaving_no_bleed — encode is bit-identical across a
steered generate, and a baseline generate is unaffected by prior encode/steer history (proves
no state bleed between the shared model's highlight forward and decode path).
Validated end-to-end on the 1B-8k-bf16 (21/21 tests pass, incl. the interleaving + steering
GPU tests). 7B fidelity still unconfirmed (no 7B checkpoint available) — HOLD push for that gate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…d onto migrated #1622 Re-lands #1635 on the post-#1633 layout, on top of migrated #1622: the steering-eval harness (scripts/{steer,steer_analysis}.py) over #1622's generate(), with model-agnostic metrics. Validated: tests/test_steer_analysis.py -> 3 passed (CPU); harness scripts compile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Review fixes: * metric: replace positional Hamming with normalized edit (Levenshtein) distance. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering now warns, steers at the effective value, and records max_clamp_strength + capped_strengths. * fix dangling doc reference (probe.py -> extract.py, which exists). * refactor steer.py into injectable pick_target()/run_steering() and add CPU test_steer.py (fake engine, local not in conftest) covering target picking, dose monotonicity, selectivity, and cap reporting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Metric / robustness: * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at the effective value, and records max_clamp_strength + capped_strengths. Consolidation: * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py). * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk. * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in conftest, to avoid colliding with the sibling server PR's engine fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Replace the 'slow' marker with an explicit @pytest.mark.skipif(not torch.cuda.is_available()) on the GPU/integration tests in test_steering.py (shared 'requires_gpu' decorator). They run when a GPU is present (CI's L4 + megatron env) and skip with a clear 'requires a GPU' reason otherwise — the conftest fixtures still further skip on too-little GPU memory or an unfetchable/unimportable checkpoint. Remove the now-unused 'slow' marker registration from pyproject. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Remove .github/workflows/unit-tests-interpretability-recipes.yaml per review (recoverable from history / 9bedf2b). Keep .ci_build.sh + .ci_test_env.sh + the tests — those are the build/run machinery (used by the Dockerfile and manual runs), not the CI lane. How to build + run the tests is documented in the PR description; CI should later fold into the repo-wide recipe lane rather than a bespoke workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ed sae lib); drop CI lane These probing primitives (eval metrics + ActivationBuffer) are evo2-specific, so move them from the shared sae library into the evo2_sae recipe package: * sae/src/sae/eval/probing.py -> recipes/evo2/src/evo2_sae/eval/probing.py * new recipes/evo2/src/evo2_sae/eval/__init__.py (re-exports the probing API) * sae/src/sae/eval/__init__.py reverted (no longer exports probing — stays shared for esm2/codonfm) * sae/tests/test_probing.py -> recipes/evo2/tests/test_probing.py (import evo2_sae.eval.probing) Remove .github/workflows/unit-tests-sae.yaml (defer CI; run tests via the recipe's .ci_build.sh + pytest). Re-parented onto #1622 so the evo2_sae package is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
A new ubuntu-latest workflow installs sae + the recipe (CPU torch) and runs the recipe's model-agnostic tests (-m 'not slow') — the label producers (#1630), eval metrics, etc. — so they run cheaply on the probing-stack branches instead of waiting for #1622's megatron GPU lane (which would run them on an L4 after a full build). Registers the 'slow' marker on the recipe pyproject so the GPU tests are excluded without an unknown-marker warning. Validated: pytest tests/ -m 'not slow' -> 16 passed (CPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…se SAE forward - Extract evo2_buffer.forward_codes(engine, id_lists) — the one place that touches the engine internals (locked GPU forward + SAE encode). build_buffer and probe._encode_windows both use it, so the #1622 engine-API coupling lives in a single spot, and the per-token label/buffer work moves out of the GPU lock. Add a CPU unit test (fake engine) for the helper's contract. - Hoist KINGDOM_TAGS to evo2_buffer (was duplicated in probe_loss_recovered). - Remove the `codon-aa` subcommand: it consumed a codon/aa npz no command produces (and was the only raw np.load); drop it and its now-unused decode_eval/fit_softmax imports until a producer exists. - SAEWrap delegates to the SAE's own forward() (top-k + normalize_input denormalization) instead of hand-rolling decoder(codes)+pre_bias and mean/std — the path the steering hook uses, so the loss-recovered recon can't drift from the SAE's actual (de)normalization. - Make evo2_buffer importable without the evo2_sae engine (lazy read_fasta), so the CPU tests exercise forward_codes and the harness imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…ck (#1622) recipes/evo2/ is co-owned with #1622 (Dockerfile/build + src/evo2_sae). Align the shared files so the two stacks merge without conflict, regardless of order: - pyproject.toml: keep `[tool.setuptools] packages = []` (unchanged from main, so #1622's `where = ["src"]` wins cleanly at merge and `pip install -e recipes/evo2` still works here with no src/ dir); make the `[tool.pytest.ini_options]` markers block byte-identical to #1622's so the add/add merges cleanly. The biopython/pyrodigal deps stay a one-sided add. - Drop tests/conftest.py (it add/add-collided with #1622's GPU-fixture conftest) and restore the per-file scripts/ sys.path insert in test_probe_integration.py, matching the sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…validation test - test_build_buffer_shapes_and_label_alignment_with_fake_engine (CPU): drives build_buffer (forward_codes + labelers + ActivationBuffer) on a fake engine, asserting codes/dense/labels shapes align and base_A fires exactly on DNA 'A' positions with the phylo tag left unlabeled. - test_build_buffer_and_score_real_engine (@pytest.mark.slow): the #1636<->#1622 seam end to end against the real Evo2SAE engine (real model -> codes -> labels -> auroc_all). Skips without CUDA / the engine; uses the recipe conftest's evo2_ckpt_dir/sae_ckpt_path/embedding_layer fixtures, which arrive when the serve + eval stacks share recipes/evo2/ — so it runs in the merged megatron GPU lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
…d onto migrated #1622 Clamp an SAE feature via the production Evo2SAE.generate path and quantify the causal effect: dose-response (effect vs strength) + selectivity (target vs control features), persisted to a structured steering_results.json. Metric / robustness: * normalized edit (Levenshtein) distance, not positional Hamming. Greedy decode is autoregressive, so one early flipped token shifts every downstream base and pins Hamming at ~1.0 — erasing the dose curve. Edit distance is shift-robust; first_divergence (shared-prefix length) is the complementary monotone signal. Tested with the shift case. * surface the clamp cap: generate() silently caps |strength| to MAX_CLAMP_STRENGTH, so two requests above it produce an identical clamp (a fake plateau). run_steering warns, steers at the effective value, and records max_clamp_strength + capped_strengths. Consolidation: * harness + metrics live in the package (src/evo2_sae/steer_analysis.py), engine injected, so they import as a normal torch-free module like evo2_sae.fasta — dropped all four sys.path inserts. scripts/steer.py is now a thin CLI (matches train.py/extract.py). * pick_target reuses Evo2SAE.top_features (the CLI/server ranking) instead of re-deriving topk. * one CPU test file (metrics + fake-engine harness) instead of two; fake stays local, not in conftest, to avoid colliding with the sibling server PR's engine fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Relocate the steering dose-response / selectivity metrics from evo2_sae.steer_analysis into the evo2_sae.eval package (src/evo2_sae/steer_analysis.py -> src/evo2_sae/eval/steering.py), alongside the eval/probing harness. Update the importers (scripts/steer.py CLI + tests/test_steer_analysis.py) to evo2_sae.eval.steering. The CI lane is dropped via the rebase onto the updated #1622. Pure-CPU tests, no GPU/model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Summary
The importable Evo2SAE inference engine + feature steering — the base of the serve stack — with tests and a runnable (layer-cached) Docker image. A single Evo2 inference engine is loaded once and serves both paths:
encodereads the residual stream off a layer-Lforward hook;generatedrives the same model's decode with decode-only feature steering. No web/CLI here: the server + CLI (#1637), dashboard (#1623), and steering eval (#1635) build on it.Rebased onto the post-#1633 top-level layout (
interpretability/sparse_autoencoders/).Architecture: one model, both paths
Earlier iterations loaded two copies of Evo2 — a truncated
post_process=Falsemodel for encode/highlight and the full inference engine for generate (~1.8× the weights). This collapses to a single engine (infer.setup_inference_engine, run eager withcuda_graph_impl="none"so the steering hook applies):load()builds the one engine and takesself.model = unwrap_model(comp.model)+comp.tokenizerfrom it._forward_hidden) runs a normal full-sequence forward and reads layerLoff a forward hook — the engine model ispost_process=True(it produces logits for generation), sooutput_embeddingscan't be used; the hook captures the same[S, B, H]module output the steeringclamp_hookreads, so encode and steer see identical activations by construction.self.model.decoder.layers[L]— the same module encode reads.Validated end-to-end on the 1B-8k-bf16 (21/21 tests, incl. a highlight↔steer interleaving test proving no state bleed between the shared model's encode forward and decode path). 7B fidelity is the remaining gate.
Contents
Engine + steering
src/evo2_sae/core.py—Evo2SAE:load → encode / encode_batch / feature_tracks / generate(decode-only clamp viasae.steering) + input-sanitization guards (_sanitize_steering: feature-id range, clamp-magnitude cap, non-finite/top_k/temperature coercion).encode_batchis length-bucketed (work sorted by token length to minimize padding waste on mixed-length inputs; results un-sorted back to input order).load()verifies the SAE'sinput_dimequals the model's hidden size (_model_hidden_sizevia config, or a 1-token forward) and raises a clear error on a mismatch ("wrong SAE/model pairing"), instead of a cryptic matmul failure on the first encode. Known gap: a wrong layer number with the same hidden size can't be caught here (the SAE checkpoint records no training layer) — it silently yields out-of-distribution features;/healthsurfaces the configured layer, and stamping the layer into the checkpoint at train time is a follow-up.sae/src/sae/steering.py— model-agnostic delta-clamp hook +steer().Build / run / CI
.ci_build.sh(env|install|all) +.ci_test_env.sh— build the env by delegating toevo2_megatron's own build (no fork of the pinned megatron stack), then installsae+ this recipe into that venv. The phase arg lets the Dockerfile cache the two steps separately.Dockerfile— thin, non-forking, layer-cached: the ~30-min mbridge megatron build is its own layer (depends only onrecipes/evo2_megatron), and the SAE source + editable installs are a separate trailing layer — so editing engine/SAE code rebuilds only the cheap install layer, not megatron. (+ a per-Dockerfile.dockerignore.)tests/conftest.py— 1B-8k-bf16 fixture (bionemo_load→run_nemo2_to_mbridge) + a synthesized tiny SAE, GPU-memory-gated; honorsEVO2_CKPT_DIR/SAE_CKPT_PATHfor manual / 7B runs. The GPU tests are gated by@pytest.mark.skipif(not torch.cuda.is_available()), so they run on a GPU box and skip otherwise.Dependency on
bionemo.evo2The engine reuses
bionemo.evo2's model code (the mbridgerecipes/evo2_megatronrecipe), which isn't pip-installable..ci_build.sh(and the Dockerfile) build it via evo2_megatron's own script; it's intentionally not inpyproject.toml, matching the codonfm/esm2 recipes (base model is environment-provided).How to run
Tests
There's no dedicated CI lane right now (deferred — it should later fold into the repo-wide recipe lane, which already runs
.ci_build.sh+pytest). Run them manually:test_core.py(engine plumbing —top_features,_load_sae,generateguards, the SAE/model dim check, encode_batch length-bucketing order) +test_steering.pysanitize guards +sae/tests/test_steering.py(exact clamp math). Quick CPU-only run without the venv:PYTHONPATH=src:../../sae/src pytest tests/test_core.py.test_steering.py— bf16 encode, generation in-distribution, steering changes the continuation (+compare_baseline), batched/empty-sequence encode, max-clamp stays finite, and highlight↔steer interleaving (encode bit-identical across a steered generate; baseline unaffected by history). Gated by@pytest.mark.skipif(not torch.cuda.is_available())— runs on a GPU box (megatron venv); setEVO2_CKPT_DIR/SAE_CKPT_PATHfor a specific model, else the fixtures build the 1B-8k-bf16 + a synthesized SAE.Base of
#1637 (server) → #1623 (dashboard), and #1635 (steering eval).
Note:
recipes/evo2/is co-owned with the eval stack (#1636)This PR owns the recipe's Dockerfile /
.ci_build.sh/src/evo2_sae+tests/conftest.py; the eval stack (#1636) addsscripts/(labelers, probe harness) and itsbiopython/pyrodigaldeps to the samerecipes/evo2/. The eval branch is pre-reconciled against this PR (verified clean withgit merge-tree): it keeps[tool.setuptools] packages = [](so this PR'swhere = ["src"]wins at merge), carries a byte-identical pytest-markers block, and has noconftest.py. No change needed here — just merge order awareness, andpip install -e recipes/evo2(in.ci_build.sh) will install the eval deps automatically once both land.