evo2 SAE serve: FastAPI server + CLI (on the engine) by polinabinder1 · Pull Request #1637 · NVIDIA-BioNeMo/bionemo-recipes

polinabinder1 · 2026-06-12T03:47:45Z

Summary

FastAPI server + CLI over the Evo2SAE engine (#1622). Thin wrappers — all model work lives in core.py — plus the input validation, resource governance, and recovery needed for a shared backend (runs behind NVIDIA SSO on Brev, reachable by many users). API routes live under /api, and the server can mount a prebuilt front-end at /, so the dashboard (#1623) and the API can be served from one origin / one container.

Rebased onto the single-engine #1622 (one inference engine serves both encode and generate; new top-level layout interpretability/sparse_autoencoders/…).

Contents (new layout)

…/src/evo2_sae/server.py — /api/health, /api/features, /api/annotate, /api/generate (+ optional static-frontend mount at /)
…/src/evo2_sae/cli.py — serve / encode / batch / generate
…/scripts/launch_inference.sh; CPU contract tests tests/test_cli.py, tests/test_server.py + the shared FakeEngine appended to evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's tests/conftest.py

Shared logic (CLI ⇄ server live in `core`)

core.annotate(engine, …) — clean → resolve-tag → encode → tag-len, behind both CLI encode and server /api/annotate.
core.parse_clamp_spec(spec) — one parser for clamps as CLI "ID[:STRENGTH]" strings or server FeatureClamp JSON; fed in front of evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's _sanitize_steering so both surfaces validate identically.

Single-origin serving (`/api` + optional static mount)

API routes are grouped under /api (one APIRouter + include_router).
build_app(engine, static_dir=None) mounts a prebuilt front-end at / via StaticFiles(html=True) when static_dir (or the DASHBOARD_DIST env) points at a real directory; otherwise the server is API-only and / 404s (never crashes). The mount is generic — it serves whatever dir it's pointed at and knows nothing about the dashboard; evo2 SAE recipe: feature-explorer dashboard (viz) #1623 supplies the dir + the Docker build that produces it.
This is what lets a single container serve UI + API on one port. Dev hits the same /api/* paths (the Vite proxy forwards /api without rewriting), so there's no dev/prod path drift.

Reliability & governance

/api/health 503 until ready so readiness probes don't route to a still-loading pod; a startup load failure is caught and leaves the engine not-ready (503) rather than crashing.
Length limits — /api/annotate and /api/generate reject input longer than max_seq_len (413) instead of silently truncating (which would misalign the per-base activations/bases the viz plots). Generation length is otherwise auto-capped to the remaining context (no fixed token cap).
Pick-id validation — /api/annotate mode=pick range-checks user-supplied feature_ids → 400 (an out-of-range id would otherwise 500 on IndexError, a negative one would silently return the wrong feature).
Steering sanitation — out-of-range ids, extreme/non-finite strengths, temperature<=0, negative top_k are all rejected/coerced before the GPU (_sanitize_steering).
CUDA-wedge recovery — a device-side assert poisons the process's CUDA context (unrecoverable in-process). Not client-inducible (sanitation covers the reachable triggers — purely defensive), but if it happens generate() flips the engine not-ready (→ 503) and, when EXIT_ON_CUDA_WEDGE=1 (set by serve), exits the worker so any restart-on-exit supervisor respawns it — host-independent recovery.
Signal-safe serve — launch_inference.sh serve runs the worker in the background, forwards SIGTERM/SIGINT (uvicorn graceful shutdown) before respawning, with a retry cap + backoff, so docker stop/k8s shuts down cleanly instead of orphaning the worker.
Request body-size limit (MAX_BODY_BYTES, default 16 MiB) → 413 — advisory (trusts Content-Length).
Bounded concurrency — Starlette's sync-endpoint threadpool capped (MAX_CONCURRENCY, default 8); the engine lock already serializes the single GPU.

Architectural decisions

Two layers: engine vs. surface. All model work stays in core.Evo2SAE (evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622); server.py/cli.py are thin and share core.annotate + core.parse_clamp_spec, so the HTTP API and the CLI can't drift and there's one validated path.
FastAPI, not raw/Flask. We get pydantic structural validation + an async threadpool we can bound (MAX_CONCURRENCY) for almost no code; the domain validation that matters (_sanitize_steering, pick-id range) is manual either way. Raw Python would hand-roll routing/validation/concurrency; Flask would add the threadpool governance by hand.
No app-level auth. Deployed behind NVIDIA SSO on Brev; auth is the proxy's job, not duplicated here (CORS removed too — calls are same-origin).
Single GPU, serialized. The engine lock + bounded threadpool match one GPU; data-parallel replicas behind a balancer are a deferred follow-up (touches no engine code).
/api prefix + generic static mount (above) so one origin/container can serve both UI and API.

How to run

Run inside the evo2_megatron venv (provides bionemo.evo2 + megatron); in the Docker image it's already active. Full dashboard run modes are in #1623's feature_explorer/README.md.

export EVO2_CKPT_DIR=<mbridge>  SAE_CKPT_PATH=<sae.pt>
export FEATURE_ANNOTATIONS=<feature_metadata.parquet>  EMBEDDING_LAYER=26
scripts/launch_inference.sh serve                                    # API on :8001 (+ UI at / if DASHBOARD_DIST set)
scripts/launch_inference.sh encode   --sequence ATGC...              # one sequence -> top features (JSON)
scripts/launch_inference.sh batch    --fasta in.fa --out out.parquet # many -> parquet
scripts/launch_inference.sh generate --prompt ATGC... --clamp 29244:300  # steered generation

Tunables (env): MAX_BODY_BYTES, MAX_CONCURRENCY, MAX_SEQ_LEN, PORT, EXIT_ON_CUDA_WEDGE, DASHBOARD_DIST.

Tests

No dedicated CI lane (deferred — see #1622). Run them via the recipe's build script:

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh
pytest tests/

CPU (no model): test_cli.py + test_server.py (FastAPI TestClient + FakeEngine: response shapes, 400/413/503, pick out-of-range → 400, /api/generate too-long → 413, body-size, k-bounds, clamp validation, static-frontend mount: SPA at /, asset served, API reachable under /api, unknown /api/* → 404, API-only when no frontend), plus evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's test_core.py + test_steering.py sanitize guards.
GPU: test_steering.py — encode, in-distribution generation, steering changes the continuation, batched/empty encode, max-clamp finite, highlight↔steer interleaving (single-engine state-bleed check). Gated by @pytest.mark.skipif(not torch.cuda.is_available()) — runs on a GPU box, skips otherwise. Validated on the 1B; the single-engine backend also serves the 7B at layer 26 live.

Deferred follow-up

Multi-GPU data-parallel replicas (one worker per GPU behind a least_conn balancer) for concurrent throughput — touches no engine code; left until concurrency is an observed need.

Stacked on #1622. The dashboard (#1623) builds on this.

copy-pr-bot · 2026-06-12T03:47:48Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-12T03:47:51Z

📝 Walkthrough

Walkthrough

This PR introduces a complete inference system for Evo2 sparse autoencoders. It provides a bash launcher, a Python CLI with four modes (serve, encode, batch, generate), a FastAPI REST API with endpoints for annotation and generation, and comprehensive server contract tests to validate the API contract.

Changes

Evo2 SAE Inference API: Launcher, CLI, and FastAPI Server

Layer / File(s)	Summary
CLI Launcher and Environment Setup `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh`	Bash script that validates the venv path, configures `PYTHONPATH` to include `src`, activates the virtual environment, and execs the Python CLI module with forwarded arguments; documents supported modes (serve, encode, batch, generate) and feature-steering usage.
CLI Interface with Subcommands `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`	Python `main()` entrypoint with shared argument registration for checkpoint paths and runtime controls; dispatches to serve (FastAPI via Uvicorn), encode (single DNA sequence with top-k features), batch (FASTA file with per-sequence feature ranking to Parquet), and generate (with optional feature-clamp steering and optional baseline comparison). Parses repeatable `--clamp FEATURE_ID[:STRENGTH]` options into structured specs.
Server Request Models and App Factory `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` (partial)	Pydantic request schemas (`AnnotateRequest`, `FeatureClamp`, `GenerateRequest`) with defaults for organism, mode, sampling parameters, and feature clamping; `build_app(engine)` factory loads the engine once via async lifespan, configures CORS using `CORS_ORIGINS`, and returns the configured FastAPI instance.
Server Endpoints `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py` (partial)	Four endpoints enforce readiness gating (HTTP 503 if not ready): `/health` returns readiness and engine metadata (layer, feature counts, SAE checkpoint, organism tags, device); `/features` lists feature id/label/natural_peak; `/annotate` cleans DNA, resolves organism/tag, encodes sequence, selects features via explicit ids or top-k ranking, returns per-feature activation summaries including per-base activations; `/generate` calls engine with feature clamps, translates `ValueError` to HTTP 400.
Server Contract Tests `bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py`	`FakeEngine` minimal stand-in providing readiness/metadata, encoding, top-features, and generation; `client` pytest fixture wraps the app in `TestClient`; tests validate response shapes for `/health`, `/features`, `/annotate` (including per-base activation presence and non-DNA rejection with HTTP 400), `/generate` (sequence returned), and readiness gating (HTTP 503 until ready).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 The Evo2 encoder bounds,
With CLI and REST in rounds,
Feature clamps dance free,
Activation to see,
Inference serves up new grounds! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and specifically describes the main change: adding a FastAPI server and CLI layer on top of the Evo2SAE engine for inference workflows.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering summary, contents, shared logic, architecture, usage examples, tests, and deferred follow-up work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pbinder/evo2-sae-server

_{Comment @coderabbitai help to get the list of available commands.}

polinabinder1 · 2026-06-12T04:36:45Z

@coderabbitai review

coderabbitai · 2026-06-12T04:37:03Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 50-53: The parser currently eagerly calls int(...) on environment
defaults (see add_argument for "--layer" and "--max-seq-len" and the similar
PORT usage), which raises a traceback if the env var is non-numeric; change
these to pass the raw env value (e.g., os.environ.get("EMBEDDING_LAYER", "26")
and os.environ.get("MAX_SEQ_LEN", "8192") and PORT default) as the default and
let argparse's type=int handle conversion and clean error messages—i.e., remove
the outer int(...) wrapper in the default arguments for the "--layer" and
"--max-seq-len" add_argument calls (and the PORT default at the other
occurrence) so invalid numeric env values are reported by argparse rather than
causing an eager exception.
- Around line 144-146: The code currently treats an unknown organism as empty
tag by using eng.resolve_tag(args.organism, None) or "", which silently falls
back to raw-DNA mode; instead, after calling eng.resolve_tag(args.organism,
None) check if the result is None and fail fast (print a clear error referencing
args.organism and exit non‑zero or raise an appropriate exception) before
calling clean_dna and eng.encode; locate the block that calls eng.resolve_tag,
clean_dna and eng.encode (the variables tag, dna, codes) and replace the "or ''"
fallback with an explicit None check that aborts with a helpful message when the
organism is unknown.
- Around line 83-86: The loop that parses clamp strings currently lets int/float
conversions raise raw ValueError; instead catch conversion errors and convert
them into a CLI parser error by raising argparse.ArgumentTypeError with a clear
message including the offending clamp string (e.g., in the function that builds
specs from clamps, wrap the int(fid)/float(strength) casts in try/except and on
failure raise argparse.ArgumentTypeError(f"invalid --clamp value: {c!r}: {err}")
so the CLI shows a clean validation error; ensure argparse is imported and used
consistently where this parsing function is invoked.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 141c9623-9e66-42db-8224-ac5633a6c07d

📥 Commits

Reviewing files that changed from the base of the PR and between f310289 and 9fd49ed.

📒 Files selected for processing (4)

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py

…he engine PR - remove test_clamp_math: it called the deleted Evo2SAE._clamp_hook (we unified onto sae.steering.clamp_hook); the delta-clamp math is covered in sae/tests/test_steering.py - pyproject: drop fastapi/uvicorn/pandas — the engine imports none of them; fastapi+uvicorn move to the serve PR (#1637), pandas was unused Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

… int port, fake shape 1. /annotate pick mode now range-checks user-supplied feature_ids -> 400 (was: out-of-range IndexError -> 500, negative id silently indexed the wrong feature via torch negative-index). + test_annotate_pick_rejects_out_of_range_id. 2. core.generate rejects an over-context prompt ("too long" -> server 413), instead of letting tokenize() silently truncate it — makes the /generate 413 branch live and matches /annotate. + test_generate_rejects_overlong_prompt. 3. cli.py: int() the env-var defaults (PORT/EMBEDDING_LAYER/MAX_SEQ_LEN) — argparse type= only coerces command-line values, so `serve` was handing uvicorn a str port. 4. conftest FakeEngine.generate now returns features keyed {id, label, strength} (the real feat_meta shape the dashboard consumes), not {feature_id, strength}; test_cli updated so the contract test pins the real API shape. 5. Note body-size limit is advisory (Content-Length only; chunked/lying bypasses). 6. Note the CUDA-wedge guard depends on a readiness-based recycler (else 503 until manual restart). Validated in the evo2_megatron venv: CPU 40 passed (was 38), GPU unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

…art loop) A device-side assert poisons the process's CUDA context (unrecoverable in-process), so ready=False alone only recovers under a readiness-based recycler. Add restart-on-exit recovery, which almost every host provides: - core.generate: on an unrecoverable CUDA fault, if EXIT_ON_CUDA_WEDGE=1, os._exit(1) the worker (after ready=False). Default unset -> just fail-closed at 503 (safe for library/CLI/test use). - launch_inference.sh: for `serve`, export EXIT_ON_CUDA_WEDGE=1 and wrap in a restart loop (respawn on crash/wedge exit; stop on clean exit / Ctrl-C 130 / SIGTERM 143). Recovery now works with no external orchestrator (and composes with docker --restart / systemd / k8s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

… 413 test - launch_inference.sh: stop managing the venv — assume it's already active (Docker: on PATH; bare metal: source the evo2_megatron .venv first, like the tests). Drops the messy VENV= passing; adds a clear "bionemo.evo2 not importable" preflight. - Restart loop signal fix (was a graceful-shutdown regression): run the worker in the background and `wait`, with a trap that forwards SIGTERM/SIGINT to it (uvicorn graceful shutdown) and stops the loop — so `docker stop`/k8s on PID 1 no longer orphans the worker. Adds a 10-restart cap + backoff so a persistent crash (e.g. port already bound) doesn't loop forever. Smoke-tested: SIGTERM stops in ~1s, not the worker's full lifetime. - /generate 413 now pinned at the server layer: FakeEngine raises "too long" past max_seq_len and test_generate_rejects_too_long drives POST /generate -> 413 (was only covered via test_core). - Reframe the CUDA-wedge comment: it's PURELY DEFENSIVE — _sanitize_steering neutralizes every client-reachable assert trigger, so a wedge implies a hardware/driver fault, not a crafted request (exit+restart is not a remote DoS). New triggers must extend _sanitize_steering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

Move the API routes under an /api prefix (one APIRouter + include_router) and, when a built frontend is configured (build_app(static_dir=...) or DASHBOARD_DIST env), mount it at / via StaticFiles(html=True). This lets a single container serve both the dashboard and the API on one origin: the frontend always calls /api/* (in dev via the Vite proxy, in prod from the same server). The static mount is generic — it serves whatever dir it's pointed at and knows nothing about the dashboard; the dashboard recipe (#1623) supplies the dir + the Docker build. With no frontend configured the server is API-only and / 404s (never crashes). Startup already tolerates a load failure (stays not-ready -> 503), so a frontend+API smoke needs no GPU/checkpoints. Tests: re-point existing contract tests to /api/*, add SPA-index/asset served, API-reachable-under- prefix, unknown-/api-is-404-not-SPA, and API-only-when-no-frontend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

This was referenced Jun 12, 2026

evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622

Open

evo2 SAE recipe: feature-explorer dashboard (viz) #1623

Open

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

polinabinder1 marked this pull request as ready for review June 12, 2026 05:32

polinabinder1 requested review from jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners June 12, 2026 05:32

polinabinder1 requested review from jstjohn and pstjohn and removed request for jstjohn, pstjohn and trvachov June 12, 2026 05:32

polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from e567efc to 9bedf2b Compare June 23, 2026 03:57

polinabinder1 force-pushed the pbinder/evo2-sae-server branch from 8c9c467 to a8a0930 Compare June 23, 2026 05:23

polinabinder1 force-pushed the pbinder/evo2-sae-server branch from 058d7f7 to c15f27f Compare June 23, 2026 06:13

polinabinder1 force-pushed the pbinder/evo2-sae-server branch from c15f27f to a819d98 Compare June 23, 2026 06:35

polinabinder1 force-pushed the pbinder/evo2-sae-server branch from a819d98 to dc46ad5 Compare June 23, 2026 18:50

polinabinder1 mentioned this pull request Jun 23, 2026

evo2 SAE eval: label producers + probing harness (on #1629) #1636

Open

polinabinder1 and others added 5 commits June 24, 2026 03:57

polinabinder1 force-pushed the pbinder/evo2-sae-server branch from cdda2f7 to 28e49be Compare June 24, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evo2 SAE serve: FastAPI server + CLI (on the engine)#1637

evo2 SAE serve: FastAPI server + CLI (on the engine)#1637
polinabinder1 wants to merge 5 commits into
pbinder/evo2-sae-servefrom
pbinder/evo2-sae-server

polinabinder1 commented Jun 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polinabinder1 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contents (new layout)

Shared logic (CLI ⇄ server live in core)

Single-origin serving (/api + optional static mount)

Reliability & governance

Architectural decisions

How to run

Tests

Deferred follow-up

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

polinabinder1 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polinabinder1 commented Jun 12, 2026 •

edited

Loading

Shared logic (CLI ⇄ server live in `core`)

Single-origin serving (`/api` + optional static mount)

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading