Skip to content

evo2 SAE serve: FastAPI server + CLI (on the engine)#1637

Open
polinabinder1 wants to merge 5 commits into
pbinder/evo2-sae-servefrom
pbinder/evo2-sae-server
Open

evo2 SAE serve: FastAPI server + CLI (on the engine)#1637
polinabinder1 wants to merge 5 commits into
pbinder/evo2-sae-servefrom
pbinder/evo2-sae-server

Conversation

@polinabinder1

@polinabinder1 polinabinder1 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

FastAPI server + CLI over the Evo2SAE engine (#1622). Thin wrappers — all model work lives in core.py — plus the input validation, resource governance, and recovery needed for a shared backend (runs behind NVIDIA SSO on Brev, reachable by many users). API routes live under /api, and the server can mount a prebuilt front-end at /, so the dashboard (#1623) and the API can be served from one origin / one container.

Rebased onto the single-engine #1622 (one inference engine serves both encode and generate; new top-level layout interpretability/sparse_autoencoders/…).

Contents (new layout)

  • …/src/evo2_sae/server.py/api/health, /api/features, /api/annotate, /api/generate (+ optional static-frontend mount at /)
  • …/src/evo2_sae/cli.pyserve / encode / batch / generate
  • …/scripts/launch_inference.sh; CPU contract tests tests/test_cli.py, tests/test_server.py + the shared FakeEngine appended to evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's tests/conftest.py

Shared logic (CLI ⇄ server live in core)

  • core.annotate(engine, …) — clean → resolve-tag → encode → tag-len, behind both CLI encode and server /api/annotate.
  • core.parse_clamp_spec(spec) — one parser for clamps as CLI "ID[:STRENGTH]" strings or server FeatureClamp JSON; fed in front of evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's _sanitize_steering so both surfaces validate identically.

Single-origin serving (/api + optional static mount)

  • API routes are grouped under /api (one APIRouter + include_router).
  • build_app(engine, static_dir=None) mounts a prebuilt front-end at / via StaticFiles(html=True) when static_dir (or the DASHBOARD_DIST env) points at a real directory; otherwise the server is API-only and / 404s (never crashes). The mount is generic — it serves whatever dir it's pointed at and knows nothing about the dashboard; evo2 SAE recipe: feature-explorer dashboard (viz) #1623 supplies the dir + the Docker build that produces it.
  • This is what lets a single container serve UI + API on one port. Dev hits the same /api/* paths (the Vite proxy forwards /api without rewriting), so there's no dev/prod path drift.

Reliability & governance

  • /api/health 503 until ready so readiness probes don't route to a still-loading pod; a startup load failure is caught and leaves the engine not-ready (503) rather than crashing.
  • Length limits/api/annotate and /api/generate reject input longer than max_seq_len (413) instead of silently truncating (which would misalign the per-base activations/bases the viz plots). Generation length is otherwise auto-capped to the remaining context (no fixed token cap).
  • Pick-id validation/api/annotate mode=pick range-checks user-supplied feature_ids400 (an out-of-range id would otherwise 500 on IndexError, a negative one would silently return the wrong feature).
  • Steering sanitation — out-of-range ids, extreme/non-finite strengths, temperature<=0, negative top_k are all rejected/coerced before the GPU (_sanitize_steering).
  • CUDA-wedge recovery — a device-side assert poisons the process's CUDA context (unrecoverable in-process). Not client-inducible (sanitation covers the reachable triggers — purely defensive), but if it happens generate() flips the engine not-ready (→ 503) and, when EXIT_ON_CUDA_WEDGE=1 (set by serve), exits the worker so any restart-on-exit supervisor respawns it — host-independent recovery.
  • Signal-safe servelaunch_inference.sh serve runs the worker in the background, forwards SIGTERM/SIGINT (uvicorn graceful shutdown) before respawning, with a retry cap + backoff, so docker stop/k8s shuts down cleanly instead of orphaning the worker.
  • Request body-size limit (MAX_BODY_BYTES, default 16 MiB) → 413 — advisory (trusts Content-Length).
  • Bounded concurrency — Starlette's sync-endpoint threadpool capped (MAX_CONCURRENCY, default 8); the engine lock already serializes the single GPU.

Architectural decisions

  • Two layers: engine vs. surface. All model work stays in core.Evo2SAE (evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622); server.py/cli.py are thin and share core.annotate + core.parse_clamp_spec, so the HTTP API and the CLI can't drift and there's one validated path.
  • FastAPI, not raw/Flask. We get pydantic structural validation + an async threadpool we can bound (MAX_CONCURRENCY) for almost no code; the domain validation that matters (_sanitize_steering, pick-id range) is manual either way. Raw Python would hand-roll routing/validation/concurrency; Flask would add the threadpool governance by hand.
  • No app-level auth. Deployed behind NVIDIA SSO on Brev; auth is the proxy's job, not duplicated here (CORS removed too — calls are same-origin).
  • Single GPU, serialized. The engine lock + bounded threadpool match one GPU; data-parallel replicas behind a balancer are a deferred follow-up (touches no engine code).
  • /api prefix + generic static mount (above) so one origin/container can serve both UI and API.

How to run

Run inside the evo2_megatron venv (provides bionemo.evo2 + megatron); in the Docker image it's already active. Full dashboard run modes are in #1623's feature_explorer/README.md.

export EVO2_CKPT_DIR=<mbridge>  SAE_CKPT_PATH=<sae.pt>
export FEATURE_ANNOTATIONS=<feature_metadata.parquet>  EMBEDDING_LAYER=26
scripts/launch_inference.sh serve                                    # API on :8001 (+ UI at / if DASHBOARD_DIST set)
scripts/launch_inference.sh encode   --sequence ATGC...              # one sequence -> top features (JSON)
scripts/launch_inference.sh batch    --fasta in.fa --out out.parquet # many -> parquet
scripts/launch_inference.sh generate --prompt ATGC... --clamp 29244:300  # steered generation

Tunables (env): MAX_BODY_BYTES, MAX_CONCURRENCY, MAX_SEQ_LEN, PORT, EXIT_ON_CUDA_WEDGE, DASHBOARD_DIST.

Tests

No dedicated CI lane (deferred — see #1622). Run them via the recipe's build script:

cd interpretability/sparse_autoencoders/recipes/evo2
bash .ci_build.sh && source .ci_test_env.sh
pytest tests/
  • CPU (no model): test_cli.py + test_server.py (FastAPI TestClient + FakeEngine: response shapes, 400/413/503, pick out-of-range → 400, /api/generate too-long → 413, body-size, k-bounds, clamp validation, static-frontend mount: SPA at /, asset served, API reachable under /api, unknown /api/* → 404, API-only when no frontend), plus evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's test_core.py + test_steering.py sanitize guards.
  • GPU: test_steering.py — encode, in-distribution generation, steering changes the continuation, batched/empty encode, max-clamp finite, highlight↔steer interleaving (single-engine state-bleed check). Gated by @pytest.mark.skipif(not torch.cuda.is_available()) — runs on a GPU box, skips otherwise. Validated on the 1B; the single-engine backend also serves the 7B at layer 26 live.

Deferred follow-up

Multi-GPU data-parallel replicas (one worker per GPU behind a least_conn balancer) for concurrent throughput — touches no engine code; left until concurrency is an observed need.

Stacked on #1622. The dashboard (#1623) builds on this.

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete inference system for Evo2 sparse autoencoders. It provides a bash launcher, a Python CLI with four modes (serve, encode, batch, generate), a FastAPI REST API with endpoints for annotation and generation, and comprehensive server contract tests to validate the API contract.

Changes

Evo2 SAE Inference API: Launcher, CLI, and FastAPI Server

Layer / File(s) Summary
CLI Launcher and Environment Setup
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh
Bash script that validates the venv path, configures PYTHONPATH to include src, activates the virtual environment, and execs the Python CLI module with forwarded arguments; documents supported modes (serve, encode, batch, generate) and feature-steering usage.
CLI Interface with Subcommands
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py
Python main() entrypoint with shared argument registration for checkpoint paths and runtime controls; dispatches to serve (FastAPI via Uvicorn), encode (single DNA sequence with top-k features), batch (FASTA file with per-sequence feature ranking to Parquet), and generate (with optional feature-clamp steering and optional baseline comparison). Parses repeatable --clamp FEATURE_ID[:STRENGTH] options into structured specs.
Server Request Models and App Factory
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py (partial)
Pydantic request schemas (AnnotateRequest, FeatureClamp, GenerateRequest) with defaults for organism, mode, sampling parameters, and feature clamping; build_app(engine) factory loads the engine once via async lifespan, configures CORS using CORS_ORIGINS, and returns the configured FastAPI instance.
Server Endpoints
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py (partial)
Four endpoints enforce readiness gating (HTTP 503 if not ready): /health returns readiness and engine metadata (layer, feature counts, SAE checkpoint, organism tags, device); /features lists feature id/label/natural_peak; /annotate cleans DNA, resolves organism/tag, encodes sequence, selects features via explicit ids or top-k ranking, returns per-feature activation summaries including per-base activations; /generate calls engine with feature clamps, translates ValueError to HTTP 400.
Server Contract Tests
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py
FakeEngine minimal stand-in providing readiness/metadata, encoding, top-features, and generation; client pytest fixture wraps the app in TestClient; tests validate response shapes for /health, /features, /annotate (including per-base activation presence and non-DNA rejection with HTTP 400), /generate (sequence returned), and readiness gating (HTTP 503 until ready).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 The Evo2 encoder bounds,
With CLI and REST in rounds,
Feature clamps dance free,
Activation to see,
Inference serves up new grounds! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: adding a FastAPI server and CLI layer on top of the Evo2SAE engine for inference workflows.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, contents, shared logic, architecture, usage examples, tests, and deferred follow-up work.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pbinder/evo2-sae-server

Comment @coderabbitai help to get the list of available commands.

@polinabinder1

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 50-53: The parser currently eagerly calls int(...) on environment
defaults (see add_argument for "--layer" and "--max-seq-len" and the similar
PORT usage), which raises a traceback if the env var is non-numeric; change
these to pass the raw env value (e.g., os.environ.get("EMBEDDING_LAYER", "26")
and os.environ.get("MAX_SEQ_LEN", "8192") and PORT default) as the default and
let argparse's type=int handle conversion and clean error messages—i.e., remove
the outer int(...) wrapper in the default arguments for the "--layer" and
"--max-seq-len" add_argument calls (and the PORT default at the other
occurrence) so invalid numeric env values are reported by argparse rather than
causing an eager exception.
- Around line 144-146: The code currently treats an unknown organism as empty
tag by using eng.resolve_tag(args.organism, None) or "", which silently falls
back to raw-DNA mode; instead, after calling eng.resolve_tag(args.organism,
None) check if the result is None and fail fast (print a clear error referencing
args.organism and exit non‑zero or raise an appropriate exception) before
calling clean_dna and eng.encode; locate the block that calls eng.resolve_tag,
clean_dna and eng.encode (the variables tag, dna, codes) and replace the "or ''"
fallback with an explicit None check that aborts with a helpful message when the
organism is unknown.
- Around line 83-86: The loop that parses clamp strings currently lets int/float
conversions raise raw ValueError; instead catch conversion errors and convert
them into a CLI parser error by raising argparse.ArgumentTypeError with a clear
message including the offending clamp string (e.g., in the function that builds
specs from clamps, wrap the int(fid)/float(strength) casts in try/except and on
failure raise argparse.ArgumentTypeError(f"invalid --clamp value: {c!r}: {err}")
so the CLI shows a clean validation error; ensure argparse is imported and used
consistently where this parsing function is invoked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 141c9623-9e66-42db-8224-ac5633a6c07d

📥 Commits

Reviewing files that changed from the base of the PR and between f310289 and 9fd49ed.

📒 Files selected for processing (4)
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.sh
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py

@polinabinder1 polinabinder1 marked this pull request as ready for review June 12, 2026 05:32
@polinabinder1 polinabinder1 requested review from jstjohn and pstjohn and removed request for jstjohn, pstjohn and trvachov June 12, 2026 05:32
polinabinder1 added a commit that referenced this pull request Jun 22, 2026
…he engine PR

- remove test_clamp_math: it called the deleted Evo2SAE._clamp_hook (we unified onto
  sae.steering.clamp_hook); the delta-clamp math is covered in sae/tests/test_steering.py
- pyproject: drop fastapi/uvicorn/pandas — the engine imports none of them; fastapi+uvicorn
  move to the serve PR (#1637), pandas was unused

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-serve branch from e567efc to 9bedf2b Compare June 23, 2026 03:57
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…,
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-server branch from 8c9c467 to a8a0930 Compare June 23, 2026 05:23
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…,
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-server branch from 058d7f7 to c15f27f Compare June 23, 2026 06:13
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…,
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-server branch from c15f27f to a819d98 Compare June 23, 2026 06:35
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-server branch from a819d98 to dc46ad5 Compare June 23, 2026 18:50
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 added a commit that referenced this pull request Jun 23, 2026
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
polinabinder1 and others added 5 commits June 24, 2026 03:57
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather
than rebased.

- Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh,
  tests/test_{cli,server}.py.
- pyproject: add pandas/fastapi/uvicorn/anyio.
- tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the
  serve-layer FakeEngine + fake_engine fixture.
- core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and
  fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the
  shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec
  in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on
  an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in
  _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.)
- test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test.

Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU),
bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed,
GPU test_steering 13 passed on the 1B (ran, not skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
… int port, fake shape

1. /annotate pick mode now range-checks user-supplied feature_ids -> 400 (was: out-of-range
   IndexError -> 500, negative id silently indexed the wrong feature via torch negative-index).
   + test_annotate_pick_rejects_out_of_range_id.
2. core.generate rejects an over-context prompt ("too long" -> server 413), instead of letting
   tokenize() silently truncate it — makes the /generate 413 branch live and matches /annotate.
   + test_generate_rejects_overlong_prompt.
3. cli.py: int() the env-var defaults (PORT/EMBEDDING_LAYER/MAX_SEQ_LEN) — argparse type= only
   coerces command-line values, so `serve` was handing uvicorn a str port.
4. conftest FakeEngine.generate now returns features keyed {id, label, strength} (the real
   feat_meta shape the dashboard consumes), not {feature_id, strength}; test_cli updated so the
   contract test pins the real API shape.
5. Note body-size limit is advisory (Content-Length only; chunked/lying bypasses).
6. Note the CUDA-wedge guard depends on a readiness-based recycler (else 503 until manual restart).

Validated in the evo2_megatron venv: CPU 40 passed (was 38), GPU unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
…art loop)

A device-side assert poisons the process's CUDA context (unrecoverable in-process), so
ready=False alone only recovers under a readiness-based recycler. Add restart-on-exit recovery,
which almost every host provides:
- core.generate: on an unrecoverable CUDA fault, if EXIT_ON_CUDA_WEDGE=1, os._exit(1) the worker
  (after ready=False). Default unset -> just fail-closed at 503 (safe for library/CLI/test use).
- launch_inference.sh: for `serve`, export EXIT_ON_CUDA_WEDGE=1 and wrap in a restart loop
  (respawn on crash/wedge exit; stop on clean exit / Ctrl-C 130 / SIGTERM 143). Recovery now works
  with no external orchestrator (and composes with docker --restart / systemd / k8s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
… 413 test

- launch_inference.sh: stop managing the venv — assume it's already active (Docker: on PATH;
  bare metal: source the evo2_megatron .venv first, like the tests). Drops the messy VENV= passing;
  adds a clear "bionemo.evo2 not importable" preflight.
- Restart loop signal fix (was a graceful-shutdown regression): run the worker in the background
  and `wait`, with a trap that forwards SIGTERM/SIGINT to it (uvicorn graceful shutdown) and stops
  the loop — so `docker stop`/k8s on PID 1 no longer orphans the worker. Adds a 10-restart cap +
  backoff so a persistent crash (e.g. port already bound) doesn't loop forever. Smoke-tested:
  SIGTERM stops in ~1s, not the worker's full lifetime.
- /generate 413 now pinned at the server layer: FakeEngine raises "too long" past max_seq_len and
  test_generate_rejects_too_long drives POST /generate -> 413 (was only covered via test_core).
- Reframe the CUDA-wedge comment: it's PURELY DEFENSIVE — _sanitize_steering neutralizes every
  client-reachable assert trigger, so a wedge implies a hardware/driver fault, not a crafted
  request (exit+restart is not a remote DoS). New triggers must extend _sanitize_steering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Move the API routes under an /api prefix (one APIRouter + include_router) and, when a built
frontend is configured (build_app(static_dir=...) or DASHBOARD_DIST env), mount it at / via
StaticFiles(html=True). This lets a single container serve both the dashboard and the API on one
origin: the frontend always calls /api/* (in dev via the Vite proxy, in prod from the same server).

The static mount is generic — it serves whatever dir it's pointed at and knows nothing about the
dashboard; the dashboard recipe (#1623) supplies the dir + the Docker build. With no frontend
configured the server is API-only and / 404s (never crashes). Startup already tolerates a load
failure (stays not-ready -> 503), so a frontend+API smoke needs no GPU/checkpoints.

Tests: re-point existing contract tests to /api/*, add SPA-index/asset served, API-reachable-under-
prefix, unknown-/api-is-404-not-SPA, and API-only-when-no-frontend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 force-pushed the pbinder/evo2-sae-server branch from cdda2f7 to 28e49be Compare June 24, 2026 03:58
polinabinder1 added a commit that referenced this pull request Jun 24, 2026
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening,
so a whole-file diff would revert it).

- Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column
  reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe
  localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py.
- server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready).
- core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full
  65536-wide matrix).
- tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine.
- pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE).

Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant