evo2 SAE serve: FastAPI server + CLI (on the engine)#1637
evo2 SAE serve: FastAPI server + CLI (on the engine)#1637polinabinder1 wants to merge 5 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis PR introduces a complete inference system for Evo2 sparse autoencoders. It provides a bash launcher, a Python CLI with four modes (serve, encode, batch, generate), a FastAPI REST API with endpoints for annotation and generation, and comprehensive server contract tests to validate the API contract. ChangesEvo2 SAE Inference API: Launcher, CLI, and FastAPI Server
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.py`:
- Around line 50-53: The parser currently eagerly calls int(...) on environment
defaults (see add_argument for "--layer" and "--max-seq-len" and the similar
PORT usage), which raises a traceback if the env var is non-numeric; change
these to pass the raw env value (e.g., os.environ.get("EMBEDDING_LAYER", "26")
and os.environ.get("MAX_SEQ_LEN", "8192") and PORT default) as the default and
let argparse's type=int handle conversion and clean error messages—i.e., remove
the outer int(...) wrapper in the default arguments for the "--layer" and
"--max-seq-len" add_argument calls (and the PORT default at the other
occurrence) so invalid numeric env values are reported by argparse rather than
causing an eager exception.
- Around line 144-146: The code currently treats an unknown organism as empty
tag by using eng.resolve_tag(args.organism, None) or "", which silently falls
back to raw-DNA mode; instead, after calling eng.resolve_tag(args.organism,
None) check if the result is None and fail fast (print a clear error referencing
args.organism and exit non‑zero or raise an appropriate exception) before
calling clean_dna and eng.encode; locate the block that calls eng.resolve_tag,
clean_dna and eng.encode (the variables tag, dna, codes) and replace the "or ''"
fallback with an explicit None check that aborts with a helpful message when the
organism is unknown.
- Around line 83-86: The loop that parses clamp strings currently lets int/float
conversions raise raw ValueError; instead catch conversion errors and convert
them into a CLI parser error by raising argparse.ArgumentTypeError with a clear
message including the offending clamp string (e.g., in the function that builds
specs from clamps, wrap the int(fid)/float(strength) casts in try/except and on
failure raise argparse.ArgumentTypeError(f"invalid --clamp value: {c!r}: {err}")
so the CLI shows a clean validation error; ensure argparse is imported and used
consistently where this parsing function is invoked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 141c9623-9e66-42db-8224-ac5633a6c07d
📒 Files selected for processing (4)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/launch_inference.shbionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/cli.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/src/evo2_sae/server.pybionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/tests/test_server.py
…he engine PR - remove test_clamp_math: it called the deleted Evo2SAE._clamp_hook (we unified onto sae.steering.clamp_hook); the delta-clamp math is covered in sae/tests/test_steering.py - pyproject: drop fastapi/uvicorn/pandas — the engine imports none of them; fastapi+uvicorn move to the serve PR (#1637), pandas was unused Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
e567efc to
9bedf2b
Compare
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
8c9c467 to
a8a0930
Compare
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
058d7f7 to
c15f27f
Compare
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
#1622 was migrated to the new top-level layout (interpretability/sparse_autoencoders/…, no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
c15f27f to
a819d98
Compare
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
a819d98 to
dc46ad5
Compare
#1637 was re-landed on the migrated #1622 (new top-level layout); #1623 is stacked on it, so it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
no bionemo-recipes/ prefix) and squash-replayed, so #1637 is layered on by hand rather than rebased. - Clean adds: src/evo2_sae/{server,cli}.py, scripts/launch_inference.sh, tests/test_{cli,server}.py. - pyproject: add pandas/fastapi/uvicorn/anyio. - tests/conftest.py: keep #1622's 1B GPU fixtures + bionemo.common loader; append the serve-layer FakeEngine + fake_engine fixture. - core.py (semantic merge): keep #1622's _sanitize_steering (all CPU sanitize tests) and fold in the explicit non-finite-strength guard (no min/max arg-order reliance); add the shared annotate() + parse_clamp_spec() (CLI strings ⇄ API dicts) and feed parse_clamp_spec in front of _sanitize_steering; add _is_unrecoverable_cuda + flip the engine not-ready on an unrecoverable CUDA fault in generate(). (Kept _sanitize_steering rather than swapping in _normalize_clamps — non-redundant and preserves #1622's sampler hardening + tests.) - test_steering.py: keep #1622's sanitize + GPU tests; add the _is_unrecoverable_cuda test. Preserved from #1622: clamp_hook canonical encode/decode, TopKSAE-only _load_sae (no ReLU), bionemo.common(/core fallback) loader. Validated in the evo2_megatron venv: CPU 38 passed, GPU test_steering 13 passed on the 1B (ran, not skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
… int port, fake shape
1. /annotate pick mode now range-checks user-supplied feature_ids -> 400 (was: out-of-range
IndexError -> 500, negative id silently indexed the wrong feature via torch negative-index).
+ test_annotate_pick_rejects_out_of_range_id.
2. core.generate rejects an over-context prompt ("too long" -> server 413), instead of letting
tokenize() silently truncate it — makes the /generate 413 branch live and matches /annotate.
+ test_generate_rejects_overlong_prompt.
3. cli.py: int() the env-var defaults (PORT/EMBEDDING_LAYER/MAX_SEQ_LEN) — argparse type= only
coerces command-line values, so `serve` was handing uvicorn a str port.
4. conftest FakeEngine.generate now returns features keyed {id, label, strength} (the real
feat_meta shape the dashboard consumes), not {feature_id, strength}; test_cli updated so the
contract test pins the real API shape.
5. Note body-size limit is advisory (Content-Length only; chunked/lying bypasses).
6. Note the CUDA-wedge guard depends on a readiness-based recycler (else 503 until manual restart).
Validated in the evo2_megatron venv: CPU 40 passed (was 38), GPU unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
…art loop) A device-side assert poisons the process's CUDA context (unrecoverable in-process), so ready=False alone only recovers under a readiness-based recycler. Add restart-on-exit recovery, which almost every host provides: - core.generate: on an unrecoverable CUDA fault, if EXIT_ON_CUDA_WEDGE=1, os._exit(1) the worker (after ready=False). Default unset -> just fail-closed at 503 (safe for library/CLI/test use). - launch_inference.sh: for `serve`, export EXIT_ON_CUDA_WEDGE=1 and wrap in a restart loop (respawn on crash/wedge exit; stop on clean exit / Ctrl-C 130 / SIGTERM 143). Recovery now works with no external orchestrator (and composes with docker --restart / systemd / k8s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
… 413 test - launch_inference.sh: stop managing the venv — assume it's already active (Docker: on PATH; bare metal: source the evo2_megatron .venv first, like the tests). Drops the messy VENV= passing; adds a clear "bionemo.evo2 not importable" preflight. - Restart loop signal fix (was a graceful-shutdown regression): run the worker in the background and `wait`, with a trap that forwards SIGTERM/SIGINT to it (uvicorn graceful shutdown) and stops the loop — so `docker stop`/k8s on PID 1 no longer orphans the worker. Adds a 10-restart cap + backoff so a persistent crash (e.g. port already bound) doesn't loop forever. Smoke-tested: SIGTERM stops in ~1s, not the worker's full lifetime. - /generate 413 now pinned at the server layer: FakeEngine raises "too long" past max_seq_len and test_generate_rejects_too_long drives POST /generate -> 413 (was only covered via test_core). - Reframe the CUDA-wedge comment: it's PURELY DEFENSIVE — _sanitize_steering neutralizes every client-reachable assert trigger, so a wedge implies a hardware/driver fault, not a crafted request (exit+restart is not a remote DoS). New triggers must extend _sanitize_steering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Move the API routes under an /api prefix (one APIRouter + include_router) and, when a built frontend is configured (build_app(static_dir=...) or DASHBOARD_DIST env), mount it at / via StaticFiles(html=True). This lets a single container serve both the dashboard and the API on one origin: the frontend always calls /api/* (in dev via the Vite proxy, in prod from the same server). The static mount is generic — it serves whatever dir it's pointed at and knows nothing about the dashboard; the dashboard recipe (#1623) supplies the dir + the Docker build. With no frontend configured the server is API-only and / 404s (never crashes). Startup already tolerates a load failure (stays not-ready -> 503), so a frontend+API smoke needs no GPU/checkpoints. Tests: re-point existing contract tests to /api/*, add SPA-index/asset served, API-reachable-under- prefix, unknown-/api-is-404-not-SPA, and API-only-when-no-frontend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
cdda2f7 to
28e49be
Compare
it's layered onto the new #1637 by hand (the old dashboard branch predates the #1637 hardening, so a whole-file diff would revert it). - Clean adds: feature_explorer/ (the React dashboard, incl. the gene_embed firing-column reduction, request timeouts, per-pane descriptions, shared components.jsx, crash-safe localStorage), scripts/{dashboard,launch_dashboard}.py, tests/test_launch_dashboard.py. - server.py: add GeneEmbedRequest + /gene_embed (uses core.clean_dna + _require_ready). - core.py: add Evo2SAE.embed_bundle (ships only firing columns + feature_ids, not the full 65536-wide matrix). - tests: gene_embed contract test in test_server.py; bind embed_bundle on the conftest FakeEngine. - pyproject: add scikit-learn (scripts/dashboard.py atlas PCA/TSNE). Validated in the evo2_megatron venv: CPU 44 passed; frontend `vite build` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>
Summary
FastAPI server + CLI over the Evo2SAE engine (#1622). Thin wrappers — all model work lives in
core.py— plus the input validation, resource governance, and recovery needed for a shared backend (runs behind NVIDIA SSO on Brev, reachable by many users). API routes live under/api, and the server can mount a prebuilt front-end at/, so the dashboard (#1623) and the API can be served from one origin / one container.Rebased onto the single-engine #1622 (one inference engine serves both encode and generate; new top-level layout
interpretability/sparse_autoencoders/…).Contents (new layout)
…/src/evo2_sae/server.py—/api/health,/api/features,/api/annotate,/api/generate(+ optional static-frontend mount at/)…/src/evo2_sae/cli.py—serve/encode/batch/generate…/scripts/launch_inference.sh; CPU contract teststests/test_cli.py,tests/test_server.py+ the sharedFakeEngineappended to evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622'stests/conftest.pyShared logic (CLI ⇄ server live in
core)core.annotate(engine, …)— clean → resolve-tag → encode → tag-len, behind both CLIencodeand server/api/annotate.core.parse_clamp_spec(spec)— one parser for clamps as CLI"ID[:STRENGTH]"strings or serverFeatureClampJSON; fed in front of evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622's_sanitize_steeringso both surfaces validate identically.Single-origin serving (
/api+ optional static mount)/api(oneAPIRouter+include_router).build_app(engine, static_dir=None)mounts a prebuilt front-end at/viaStaticFiles(html=True)whenstatic_dir(or theDASHBOARD_DISTenv) points at a real directory; otherwise the server is API-only and/404s (never crashes). The mount is generic — it serves whatever dir it's pointed at and knows nothing about the dashboard; evo2 SAE recipe: feature-explorer dashboard (viz) #1623 supplies the dir + the Docker build that produces it./api/*paths (the Vite proxy forwards/apiwithout rewriting), so there's no dev/prod path drift.Reliability & governance
/api/health503 until ready so readiness probes don't route to a still-loading pod; a startup load failure is caught and leaves the engine not-ready (503) rather than crashing./api/annotateand/api/generatereject input longer thanmax_seq_len(413) instead of silently truncating (which would misalign the per-baseactivations/basesthe viz plots). Generation length is otherwise auto-capped to the remaining context (no fixed token cap)./api/annotatemode=pickrange-checks user-suppliedfeature_ids→ 400 (an out-of-range id would otherwise 500 onIndexError, a negative one would silently return the wrong feature).temperature<=0, negativetop_kare all rejected/coerced before the GPU (_sanitize_steering).generate()flips the engine not-ready (→ 503) and, whenEXIT_ON_CUDA_WEDGE=1(set byserve), exits the worker so any restart-on-exit supervisor respawns it — host-independent recovery.launch_inference.sh serveruns the worker in the background, forwardsSIGTERM/SIGINT(uvicorn graceful shutdown) before respawning, with a retry cap + backoff, sodocker stop/k8s shuts down cleanly instead of orphaning the worker.MAX_BODY_BYTES, default 16 MiB) → 413 — advisory (trustsContent-Length).MAX_CONCURRENCY, default 8); the engine lock already serializes the single GPU.Architectural decisions
core.Evo2SAE(evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622);server.py/cli.pyare thin and sharecore.annotate+core.parse_clamp_spec, so the HTTP API and the CLI can't drift and there's one validated path.MAX_CONCURRENCY) for almost no code; the domain validation that matters (_sanitize_steering, pick-id range) is manual either way. Raw Python would hand-roll routing/validation/concurrency; Flask would add the threadpool governance by hand./apiprefix + generic static mount (above) so one origin/container can serve both UI and API.How to run
Run inside the evo2_megatron venv (provides
bionemo.evo2+ megatron); in the Docker image it's already active. Full dashboard run modes are in #1623'sfeature_explorer/README.md.Tunables (env):
MAX_BODY_BYTES,MAX_CONCURRENCY,MAX_SEQ_LEN,PORT,EXIT_ON_CUDA_WEDGE,DASHBOARD_DIST.Tests
No dedicated CI lane (deferred — see #1622). Run them via the recipe's build script:
test_cli.py+test_server.py(FastAPITestClient+FakeEngine: response shapes, 400/413/503, pick out-of-range → 400,/api/generatetoo-long → 413, body-size, k-bounds, clamp validation, static-frontend mount: SPA at/, asset served, API reachable under/api, unknown/api/*→ 404, API-only when no frontend), plus evo2 SAE: inference engine + steering, CI lane, tests, Dockerfile #1622'stest_core.py+test_steering.pysanitize guards.test_steering.py— encode, in-distribution generation, steering changes the continuation, batched/empty encode, max-clamp finite, highlight↔steer interleaving (single-engine state-bleed check). Gated by@pytest.mark.skipif(not torch.cuda.is_available())— runs on a GPU box, skips otherwise. Validated on the 1B; the single-engine backend also serves the 7B at layer 26 live.Deferred follow-up
Multi-GPU data-parallel replicas (one worker per GPU behind a
least_connbalancer) for concurrent throughput — touches no engine code; left until concurrency is an observed need.Stacked on #1622. The dashboard (#1623) builds on this.