Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
a07983e
feat(ndi-python): Phase A — install NDI-python + wire VHSB/compressio…
audriB May 13, 2026
b6ac0a6
feat(tabular_query): violin-chart endpoint + NDI-python SHA pinning +…
audriB May 14, 2026
3be7c96
fix(tabular_query): prefer numeric column when multiple match
audriB May 14, 2026
83a9358
feat(tabular_query): substring-match groupBy too (LLM ergonomics)
audriB May 14, 2026
d62610c
feat(backend): labchat wave — image, distinct_summary, Sprint 1.5 bin…
audriB May 14, 2026
6aebed9
feat(tabular_query): per-group docIds for granular sample-row citations
audriB May 14, 2026
bfba171
chore: remove inadvertent Finder dupe file_format 2.py
audriB May 14, 2026
0fc129b
fix(ontology): bypass stale stubs so NDI-python fallback fires across…
audriB May 14, 2026
26f71ad
fix(backend): aggregated audit findings — ontology resolution + tabul…
audriB May 14, 2026
b1bb29f
fix(csrf): exempt /api/ontology/batch-lookup so anonymous popovers re…
audriB May 14, 2026
6b1b9ef
fix(ontology): WBStrain scrape fallback + Caenorhabditis facet dedup
audriB May 14, 2026
aa11de6
fix(chat): probe→element class alias + typed binding-failure codes
audriB May 14, 2026
93f2887
feat(timeline): port treatment-timeline orchestration to Railway/Python
audriB May 14, 2026
eac08c9
feat(spike-summary): Python port of fetch_spike_summary orchestration
audriB May 14, 2026
74ddec9
feat(psth): peri-stimulus time histogram service + router
audriB May 14, 2026
31d2e0c
fix(auth): scope CSRF/session cookie Domain to *.ndi-cloud.com hosts
audriB May 15, 2026
b850d1f
chore(tests): scrub multiplication-sign from cookie_attrs test comments
audriB May 15, 2026
f3c5b75
fix(summary): widen epoch-count fallback chain to phase-A ingest classes
audriB May 15, 2026
0a3c008
fix(security+observability): Stream 1 quick wins — session-id log tru…
audriB May 15, 2026
9fc8b2d
test(compliance): Stream 2.1 — static regression test asserting no PH…
audriB May 15, 2026
9c2bc15
docs: Stream 4.8 — backend service-dependency README
audriB May 15, 2026
580a76b
fix(observability+test-isolation): Stream 5.5 sessions diagnostic + 6…
audriB May 15, 2026
d168134
feat(treatment-timeline): Stream 5.2 — treatment_drug class + adminis…
audriB May 15, 2026
0956236
feat: backend pieces — S3.4 enable_ask + S5.1 fuzzier substring + S5.…
audriB May 15, 2026
6ec72e9
feat(tables): Stream 5.8 — server-side pagination on /tables/{class}
audriB May 15, 2026
bc68b13
feat(aggregate-documents): Stream 4.9 — port aggregation to Railway (…
audriB May 15, 2026
27c93a6
feat(class-aliases): F-1c + F-1d + F-1e — surface legacy class data
audriB May 18, 2026
ea51ff3
feat: F-2 + F-3 — subject filter on tables + direction filter on depe…
audriB May 18, 2026
0231851
feat(stimulus): F-1 — /tables/stimulus projection for StimuliPicker
audriB May 18, 2026
44842e3
feat(tabular-query): F-8 — add POST variant alongside GET
audriB May 18, 2026
9e586b5
fix(projection): use REQUESTED class for _project_for_class dispatch
audriB May 18, 2026
e94fe0a
fix(treatment): F-1e complete — projection + row builder for treatmen…
audriB May 18, 2026
e0124f6
feat(tables): expose treatment_drug + treatment_transfer via /tables …
audriB May 18, 2026
4053119
fix(cache): bump table schema v4 → v5 to invalidate F-1d/F-1e blobs
audriB May 18, 2026
8401286
test(cache): update tests for v5 schema bump
audriB May 18, 2026
de2132d
feat(F-1b): broadcast treatments onto subject summary table
audriB May 18, 2026
a560a41
fix(F-1b): extend subject enrichment with treatment_drug + treatment_…
audriB May 18, 2026
e03d470
fix(signal): smart default file pick — skip channel_list.bin
audriB May 18, 2026
4181c12
fix(documents): apply class-alias chain in /documents listing (B2)
audriB May 18, 2026
5034249
fix(treatment-timeline): parse MATLAB datestr in stringValue (B3)
audriB May 18, 2026
48b9ce7
fix(binary): smart default file pick on image decode paths (B5 sweep)
audriB May 18, 2026
058107a
fix(B6): filter parent/aggregate session docs from counts.sessions
audriB May 18, 2026
9523950
fix(B6): bump SUMMARY_KEY_PREFIX v1 → v2 to invalidate stale entries
audriB May 18, 2026
cc64299
fix(B6): add session.reference prefix-suffix fallback for non-graph d…
audriB May 18, 2026
ba0dcd1
chore(B6): bump SUMMARY_KEY_PREFIX v2 → v3 for prefix-fallback rollout
audriB May 18, 2026
984ec66
chore(B6): add diagnostic log + bump cache v3 → v4 for fresh build
audriB May 18, 2026
302d1a7
chore(B6): surface filter diagnostic via warnings + bump cache v4 → v5
audriB May 18, 2026
1377bc6
chore(B6): move diagnostic upstream of depends_on early-exit + v6 cache
audriB May 18, 2026
15159c3
fix(B6): always prefer prefix-suffix when it filters; remove debug + …
audriB May 18, 2026
46f57f9
fix(F-1c): counts.probes aliases to elements when literal probe is 0
audriB May 18, 2026
357eabc
perf(F-7): aggregate_documents hydrates slim ndiquery refs via bulk_f…
audriB May 18, 2026
2981444
test(F-8): pin tabular_query GET == POST shape + validation parity
audriB May 18, 2026
7157bde
feat(S5.3): cross_table_pairs service + POST /cross-table-query route
audriB May 19, 2026
f6ecb83
test(F-1): apply preserved integration-test stub with respx fix
audriB May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions backend/SERVICE_DEPENDENCIES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Backend service dependency map

**Audience:** contributors changing the FastAPI backend; operators
investigating an incident; auditors tracing data flow.

**Last updated:** 2026-05-15

This doc inventories every service the FastAPI backend depends on, in
the direction of the dependency (who-calls-whom). For each: what it's
used for, when failure is acceptable, and the failure-mode hooks.

The complementary docs (in the sibling `ndi-cloud-app` repo):
- `apps/web/docs/operations/vendor-dependencies.md` — vendor + BAA
inventory at the higher level
- `apps/web/docs/operations/disaster-recovery.md` — runbooks per
failure mode

---

## Topology

```
┌──────────────────────────────┐
│ FastAPI backend │
│ (this repo, on Railway) │
└─────┬───────────┬────────────┘
│ │
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Redis │ │ Postgres │
│ (Railway) │ │ (Railway) │
└─────────────┘ └──────────────┘
│ (rate limits, sessions, table cache)
┌──────────────────────────────────┐
│ ndi-cloud-node │
│ (AWS Lambda + API Gateway) │
└──────┬───────────────────────────┘
├── AWS Cognito User Pool (identity)
├── AWS DocumentDB (metadata)
└── AWS S3 (binary recordings)
```

---

## Outbound dependencies (what FastAPI calls)

### Redis (Railway-hosted)

| Field | Value |
|---|---|
| **Used for** | Session store (Fernet-encrypted access tokens), rate-limit counters, summary-table response cache, CSRF-failure budget |
| **Failure mode** | Sessions: every request returns 401 (forces re-login). Rate limit: middleware fails-open (allows requests) per the swallow-error-and-pass pattern in `csrf.py:_maybe_promote_to_rate_limit`. Cache: every read becomes a miss (slower but correct). |
| **Acceptable downtime?** | Sessions: no — platform unusable. Rate limit + cache: yes, with degraded UX. |
| **Code surface** | `backend/auth/session.py` (sessions), `backend/middleware/rate_limit.py`, `backend/cache/redis_table.py`. |

### Postgres (Railway-hosted)

| Field | Value |
|---|---|
| **Used for** | pgvector RAG store for `/ask` semantic search; future `chat_usage_events` table (Stream 3) for per-user cost tracking. |
| **Failure mode** | Semantic search returns soft error; chat falls back to structured catalog tools. |
| **Acceptable downtime?** | Yes — chat works without semantic search via fallback. |
| **Code surface** | The RAG-store schema lives in the sibling `ndi-cloud-app` repo at `apps/web/lib/ai/db/`. The cloud-app side reads pgvector directly via `@vercel/postgres`. FastAPI doesn't currently touch the RAG store; it WILL when Stream 3.2 (`chat_usage_events`) lands. |

### ndi-cloud-node (AWS Lambda)

| Field | Value |
|---|---|
| **Used for** | All catalog reads, all auth (Cognito-backed login), all dataset metadata, all NDI Query DSL evaluation, all binary-document downloads (proxied via signed S3 URLs). |
| **Failure mode** | Circuit breaker opens after 5 consecutive failures (default `CLOUD_CIRCUIT_BREAKER_THRESHOLD`); cooldown 30s. While the breaker is open, every FastAPI request that needs the cloud returns `CloudUnreachable` typed error → 503 `cloud_unreachable`. |
| **Acceptable downtime?** | No — platform unusable. AWS SLO is the binding constraint. |
| **Code surface** | `backend/clients/ndi_cloud.py` (HTTP client + circuit breaker), `backend/clients/circuit_breaker.py`. |
| **Auth** | Bearer access-token (Cognito JWT) per-request, no service account; the user's session-stored token is decrypted and forwarded on the request. |

### AWS S3 (via signed URLs)

| Field | Value |
|---|---|
| **Used for** | Binary recording downloads. ndi-cloud-node returns a signed S3 URL; FastAPI forwards the URL to the client OR streams the bytes through (depending on size). |
| **Failure mode** | Binary downloads return 502. Catalog reads + metadata are unaffected. |
| **Code surface** | `backend/clients/_url_allowlist.py` enforces an allowlist of S3 hostnames before any FastAPI-side download proxy. The May 2026 audit (`test_download_from_off_allowlist_host_hard_rejects`) verifies the allowlist rejects non-S3 hosts even when ndi-cloud-node returns a redirect to one. |

### OpenTelemetry collector (optional)

| Field | Value |
|---|---|
| **Used for** | Trace export when `OTEL_EXPORTER_OTLP_ENDPOINT` is non-empty. Default: empty (tracing disabled). |
| **Failure mode** | Tracing dropped silently. No impact on application requests. |
| **Code surface** | `backend/observability/` (sender), `backend/middleware/request_id.py` (per-request id propagation). |

---

## Inbound dependencies (who calls FastAPI)

### Vercel-hosted ndi-cloud-app frontend (production + preview)

| Field | Value |
|---|---|
| **Used for** | Every `/api/*` request from the browser is proxied to FastAPI via Vercel `rewrites()`. Same for RSC-server-side fetches (`INTERNAL_API_URL`). |
| **Auth posture** | Cookie + CSRF — matches the FastAPI middleware contract. |
| **Branch awareness** | The cloud-app's `feat/experimental-ask-chat` branch routes `/api/*` to **this** experimental FastAPI env (`ndb-v2-experimental`) via the branch-aware rewrite. Main branch routes to production FastAPI. See ADR-005 in the cloud-app repo. |

### vh-lab-chatbot + shrek-lab-chatbot

| Field | Value |
|---|---|
| **Used for** | These two sibling chatbots historically read the same Postgres RAG index. Today they don't call FastAPI directly — they query their own embedding indices. Listed here for completeness because they share the Voyage API key (incident-prone: see the May 2026 leaked-credentials postmortem in the cloud-app repo). |

---

## Service-startup order

The FastAPI app's lifespan handler (`backend/app.py:lifespan`) starts services in this order:

1. **NdiCloudClient.start()** — opens the httpx pool. Lazy DNS, no
eager call to the cloud.
2. **SessionStore** — instantiates with the Fernet key from settings.
3. **RateLimiter** — Redis-backed; lazy on first use.
4. **Ontology cache** — SQLite at `ONTOLOGY_CACHE_DB_PATH`, created if
absent.

Shutdown is reverse order. If startup fails at any step, the container
crashes before serving the first request — by design (fail-loud).

---

## Update history

| Date | Change |
|---|---|
| 2026-05-15 | Initial draft (Stream 4.8 deliverable). |
118 changes: 118 additions & 0 deletions backend/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,25 @@
from .observability.logging import configure_logging, get_logger, request_id_ctx
from .observability.tracing import init_tracing
from .routers import (
aggregate_documents,
auth,
binary,
datasets,
documents,
health,
image,
ndi_dataset,
ontology,
psth,
query,
signal,
spike_summary,
tables,
tabular_query,
treatment_timeline,
visualize,
)
from .services.dataset_binding_service import DatasetBindingService
from .services.ontology_cache import OntologyCache
from .services.ontology_service import OntologyService
from .static_files import safe_static_path
Expand Down Expand Up @@ -255,6 +263,98 @@ async def _facets_warm() -> None:
log.info("keepwarm.started", interval_seconds=240)
log.info("facets_warm.started", interval_seconds=240)

# NDI-python strict-boot check.
#
# The Phase A integration adds vlt (VHSB), ndicompress, and
# ndi.ontology. When `NDI_PYTHON_REQUIRED=1` (set by the Railway
# Dockerfile), the stack MUST be importable or we hard-fail.
# Unset (dev/test/CI), we log a warning if NDI is missing but
# keep going — every NDI-python call gracefully returns None and
# callers fall through to their legacy paths.
#
# Why an explicit env var rather than guessing from
# `settings.ENVIRONMENT`: the test/CI/local matrix is fuzzy, and
# the only thing that actually matters here is "is this image
# supposed to have NDI-python installed?" The Dockerfile knows;
# nothing else needs to.
import os as _os
if _os.environ.get("NDI_PYTHON_REQUIRED", "").strip() in ("1", "true", "yes"):
from .services import ndi_python_service as _ndi
if not _ndi.is_ndi_available():
raise RuntimeError(
"ndi_python_service.is_ndi_available() returned False at "
"startup but NDI_PYTHON_REQUIRED=1. The NDI-python stack "
"(vlt, ndicompress, ndi.ontology) failed to import. Check "
"the Dockerfile's pinned git SHAs and the install layer logs."
)
log.info("ndi_python.boot_ok")

# Sprint 1.5 dataset-binding service — singleton, lives on app.state.
# Always instantiated (cheap object — empty LRU). The router behind
# ``/api/datasets/{id}/ndi_overview`` calls into it; on internal
# failure (NDI-python missing, cloud unreachable, etc.) the service
# returns None and the router maps that to a 503. Frontend tool
# falls back to ndi_query gracefully.
app.state.dataset_binding_service = DatasetBindingService()

# Optional pre-warm of the 3 demo datasets. We fire-and-forget per
# dataset so a single failure doesn't block the others. Each task
# is parked on app.state so asyncio doesn't GC the reference
# mid-flight (RUF006). We DO NOT await them — they run in the
# background while the app starts serving requests immediately.
#
# If NDI-python isn't available, the service returns None on the
# first call and we skip the rest — costs essentially nothing.
async def _prewarm_dataset(dataset_id: str) -> None:
try:
log.info("dataset_binding.prewarm_start", dataset_id=dataset_id)
result = await app.state.dataset_binding_service.get_dataset(
dataset_id
)
if result is not None:
log.info(
"dataset_binding.prewarm_done",
dataset_id=dataset_id,
)
else:
# Service already logged the reason at WARN — keep this
# at INFO so the boot timeline is one-line-per-dataset.
log.info(
"dataset_binding.prewarm_skipped",
dataset_id=dataset_id,
)
except _asyncio.CancelledError:
raise
except Exception as exc:
# Truly defensive: get_dataset() is documented to never
# raise, but log loudly if that contract breaks so we know.
log.warning(
"dataset_binding.prewarm_unexpected_raise",
dataset_id=dataset_id,
error=str(exc),
error_type=type(exc).__name__,
)

# Three demo datasets surfaced by the experimental /ask chat:
# Dabrowska BNST (EPM behavior), Bhar (chemotaxis), Haley
# (patch-encounter). Order does not matter; tasks run concurrently.
# Pre-warm is gated to production-like environments so dev/test
# boots stay fast.
if settings.ENVIRONMENT in ("production", "preview"):
prewarm_ids = (
"67f723d574f5f79c6062389d", # Dabrowska BNST
"69bc5ca11d547b1f6d083761", # Bhar
"682e7772cdf3f24938176fac", # Haley
)
app.state.dataset_binding_prewarm_tasks = [
_asyncio.create_task(_prewarm_dataset(did))
for did in prewarm_ids
]
log.info(
"dataset_binding.prewarm_started",
count=len(prewarm_ids),
)

log.info("app.startup", environment=settings.ENVIRONMENT)
try:
yield
Expand All @@ -272,6 +372,16 @@ async def _facets_warm() -> None:
# so it surfaces in logs instead of disappearing.
with _contextlib.suppress(_asyncio.CancelledError):
await task
# Cancel any in-flight dataset-binding pre-warm tasks.
# downloadDataset is blocking I/O inside asyncio.to_thread — we
# can't actually interrupt it mid-thread, but cancellation
# prevents the post-download cache-write from running after
# teardown.
prewarm_tasks = getattr(app.state, "dataset_binding_prewarm_tasks", None) or []
for task in prewarm_tasks:
task.cancel()
with _contextlib.suppress(_asyncio.CancelledError, Exception):
await task
await cloud_client.close()
await ontology_service.close()
# `redis.asyncio.Redis.aclose()` is the correct async-context
Expand Down Expand Up @@ -422,8 +532,16 @@ async def handle_unhandled(request: Request, exc: Exception) -> JSONResponse:
app.include_router(tables.router)
app.include_router(query.router)
app.include_router(query.facets_router)
# Stream 4.9 (2026-05-16) — heavy aggregate runs on Railway, not Vercel.
app.include_router(aggregate_documents.router)
app.include_router(binary.router)
app.include_router(signal.router)
app.include_router(image.router)
app.include_router(tabular_query.router)
app.include_router(treatment_timeline.router)
app.include_router(psth.router)
app.include_router(spike_summary.router)
app.include_router(ndi_dataset.router)
app.include_router(ontology.router)
app.include_router(visualize.router)

Expand Down
77 changes: 70 additions & 7 deletions backend/auth/cookie_attrs.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,81 @@
"""Per-environment cookie attribute helper.

Centralizes the ``Set-Cookie`` / ``Delete-Cookie`` attribute set used by
the session and CSRF cookies. Production carries
``Domain=.ndi-cloud.com`` so the apex Vercel deployment can read cookies
issued by the Railway backend after the cross-repo unification (Phase
4); dev keeps host-only + insecure for plain-HTTP localhost; everything
else (e.g. staging) is host-only + secure.
the session and CSRF cookies.

Domain attribute
----------------

Production carries ``Domain=.ndi-cloud.com`` ONLY when the request
originates from ``*.ndi-cloud.com`` so the apex Vercel deployment can
read cookies issued by the Railway backend (cross-repo unification,
Phase 4).

Vercel **preview** deployments at ``*.vercel.app`` get host-only
cookies. A Set-Cookie that carries ``Domain=.ndi-cloud.com`` on a
response served back to a non-``ndi-cloud.com`` host is silently
rejected by the browser — the cookie spec forbids servers from
setting cookies for domains they don't control. That's why
preview-time login was breaking with ``CSRF_INVALID`` errors before
this fix (2026-05-14 tutorial-parity smoke).

Other attributes
----------------

Dev keeps host-only + insecure for plain-HTTP localhost. Staging (and
any other ENVIRONMENT value) is host-only + secure.
"""
from typing import Any
from urllib.parse import urlparse

from fastapi import Request

from ..config import Settings


def cookie_attrs(settings: Settings) -> dict[str, Any]:
def cookie_attrs(settings: Settings, *, request: Request) -> dict[str, Any]:
"""Return the Set-Cookie attribute dict for the current env + request.

The ``request`` parameter is required: the per-request Origin (or
Referer) is what decides whether the Domain attribute is safe to
attach. Old callers that passed only ``settings`` must be updated —
silently guessing wrong is what broke preview login.
"""
if settings.ENVIRONMENT == "production":
return {"secure": True, "domain": ".ndi-cloud.com"}
if _request_from_ndi_cloud(request):
return {"secure": True, "domain": ".ndi-cloud.com"}
# Preview / vercel.app / anything else served by the production
# backend: secure but host-only. The browser will accept these
# because the cookie's implicit Domain matches the response
# origin (the preview hostname).
return {"secure": True}
return {"secure": settings.ENVIRONMENT != "development"}


def _request_from_ndi_cloud(request: Request) -> bool:
"""Was this request issued by a browser tab on ``*.ndi-cloud.com``?

Reads the Origin header (browsers set this on every cross-site and
every same-origin POST since 2020), with a fallback to Referer for
older clients and the few same-origin GETs that omit Origin.
Returns True only if the URL's hostname is exactly
``ndi-cloud.com`` or a subdomain of it.

Returns False when:
- both Origin and Referer are missing or unparseable
- the host doesn't end with ``ndi-cloud.com`` (i.e. preview)
"""
for header_name in ("origin", "referer"):
raw = request.headers.get(header_name)
if not raw:
continue
try:
parts = urlparse(raw)
except ValueError:
continue
if not parts.netloc:
continue
host = parts.netloc.split(":", 1)[0].lower()
if host == "ndi-cloud.com" or host.endswith(".ndi-cloud.com"):
return True
return False
Loading
Loading