A production-shaped Retrieval-Augmented Generation app where per-document sharing is a first-class part of the retrieval predicate, not a post-hoc filter. Multi-user from day one — every chunk carries an ACL, every retrieval call runs under the viewer's JWT, every tool-call attribution in the chat UI surfaces why the viewer can see a chunk.
Raw OpenAI SDK + Pydantic (no LLM frameworks), FastAPI backend, React/Vite/Tailwind frontend, Supabase (Postgres + pgvector + Auth + Storage + Realtime), LangSmith observability.
Tool-call attribution renders a per-chunk badge — "via owner" / "via direct grant" / "via {group}" — so the viewer can see exactly which ACL rule granted them access to each retrieved chunk.
The retrieval path is evaluated in two cuts: a correctness eval that proves the security property holds at small scale, and a scale benchmark that characterises the recall curve as the visible set shrinks.
Security — fraction of no_access runs that returned zero gold chunks
(50 questions × 3 modes × 3 viewer setups, 14-chunk Acme corpus):
| Mode | Pre-filter | Post-filter |
|---|---|---|
| vector | 1.000 | 1.000 |
| keyword | 1.000 | 1.000 |
| hybrid | 1.000 | 1.000 |
Pre-filter is the load-bearing row — security is enforced in the SQL predicate, not a Python drop after the fact (post-filter passes too but could in principle leak via timing or payload size).
Recall@5 across viewers, ef_search × selectivity sweep (15 multi-hop queries against a synthetic Wikipedia 10k-chunk corpus, gold = top-5 at the most exhaustive sweep):
| Viewer | Visible chunks | Selectivity | ef_search=40 | ef_search=80 | ef_search=200 | ef_search=500 (gold) |
|---|---|---|---|---|---|---|
| viewer_50pct | 5,000 | 50.0% | 1.000 | 1.000 | 1.000 | 1.000 |
| viewer_10pct | 1,000 | 10.0% | 1.000 | 1.000 | 1.000 | 1.000 |
| viewer_1pct | 100 | 1.0% | 1.000 | 1.000 | 1.000 | 1.000 |
Every cell is 1.000 because at 10k chunks the Postgres planner sidesteps
HNSW entirely — it bitmap-scans chunk_acl, index-scans the visible
chunks, sorts exactly by embedding distance, and takes top-5. EXPLAIN ANALYZE confirms; ef_search is a no-op in that plan. The eval
infrastructure (10k seed, viewer ACL setup, sweep, regression alarm) is
shipped; the recall curve surfaces at the corpus size where exact NN
over the filtered set becomes more expensive than HNSW + post-filter
(tens to hundreds of thousands of visible chunks per query). The
nightly workflow fails loudly if the configured recall floor is
breached. See docs/permissions-aware-rag.md
§5b for the full plan output.
The naive approach to per-document sharing in a RAG retriever is to
leave the vector search alone and post-filter the results: pull
top-k chunks by similarity, then drop the ones the viewer can't see.
This fails on selective ACLs in a way that's easy to miss. The math:
if a viewer can see 5% of the corpus and we ask for top-10, the
expected number of visible chunks in that result is
k × selectivity = 10 × 0.05 = 0.5 — half a chunk on average. The
viewer most often sees zero relevant chunks; multi-hop questions that
need two chunks become unanswerable. "Fetch more candidates and
post-filter harder" doesn't rescue it — at 5% selectivity you'd need
top-100 to expect five visible chunks, and post-filtering top-100
means embedding distance is no longer ranking the visible chunks
against each other. The fix is to push the ACL check into the SQL
predicate so the planner is choosing among visible candidates from the
start — which then opens a second gotcha around HNSW behaviour under
selective filters. The full write-up is in
docs/permissions-aware-rag.md.
- Chat with streaming — OpenAI Responses or Chat Completions API, configurable per-request, streamed token-by-token to the UI. Tool calls and results persist alongside messages.
- Drag-and-drop ingestion —
.txt / .md / .pdf / .docx / .htmlparsed via docling, chunked, embedded, indexed. Live status updates via Supabase Realtime. Document-level metadata (title, authors, topics, dates) extracted via LLM structured outputs. - Hybrid retrieval — vector (pgvector HNSW) + keyword (Postgres full-text) fused via Reciprocal Rank Fusion. Optional reranker layer: Cohere, Voyage, or LLM-as-judge. All retrieval runs under user JWT — RLS enforces per-user visibility.
- Per-document sharing — share documents with individual users or groups via the per-chunk ACL system. Share dialog in the ingestion UI. Per-chunk badges in chat tool attribution show why the viewer can see each chunk.
- Workspace tenant isolation — a hard tenant boundary above per-document sharing: a chunk is visible only if the viewer is a member of its document's workspace, AND-ed into the same
SECURITY INVOKERretrieval predicate (resolved from the viewer's JWT, never a backend-passed tenant id) and mirrored in the table RLS. Existing data lives in one operator-managed Default Workspace; the boundary bites once a second workspace exists. Seedocs/adr/0002-workspace-tenant-isolation.md. - Structured RAG (text-to-SQL) —
query_databasetool over an allowlisted read-only schema, with a semantic-layer-aware compiler so the LLM doesn't have to know table internals. - Web search fallback —
web_searchtool when local retrieval is insufficient. - Sub-agents —
spawn_document_agentlaunches a sub-agent with isolated context and purpose-specific tools. - Retrieval eval suite — 50-question golden set, runner that exercises vector / keyword / hybrid against the real backend functions, recall@k / MRR / nDCG@5 metrics, optional generation + LLM-judge step. PR CI posts a delta-vs-
maincomment; nightly publishes snapshots todocs/nightly/. - RAGAS metrics — the four canonical RAG-eval scores (Faithfulness, Answer Relevancy, Context Precision, Context Recall) computed weekly alongside the custom Claude judge and published to
docs/ragas-weekly/. - Permissions scale benchmark — Wikipedia 10k synthetic corpus, ef_search sweep across three permission selectivities, nightly workflow with regression alarm.
Long-form writeups for the parts of the system that benefit from prose explanation — the kind of context a code review won't recover:
| Doc | What it covers |
|---|---|
docs/permissions-aware-rag.md |
The post-filter recall problem, the four-table data model, the SQL predicate, the HNSW interaction, the eval tables, deliberate v0 scope cuts (group nesting, write-vs-read tiers). |
docs/adr/0002-workspace-tenant-isolation.md |
Phase 2 — the Workspace tenant boundary layered above owner-OR-ACL: where the boundary is enforced (membership clause inside the retrieval predicate, never a backend-passed tenant id), how existing data migrates into a Default Workspace, the alternatives rejected, and the Identity Boundary (AU3) — what an integrator may swap in the auth stack (federation-edge only) versus the welded Supabase-JWT pass-through floor. |
docs/evals.md |
Corpus design, the 50-question golden set, what each metric measures and what it doesn't, a worked example of CI catching a regression (Δ -0.510 on recall@5 from a one-line chunk-size change), a frank list of the eval's limitations, and the E7 escalation eval (§6) - the deflection-pipeline golden set, why its deterministic legs gate per-PR while the LLM-judged legs run weekly, and the false-resolve ceiling as a pinned safety invariant. |
docs/structured-rag.md |
The semantic-layer-aware text-to-SQL compiler, allowlisted schemas, the read-only role boundary. |
docs/ingestion-parser-adapters.md |
Write your own DocumentParser — the load-bearing markdown-string contract, the edits to add one (subclass + PARSER validation + build_parser), PARSER selection, proving the round-trip, and Unstructured.io as the canonical buyer-written adapter. |
The eval tables in docs/permissions-aware-rag.md are auto-embedded
from the runner-generated summary.md files via marker comments:
python -m evals.retrieval.runner # populates evals/retrieval/summary.md
python -m evals.permissions_scale.runner # populates evals/permissions_scale/summary.md (after wikipedia_seed)
python -m docs._embed_eval_summaries # injects into docs/permissions-aware-rag.mdbackend/ FastAPI service (Dockerfile, railway.toml, fly.toml)
frontend/ React + Vite + Tailwind (vercel.json)
supabase/ Migrations + local CLI config
evals/retrieval/ 50-question golden set + E7 escalation golden set + runners + CI workflow integration
evals/permissions_scale/ Wikipedia 10k corpus benchmark + nightly workflow
evals/structured_rag/ Text-to-SQL eval
db_seed/ Deterministic seeders for the eval corpora
docs/ Long-form writeups (evals, structured RAG, permissions-aware RAG)
.github/workflows/ PR + nightly eval workflows
.claude/ Agent task specs (not needed to run the app)
Prerequisites: Node 20+, Python 3.11+, Docker Desktop (for local Supabase), Supabase CLI, OpenAI API key.
# 1. Start the local Supabase stack (Postgres + pgvector + GoTrue + Storage + Studio)
# Brings up Docker containers and applies all migrations in supabase/migrations/.
supabase start
supabase status # note API_URL, SERVICE_ROLE_KEY, DB_URL for env files
# 2. Backend
cd backend
cp .env.example .env # fill in the values below
pip install -r requirements.txt
uvicorn main:app --reload --port 8000
# 3. Frontend
cd ../frontend
cp .env.example .env # fill in VITE_SUPABASE_* + VITE_BACKEND_URL
npm install
npm run dev # http://localhost:5173To run against hosted Supabase instead of local, push migrations with supabase db push --linked and point SUPABASE_URL / VITE_SUPABASE_URL at the hosted project URL — no other code changes.
| Var | Required | Notes |
|---|---|---|
SUPABASE_URL |
yes | https://<project>.supabase.co (hosted) or http://127.0.0.1:54321 (local) |
SUPABASE_ANON_KEY |
yes | Used to call GoTrue for JWT validation |
SUPABASE_SERVICE_ROLE_KEY |
yes | Reserved for system-level ops (share API owner-lookup, ingestion, support-bot provisioning via the GoTrue admin API - US-069, backend/support_bot.py, the backend-mediated conversation-token surface - issuance + the resume_conversation RPC, US-071, backend/conversation_tokens.py, and the anonymous public widget-key resolution gate - US-072, backend/widget_keys.py); never used to touch user data on the retrieval path (RLS enforced via user JWT). The public widget endpoints fail closed with a 503 when it is unset |
SUPABASE_JWT_SECRET |
only for support bot | The project JWT secret GoTrue signs with. The support bot self-signs its short-lived bot token with it so auth.uid()/RLS resolve it natively (US-068, backend/supabase_jwt.py); a knowledge-assistant-only deploy leaves it blank. NEW signing surface - keep server-side only, never embed client-side |
SUPPORT_BOT_EMAIL_DOMAIN |
no | Internal, non-routable email domain for the per-workspace support bot's auth.users row (US-069, backend/support_bot.py). Default bots.support.internal. The bot row is admin-created with email_confirm=true and no password, so the address never logs in or receives mail |
OPENAI_API_KEY |
yes | |
OPENAI_MODEL |
no | Default gpt-4o-mini |
OPENAI_VECTOR_STORE_ID |
no | Enables file_search retrieval when set |
PARSER |
no | Ingestion parser: docling (default) / llamaparse / unstructured. Invalid value fails fast at startup. To add your own, see docs/ingestion-parser-adapters.md |
LLAMA_CLOUD_API_KEY |
only if PARSER=llamaparse |
LlamaParse cloud key; checked at startup, not first ingest |
FRONTEND_ORIGIN |
yes (prod) | Comma-separated list of allowed CORS origins for the authenticated app surface (/api/*, /healthz). Defaults to http://localhost:5173 for dev. The public widget surface (/widget/*) does NOT use this - it has its own posture keyed off each active widget key's registered origins (US-074) |
WIDGET_CORS_ORIGIN_CACHE_TTL |
no | Seconds the public-widget CORS layer caches the union of active-key registered origins before re-reading under the service role. Default 30; must be > 0. Issuing/revoking a key invalidates the cache immediately on that instance; the TTL is the cross-instance backstop (US-074) |
RATE_LIMITER |
no | Backend for the public-widget abuse/cost-DoS rate limiter (US-075 seam, backend/rate_limiting.py). postgres (default - durable counter rows reached over PostgREST via service-role RPCs) or redis. No in-memory backend by design (it would under-count per replica and reset on restart). Fails closed at startup on a misconfigured backend; the limiter is only built when support is configured (SUPABASE_SERVICE_ROLE_KEY set) |
REDIS_URL |
only if RATE_LIMITER=redis |
Redis connection URL for the Redis limiter backend. The redis package is an optional dependency (not in requirements.txt; pip install redis). Checked at startup |
WIDGET_RATE_LIMIT_WINDOW_SECONDS |
no | Sliding-window length (seconds) for the public-widget per-key + per-session/IP rate limits. Default 60; must be > 0 (US-076) |
WIDGET_RATE_LIMIT_PER_KEY |
no | Max requests per public_key per window, aggregated across every session/IP. Default 300; must be > 0. A breach refuses with a 429 + Retry-After, having done no retrieval/LLM work (US-076) |
WIDGET_RATE_LIMIT_PER_SESSION |
no | Max requests per session/IP (best-effort left-most X-Forwarded-For hop) per window, across every key. Default 30; must be > 0. Defense-in-depth - the per-key window and an edge/WAF limiter (P5) are the harder bounds (US-076) |
CHAT_MODE_DEFAULT |
no | responses or completions. Defaults to responses on an openai answerer, completions on any other provider. responses is OpenAI-only and fails closed at startup on a non-openai answerer — see docs/model-surface.md |
CHAT_HISTORY_MAX_TURNS |
no | Default 10 |
RETRIEVAL_MODE |
no | hybrid (default) / vector / keyword. Safety escape hatch — production uses hybrid |
SEARCH_SIMILARITY_THRESHOLD |
no | Cosine threshold for match_chunks filter. Default 0.3 |
HYBRID_RRF_K |
no | RRF damping constant. Default 60 |
RERANKER |
no | none (default) / cohere / voyage / llm |
COHERE_API_KEY |
only if RERANKER=cohere |
|
VOYAGE_API_KEY |
only if RERANKER=voyage |
|
RERANK_INPUT_K |
no | Pool size fed into the reranker. Default 20 |
LANGSMITH_API_KEY |
no | When set, traces ship to LangSmith |
LANGSMITH_PROJECT |
no | Default agentic-rag |
LANGSMITH_TRACING |
no | true/false; auto-set based on API key presence |
PORT |
no | Injected by Railway/Fly at runtime |
ANALYTICS_DATABASE_URL |
no | Postgres URL for the analytics_readonly role used by the text-to-SQL baseline |
CRM_DATABASE_URL |
no | Postgres URL for the crm_readonly role used by the semantic-layer-aware SQL search. Falls back to ANALYTICS_DATABASE_URL |
CRM_SEED_DATABASE_URL |
no | Writable Postgres URL used only by python -m db_seed.crm_seed. Falls back to DATABASE_URL |
ALLOWED_SQL_SCHEMAS |
no | Comma-separated schema allowlist for SQL tools. Default analytics,crm |
SQL_QUERY_TIMEOUT_MS |
no | Statement timeout for SQL tools. Default 10000 |
ANTHROPIC_API_KEY |
only for eval generation | Required by evals/retrieval/runner.py --include-generation (the LLM judge runs Claude). Never read by the live backend |
Bring your own model host. Provider binds per role (answerer / embedder /
judge); model binds per call-site. Two targets are tested — openai and
azure — and openai accepts a base_url for any OpenAI-compatible endpoint.
The embedder/judge inherit the answerer config unless overridden, so a
single-provider deploy sets only the answerer (bare) vars. Full reference,
role-fallback precedence, worked Azure example, capability matrix, and the
embedder re-index procedure: docs/model-surface.md.
| Var | Required | Notes |
|---|---|---|
LLM_PROVIDER |
no | Answerer provider: openai (default) or azure |
OPENAI_BASE_URL |
no | Any OpenAI-compatible endpoint (supported-but-untested) |
AZURE_OPENAI_ENDPOINT / AZURE_OPENAI_API_VERSION / AZURE_OPENAI_API_KEY |
only if provider=azure |
All three required — provider=azure fails closed at startup if any is missing |
AZURE_OPENAI_DEPLOYMENT |
no | Azure deployment name (≠ model id); unset → per-call model id is the deployment |
EMBEDDER_PROVIDER / EMBEDDER_API_KEY / EMBEDDER_BASE_URL / EMBEDDER_AZURE_OPENAI_* |
no | Embedder-role overrides; fall back to the answerer config (deployment is per-role, not inherited) |
JUDGE_PROVIDER / JUDGE_API_KEY / JUDGE_BASE_URL / JUDGE_AZURE_OPENAI_* |
no | Runtime-judge-role overrides; same fallback rules as the embedder |
EMBEDDER_MODEL |
no | Embedder model. Falls back to EMBEDDING_MODEL → text-embedding-3-small |
METADATA_MODEL / OPENAI_PLANNER_MODEL / OPENAI_SQL_MODEL / OPENAI_SUBAGENT_MODEL / OPENAI_RERANK_MODEL |
no | Per-call-site model selectors within the answerer provider; each falls back to OPENAI_MODEL |
| Var | Required | Notes |
|---|---|---|
VITE_SUPABASE_URL |
yes | Same as backend SUPABASE_URL |
VITE_SUPABASE_ANON_KEY |
yes | Same as backend SUPABASE_ANON_KEY |
VITE_BACKEND_URL |
yes | Backend origin — http://localhost:8000 for dev, your Railway/Fly URL in prod |
The backend exposes:
| Method | Path | Purpose |
|---|---|---|
POST |
/api/chat |
Streaming chat, tool-using agent loop |
GET |
/api/config |
Frontend bootstrap (chat mode default, etc.) |
POST |
/api/documents/{id}/ingest |
Trigger / re-trigger ingestion for an uploaded document |
POST |
/api/documents/{id}/share |
Grant a user or group access to a document |
GET |
/api/documents/{id}/shares |
List existing grants (owner-only) |
DELETE |
/api/documents/{id}/shares/{principal_id} |
Revoke a grant |
POST |
/api/search /api/search/keyword /api/search/hybrid /api/search/rerank |
Direct retrieval probes (debugging / eval) |
POST |
/api/sql |
Text-to-SQL via the semantic-layer compiler |
POST |
/api/web-search |
Web fallback |
POST |
/api/subagent |
Spawn a document sub-agent |
POST |
/api/support/widget-keys |
Admin: issue a new (non-secret) widget public key for a workspace you administer; the first key issued lazily provisions the workspace support bot. Rejects an empty/blank allowed_origins with a 400 - a key with no registered origin is inactive under the US-073 fail-closed gate, so it is refused at creation rather than minted dead (US-072/US-073) |
GET |
/api/support/widget-keys?workspace_id=… |
Admin: list a workspace's widget keys (active + revoked) for the /support/settings UI (US-072) |
POST |
/api/support/widget-keys/{key_id}/revoke |
Admin: revoke a widget key - a one-way revoked_at latch that blocks new conversations but never terminates a live one (US-072) |
POST |
/widget/keys/resolve |
Public widget: resolve a non-secret public_key on open, gating on not-revoked then the per-key registered-origin allowlist (fail-closed - an empty allowlist or a missing/unlisted request Origin is refused with the same opaque 404) under the service role; returns {"active": true} or 404 and leaks no workspace topology. Rate-limited per-session/IP (charged first, before the resolve) and per-key (after the resolve), 429 + Retry-After on breach (US-072/US-073/US-076) |
POST |
/widget/conversations/resume |
Public widget: resume an anonymous support conversation from its opaque per-conversation token (X-Conversation-Token, not a JWT) - slides the 24h window (US-071) |
GET |
/widget/conversations/{id}/transcript |
Public widget: fetch a conversation's transcript, authorized by the same opaque token; read-only, never slides the window (US-071) |
GET |
/healthz |
Liveness check |
The CI workflows wrap the eval runners:
.github/workflows/retrieval-eval.yml— runs on PRs that touch retrieval / chunking / embeddings / escalation / migrations / the runner itself. Executes the 50-question golden set against PR head ANDmain, posts a delta-vs-maincomment. The delta comment is advisory — it never fails the build. The PR run additionally executes two hard gates: the E6 second-workspace zero-leak eval (--include-e6) — a detected cross-workspace leak (or a structurally blind positive control) fails the build — and the E7 escalation tripwire (e7_runner --include-p1b, US-059): the deterministic deflection legs (P1a/P1b retrieval-gate decisions + the P1b non-disclosure byte-equality assertion, no LLM), where a P1a/P1b gate clear or a non-disclosure mismatch fails the build. Both are deterministic, so a real verdict can't flake; a transient E6 execution error is surfaced loudly but stays non-blocking..github/workflows/escalation-eval-weekly.yml— Sundays 06:00 UTC + manualworkflow_dispatch. Runs the full E7 deflection sweep including the LLM-judged P2/P3 legs + the knob sweep; publishes todocs/escalation-weekly/<DATE>.md+.json. A measured false-resolve rate above the buyer's ceiling (the pinned safety number) fails the scheduled workflow and files an issue — it never blocks a merge (a judge wobble must not red-bar a PR; US-059)..github/workflows/retrieval-eval-ragas-weekly.yml— Sundays 04:00 UTC + manualworkflow_dispatch. Scores the four canonical RAGAS metrics weekly; publishes todocs/ragas-weekly/<DATE>.md; files an issue on a red gate finding..github/workflows/retrieval-eval-nightly.yml— daily 02:00 UTC. Publishes snapshots todocs/nightly/<DATE>.md+.json..github/workflows/permissions-scale-eval.yml— daily 03:00 UTC + manualworkflow_dispatch. Runs the Wikipedia 10k seed + ef_search sweep; publishes todocs/permissions-scale-nightly/<DATE>.md. Fails the workflow if the configured recall floor is breached — this is the regression alarm for the day the planner flips to HNSW for some workload.
To run the eval locally:
# One-time corpus seed
export CORPUS_SEED_DATABASE_URL=postgresql://postgres:postgres@localhost:54322/postgres
export SUPABASE_URL=http://127.0.0.1:54321
export SUPABASE_SERVICE_ROLE_KEY=<from `supabase status`>
export OPENAI_API_KEY=sk-...
python -m db_seed.corpus_seed
# Eval runs
python -m evals.retrieval.runner # all three modes
python -m evals.retrieval.runner --mode vector # single mode (faster)
python -m evals.retrieval.runner --include-generation # adds LLM-judge faithfulness/helpfulness (needs ANTHROPIC_API_KEY)
python -m evals.retrieval.runner --include-e6 # adds the E6 second-workspace zero-leak gate (exits 1 on a cross-workspace leak)
python -m evals.retrieval.e7_runner --include-p1b # E7 escalation tripwire - the deterministic per-PR gate (P1a/P1b retrieval gate + non-disclosure byte-equality, no LLM; exits 1 on a gate clear or non-disclosure mismatch). The P1b leg also needs DATABASE_URL set. Add --include-p2 --include-p3 --sweep for the weekly LLM-judged legs (needs ANTHROPIC_API_KEY)The app deploys to Vercel (frontend) + Railway or Fly (backend) + Supabase (DB/Auth/Storage). No code changes required — only env vars.
- Create a project at supabase.com.
- Link and push the schema:
cd supabase supabase link --project-ref <your-ref> supabase db push
- Enable Google and GitHub OAuth providers in Authentication → Providers.
- Grab
SUPABASE_URL,anonkey, andservice_rolekey from Settings → API.
- Push the repo to GitHub.
- Create a Railway project → New Service → Deploy from GitHub repo.
- Set Service Root Directory to
backend/. Railway picks upbackend/Dockerfileandbackend/railway.tomlautomatically. - Under Variables, set:
SUPABASE_URL,SUPABASE_ANON_KEY,SUPABASE_SERVICE_ROLE_KEY,OPENAI_API_KEY,OPENAI_MODEL,OPENAI_VECTOR_STORE_ID,FRONTEND_ORIGIN,LANGSMITH_API_KEY,LANGSMITH_PROJECT. AddRERANKER+ the matching API key if you want a reranker on by default. - Deploy. Note the generated
*.up.railway.appURL — that's yourVITE_BACKEND_URL. - Hit
/healthzto confirm the service is up.
cd backend
fly launch --copy-config --no-deploy # picks up fly.toml + Dockerfile
fly secrets set \
SUPABASE_URL=... SUPABASE_ANON_KEY=... SUPABASE_SERVICE_ROLE_KEY=... \
OPENAI_API_KEY=... OPENAI_VECTOR_STORE_ID=... \
FRONTEND_ORIGIN=https://<your-vercel-url> \
LANGSMITH_API_KEY=...
fly deploy- Add New Project → import the GitHub repo.
- Set Root Directory to
frontend/. Vercel picks upfrontend/vercel.json(Vite preset, SPA rewrites). - Set env vars:
VITE_SUPABASE_URL,VITE_SUPABASE_ANON_KEY,VITE_BACKEND_URL(← your Railway/Fly URL). - Deploy. Copy the production URL back into the backend's
FRONTEND_ORIGINand redeploy the backend so CORS allows it.
Open the Vercel URL, sign up, create a thread, send a message. The response should stream token-by-token, and a trace should appear in LangSmith tagged with your user_id and thread_id. Upload a document at /ingestion, watch it transition pending → processing → ready, then ask the chat about its contents.
The system landed in 11 progressive modules; the full plan + per-story
acceptance criteria live in .claude/agent/tasks/prd-agentic-rag.md.
| Module | What landed |
|---|---|
| 1 | App shell, auth, threads, streaming chat, LangSmith |
| 2 | BYO retrieval (vector via match_chunks RPC), per-thread memory |
| 3 | Content-hashing dedup on documents and chunks |
| 4 | LLM structured-output metadata extraction at ingestion |
| 5 | Multi-format ingestion (txt/md/pdf/docx/html via docling) |
| 6 | Hybrid retrieval (RRF) + reranker layer (cohere / voyage / llm) |
| 7 | Additional tools — query_database, web_search |
| 8 | Sub-agents — spawn_document_agent |
| 9 | Structured RAG with semantic-layer-aware text-to-SQL |
| 10 | Retrieval eval suite (golden set, metrics, PR CI delta, nightly) |
| 11 | Permission-aware retrieval (per-chunk ACLs, share dialog, granting-principal badges) |
