Skip to content

Add foundational support for durable storage#135

Merged
krisztianfekete merged 5 commits intomainfrom
feature/add-durable-storage
May 6, 2026
Merged

Add foundational support for durable storage#135
krisztianfekete merged 5 commits intomainfrom
feature/add-durable-storage

Conversation

@krisztianfekete
Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete commented May 4, 2026

This PR is opt-in (AGENTEVALS_STORAGE_BACKEND=postgres), so the existing in-memory developer experience is unchanged: agentevals run trace.json keeps working, the React UI behaves identically, OTLP streaming is untouched.

There is no proper UI support for this at the moment. It's a preview feature, expect breaking changes to the APIs and schema.

Setup

uv lock                        # picks up the new [postgres] extra
uv sync --extra postgres       # installs asyncpg
make pg-up                     # boots postgres:17-alpine, waits for pg_isready (idempotent)
make migrate                   # applies 000001_init; idempotent on replay
make dev-backend-pg            # serves with backend=postgres + worker pool

Look for these log lines on startup:

INFO:agentevals.api.app:Applying any pending migrations to schema 'agentevals'
INFO:agentevals.storage.postgres.pool:Creating asyncpg pool (min=4, max=12) for schema 'agentevals'
INFO:agentevals.run.worker:Started 4 run worker(s) (lease=30s, heartbeat=5s, deadline=300s)

The async run pipeline (POST /api/runs)

Submit a run, watch the worker pick it up, read the persisted results back:

RUN_ID=$(uuidgen | tr 'A-Z' 'a-z')
INLINE=$(cat samples/helm.json)
cat > /tmp/req.json <<EOF
{"runId": "$RUN_ID",
"spec": {"approach": "trace_replay",
"target": {"kind": "inline", "inline": $INLINE},
"evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}
EOF

curl -s -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | jq .data.status
# expect: "queued"

sleep 3
curl -s "http://localhost:8001/api/runs/$RUN_ID" | jq .data.status
# expect: "succeeded"

curl -s "http://localhost:8001/api/runs/$RUN_ID/results" | jq '.data | length'

Idempotency, 409, and cancel

# Idempotent re-submit (HTTP 202)
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | head -1

# Different spec, same id (HTTP 409)
sed 's/tool_trajectory_avg_score/response_match_score/' /tmp/req.json > /tmp/req2.json
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req2.json | head -3

# Cancel (returns "cancelled" if you race the worker, otherwise the terminal status)
curl -s -X POST "http://localhost:8001/api/runs/$RUN_ID/cancel" | jq -r '.data.status'

Existing /api/evaluate flows persist when backend=postgres

UI uploads, multipart curl, SSE stream, and the JSON variant all now write a Run row plus Result rows. The response carries an extra runId field that wasn't there before. No UI changes required.

# Multipart (UI uses this)
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId

# SSE stream
curl -N -X POST http://localhost:8001/api/evaluate/stream \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' \
    | grep '"done": true' | head -1 | sed 's/^data: //' | jq .result.runId

# JSON body
.venv/bin/python -c 'import json; t=json.load(open("samples/helm.json"));
print(json.dumps({"traces":t,"config":{"metrics":["tool_trajectory_avg_score"]}}))' > /tmp/json_req.json
curl -s -X POST http://localhost:8001/api/evaluate/json -H 'content-type: application/json' -d @/tmp/json_req.json | jq .data.runId

Each call yields a new run row with target.kind = "uploaded". That's the OSS user-facing benefit of this PR: persistent run history for any eval that flows through the existing endpoints.

Inspecting the data in Postgres

alias aepsql='docker exec agentevals-pg psql -U agentevals -d agentevals'

# Run history, most recent first
aepsql -c "SELECT run_id, status, attempt, created_at FROM agentevals.run ORDER BY created_at DESC LIMIT 10"

# Counts by status
aepsql -c "SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 2 DESC"

# Counts by submission path (uploaded vs inline POST /api/runs)
aepsql -c "SELECT spec->'target'->>'kind' AS target, status, COUNT(*) FROM agentevals.run GROUP BY 1, 2"

# Drill into the last run
RUN=$(aepsql -At -c "SELECT run_id FROM agentevals.run ORDER BY created_at DESC LIMIT 1")
aepsql -c "SELECT evaluator_name, evaluator_type, status, score, latency_ms FROM agentevals.result WHERE run_id = '$RUN'"

# Aggregate scores per evaluator across all runs
aepsql -c "SELECT evaluator_name, ROUND(AVG(score)::numeric, 3) AS avg_score, COUNT(*) FROM agentevals.result WHERE score IS NOT NULL GROUP BY 1 ORDER BY 1"

# Queue / worker state (snapshot during a hot queue)
aepsql -c "SELECT run_id, status, worker_id, attempt, lease_expires_at, cancel_requested FROM agentevals.run WHERE status IN ('queued','running')"

# Schema state
aepsql -c "SELECT version, dirty FROM agentevals.schema_migrations"

Live tail while exercising the worker:

watch -n 1 "docker exec agentevals-pg psql -U agentevals -d agentevals -c \"SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 1\""

Crash recovery

Submit a slow run using a bigger trace, then Ctrl+C the agentevals process. Wait roughly 35 seconds (one lease window plus slack), restart with make dev-backend-pg. The previously claimed run is re-claimed by a new worker via the SKIP LOCKED predicate and completes; the run row's attempt counter reads 2.

Memory backend regression (zero-config flow unchanged)

make pg-down
make dev-backend                  # default in-memory backend, no AGENTEVALS_STORAGE_BACKEND set

curl -s -i http://localhost:8001/api/runs | head -3       # expect: 503 with hint pointing at the env var
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId    # expect: null (no persistence configured)
curl -s http://localhost:8001/api/health | jq -r .data.status                  # expect: "ok"

Cleanup

make pg-down
rm /tmp/req.json /tmp/req2.json /tmp/json_req.json

@krisztianfekete krisztianfekete force-pushed the feature/add-durable-storage branch from 18785bd to 99247be Compare May 4, 2026 16:39
@krisztianfekete krisztianfekete force-pushed the feature/add-durable-storage branch from 99247be to 5c6d499 Compare May 4, 2026 16:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in durable storage and an async run pipeline backed by Postgres, enabling persistent run history/results and a queue-driven worker while keeping the default in-memory workflow unchanged.

Changes:

  • Introduces storage abstractions (models + repos) with memory and postgres backends, plus SQL migrations and an asyncpg pool.
  • Adds /api/runs endpoints + in-process async worker to claim/execute queued runs and persist results.
  • Updates CLI, docs, Makefile, Docker image, and Helm chart to support Postgres-backed deployments; adds comprehensive tests for the new behavior.

Reviewed changes

Copilot reviewed 43 out of 47 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Adds postgres extra and locks asyncpg.
pyproject.toml Declares postgres optional dependency + ensures migrations are packaged in wheels.
Dockerfile Installs postgres extra in the container image.
Makefile Adds local Postgres + migrate + pg-backed dev server targets.
README.md Documents Helm usage for Postgres backend and /api/runs.
DEVELOPMENT.md Documents local dev flow for Postgres backend.
charts/agentevals/values.yaml Adds chart values for storage backend + Postgres configuration.
charts/agentevals/templates/service.yaml Scopes Service selector labels to avoid matching bundled Postgres pods.
charts/agentevals/templates/deployment.yaml Wires env vars for Postgres backend (DSN/urlFile/bundled) into the app deployment.
charts/agentevals/templates/_helpers.tpl Adds helper templates for app selectors and bundled Postgres resources.
charts/agentevals/templates/postgresql.yaml Adds an optional bundled Postgres Deployment/Service/PVC.
charts/agentevals/templates/postgresql-secret.yaml Adds bundled Postgres password Secret.
src/agentevals/storage/repos/postgres.py Implements asyncpg-backed Session/Run/Result repositories.
src/agentevals/storage/repos/memory.py Implements in-memory Session/Run/Result repositories for OSS default/testing.
src/agentevals/storage/repos/init.py Defines repository protocols and Repos bundle.
src/agentevals/storage/postgres/pool.py Adds asyncpg pool factory with readiness retry.
src/agentevals/storage/postgres/migrator.py Adds migration discovery + advisory-lock-protected migrator.
src/agentevals/storage/postgres/migrations/000001_init.up.sql Adds baseline schema for sessions/runs/results and supporting tables/indexes.
src/agentevals/storage/postgres/migrations/000001_init.down.sql Adds destructive rollback (drop schema).
src/agentevals/storage/postgres/init.py Documents Postgres backend intent.
src/agentevals/storage/models.py Adds Pydantic models for persisted Run/Result and trace targets + result_id hashing.
src/agentevals/storage/config.py Adds env-driven storage settings and validation.
src/agentevals/storage/init.py Adds build_repos() factory for selecting backend.
src/agentevals/runner.py Extends RunResult with optional run_id for persistence linkage.
src/agentevals/run/worker.py Adds async worker pool for claiming/running queued work with heartbeat/cancellation.
src/agentevals/run/sinks.py Adds sink fanout (stdout/file/http webhook) with best-effort delivery.
src/agentevals/run/service.py Adds RunService for idempotent submit/list/cancel and /api/evaluate persistence.
src/agentevals/run/result_builder.py Adds pure helpers to build persisted Result rows + run summary.
src/agentevals/run/fetcher.py Adds inline/http trace fetchers for worker execution.
src/agentevals/run/init.py Documents run pipeline modules.
src/agentevals/cli.py Adds agentevals migrate CLI group (up/down/version/force/create).
src/agentevals/api/runs_routes.py Adds /api/runs router (submit/get/list/results/cancel).
src/agentevals/api/routes.py Persists /api/evaluate runs/results when run_service is configured and returns runId.
src/agentevals/api/app.py On startup: loads storage settings, runs migrations, builds repos, wires RunService, starts worker.
tests/storage/test_models.py Unit tests for storage models and deterministic result_id hashing.
tests/storage/test_migrator.py Tests migration discovery/schema substitution + optional live PG tests behind env var.
tests/storage/test_memory_repos.py Contract tests for memory repo behavior (runs/results/sessions).
tests/storage/test_config.py Tests env loading + validation for StorageSettings.
tests/storage/init.py Adds storage test package.
tests/run/test_sinks.py Tests stdout/file/webhook sink behavior and fanout failure isolation.
tests/run/test_service.py Tests RunService submit/idempotency/conflicts + /api/evaluate persistence path.
tests/run/test_result_builder.py Tests result projection, status mapping, summary building, and evaluator classification.
tests/run/test_fetcher.py Tests fetcher dispatch and validation paths.
tests/run/init.py Adds run test package.
tests/api/test_runs_routes.py HTTP-level tests for /api/runs endpoints (503 when unconfigured + happy paths via stub service).
tests/api/test_evaluate_persistence.py HTTP-level tests that /api/evaluate persists when run_service is injected.
tests/api/init.py Adds API test package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/agentevals/storage/config.py
Comment thread src/agentevals/storage/config.py Outdated
Comment thread src/agentevals/storage/repos/postgres.py
Comment thread src/agentevals/api/runs_routes.py Outdated
@krisztianfekete krisztianfekete marked this pull request as ready for review May 5, 2026 14:55
@krisztianfekete krisztianfekete requested a review from peterj May 5, 2026 20:41
@krisztianfekete krisztianfekete merged commit 78f0d68 into main May 6, 2026
5 checks passed
@krisztianfekete krisztianfekete deleted the feature/add-durable-storage branch May 6, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants