Add foundational support for durable storage by krisztianfekete · Pull Request #135 · agentevals-dev/agentevals

krisztianfekete · 2026-05-04T16:15:15Z

This PR is opt-in (AGENTEVALS_STORAGE_BACKEND=postgres), so the existing in-memory developer experience is unchanged: agentevals run trace.json keeps working, the React UI behaves identically, OTLP streaming is untouched.

There is no proper UI support for this at the moment. It's a preview feature, expect breaking changes to the APIs and schema.

Setup

uv lock                        # picks up the new [postgres] extra
uv sync --extra postgres       # installs asyncpg
make pg-up                     # boots postgres:17-alpine, waits for pg_isready (idempotent)
make migrate                   # applies 000001_init; idempotent on replay
make dev-backend-pg            # serves with backend=postgres + worker pool

Look for these log lines on startup:

INFO:agentevals.api.app:Applying any pending migrations to schema 'agentevals'
INFO:agentevals.storage.postgres.pool:Creating asyncpg pool (min=4, max=12) for schema 'agentevals'
INFO:agentevals.run.worker:Started 4 run worker(s) (lease=30s, heartbeat=5s, deadline=300s)

The async run pipeline (POST /api/runs)

Submit a run, watch the worker pick it up, read the persisted results back:

RUN_ID=$(uuidgen | tr 'A-Z' 'a-z')
INLINE=$(cat samples/helm.json)
cat > /tmp/req.json <<EOF
{"runId": "$RUN_ID",
"spec": {"approach": "trace_replay",
"target": {"kind": "inline", "inline": $INLINE},
"evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}
EOF

curl -s -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | jq .data.status
# expect: "queued"

sleep 3
curl -s "http://localhost:8001/api/runs/$RUN_ID" | jq .data.status
# expect: "succeeded"

curl -s "http://localhost:8001/api/runs/$RUN_ID/results" | jq '.data | length'

Idempotency, 409, and cancel

# Idempotent re-submit (HTTP 202)
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req.json | head -1

# Different spec, same id (HTTP 409)
sed 's/tool_trajectory_avg_score/response_match_score/' /tmp/req.json > /tmp/req2.json
curl -s -i -X POST http://localhost:8001/api/runs -H 'content-type: application/json' -d @/tmp/req2.json | head -3

# Cancel (returns "cancelled" if you race the worker, otherwise the terminal status)
curl -s -X POST "http://localhost:8001/api/runs/$RUN_ID/cancel" | jq -r '.data.status'

Existing /api/evaluate flows persist when backend=postgres

UI uploads, multipart curl, SSE stream, and the JSON variant all now write a Run row plus Result rows. The response carries an extra runId field that wasn't there before. No UI changes required.

# Multipart (UI uses this)
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId

# SSE stream
curl -N -X POST http://localhost:8001/api/evaluate/stream \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' \
    | grep '"done": true' | head -1 | sed 's/^data: //' | jq .result.runId

# JSON body
.venv/bin/python -c 'import json; t=json.load(open("samples/helm.json"));
print(json.dumps({"traces":t,"config":{"metrics":["tool_trajectory_avg_score"]}}))' > /tmp/json_req.json
curl -s -X POST http://localhost:8001/api/evaluate/json -H 'content-type: application/json' -d @/tmp/json_req.json | jq .data.runId

Each call yields a new run row with target.kind = "uploaded". That's the OSS user-facing benefit of this PR: persistent run history for any eval that flows through the existing endpoints.

Inspecting the data in Postgres

alias aepsql='docker exec agentevals-pg psql -U agentevals -d agentevals'

# Run history, most recent first
aepsql -c "SELECT run_id, status, attempt, created_at FROM agentevals.run ORDER BY created_at DESC LIMIT 10"

# Counts by status
aepsql -c "SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 2 DESC"

# Counts by submission path (uploaded vs inline POST /api/runs)
aepsql -c "SELECT spec->'target'->>'kind' AS target, status, COUNT(*) FROM agentevals.run GROUP BY 1, 2"

# Drill into the last run
RUN=$(aepsql -At -c "SELECT run_id FROM agentevals.run ORDER BY created_at DESC LIMIT 1")
aepsql -c "SELECT evaluator_name, evaluator_type, status, score, latency_ms FROM agentevals.result WHERE run_id = '$RUN'"

# Aggregate scores per evaluator across all runs
aepsql -c "SELECT evaluator_name, ROUND(AVG(score)::numeric, 3) AS avg_score, COUNT(*) FROM agentevals.result WHERE score IS NOT NULL GROUP BY 1 ORDER BY 1"

# Queue / worker state (snapshot during a hot queue)
aepsql -c "SELECT run_id, status, worker_id, attempt, lease_expires_at, cancel_requested FROM agentevals.run WHERE status IN ('queued','running')"

# Schema state
aepsql -c "SELECT version, dirty FROM agentevals.schema_migrations"

Live tail while exercising the worker:

watch -n 1 "docker exec agentevals-pg psql -U agentevals -d agentevals -c \"SELECT status, COUNT(*) FROM agentevals.run GROUP BY status ORDER BY 1\""

Crash recovery

Submit a slow run using a bigger trace, then Ctrl+C the agentevals process. Wait roughly 35 seconds (one lease window plus slack), restart with make dev-backend-pg. The previously claimed run is re-claimed by a new worker via the SKIP LOCKED predicate and completes; the run row's attempt counter reads 2.

Memory backend regression (zero-config flow unchanged)

make pg-down
make dev-backend                  # default in-memory backend, no AGENTEVALS_STORAGE_BACKEND set

curl -s -i http://localhost:8001/api/runs | head -3       # expect: 503 with hint pointing at the env var
curl -s -X POST http://localhost:8001/api/evaluate \
    -F 'trace_files=@samples/helm.json' \
    -F 'config={"metrics": ["tool_trajectory_avg_score"]}' | jq .data.runId    # expect: null (no persistence configured)
curl -s http://localhost:8001/api/health | jq -r .data.status                  # expect: "ok"

Cleanup

make pg-down
rm /tmp/req.json /tmp/req2.json /tmp/json_req.json

Copilot

Pull request overview

Adds opt-in durable storage and an async run pipeline backed by Postgres, enabling persistent run history/results and a queue-driven worker while keeping the default in-memory workflow unchanged.

Changes:

Introduces storage abstractions (models + repos) with memory and postgres backends, plus SQL migrations and an asyncpg pool.
Adds /api/runs endpoints + in-process async worker to claim/execute queued runs and persist results.
Updates CLI, docs, Makefile, Docker image, and Helm chart to support Postgres-backed deployments; adds comprehensive tests for the new behavior.

Reviewed changes

Copilot reviewed 43 out of 47 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
uv.lock	Adds `postgres` extra and locks `asyncpg`.
pyproject.toml	Declares `postgres` optional dependency + ensures migrations are packaged in wheels.
Dockerfile	Installs `postgres` extra in the container image.
Makefile	Adds local Postgres + migrate + pg-backed dev server targets.
README.md	Documents Helm usage for Postgres backend and `/api/runs`.
DEVELOPMENT.md	Documents local dev flow for Postgres backend.
charts/agentevals/values.yaml	Adds chart values for storage backend + Postgres configuration.
charts/agentevals/templates/service.yaml	Scopes Service selector labels to avoid matching bundled Postgres pods.
charts/agentevals/templates/deployment.yaml	Wires env vars for Postgres backend (DSN/urlFile/bundled) into the app deployment.
charts/agentevals/templates/_helpers.tpl	Adds helper templates for app selectors and bundled Postgres resources.
charts/agentevals/templates/postgresql.yaml	Adds an optional bundled Postgres Deployment/Service/PVC.
charts/agentevals/templates/postgresql-secret.yaml	Adds bundled Postgres password Secret.
src/agentevals/storage/repos/postgres.py	Implements asyncpg-backed Session/Run/Result repositories.
src/agentevals/storage/repos/memory.py	Implements in-memory Session/Run/Result repositories for OSS default/testing.
src/agentevals/storage/repos/init.py	Defines repository protocols and `Repos` bundle.
src/agentevals/storage/postgres/pool.py	Adds asyncpg pool factory with readiness retry.
src/agentevals/storage/postgres/migrator.py	Adds migration discovery + advisory-lock-protected migrator.
src/agentevals/storage/postgres/migrations/000001_init.up.sql	Adds baseline schema for sessions/runs/results and supporting tables/indexes.
src/agentevals/storage/postgres/migrations/000001_init.down.sql	Adds destructive rollback (drop schema).
src/agentevals/storage/postgres/init.py	Documents Postgres backend intent.
src/agentevals/storage/models.py	Adds Pydantic models for persisted Run/Result and trace targets + result_id hashing.
src/agentevals/storage/config.py	Adds env-driven storage settings and validation.
src/agentevals/storage/init.py	Adds `build_repos()` factory for selecting backend.
src/agentevals/runner.py	Extends `RunResult` with optional `run_id` for persistence linkage.
src/agentevals/run/worker.py	Adds async worker pool for claiming/running queued work with heartbeat/cancellation.
src/agentevals/run/sinks.py	Adds sink fanout (stdout/file/http webhook) with best-effort delivery.
src/agentevals/run/service.py	Adds `RunService` for idempotent submit/list/cancel and `/api/evaluate` persistence.
src/agentevals/run/result_builder.py	Adds pure helpers to build persisted Result rows + run summary.
src/agentevals/run/fetcher.py	Adds inline/http trace fetchers for worker execution.
src/agentevals/run/init.py	Documents run pipeline modules.
src/agentevals/cli.py	Adds `agentevals migrate` CLI group (up/down/version/force/create).
src/agentevals/api/runs_routes.py	Adds `/api/runs` router (submit/get/list/results/cancel).
src/agentevals/api/routes.py	Persists `/api/evaluate` runs/results when run_service is configured and returns `runId`.
src/agentevals/api/app.py	On startup: loads storage settings, runs migrations, builds repos, wires RunService, starts worker.
tests/storage/test_models.py	Unit tests for storage models and deterministic result_id hashing.
tests/storage/test_migrator.py	Tests migration discovery/schema substitution + optional live PG tests behind env var.
tests/storage/test_memory_repos.py	Contract tests for memory repo behavior (runs/results/sessions).
tests/storage/test_config.py	Tests env loading + validation for StorageSettings.
tests/storage/init.py	Adds storage test package.
tests/run/test_sinks.py	Tests stdout/file/webhook sink behavior and fanout failure isolation.
tests/run/test_service.py	Tests RunService submit/idempotency/conflicts + `/api/evaluate` persistence path.
tests/run/test_result_builder.py	Tests result projection, status mapping, summary building, and evaluator classification.
tests/run/test_fetcher.py	Tests fetcher dispatch and validation paths.
tests/run/init.py	Adds run test package.
tests/api/test_runs_routes.py	HTTP-level tests for `/api/runs` endpoints (503 when unconfigured + happy paths via stub service).
tests/api/test_evaluate_persistence.py	HTTP-level tests that `/api/evaluate` persists when run_service is injected.
tests/api/init.py	Adds API test package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

krisztianfekete force-pushed the feature/add-durable-storage branch from 18785bd to 99247be Compare May 4, 2026 16:39

add foundational support for durable storage

5c6d499

krisztianfekete force-pushed the feature/add-durable-storage branch from 99247be to 5c6d499 Compare May 4, 2026 16:43

cleanup and simplify

5ee5122

krisztianfekete requested a review from Copilot May 5, 2026 12:55

Copilot started reviewing on behalf of krisztianfekete May 5, 2026 12:55 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread src/agentevals/storage/config.py

Comment thread src/agentevals/storage/config.py Outdated

Comment thread src/agentevals/storage/repos/postgres.py

Comment thread src/agentevals/api/runs_routes.py Outdated

address review feedback

2e3636a

krisztianfekete marked this pull request as ready for review May 5, 2026 14:55

krisztianfekete requested a review from peterj May 5, 2026 20:41

krisztianfekete added 2 commits May 6, 2026 14:04

add preview disclaimers

4d3e41b

switch to pg 18.3-alpine

6b41538

krisztianfekete merged commit 78f0d68 into main May 6, 2026
5 checks passed

krisztianfekete deleted the feature/add-durable-storage branch May 6, 2026 13:54

krisztianfekete mentioned this pull request May 6, 2026

Schema + APIs for durable storage #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add foundational support for durable storage#135

Add foundational support for durable storage#135
krisztianfekete merged 5 commits intomainfrom
feature/add-durable-storage

krisztianfekete commented May 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krisztianfekete commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

The async run pipeline (POST /api/runs)

Idempotency, 409, and cancel

Existing /api/evaluate flows persist when backend=postgres

Inspecting the data in Postgres

Live tail while exercising the worker:

Crash recovery

Memory backend regression (zero-config flow unchanged)

Cleanup

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krisztianfekete commented May 4, 2026 •

edited

Loading