Skip to content

Test plan: open-issue coverage for staging/api-hardening #355

@norrietaylor

Description

@norrietaylor

Test Plan — staging/api-hardening Open-Issue Coverage

Context

staging/api-hardening is 152 commits ahead of main. It bundles two waves of work:

  1. MCP API hardening (epic Harden MCP interface: tool descriptions, error codes, validation, docs #245) — standardised error codes, validation helpers, tool-description rewrites, default output_mode shrinkage, default-status filter, group-by additions, new distillery_status/store_batch tools.
  2. Feed/sync hardening — gh-sync project/tag/author backfill, RSS/GitHub real author, watch URL probe, watch liveness metadata, async background sync jobs, Jina truncation.

Plus security follow-up #112 and CI/CVE #271.

This plan enumerates the 28 open issues with code landed on the branch, each with a self-contained test scenario a Claude subagent can execute. The intent is parallel dispatch: 4 worker subagents pick up groups by isolation requirement (pure unit, in-memory store, real HTTP MCP, CI/manifest).

Branch under test: staging/api-hardening
Base for diff: main
Commit range: git log main..staging/api-hardening (152 commits)

Open issues with landed work

# Title Surface Group
112 Security follow-up (TLS, ownership, CORS, pin deps, log retention) embedding/*, mcp/auth.py, mcp/middleware.py, pyproject.toml, mcp/webhooks.py C
232 distillery_store enum omits github models.py:EntryType, mcp/tools/crud.py:_VALID_ENTRY_TYPES A
238 distillery_store output_mode="summary" mcp/tools/crud.py:_handle_store B
240 /gh-sync invalid output_mode="metadata" mcp/tools/feeds.py:_handle_gh_sync B
241 label→tag sanitiser fails on underscored labels feeds/github_sync.py:_sanitize_tag_segment A
244 bulk ingest: store_batch + watch --sync-history mcp/tools/crud.py:_handle_store_batch, feeds.py B
245 MCP hardening epic mcp/tools/_errors.py, _common.py, all handlers A+B
266 CASCADE on dropping FTS schema store/duckdb.py:_rebuild_fts_index B
269 setup/watch CronCreate uses MCP not webhook skills/setup/, skills/watch/ D
271 suppress upstream CVEs in Docker base .grype.yaml D
274 history sync exceeds Jina 8194-token limit feeds/truncation.py, embedding/jina.py B
276 --sync history async to avoid timeout feeds/sync_jobs.py, github_sync.py:sync_batched B+C
278 gh-sync use store_batch async pipeline feeds/sync_jobs.py, github_sync.py B+C
283 group_by='tags' in distillery_list mcp/tools/crud.py:_handle_list B
286 stale distillery_tag_tree permission .claude/settings.local.json D
301 classify --batch with filters mcp/tools/classify.py, cli.py, skills/classify B
302 sync uses real author feeds/github_sync.py, feeds/rss.py, feeds/poller.py B
303 dynamic MCP transport for SessionStart hook scripts/hooks/session_start_briefing.py C
307 distillery_stale missing → route to list skills/briefing/SKILL.md, mcp/tools/crud.py D
308 watch accepts invalid/unreachable URLs mcp/tools/feeds.py:_validate_url_syntax/_probe_url B+C
309 distillery_list(source=feed_url) returns 0 mcp/tools/crud.py:_build_filters_from_arguments B
310 watch list omits liveness metadata feeds/poller.py, store/duckdb.py, migration 12 B
311 list default output_mode=full floods context mcp/tools/crud.py:_handle_list B
312 gh-sync project=null tags=[] feeds/github_sync.py, cli.py:backfill_github_metadata B
313 no distillery_status MCP tool mcp/tools/meta.py, mcp/server.py A
314 ghost entry_ids on dedup-skip mcp/tools/crud.py:_handle_store dedup branch B
315 resolve_review reviewer ignored mcp/tools/classify.py, mcp/server.py B
316 resolve_review reclassify leaves pending_review mcp/tools/classify.py B
317 list default includes archived mcp/tools/crud.py:_apply_default_status_filter B
330 docs: stale tool count + self-host guidance in plugin-install.md docs/install/plugin-install.md D
332 dedup_action="merged" returns independent new entry_id (PR #341 merged) mcp/tools/crud.py:_handle_store merge branch B
333 resolve_review double-approve silently bumps version (PR #339 merged) mcp/tools/classify.py:_handle_resolve_review B
334 watch list liveness fields exposed but never populated (PR #338 merged) feeds/poller.py, store/duckdb.py liveness writes B
335 source=<url> vs feed_url=<url> diverge (PR #340 merged) mcp/tools/crud.py:_build_filters_from_arguments B

Update 2026-04-19: PRs #352, #353, #354, #358, #359, #360, #361 (issues 345–351) are now merged to staging/api-hardening. Group E scenarios run against staging directly — no separate branch checkouts required. PRs #356/#357 against main are superseded; staging itself is landing on main tonight.

Group key

  • A — pure schema/enum/static check. No runtime needed. Subagent reads source + runs targeted pytest.
  • B — in-memory async store + handler. Subagent runs pytest -k <pattern> against in-memory DuckDB fixture or invokes handler directly.
  • C — needs running MCP HTTP server (TLS/CORS/transport probe).
  • D — manifest/skill text/CI config. Read-only assertion against repo files.
  • E — issues 345–351 (now on staging). Subagent runs the listed pytest + direct calls against the staging worktree.
  • F — agent-driven E2E user journeys against a live MCP. Subagent acts as a real client, chaining tool calls across multiple issues per scenario.

Subagent dispatch strategy

Spawn 6 worker subagents in parallel, each owning one group (A, B, C, D, E, F). A 7th orchestrator (this conversation) collects reports and aggregates pass/fail. All groups share the staging/api-hardening worktree; Groups C and F additionally need a running HTTP MCP server.

Per-subagent prompt template:

You are testing the staging/api-hardening branch of distillery2. Checkout staging/api-hardening in a worktree before starting (or operate in the existing worktree if provided). For each scenario in the assigned group, execute the listed steps, capture actual output, and report PASS/FAIL with one-line evidence (test name, error message, or response snippet). Do not modify source. If a test fixture is missing, mark BLOCKED. Final report: markdown table | issue | scenario | result | evidence |.

Common prerequisite for groups B and C:

git worktree add /tmp/distillery-test staging/api-hardening
cd /tmp/distillery-test
pip install -e ".[dev]" --quiet

For group C, additionally:

distillery-mcp --transport http --port 8765 &

Group A — Schema / enum / static (1 subagent)

#232github entry type

  • Read src/distillery/models.py → assert EntryType.GITHUB == "github" and TYPE_METADATA_SCHEMAS["github"] exists with required keys {repo, ref_type, ref_number}.
  • Read src/distillery/mcp/tools/crud.py → assert _VALID_ENTRY_TYPES contains "github".
  • Run: pytest tests/ -k "github_entry_type or entry_type_github" -v
  • Pass: all assertions hold + tests green.

#241 — sanitiser

  • Run: pytest tests/ -k "sanitize_tag or sanitiser or sanitizer" -v
  • Direct call test (subagent inlines):
    from distillery.feeds.github_sync import _sanitize_tag_segment
    assert _sanitize_tag_segment("github_actions") == "github-actions"
    assert _sanitize_tag_segment("__a__b__") == "a-b"
    assert _sanitize_tag_segment("Foo.Bar") == "foo-bar"
    assert _sanitize_tag_segment("123abc") == "123abc"
  • Pass: no exception, all four equalities hold.

#245 — error code surface

  • Read src/distillery/mcp/tools/_errors.py → assert ToolErrorCode has exactly: INVALID_PARAMS, NOT_FOUND, CONFLICT, INTERNAL, FORBIDDEN, BUDGET_EXCEEDED, RATE_LIMITED.
  • Run: pytest tests/test_mcp_errors.py -v
  • Run: pytest tests/ -k "validate_required or validate_enum or validate_limit" -v
  • Pass: enum members match + suite green.

#313distillery_status registered

  • Run: pytest tests/test_mcp_meta.py -v (or tests/ -k status)
  • Direct check: import distillery.mcp.server and assert distillery_status is in the registered tool list (introspect FastMCP instance).
  • Pass: tool registered, returns dict with keys {status, version, transport, tool_count, store, embedding_provider}.

Group B — In-memory store + handler (1 subagent, runs full pytest by topic)

Use tests/conftest.py fixtures: store, make_entry, deterministic_embedding_provider.

Each scenario: subagent runs the listed pytest pattern AND inlines a direct handler call to verify response shape against the issue's acceptance criteria.

#238 / #311 / #317 / #309 / #283 — distillery_list extensions

#232 / #238 / #314 — distillery_store

#244 — store_batch + watch sync_history

  • pytest tests/ -k "store_batch or sync_history" -v
  • Direct: _handle_store_batch({"entries":[{...},{...},{...}]}) → response has entry_ids (3), count==3, results list of 3 with persisted=True per entry.
  • _handle_watch({"action":"add","source_type":"github","url":"https://github.com/python/cpython","sync_history":True}) → response includes sync_job with job_id.

#266 — FTS CASCADE

  • pytest tests/ -k "fts_cascade or rebuild_fts" -v
  • Direct: open store, force _rebuild_fts_index() twice in sequence → no exception.

#283 — covered above.

#301 — classify --batch

  • pytest tests/test_mcp_classify.py -k "batch or filter" -v
  • CLI smoke: distillery classify --batch (no filter) → exit non-zero, stderr contains at least one filter.
  • distillery classify --batch --inbox → exits 0; processes ≤50 entries.

#302 — real author

  • pytest tests/test_real_author.py -v
  • Direct: feed a fake GitHub issue payload with user.login=alice through GitHubSyncAdapter → resulting Entry has author=="alice" and metadata["imported_by"]=="gh-sync".

#308 — watch URL validation (handler-level)

  • pytest tests/test_mcp_watch.py -v
  • Direct calls:
    • _handle_watch({"action":"add","source_type":"rss","url":"not a url"})INVALID_PARAMS (or INVALID_URL); no DB row.
    • _handle_watch({"action":"add","source_type":"github","url":"owner/repo"}) → accepted (bare slug allowed for github).
    • _handle_watch({"action":"add","source_type":"rss","url":"owner/repo"}) → rejected.

#310 — watch liveness metadata

  • pytest tests/test_mcp_feeds.py -k "liveness or last_polled or item_count" -v
  • pytest tests/test_poller.py -k "record_poll_status" -v
  • Direct: add a source, poll once via FeedPoller, then _handle_watch({"action":"list"}) → entry includes last_polled_at, last_item_count, last_error (null on success), next_poll_at.

#312 — gh-sync project + tags backfill

  • pytest tests/test_mcp_feeds.py -k "project or backfill" -v
  • Direct: sync a single GitHub issue payload → resulting Entry has project=="<repo-name>", tags contains source/github, repo/<name>, ref-type/issue, state/<x>.
  • CLI: distillery maintenance backfill-github-metadata --dry-run → reports counts of entries it WOULD update.

#315 — reviewer parameter

  • pytest tests/test_mcp_classify.py -k "reviewer or actor or on_behalf_of" -v
  • Direct: call _handle_resolve_review({"entry_id": id, "action":"approve","reviewer":"bob"}) from server context with actor="alice" → entry metadata gains reviewed_by="alice", reviewed_on_behalf_of="bob".
  • Same call without delegation (reviewer="alice", actor="alice") → no *_on_behalf_of field.

#316 — reclassify status

  • pytest tests/test_mcp_classify.py -k "reclassify_status or reclassify_pending" -v
  • Direct: seed entry with status="pending_review", call _handle_resolve_review({"action":"reclassify",...}) → entry status=="active". Repeat with seed status="archived" → status remains archived.

#274 — Jina truncation

  • pytest tests/test_truncation.py -v
  • Direct: truncate_content("x" * 60_000) returns ≤ 30_000 chars + [truncated] suffix.

#276 / #278 — async sync jobs

  • pytest tests/test_async_sync_pipeline.py -v
  • Direct: kick off run_sync_job_async(...) against a stub adapter → SyncJobTracker.get(job_id) transitions PENDING→RUNNING→COMPLETED with pages_processed > 0.

#332dedup_action="merged" ghost id (PR #341 merged on staging)

  • pytest tests/ -k "merge or fold or dedup_action_merged" -v
  • Direct: configure dedup thresholds so a second store call lands in the merge band (≥0.80, <0.95). Call _handle_store(...) twice → second response: dedup_action == "merged", entry_id == first_entry_id (true fold; no fresh row). Verify with _handle_get(entry_id=first_entry_id) — content/refs folded into existing row, version incremented.
  • Negative: a "stored" path (similarity < 0.60) must NOT report dedup_action="merged".

#333resolve_review double-approve idempotency (PR #339 merged on staging)

  • pytest tests/test_mcp_classify.py -k "double_approve or idempotent or no_op" -v
  • Direct: seed entry with status="active", capture version=N. Call _handle_resolve_review({"action":"approve","entry_id":id}) → response indicates no-op (e.g. changed: false); entry version still N.
  • Repeat for archive on already-archived entry: no version bump, no audit-log duplicate.

#334 — watch list liveness actually populated (PR #338 merged on staging)

  • pytest tests/test_mcp_feeds.py -k "liveness or populate" -v
  • Direct: add a feed source, run FeedPoller.poll_once() against a stub adapter, then _handle_watch({"action":"list"}) → row has non-null last_polled_at AND non-null last_item_count (not just exposed-but-null). After a forced failure, last_error non-null and ≤200 chars.
  • Sync path: kick off a sync_history job; while RUNNING and after COMPLETED, list shows liveness fields update from sync writes too.

#335source=<url> aliases to feed_url (PR #340 merged on staging)

  • pytest tests/ -k "source_alias or feed_url_alias" -v
  • Direct: seed feed source https://x.test/rss with 3 entries. _handle_list({"source": "https://x.test/rss"}) and _handle_list({"feed_url": "https://x.test/rss"}) MUST return identical entry sets. _handle_list({"source": "manual"}) (enum value) still works as a source-type filter — alias only kicks in when the value parses as a URL.

Group C — Real HTTP MCP server (1 subagent, more setup)

#112 — security follow-up

Subagent runs:

  1. TLS verify=True audit — grep all httpx.Client( and httpx.AsyncClient( callsites; assert each constructed with verify=True (or default which is True; flag any explicit verify=False).
    grep -rn "httpx.Client\|httpx.AsyncClient" src/ | grep -v "verify=True"
    
    Expected: empty output OR every match passes default verify (no verify=False anywhere).
  2. Ownership on classifypytest tests/ -k "ownership and classify" -v. Direct: as user-A, store entry; as user-B, call distillery_classify on it → FORBIDDEN.
  3. CORS — start HTTP server with default config; curl -H 'Origin: https://evil.test' -i http://localhost:8765/mcp → response must NOT echo Access-Control-Allow-Origin: https://evil.test. Then start with cors_allowed_origins=["https://ok.test"] and confirm allowed origin echoes back.
  4. Dep pinning — open pyproject.toml and assert upper bounds present on pyyaml, httpx, fastmcp, defusedxml.
  5. Log retention — invoke /api/maintenance with bearer token; assert response includes search_logs_pruned: <n> field. Verify config defaults search_log_retention_days == 90.

#303 — dynamic transport

  • pytest tests/test_session_start_briefing.py -v
  • Manual: with DISTILLERY_MCP_URL=http://localhost:8765/mcp set, run python scripts/hooks/session_start_briefing.py → exits 0, prints briefing.
  • Unset env, place a .mcp.json at cwd with stdio entry → re-run, hook resolves stdio.
  • Both env unset and no manifest → fallback to localhost:8000; hook reports unreachable cleanly.

#308 — watch URL probe (HTTP layer)

  • _handle_watch add against a known-404 host (e.g. https://nonexistent.invalid/feed.xml) → returns UNREACHABLE_URL unless force=True.
  • HEAD-405 host (subagent stands up tiny aiohttp stub on :9001 returning 405 for HEAD, 200 for GET) → watch add succeeds (GET fallback).

#276 / #278 — async pipeline end-to-end

  • Start server. POST distillery_watch action=add url=<small repo> sync_history=true via MCP client → response contains job_id.
  • Poll distillery_sync_status job_id=<id> until status=="completed" or 60s timeout. Assert entries_created > 0 and errors == [].

Group D — Skills, manifests, CI (1 subagent, read-only)

#269 — CronCreate uses MCP tool calls

  • Read skills/setup/SKILL.md and skills/setup/references/cron-payloads.md. Assert no occurrences of POST /hooks/poll, POST /api/maintenance, or HTTP-only references in cron sections.
  • Assert presence of distillery_list, distillery_watch, distillery_store tool calls in payload examples.
  • Read skills/watch/SKILL.md. Same assertions.

#271 — CVE suppression

  • Read .grype.yaml. Assert ≥40 entries.
  • Each suppression has vulnerability: and a justification: (or reason:) field with non-empty content.
  • Spot-check that CVE-2026-31790, CVE-2026-4786 are present.

#286 — stale permission

  • Read .claude/settings.local.json. Assert distillery_tag_tree does NOT appear.

#307 — stale section routing

  • Read skills/briefing/SKILL.md. Assert it references distillery_list with stale_days parameter (not a missing distillery_stale tool).
  • Grep skills/ for distillery_stale → no matches outside historical changelogs.

#330 — docs: stale tool count + self-host guidance

  • Read docs/install/plugin-install.md (or docs/skills/index.md — wherever total tool count is published).
  • Assert published count matches the actual registered tool count (introspect distillery.mcp.server or count tool decorators in src/distillery/mcp/tools/).
  • Assert presence of a self-host section covering (a) DISTILLERY_CONFIG env var, (b) HTTP transport with GitHub OAuth, (c) plugin user-scope override.
  • If the doc still hardcodes the old count (e.g. "12 tools" when current is higher) → FAIL.

Group E — Issues 345–351 (now on staging)

All seven PRs are merged to staging/api-hardening. Tests run against the single staging worktree.

Issue PR Merge commit Key files
#345 #353 8c5f2be / 7a199e8 mcp/tools/classify.py, mcp/tools/crud.py, tests/test_entry_type_suggestions.py
#346 #360 00c1698 / 6d70436 store/duckdb.py, tests/test_store_wal_durability.py
#347 #359 69a49e8 / c75231e scripts/hooks/session_start_briefing.py, scripts/hooks/session-start-briefing.sh
#348 #354 24c6cc8 / 188591a mcp/tools/crud.py, tests/test_conflict.py
#349 #358 df25c15 / 4c96143 store/duckdb.py, tests/test_duckdb_store.py
#350 #352 921aab1 / 8fe4382 .github/workflows/staging-deploy.yml
#351 #361 5e4f924 / ab842ec + CR rounds embedding/{jina,openai,errors}.py, mcp/budget.py, mcp/tools/{crud,search}.py

#345 (PR #353) — entry_type alias suggestions

  • pytest tests/test_entry_type_suggestions.py tests/test_mcp_classify.py tests/test_corrections.py tests/test_bulk_ingest.py -v
  • Direct calls (one each):
    • _handle_store({"content":"x","entry_type":"note"})INVALID_PARAMS with details.suggestion == "inbox", details.allowed is the 12-element canonical list, details.field == "entry_type", message contains Did you mean 'inbox'?.
    • Repeat for taskidea, prgithub, articlebookmark, summarydigest, docreference, contactperson, repoproject.
    • entry_type="ZzUnknownZz"INVALID_PARAMS, details present but no suggestion key.
    • entry_type="NOTE" (case) and " note " (whitespace) both → suggestion inbox.
    • Reclassify path: _handle_resolve_review({"action":"reclassify","new_entry_type":"note",...}) → same suggestion.
    • Regression guard: no alias key collides with a canonical EntryType value (asserted in test_entry_type_suggestions.py).

#346 (PR #360) — checkpoint-after-write WAL durability

  • pytest tests/test_store_wal_durability.py tests/test_duckdb_store.py -v (target ≥118 passing)
  • Direct: open store, _handle_store(...), then inspect <db>.wal size — should be 0 or near-0 after the write (CHECKPOINT flushed it).
  • Recovery: simulate replay failure (mock _sync_initialize to raise), then trigger recovery → assert WAL renamed to *.wal.corrupt.<ts>, NOT unlinked. The original WAL bytes remain on disk under the new name.
  • Failure swallowing: monkeypatch the connection so CHECKPOINT raises → _handle_store still returns persisted=true; warning logged.

#347 (PR #359) — briefing hook tools/list probe

  • bash scripts/hooks/test-hooks.sh → 34/34 pass.
  • Manual reproduction:
    • Stand up a stub HTTP server on :9100 returning 404 on /health and a valid JSON-RPC tools/list response on POST /mcp. Run hook with DISTILLERY_MCP_URL=http://localhost:9100/mcp → exit 0, briefing rendered (no longer no-ops on the 404).
    • Stub returning 401 on /mcp → hook exits 0 with [Distillery] briefing disabled — auth failed on stderr.
    • With DISTILLERY_BRIEFING_QUIET=1 set, the diagnostic stderr line MUST be suppressed.

#348 (PR #354) — include_conflict_prompt flag

  • pytest tests/test_conflict.py -v
  • Direct calls (seed 3 near-duplicates first):
    • _handle_store({"content":"new...", ...}) (default) → response conflicts[*] carries entry_id, content_preview, similarity_score ONLY. NO conflict_prompt key. Total response bytes ≤ ~1KB.
    • _handle_store({"content":"new...","include_conflict_prompt":true, ...}) → each conflict carries conflict_prompt. Response size approx 3x larger (~3KB+ per docs).
    • output_mode="summary" (existing bulk-store fast path) still skips dedup+conflict entirely — unchanged behaviour, no regression.

#349 (PR #358) — FTS WAL replay with overwrite=1

  • pytest tests/test_duckdb_store.py::TestWalFtsReplayHardening -v
  • Direct: store._rebuild_fts_index() twice in sequence → no Cannot drop entry "fts_main_entries" error.
  • Reproduce subprocess SIGKILL test: spawn child that opens store, writes, calls rebuild, then SIGKILLs itself before clean shutdown. Reopen store in parent → no WAL replay error, FTS searchable.
  • Inspect rebuild path: confirm _rebuild_fts_index calls PRAGMA create_fts_index(..., overwrite=1) (no manual DROP SCHEMA ... CASCADE left in the code). Confirm a CHECKPOINT follows.

#350 (PR #352) — staging-deploy comment escaping

  • Read .github/workflows/staging-deploy.yml. Assert:
    • Both PR-comment blocks use gh pr comment --body-file - with a <<'EOF' heredoc (not --body "...\url`..."`).
    • Comment text mentions GET /mcp returns 405 (not 404), and bare hostname returns 404.
  • Validate parses: python3 -c "import yaml; yaml.safe_load(open('.github/workflows/staging-deploy.yml'))" exits 0.
  • Optional live check: after PR merges, the next /deploy-staging PR comment renders backticked URLs cleanly (no %5C%60 in Fly access logs).

#351 (PR #361) — embedding budget default unlimited

  • pytest tests/test_budget.py tests/test_embedding.py tests/test_mcp_errors.py tests/test_mcp_coverage_gaps.py -v (target 261 passing)
  • Direct:
    • Read src/distillery/config.pyembedding_budget_daily default == 0.
    • With default config, run 600 embed calls in a loop (mock provider returning fast) → no EmbeddingBudgetError. Set embedding_budget_daily=10 and run 11 calls → 11th raises EmbeddingBudgetError.
    • 429 path: monkeypatch Jina/OpenAI client to return HTTP 429 with Retry-After: 12 after retry exhaustion → EmbeddingRateLimitError raised; MCP tool surfaces INVALID_PARAMS with details.provider, details.endpoint, details.http_status==429, details.retry_after==12. WARNING line in logs includes provider name.
    • 429 without Retry-After header → error still raised, retry_after field absent or null.
    • Follow-on commits (b247f3e, 516b694, b983903): OpenAI.embed() routes through embed_batch() (structured errors), non-finite Retry-After values pinned, provider errors propagate through store dedup precheck. Spot-check: inject Retry-After: inf → error surfaces with retry_after clamped/omitted (no traceback).

Group F — Agent-driven E2E user journeys (1 subagent, live MCP)

Each scenario drives a live staging MCP as a real client would: sequential tool calls across multiple issues, verifying behaviour observable from outside the server. No pytest — the subagent speaks MCP JSON-RPC (or uses the distillery CLI / Python client) and inspects responses.

Setup (once per run):

git worktree add /tmp/distillery-e2e staging/api-hardening
cd /tmp/distillery-e2e
pip install -e ".[dev]" --quiet
rm -f /tmp/distillery-e2e.db*
DISTILLERY_DB_PATH=/tmp/distillery-e2e.db \
DISTILLERY_AUTH_ALLOW_LOOPBACK=1 \
DISTILLERY_EMBEDDING_PROVIDER=deterministic \
distillery-mcp --transport http --port 8765 &
export MCP=http://localhost:8765/mcp
sleep 2 && curl -sf $MCP -X POST -H 'content-type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | jq '.result.tools | length'
# expect: tool count > 12 (includes distillery_status, distillery_store_batch)

Agent-driver protocol. The subagent MUST call MCP tools via JSON-RPC over HTTP (or the Python FastMCP client), NOT by importing handlers. Every scenario must close with a cleanup that drops or archives every entry/source it created.

Each scenario's pass criterion is a cross-cutting predicate — not just "a single field equals X", but "the chained workflow an agent would run actually works".


F1 — Capture-to-classify round trip (covers #245, #311, #313, #314, #317, #332, #348)

  1. distillery_status → returns {status:"ok", tool_count, transport:"http"}. Record tool_count.
  2. distillery_store content="Research note about Claude prompt caching TTL" entry_type="inbox" → response has persisted:true, dedup_action:"stored", entry_id set, conflicts present but each conflict object has NO conflict_prompt key (default off).
  3. distillery_store content="Research note about Claude prompt caching TTL" entry_type="inbox" (same content) → persisted:false, dedup_action:"skipped", entry_id == existing_entry_id (no ghost). Then distillery_get entry_id=<returned> succeeds.
  4. distillery_store content="Research note about Claude prompt caching time-to-live" entry_type="inbox" (near-dup, merge band) → dedup_action:"merged", entry_id == first_entry_id. Version on first entry incremented by 1 (verify with get).
  5. distillery_list (no args) → default output_mode="summary" (response bytes < 1.5KB for 1 entry; no conflicts/versions/metadata on rows). No archived entries included.
  6. distillery_resolve_review entry_id=<id> action="approve" → on already-active entry, response is no-op (changed:false), version NOT bumped (distillery_resolve_review: double-approve of already-active entry silently bumps version #333 regression guard).
  7. distillery_store content="Similar research on cache" entry_type="inbox" include_conflict_prompt=true → each conflict object NOW has conflict_prompt (~1KB string). Response bytes ≥ 2x a default call.
  8. Cleanup: archive created entries.

Pass: every assertion above holds. Fail: any response shape or field missing.


F2 — Entry-type alias suggestion flow (covers #232, #245, #345)

  1. distillery_store content="todo: wire radar digest" entry_type="note" → error code:"INVALID_PARAMS", message contains Did you mean 'inbox'?, details.suggestion == "inbox", details.allowed is a 12-element array including "github", details.provided == "note".
  2. Retry with entry_type="inbox" → success.
  3. distillery_store content="gh-17" entry_type="pr" → suggestion "github". Retry with "github" + required metadata {repo, ref_type:"pr", ref_number:17} → success.
  4. distillery_store content="x" entry_type="ZzZz"INVALID_PARAMS, details present but details.suggestion key absent.
  5. distillery_store content="x" entry_type=" NOTE " (case + whitespace) → still suggests inbox.
  6. Cleanup.

Pass: alias map works on both store and reclassify paths; unknown types still get structured details.


F3 — Watch: URL validation → liveness → async sync (covers #276, #278, #302, #308, #310, #312, #334)

  1. distillery_watch action="add" source_type="rss" url="not a url"INVALID_PARAMS (or INVALID_URL), nothing persisted.
  2. distillery_watch action="add" source_type="rss" url="https://nonexistent.invalid.test/feed.xml"UNREACHABLE_URL (or similar), not persisted. Retry with force=true → persists.
  3. distillery_watch action="add" source_type="github" url="https://github.com/norrietaylor/distillery" sync_history=true → response includes sync_job.job_id. Remember the id.
  4. Poll distillery_sync_status job_id=<id> every 3s until status == "completed" or 90s timeout. Assert entries_created > 0 and errors == [].
  5. distillery_watch action="list" → the GitHub source has non-null last_polled_at AND non-null last_item_count (distillery_watch list: liveness fields exposed but never populated (related #310) #334 — fields actually populated, not just exposed), non-null next_poll_at. last_error is null.
  6. distillery_list source="https://github.com/norrietaylor/distillery" AND distillery_list feed_url="https://github.com/norrietaylor/distillery" → identical result sets (distillery_list: source= vs feed_url= diverge; source= silently returns 0 (related #309) #335 alias).
  7. Pick one entry; assert entry.project == "distillery" (gh-sync writes entries with project=null and tags=[] — project/tag filters broken for 4k+ entries #312), entry.tags contains source/github and repo/distillery and a ref-type/*, entry.author is a real GitHub login (feat(sync): use real author (GitHub user / RSS author) instead of tool name #302 — not "gh-sync" literal).
  8. Cleanup: distillery_watch action="remove" url=..., archive ingested entries.

Pass: the full ambient-intel path works end-to-end; the agent can trust the liveness table and the real-author metadata for downstream skills.


F4 — Classify batch + review queue (covers #301, #315, #316)

  1. Seed 5 entries via distillery_store_batch with entry_type="inbox", distinct content, all with status="pending_review" forced (or via classification that routes to review).
  2. distillery_classify batch endpoint (via CLI: distillery classify --batch) with no filter → exit non-zero, stderr contains at least one filter.
  3. distillery classify --batch --inbox → exits 0; processes all 5; output reports counts by disposition.
  4. distillery_resolve_review entry_id=<id-1> action="approve" reviewer="bob" called as actor alice → entry metadata: reviewed_by:"alice", reviewed_on_behalf_of:"bob" (distillery_resolve_review: reviewer parameter silently ignored #315).
  5. Same call with reviewer="alice" (= actor) → no *_on_behalf_of field written.
  6. distillery_resolve_review entry_id=<id-2> action="reclassify" new_entry_type="reference" (entry is pending_review) → post-state status == "active" (distillery_resolve_review: reclassify action leaves status=pending_review #316), not still pending.
  7. distillery_list (default) → the reclassified entry appears (no longer hidden from default view).
  8. Cleanup.

Pass: review-queue exits align with reviewer/actor audit expectations; batch CLI composes filters cleanly.


F5 — WAL durability + FTS replay (covers #266, #346, #349)

  1. Start staging MCP against a fresh on-disk DB (not in-memory).
  2. Store 10 entries in rapid succession.
  3. Trigger FTS rebuild (either via a distillery_search call that forces rebuild, or direct CLI maintenance: distillery maintenance rebuild-fts). Do it twice in a row — no Cannot drop entry "fts_main_entries" error.
  4. Kill the server with SIGKILL (not SIGTERM). Restart it against the same DB path.
  5. distillery_list → all 10 entries present (no WAL discarded by recovery path Delayed ghost entry_ids: store→get succeeds, update/get later returns NOT_FOUND #346).
  6. Look in the DB directory: any .wal.corrupt.<ts> files are preserved (if recovery fired). No silently-unlinked WALs.
  7. distillery_search query="..." → FTS operational, returns expected hits.
  8. Cleanup.

Pass: a hard kill mid-write does not lose committed entries; operators retain the corrupt WAL for forensics.


F6 — Briefing hook dynamic transport (covers #303, #347)

The subagent runs the hook itself, not inside a Claude Code session.

  1. With DISTILLERY_MCP_URL=http://localhost:8765/mcp set, run python scripts/hooks/session_start_briefing.py → exit 0, briefing text on stdout (recent entries, corrections, radar).
  2. Unset env. Create a temp dir with a .mcp.json pointing at the same HTTP URL. Run the hook from that dir → resolves via .mcp.json, exit 0.
  3. Stand up a stub on :9100 that returns 404 on /health and a valid JSON-RPC tools/list on POST /mcp. Set DISTILLERY_MCP_URL=http://localhost:9100/mcp → hook exits 0 with briefing rendered (no silent no-op on /health 404).
  4. Stub returning 401 on /mcp → hook exits 0, stderr has [Distillery] briefing disabled.
  5. Re-run docs: add slides link to README #4 with DISTILLERY_BRIEFING_QUIET=1 → stderr silent.

Pass: hook resolves transport from the full env/manifest chain and no longer requires a /health sibling.


F7 — Embedding 429 surfacing (covers #245, #351)

  1. Start the server with DISTILLERY_EMBEDDING_PROVIDER=openai and a stub OpenAI endpoint configured to always return HTTP 429 with Retry-After: 12.
  2. distillery_store content="..." entry_type="inbox" → error, code INVALID_PARAMS, details.provider == "openai", details.endpoint set, details.http_status == 429, details.retry_after == 12. No stack trace leaked in message.
  3. Server logs a WARNING line with provider context.
  4. Flip stub to return 429 without Retry-After header → details.retry_after absent or null; error still structured.
  5. Flip stub to return Retry-After: inf (non-finite) → error surfaces, retry_after clamped/omitted, no exception (regression guard for b247f3e).
  6. Restore normal stub. Confirm embedding_budget_daily == 0 in config — run 500 sequential stores, none hit EmbeddingBudgetError.
  7. Set embedding_budget_daily=5; on the 6th store → EmbeddingBudgetError surfaced as a structured MCP error.

Pass: upstream provider throttling is the rate limiter; the local budget is opt-in.


F8 — Security perimeter (covers #112)

  1. curl -i -X POST -H 'Origin: https://evil.test' -H 'content-type: application/json' -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' $MCP → response does NOT echo Access-Control-Allow-Origin: https://evil.test.
  2. Restart server with DISTILLERY_CORS_ORIGINS=https://ok.test. Repeat curl with Origin: https://ok.test → response echoes Access-Control-Allow-Origin: https://ok.test.
  3. As user-A (one OAuth identity or API key), store entry. As user-B, call distillery_classify against that entry_id → FORBIDDEN (Security Review Follow-Up: Audit of Issue #51 Remediation Status #112 P2).
  4. POST /api/maintenance with valid bearer → response body includes search_logs_pruned: <n>. Verify default retention: grep search_log_retention_days src/distillery/config.py shows default 90.
  5. Grep for verify=False in src/ → zero hits (TLS pin — Security Review Follow-Up: Audit of Issue #51 Remediation Status #112 P1).
  6. Open pyproject.tomlpyyaml, httpx, fastmcp, defusedxml all have upper bounds (Security Review Follow-Up: Audit of Issue #51 Remediation Status #112 P4).

Pass: the server does not echo unconfigured origins, enforces ownership on classify, prunes search logs, pins transitive deps.


F9 — Bulk ingest + dedup contract (covers #238, #244, #311, #314, #332, #348)

  1. distillery_store_batch entries=[{...} x 20] with mixed content (some near-dup, most distinct). Response: entry_ids length 20, count == 20, results[i].persisted varies per item, results[i].dedup_action{"stored","skipped","merged"}.
  2. Per-item error isolation: inject one invalid entry (missing required metadata for entry_type="github") into the batch → batch returns partial success; other 19 persist; the bad one has results[i].error populated, no exception leaked.
  3. Call distillery_list with default paging → reflects only the unique/non-merged entries (no ghosts).
  4. Call distillery_store_batch with output_mode="summary" against the same content → each item's response object is minimal (no conflicts, no dedup preview). Measured response bytes ≤ 30% of full-mode.
  5. Cleanup.

Pass: bulk path correctly isolates per-item failures and honours the summary contract.


F10 — Docs/skills/catalog alignment (covers #232, #245, #269, #286, #307, #330)

  1. distillery_status → records tool_count.
  2. Parse docs/install/plugin-install.md (or equivalent published doc) — assert the published count matches (docs: fix stale tool count and add self-host guidance in plugin-install.md #330).
  3. Call tools/list via JSON-RPC — assert distillery_stale is NOT in the returned tools (distillery_stale tool missing from MCP catalog — briefing hook silently drops stale section #307 — routed to list instead). Assert distillery_status IS present (No distillery_status or distillery_metrics MCP tool — /setup cannot verify connectivity in-protocol #313).
  4. Read skills/briefing/SKILL.md — references distillery_list stale_days=30, not distillery_stale.
  5. Read skills/setup/SKILL.md and skills/watch/SKILL.md — all CronCreate examples use MCP tool calls, not POST /hooks/* (fix(skills): setup/watch CronCreate prompts use webhook POSTs instead of MCP tool calls for local transport #269).
  6. Read .claude/settings.local.json — no distillery_tag_tree permission (bug(skills): stale distillery_tag_tree permission in settings.local.json #286).
  7. Introspect each entry-type-accepting tool's description string via tools/list — every value of EntryType appears in the description (distillery_store tool description enum omits github entry type (used by gh-sync) #232 regression guard; doc drift = fail).

Pass: surfaces agents rely on (docs, skill prompts, tool catalog, permissions) agree with the runtime.


Group F teardown:

kill %1                         # stop MCP server
rm -f /tmp/distillery-e2e.db*
git worktree remove /tmp/distillery-e2e

Group F subagent prompt template:

You are the Group F agent-driver. A staging Distillery MCP is running on $MCP and you are authenticated as loopback. For each F-scenario, execute every step AS IF YOU WERE A REAL MCP CLIENT: call tools via JSON-RPC over HTTP (curl or requests), NEVER by importing Python handlers. Before each scenario, snapshot distillery_list(limit=0) count. After each scenario, run the documented cleanup and confirm the count returns to snapshot ±0. Any unhandled exception, non-2xx response code on a step expected to succeed, or schema mismatch is a FAIL. Report | scenario | issues | result | evidence |, where evidence is the single curl/Python line that failed (or "all steps ok") per scenario.


Critical files reference

Purpose Path
Error codes src/distillery/mcp/tools/_errors.py
Validation helpers src/distillery/mcp/tools/_common.py
Store/list/update handlers src/distillery/mcp/tools/crud.py
Classify/resolve_review handler src/distillery/mcp/tools/classify.py
Watch/gh-sync/store-batch handlers src/distillery/mcp/tools/feeds.py
Status tool src/distillery/mcp/tools/meta.py
Server registration src/distillery/mcp/server.py
Auth src/distillery/mcp/auth.py
Middleware (CORS, rate-limit) src/distillery/mcp/middleware.py
Webhooks (incl. log pruning) src/distillery/mcp/webhooks.py
DuckDB store + migrations src/distillery/store/duckdb.py
GitHub sync adapter src/distillery/feeds/github_sync.py
RSS adapter src/distillery/feeds/rss.py
Poller src/distillery/feeds/poller.py
Background jobs src/distillery/feeds/sync_jobs.py
Truncation src/distillery/feeds/truncation.py
Embedding (Jina, OpenAI) src/distillery/embedding/{jina,openai}.py
SessionStart hook scripts/hooks/session_start_briefing.py
Skill files skills/{setup,watch,briefing,classify}/SKILL.md
CVE suppressions .grype.yaml
Pyproject pins pyproject.toml

Verification of the test plan itself

Before dispatching subagents, run a smoke check on the orchestrator:

git worktree add /tmp/distillery-test staging/api-hardening
cd /tmp/distillery-test
pip install -e ".[dev]" --quiet
pytest --collect-only -q | tail -5            # confirm pytest finds suite
ruff check src/                                # confirm tree is buildable

If both pass, dispatch the four group subagents in parallel and aggregate | issue | scenario | result | evidence | tables into a single coverage matrix. Any FAIL or BLOCKED triggers a follow-up task on the originating issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions