diff --git a/docs/plan/2026-05-03-dld-search-evaluation.md b/docs/plan/2026-05-03-dld-search-evaluation.md new file mode 100644 index 0000000..b7908b8 --- /dev/null +++ b/docs/plan/2026-05-03-dld-search-evaluation.md @@ -0,0 +1,299 @@ +# `dld-search` skill — design and evaluation + +Date: 2026-05-03 +Test corpus: `~/dev/gillerkvitter` — 162 decisions across 5 namespaces (backend 63, web 59, field-station 16, field-sync 16, birdnet 8). Real production DLD project. + +## Background + +Some DLD-using projects are approaching ~200 decisions. The user wanted to evaluate adding a retrieval layer beyond standard agent grep. Up-front research (see same-day session) concluded: at this scale, embedding-heavy semantic search is over-engineering — Cursor / Claude Code / Aider have all converged on agentic-grep as the primary pattern, with reranking/embeddings as optional assists for very large corpora. The user agreed to skip the FTS5 / vector-search path for now. + +The remaining question: can we get a meaningful win from a **methodology skill** that wraps the existing grep-based retrieval, runs on a cheap fast model (Haiku 4.5), and encodes a search playbook? Specifically, one designed to be invoked by `dld-plan` and `dld-implement` so they reliably surface prior decisions before planning or coding. + +## Skill design + +`SKILL.md` lives at `.claude/skills/dld-search/SKILL.md` (Claude Code variant) and `skills/dld-search/SKILL.md` (tessl variant per the dual-directory convention). + +### Frontmatter (Claude Code variant) + +```yaml +name: dld-search +description: Find decisions relevant to a query, feature, or code path. Returns a ranked, + structured list (IDs + titles + one-line relevance) — does NOT dump file contents. + Designed to be invoked by other DLD skills (especially `dld-plan` and `dld-implement`)… +user_invocable: true +model: claude-haiku-4-5-20251001 +effort: low +context: fork +agent: Explore +``` + +The Claude-Code-specific fields encode the cost/isolation profile: + +- `model: claude-haiku-4-5-20251001` — Haiku for speed and cost +- `effort: low` — search is mostly mechanical +- `context: fork` — runs in an isolated subagent so the parent's conversation isn't polluted with grep noise +- `agent: Explore` — read-only tools, optimised for codebase exploration + +The tessl variant uses only spec-compliant fields (`name`, `description`, `compatibility: Requires ripgrep (rg) on PATH.`) since `model`/`effort`/`context`/`agent` are Claude-Code extensions outside the agentskills.io spec. + +### Playbook structure + +The skill defines three modes: + +- `mode:plan ` — wide net, recall over precision +- `mode:implement DL-NNN [extra context]` — narrow scope, supersedes/amends chain mandatory +- `mode:lookup ` — default, balanced + +And a 5-step playbook: + +1. **Cheap exact-match filters first** — ID match, path filter (`paths=` hint), tag filter (`tags=` hint). If a small pool falls out, skip ahead. +2. **INDEX.md as cheatsheet** — the denormalised table of all decisions; metadata-only ranking before opening files. +3. **Targeted ripgrep** with `-l` (file list only) — never `-C/-A/-B` in the candidate-discovery phase. +4. **Rank and trim** — read frontmatter + `## Decision` paragraph for top candidates only. Status filtering (drop superseded/rejected unless asked). +5. **Sanity-check before returning** — if 0 hits, retry with most distinctive term; if >50, trim and note in output. + +Output is a strict structured block (markdown table) optimised for LLM consumption but human-readable. + +## Test methodology + +All tests run in a worktree of `~/dev/gillerkvitter` (clean, real corpus). Sessions launched via `claude -p --output-format stream-json --verbose --no-session-persistence --dangerously-skip-permissions`. Stream-JSON enables ground-truth tool/skill invocation logging — earlier `--output-format json` runs in this evaluation gave only the final result with no tool trace, which led to one false inference (later corrected). + +Two paired conditions per test: **with-skill** (`dld-search` installed at `.claude/skills/dld-search/`) and **without-skill** (skill removed). For Part 2, separate worktrees so the runs could execute in parallel without colliding on file mutations. + +The model in the parent session is the user's default (Sonnet/Opus class) in both conditions; `dld-search` internally forks to Haiku. + +## Part 1 — Direct skill calls (4 prompts × 2 conditions) + +Realistic developer prompts; agent free to invoke `dld-search` or not based on the skill's description alone. + +### Tool/cost trace + +| Prompt | Condition | Skill invs | Bash | Read | Grep | Glob | Turns | Duration | Cost | +|---|---|---|---|---|---|---|---|---|---| +| P1 | with | 1 (`dld-search`) | 3 | 19 | 0 | 0 | 2 | 60s | $0.18 | +| P1 | without | 1 (`tessl__dld-lookup`) | 3 | 0 | 2 | 1 | 9 | 37s | $0.22 | +| P2 | with | 2 (`tessl__dld-lookup` then `dld-search`) | 12 | 15 | 0 | 0 | 7 | 103s | $0.34 | +| P2 | without | 0 | 3 | 3 | 0 | 0 | 7 | 29s | $0.16 | +| P3 | with | 1 (`dld-search`) | 6 | 8 | 0 | 0 | 2 | 48s | $0.12 | +| P3 | without | 0 | 6 | 14 | 3 | 0 | 3 | 68s | $0.47 | +| P4 | with | 1 (`dld-search`) | 4 | 12 | 0 | 0 | 2 | 35s | $0.12 | +| P4 | without | 1 (`tessl__dld-lookup`) | 2 | 0 | 4 | 0 | 9 | 40s | $0.22 | + +**Important context: the without-skill baseline already had `tessl__dld-lookup` available** (the existing tessl-installed lookup skill). It was reached for in 2 of 4 prompts. So `dld-search` competes with `tessl__dld-lookup`, not bare grep — both add value over raw search but the lookup skill is keyword/ID/path-oriented while `dld-search` is methodology-oriented. + +### Decision IDs surfaced + +| Prompt | with-skill (n) | without-skill (n) | Overlap | +|---|---|---|---| +| P1 (audio playback in matrix) | 15 | 10 | 7 | +| P2 (implement DL-147) | 8 | 4 | 4 | +| P3 (idempotency / dedup) | 8 | 3 | 3 | +| P4 (audio enhancement pipeline)| 12 | 13 | 12 | + +### What the with-skill condition actually does + +In every case where with-skill invoked `dld-search`, the parent then **followed up with substantial Read activity** (8–19 reads). So delegation isn't a hard handoff — the parent uses the skill as a curated seed list and then opens individual files itself. The hoped-for "fewer parent tool calls" benefit didn't materialise: the parent does *more* Reads (it has more known-relevant candidates to open), not fewer. + +### Sample outputs (P3 — "How does this codebase prevent duplicate work?") + +**with-skill** (after invoking `dld-search` with args `idempotency, deduplication, preventing duplicate work, dedup, unique constraints, retry safety`): + +``` +Idempotency in this codebase is built around three coupled decisions added 2026-04-17 +after a 31× duplication incident: + +- DL-156 (field-sync) — sync agent attaches a per-row externalId idempotency key +- DL-157 (backend) — external_id unique column + ON CONFLICT DO NOTHING +- DL-158 (backend) — alerts and clip pulls fire only for newly-inserted rows + +Supporting context: +- DL-027 — original push retry mechanism (the reason idempotency was needed); amended by DL-156 +- DL-022 — original batch ingest endpoint; amended by DL-157 +- DL-068 (alerts) and DL-105 (clip pull) — both amended by DL-158 +- DL-078 — per-species trigger interval, a related dedup pattern in the alert engine + +Run /dld-lookup DL-156 (or 157/158) to read any of them in full. +``` + +**without-skill** (no skill invocation, did its own grep + read): + +``` +Three decisions cover this end-to-end: + +- DL-156 — Per-row idempotency key externalId = "-", generated by + field-sync, station-scoped, survives BirdNET DB rebuilds. +- DL-157 — Backend dedup via partial unique index on (station_id, external_id) + + ON CONFLICT DO NOTHING. Retried pushes are silent no-ops. +- DL-158 — Side effects (alerts, clip pull) fire only for newly inserted rows; + response returns { inserted, duplicates }. + +Implementation pointers: +- Key generation: apps/field-sync/src/detection-reader.ts:62 +- Unique index: apps/backend/migrations/015_detection_external_id.sql:5 +- ON CONFLICT insert: apps/backend/src/components/detection/internal/detection.repository.ts:105 +- Side-effects gated on inserted rows: apps/backend/src/components/detection/detection.controller.ts:102 +- Clip pull skip-if-fetched: apps/backend/src/components/audio/internal/clip-pull.ts:29 +- Alert per-species cooldown: apps/backend/src/components/notification/internal/notification.repository.ts:48 +``` + +The without-skill version is **arguably more useful** — same core 3 decisions, plus line-numbered implementation pointers across 6 files. The with-skill version surfaced 5 extra related decisions (DL-022, DL-027, DL-068, DL-078, DL-105) that the user didn't ask about and probably didn't need. + +### Sample outputs (P1 — "What decisions should I be aware of for per-cell audio playback?") + +Both conditions surfaced the core matrix decisions and audio-playback decisions. Both correctly flagged the conflict with the existing cell-click drill-down (DL-053). Differences are subtle: + +- **with-skill** surfaced **DL-049** (the canonical detail-page audio player) and **DL-050** (Tailscale, not WireGuard — corrected stale CLAUDE.md mention) +- **without-skill** surfaced **DL-064** (a superseded matrix decision) and **DL-123** (admin enhance trigger pattern) +- Both flagged the cell-click conflict; both flagged the "which clip plays" question + +This is a wash. Both reached the right conclusions through different paths. + +### Part 1 verdict + +The skill **is** auto-invoked by the description alone in 3 of 4 prompts (P2 was a known-DL-NNN lookup where the parent correctly chose `tessl__dld-lookup` first instead). Output quality is **comparable, not clearly better** than baseline. Cost is comparable. Wall-clock is mixed (faster on P3 & P1, slower on P2 & P4). The clearest finding is that **the without-skill baseline is already strong** because `tessl__dld-lookup` exists and Sonnet+grep handles small markdown corpora well. + +The skill's output discipline (compact structured table, explicit "why relevant") is a real qualitative advantage, but the parent agent always re-summarises in its own voice for the user, so the structure mostly benefits intermediate consumption — i.e. the planning/implementing skill use case in Part 2. + +## Part 2 — Skill orchestration via `/dld-plan` and `/dld-implement` + +### Critical mid-test finding + +The **first** Part 2 attempt was uninformative: with-skill `/dld-plan` did **not** invoke `dld-search`. Both runs reached for `tessl__dld-plan` (the existing planner skill), which then did its own grep/Read. The two plans differed only in non-deterministic sampling — same tools, same paths. + +This proved that **auto-invocation from skill descriptions alone is unreliable for inter-skill calls**. To test the actual integration, we had to wire `dld-search` into `dld-plan` and `dld-implement` explicitly. + +### Wiring change + +Patched the "Check for related existing decisions" section in `dld-plan/SKILL.md` and the "Understand the decision(s)" section in `dld-implement/SKILL.md` (both Claude Code and tessl variants in dld-kit, plus the `tessl__dld-plan`/`tessl__dld-implement` copies in the gillerkvitter test repo). + +Pattern: + +> **Preferred:** invoke the `dld-search` skill in `mode:plan` (or `mode:implement`), passing rich context. +> +> **Fallback** (when `dld-search` is not installed): scan decision files directly using grep/Read. + +This makes the call explicit while keeping the skill optional. + +### `/dld-plan` rerun (post-wiring) + +| Condition | dld-search calls | Bash | Read | Grep | Turns | Duration | Cost | +|---|---|---|---|---|---|---|---| +| with | 2 (both `mode:plan` with rich args) | 18 | 35 | 4 | 35 | 212s | $1.29 | +| without | 0 | 17 | 11 | 0 | 35 | 146s | $1.25 | + +The with-skill run **did** invoke `dld-search` twice — once with full context including code paths and likely tags, once apparently retried. The actual args passed: + +``` +mode:plan Per-cell audio playback in homepage activity matrix. Each cell currently +aggregates detections for one species in one hour and links to a filtered observations +list. Want users to play a representative clip (highest-confidence detection) directly +from the cell. Touches: web frontend homepage activity matrix component, audio playback +UI, backend API serving audio clips, detection clip URL/metadata. Possibly related tags: +activity-matrix, homepage, audio, audio-playback, detection-clip, clip-export, +clip-serving, detection-audio +``` + +That's exactly the rich-context pattern the playbook prescribes. Wiring works. + +### Resulting plans (post-wiring) + +Both runs produced 3–4 well-structured proposed decisions, both correctly: + +- Concluded **no backend decision needed** (existing endpoint sufficient) +- Reused `fetchAndNormalize` / DL-159 hipass parity +- Preserved DL-053/DL-143 drill-down semantics + +**with-skill (3 decisions):** +| ID | Title | +|---|---| +| DL-161 | Pick highest-confidence detection as cell's representative clip | +| DL-162 | Single global cell-audio player reusing normalization and hipass pipeline | +| DL-163 | Per-cell play icon with cell-body drill-down preserved | + +**without-skill (4 decisions):** +| ID | Title | +|---|---| +| DL-161 | Pick highest-confidence detection with clip as cell representative | +| DL-162 | Composite cell with corner play button overlay | +| DL-163 | Shared single-instance HTMLAudioElement for matrix cell playback | +| DL-164 | Amend DL-053 — matrix cell becomes composite, not a single anchor | + +The without-skill plan is arguably **more explicit** — it carved out a separate amendment decision for DL-053. The with-skill plan is more concise, capturing the amendment in notes. Both are reasonable. **Quality is comparable.** + +Cost premium for with-skill: ~$0.04 (3%). Wall-clock premium: ~66 seconds (45%). + +### `/dld-implement DL-161` (chained from plan output) + +| Condition | dld-search calls | Bash | Read | Grep | Edit/Write | Turns | Duration | Cost | +|---|---|---|---|---|---|---|---|---| +| with | 1 (`mode:implement` with rich args) | 19 | 34 | 3 | 5 | 31 | 217s | $1.47 | +| without | 0 | 10 | 17 | 0 | 4 | 31 | 155s | $1.32 | + +The with-skill `/dld-implement` invoked `dld-search` once with the prescribed `mode:implement DL-161 — implementing "pick highest-confidence representative detection per matrix cell". Touches apps/web/src/lib/components/ActivityMatrix.svelte and the Detection grouping client-side. Related: DL-045 (matrix), DL-053/DL-143 (drill-down), DL-108 (clip variant), DL-139 (polling). …` + +Both implementations: +- Modified `apps/web/src/components/ActivityMatrix.svelte` to extend client-side grouping with representative-detection tracking +- Annotated with `@decision(DL-161)` +- Updated `decisions/INDEX.md` and the DL-161 record's `references` +- Correctly noted that template wiring is deferred to DL-162/DL-163 + +The with-skill version refined the tiebreak rule during implementation (earliest detectedAt, for cross-poll stability) and aligned eligibility wording to the actual `Detection` shape. The without-skill version independently picked a different tiebreak (latest detectedAt). Both reasonable; both updated their respective DL-161 record inline (the "small refinement" pattern from the implement playbook). + +Cost premium for with-skill: ~$0.15 (11%). Wall-clock premium: ~62 seconds (40%). + +## Findings + +### What works + +1. **The skill auto-invokes reliably from its description for direct user queries.** P1, P3, P4 all triggered it without prompting. The description format ("designed to be invoked by other DLD skills…") is matched by Claude appropriately. + +2. **The wiring pattern works for inter-skill calls.** Once `dld-plan` and `dld-implement` explicitly mention `/dld-search` in their playbooks, they invoke it consistently with rich context (full feature description, code paths, candidate tags). Auto-invocation from description alone is **not reliable** between skills. + +3. **`context: fork` + Haiku does isolate the search work** — the search subagent's tool calls don't appear in the parent's tool count. But this doesn't translate into fewer parent tool calls overall, because the parent re-reads candidates regardless. + +4. **The mode arg system carries through correctly.** Callers used `mode:plan` and `mode:implement` as prescribed. + +### What doesn't materialise + +1. **No clear quality win at this corpus size.** On 162 decisions across 5 namespaces, the without-skill baseline (Sonnet/Opus + `tessl__dld-lookup` + grep) is already very capable. Outputs differ in detail and ordering, not in correctness. + +2. **No cost or latency win.** Wall-clock is consistently slower with the skill (~40% on Part 2), cost is slightly higher (3–11%). The fork+Haiku overhead exceeds whatever the parent saves on its own reasoning at this scale. + +3. **No "fewer parent context" win.** The parent does more Read calls in the with-skill condition, not fewer — it treats the skill output as a curated seed list, not a final answer. + +4. **The structured-output discipline mostly benefits the inter-skill case.** Direct human users get the parent's re-summarisation either way. + +### Honest conclusions + +For a corpus of ~162 decisions, **`dld-search` is not delivering a measurable improvement on real workflows** beyond the existing `tessl__dld-lookup` + grep baseline. Quality, cost, and latency are all approximately a wash, with cost/latency slightly worse with the skill. + +The skill **is** technically sound: +- Wiring works +- Modes work +- Output discipline is correct +- Auto-invocation works for direct queries + +But the **value proposition isn't there yet** at this scale. Every original argument for a retrieval layer (the up-front research and discussion) was for *much larger* corpora where unaided grep starts to fail. At 162 decisions, the failure mode hasn't appeared. + +## Recommendations + +1. **Do not ship `dld-search` as a default DLD skill** at this point. The cost/complexity addition isn't justified by measurable benefit on real corpora at this scale. + +2. **Keep the skill files in dld-kit but mark them experimental.** They're well-designed and the methodology is correct; they may pay off at larger scales (500+ decisions) or in projects where the user reports retrieval failures with the current setup. + +3. **Revert the `dld-plan` and `dld-implement` wiring edits**. They reference `dld-search` as the preferred path, which would degrade UX for users who don't have the skill installed (the fallback works but adds prompt clutter). + +4. **Re-evaluate when concrete drift appears.** If a project hits 300+ decisions and users start reporting that `/dld-plan` misses obvious prior decisions, that's the trigger to re-test the wiring with real failure cases as the test set. + +5. **If `dld-search` is kept**, consider **tightening its description** so it's only auto-invoked when the query is genuinely conceptual ("what have we decided about X" framings), not for direct ID lookups where `tessl__dld-lookup` is already the right tool. This would reduce the cases where it competes uselessly with the lookup skill. + +## Appendix — Test artefacts + +Streams and responses preserved at `/tmp/dld-search-tests/`: +- `with-p{1..4}.stream.jsonl`, `without-p{1..4}.stream.jsonl` — Part 1 +- `with-plan.stream.jsonl`, `without-plan.stream.jsonl` — Part 2 first attempt (pre-wiring; uninformative) +- `with-plan2.stream.jsonl`, `without-plan2.stream.jsonl` — Part 2 with wiring +- `with-impl1.stream.jsonl`, `without-impl1.stream.jsonl` — Part 2 implement chain +- `results/*.txt` — extracted final responses + +Worktrees: `/tmp/gillerkvitter-test`, `/tmp/gillerkvitter-test2` (both with mutated decision logs from the test runs — discard).