You wrote a prompt. It works. But is it good? Simmer runs your artifact through multiple rounds of criteria-driven refinement — each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.
Read the story behind Simmer →
Iterative artifact refinement — take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.
/plugin marketplace add 2389-research/claude-plugins
/plugin install simmer@2389-researchOne skill (simmer) with four subskills that run the refinement loop:
- Setup — identify the artifact (file or workspace), elicit 2-3 quality criteria, determine evaluation method
- Generator — produce an improved version based on the judge's ASI (Actionable Side Information — the single highest-leverage fix)
- Judge — score the candidate 1-10 per criterion, produce the ASI
- Reflect — record the trajectory, track the best candidate across iterations
Say any of these to start a simmer loop:
"simmer this", "refine this", "hone this", "iterate on this", "make this better", "improve this over a few rounds", "polish this", "tighten this up", "optimize this pipeline", "find the best model for this task", "tune this configuration"
Any request to iteratively improve an artifact triggers simmer.
| Situation | Use | Why |
|---|---|---|
| You have an artifact and want it better | simmer | Iterative refinement against criteria |
| You want parallel code implementations competing | cookoff | Parallel generation + fixed criteria + pick winner |
| You don't know what you want yet | omakase-off | Parallel designs + react + discover preference |
Rule of thumb: Simmer = serial iteration. Cookoff/omakase-off = parallel exploration.
User: "Simmer this" [pastes a pitch email]
Claude: I'm using the simmer skill to set up iterative refinement.
What does "better" mean for this? I'd suggest:
1. Value prop clarity
2. Tone match
3. Call to action strength
User: "Those work"
[Iteration 0: Judge scores seed — 4.0/10]
[Iteration 1: Generator fixes value prop → 5.3/10]
[Iteration 2: Generator fixes CTA → 6.3/10]
[Iteration 3: Generator fixes tone → 7.3/10]
Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0 | 4 | 5 | 3 | 4.0 | seed |
| 1 | 7 | 5 | 4 | 5.3 | specific problem statement |
| 2 | 7 | 6 | 6 | 6.3 | low-friction CTA |
| 3 | 7 | 7 | 8 | 7.3 | peer-sharing tone |
Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?
User: "Simmer this pipeline — find the best setup for this task"
Claude inspects the workspace, finds evaluate.sh and config.json,
and proposes criteria + evaluation method:
Evaluator: ./evaluate.sh
Criteria: accuracy, cost efficiency, latency
Search space: models, prompt text, pipeline topology
User: "Looks good, coverage is the priority. 5 iterations."
[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]
[Iteration 1: Generator swaps to cheaper model → 5.3/10]
[Iteration 2: Generator splits into 2-step chain → 7.0/10]
[Iteration 3: Generator adds few-shot examples → 7.7/10]
...
Best candidate: iteration 4 (8.1/10)
| Artifact type | Suggested criteria |
|---|---|
| Document / spec | clarity, completeness, actionability |
| Creative writing | narrative tension, specificity, voice consistency |
| Email / comms | value prop clarity, tone match, call to action strength |
| Prompt / instructions | instruction precision, output predictability, edge case coverage |
| API design | contract completeness, developer ergonomics, consistency |
| Pipeline / workflow | coverage, efficiency, noise |
| Configuration / infra | correctness, resource efficiency, maintainability |
| Mode | When to use |
|---|---|
| Judge-only (default) | Text artifacts — judge scores against criteria |
| Runnable | Code/pipelines — judge interprets script output |
| Hybrid | Both — run script AND judge results against criteria |
No format contract on evaluator output. The judge reads whatever your script produces — test results, metrics, error logs, anything.
Simmer auto-selects between a single judge and a multi-judge board based on complexity:
- Simple (short email, tweet, ≤2 criteria) → single judge, fast
- Complex (3 criteria, long artifact, code, pipelines) → judge board with deliberation
The board constructs three judges tailored to your specific problem — not from a fixed menu, but by reading your artifact, criteria, and constraints and designing judges with diverse perspectives. An extraction prompt gets different judges than a DND adventure hook.
Judges investigate before scoring — they read the evaluator script, ground truth, prior candidates, and config files to understand the problem deeply. A judge who reads the evaluator discovers scoring mechanics on iteration 0 instead of learning them through 3 iterations of trial and error.
If a single-judge run hits a plateau (3 iterations without improvement), simmer offers to upgrade to the board mid-run with 2 extra iterations.
Default iteration count: 3 rounds per batch. After each batch, simmer asks whether to continue. You can request a specific count ("simmer this for 10 rounds") or stop early at any prompt.
Regression safety: The reflect subskill tracks the best candidate seen so far. If a new iteration scores lower than the current best, the best-so-far is preserved — the loop never loses progress. At the end, result.md always contains the highest-scoring candidate, not just the latest one.
| Feature | When you need it |
|---|---|
| Workspace targets | Refining a multi-file directory — iterations tracked as git commits so you can diff any two rounds |
| Runnable evaluators | Your artifact has a test script — point simmer at it (python evaluate.py) and the judge interprets output |
| Background constraints | The generator needs to know what's available (models, budget, latency targets) to make realistic choices |
| Output contracts | Valid output has a defined shape (e.g., JSON schema) — violations score 1/10, forcing format fixes first |
| Validation commands | A cheap pre-check (./validate.sh) catches broken pipelines in seconds before the full evaluator runs |
| Search space tracking | Explicit bounds on what to explore — reflect tracks tried vs. untried regions so the judge steers toward gaps |
See the v2 design spec for full details on each feature.
Single-file mode (default output dir: docs/simmer):
docs/simmer/
iteration-0-candidate.md # Seed (original artifact)
iteration-1-candidate.md # Each improved candidate
iteration-2-candidate.md
iteration-3-candidate.md
trajectory.md # Running score table
result.md # Final best candidate (highest score, not necessarily latest)
Workspace mode:
./pipeline/ # Target directory (modified in place)
[project files] # Tracked via git commits per iteration
docs/simmer/ # Tracking files (separate from workspace)
trajectory.md # Running score table
Workspace iterations are tracked as git commits rather than separate files.
- Focused improvement — each iteration targets one direction (the ASI), not everything at once. Compound gains over scattered edits.
- Context isolation — generator doesn't see scores, judge doesn't see previous scores. Each role gets only the context it needs to avoid bias.
- The generator is the search strategy — in workspace mode, the generator decides what to change (swap a model, restructure a pipeline, tune a prompt). The ASI guides direction, the generator executes.
See the design spec for the full architecture.
Part of the test-kitchen family, but independently installable:
test-kitchen:omakase-off— parallel design explorationtest-kitchen:cookoff— parallel implementation competitionsimmer— iterative refinement
- CLAUDE.md — full plugin instructions
- Simmer skill — orchestrator
- v2 Design — design spec
- Integration tests — test scenarios
If Simmer helped you ship something better than your first draft, a ⭐ helps us know it's landing.
Built by 2389 · Part of the Claude Code plugin marketplace