What this is: Implementation of Meta-Harness optimization (arxiv 2603.28052) for the QUY multi-agent system. Automatically improves agent harness components (CLAUDE.md files, skills activation, memory retrieval) while enforcing behavioral invariants as the primary oracle.
Current status: Phase 0 complete. Phase 1 ready to begin.
Companion repo: C:\Users\ryuke\Desktop\QUY — the live QUY system this optimizer targets.
quy-harness-optimizer/
├── README.md ← you are here
├── context/
│ ├── BOOTSTRAP_QUY.md ← feed to Quy at session start
│ ├── BOOTSTRAP_PRAXIS.md ← feed to Praxis at session start
│ └── BOOTSTRAP_STRAXIA.md ← feed to Straxia at session start
├── spec/
│ ├── BEHAVIORAL_INVARIANTS.md ← the oracle (27 invariants, 4 tiers)
│ └── PROPOSAL.md ← full implementation plan
├── agent_configs/
│ ├── CLAUDE_QUY.md ← agent identity files (reference copies)
│ ├── CLAUDE_PRAXIS.md
│ ├── CLAUDE_STRAXIA.md
│ ├── SOUL_QUY.md
│ ├── SOUL_PRAXIS.md
│ └── SOUL_STRAXIA.md
├── reference/
│ ├── session_end.py ← copy of QUY hook — Praxis extends this
│ ├── memory_core.py ← copy of QUY context assembly hook
│ ├── SKILL_STATS.json ← Phase 3 signal data
│ └── skills_INDEX.md ← Phase 3 optimization target
└── harness_optimizer/ ← all implementation code goes here
├── invariant_tests/ ← automated test harness (Phase 1)
├── adversarial/ ← sealed adversarial task files (Phase 1)
├── baselines/ ← Q_baseline.json (Phase 1)
├── snapshots/ ← daily harness snapshots (Phase 1+)
├── proposals/ ← pending/ accepted/ rejected/ (Phase 3+)
└── monitoring/ ← weekly Q tracking (Phase 4+)
Before starting Phase 1, verify:
- Python 3.11 available (
python --version) -
whisper_envexists atC:\Users\ryuke\Desktop\QUY\whisper_env\ - Claude Code running with API access
- Git configured (
git config --global user.name) - QUY system running (Discord bot active with Praxis, Straxia, Quy)
spec/BEHAVIORAL_INVARIANTS.md is written. 27 invariants, 4 oracle tiers. This is the oracle.
Your actions before starting:
-
Review HARNESS-IMMUTABLE proposals — Praxis will propose which sections of each CLAUDE file to tag. You approve before any tags are written. Look for: Security section, identity paragraph, loop guard, config immutability rule.
-
Eval set designation — Praxis will pull candidate tasks from
C:\Users\ryuke\Desktop\QUY\.claude\trajectory\logs and present them to you grouped by category. You validate/relabel them. Need 50 per category (200 total): software engineering, research/analysis, orchestration, operational.
Start agents:
1. Open new Claude Code session
2. Paste contents of context/BOOTSTRAP_QUY.md as your first message
3. Quy will initialize Praxis and Straxia using context/BOOTSTRAP_PRAXIS.md and context/BOOTSTRAP_STRAXIA.md
What gets built (Praxis):
harness_optimizer/snapshot.py— per-component harness versioningharness_optimizer/invariant_tests/— automated tests for all 13 syntactic invariantsharness_optimizer/task_sampler.py— trajectory JSONL reader- Extended
session_end.pyin the QUY repo — writessession_outcometo events.jsonl harness_optimizer/baselines/Q_baseline.json— current harness Q score
What gets built (Straxia):
harness_optimizer/adversarial/adversarial_methodology.md— the framework for adversarial task designharness_optimizer/adversarial/praxis_adversarial.jsonl— tasks for testing Praxis (authored by Straxia)harness_optimizer/adversarial/quy_adversarial.jsonl— tasks for testing Quy (authored by Straxia)
Phase 1 done when:
- All
harness_optimizer/code files are present and non-stub Q_baseline.jsonexists with a real score- Adversarial task files are sealed (read-only)
- You have reviewed and approved HARNESS-IMMUTABLE tags in all CLAUDE files
Prerequisite: Phase 1 complete + at least 4 weeks of instrumented session data collected.
Your action: None during this phase. Just keep using the QUY system normally. The instrumentation collects data automatically.
What gets built (Praxis + Straxia):
harness_optimizer/analyze.py— correlates harness snapshots with Q deltas- Phase 2 report (Straxia writes) — which components correlate with Q
Gate to Phase 3 — all required:
- At least one harness component shows |r| > 0.30, p < 0.05 correlation with Q
- Adversarial task baseline stable (CV < 20% across 4 weeks)
- You have completed human calibration spot-checks: review 40 invariant test results and confirm the automated tests are scoring correctly (Praxis will surface these for you)
If gate fails: Do not proceed. Improve instrumentation, wait for more data.
First automated optimization. Lowest risk — only skills/INDEX.md changes. Rollback is instant.
Your action: Review and approve each proposed change to skills/INDEX.md before it's written. Praxis will present proposals with rationale. Takes ~5 minutes per batch.
Gate to Phase 4:
- At least one skill shows p < 0.05, h ≥ 0.20 improvement
- Zero adversarial degradation (A_degrade = 0)
Full loop. Human-gated at every step.
Your action at each iteration:
- Receive proposed harness variant from Quy
- Read the diff (what changed and why)
- Approve or reject
- If approved: Praxis merges to main, agents restart, monitoring watches for degradation
Hard stops — act immediately if any trigger:
- Any SEC-1/2/3 or AGENT-1 invariant test fails post-deployment → rollback immediately (Praxis does this,
rollback.py) - Q drops > 0.10 from baseline → auto-rollback fires, notify you
- Any adversarial task erosion → halt optimization queue, human review
| Gate | What you review | Time needed |
|---|---|---|
| Phase 1 start | HARNESS-IMMUTABLE tag proposals | 15 min |
| Phase 1 complete | Eval set labels (200 tasks) | 30–45 min |
| Phase 2 complete | 40 invariant test spot-checks | 20 min |
| Each Phase 3 proposal | Skill activation signal change | 5 min each |
| Each Phase 4 proposal | Harness diff + rationale | 10–15 min each |
Roll back to a previous harness:
cd C:\Users\ryuke\Desktop\QUY
python harness_optimizer/rollback.py --snapshot <snapshot_id>
# Then restart agents:
echo "restart" > _discord_out/_restart_praxis.flag
echo "restart" > _discord_out/_restart_straxia.flag
echo "restart" > _discord_out/_restart_quy.flagStop all optimization immediately:
echo '{"active": true}' > C:\Users\ryuke\.claude\estop.jsonAll agents will halt on next task. Remove the file to resume.
Find snapshot IDs:
ls harness_optimizer/snapshots/
cat harness_optimizer/baselines/Q_baseline.json # shows baseline snapshot_id| Metric | Threshold | Meaning |
|---|---|---|
| Δ_Q ≥ 0.05 | Success criterion | 5pp improvement on 200-task eval set |
| A_degrade = 0 | Hard requirement at deployment | No adversarial task failures |
| |r| > 0.30, p < 0.05 | Phase 2 gate | Correlation strong enough to optimize |
| h ≥ 0.20 | Phase 3 gate | Measurable medium effect on skills |
| Q drops > 0.10 | Auto-rollback trigger | Significant regression |
| Q drops > 0.05 | Alert threshold | Degradation warning |
harness_optimizer/adversarial/*.jsonl— sealed after Phase 1; editing invalidates the adversarial oracleharness_optimizer/baselines/Q_baseline.json— written once in Phase 1; never overwrittenspec/BEHAVIORAL_INVARIANTS.md— the oracle spec; changes require full team review
If context is wiped, feed context/BOOTSTRAP_QUY.md to Quy.
If Praxis needs context, feed context/BOOTSTRAP_PRAXIS.md.
If Straxia needs context, feed context/BOOTSTRAP_STRAXIA.md.
Each bootstrap doc contains complete project state and resumes seamlessly.