QUY Harness Optimizer

What this is: Implementation of Meta-Harness optimization (arxiv 2603.28052) for the QUY multi-agent system. Automatically improves agent harness components (CLAUDE.md files, skills activation, memory retrieval) while enforcing behavioral invariants as the primary oracle.

Current status: Phase 0 complete. Phase 1 ready to begin.

Companion repo: C:\Users\ryuke\Desktop\QUY — the live QUY system this optimizer targets.

Directory Map

quy-harness-optimizer/
├── README.md                        ← you are here
├── context/
│   ├── BOOTSTRAP_QUY.md             ← feed to Quy at session start
│   ├── BOOTSTRAP_PRAXIS.md          ← feed to Praxis at session start
│   └── BOOTSTRAP_STRAXIA.md         ← feed to Straxia at session start
├── spec/
│   ├── BEHAVIORAL_INVARIANTS.md     ← the oracle (27 invariants, 4 tiers)
│   └── PROPOSAL.md                  ← full implementation plan
├── agent_configs/
│   ├── CLAUDE_QUY.md                ← agent identity files (reference copies)
│   ├── CLAUDE_PRAXIS.md
│   ├── CLAUDE_STRAXIA.md
│   ├── SOUL_QUY.md
│   ├── SOUL_PRAXIS.md
│   └── SOUL_STRAXIA.md
├── reference/
│   ├── session_end.py               ← copy of QUY hook — Praxis extends this
│   ├── memory_core.py               ← copy of QUY context assembly hook
│   ├── SKILL_STATS.json             ← Phase 3 signal data
│   └── skills_INDEX.md              ← Phase 3 optimization target
└── harness_optimizer/               ← all implementation code goes here
    ├── invariant_tests/             ← automated test harness (Phase 1)
    ├── adversarial/                 ← sealed adversarial task files (Phase 1)
    ├── baselines/                   ← Q_baseline.json (Phase 1)
    ├── snapshots/                   ← daily harness snapshots (Phase 1+)
    ├── proposals/                   ← pending/ accepted/ rejected/ (Phase 3+)
    └── monitoring/                  ← weekly Q tracking (Phase 4+)

Prerequisites

Before starting Phase 1, verify:

Python 3.11 available (python --version)
whisper_env exists at C:\Users\ryuke\Desktop\QUY\whisper_env\
Claude Code running with API access
Git configured (git config --global user.name)
QUY system running (Discord bot active with Praxis, Straxia, Quy)

Phase Execution Guide

Phase 0 — COMPLETE

spec/BEHAVIORAL_INVARIANTS.md is written. 27 invariants, 4 oracle tiers. This is the oracle.

Phase 1: Instrumentation (Weeks 1–2)

Your actions before starting:

Review HARNESS-IMMUTABLE proposals — Praxis will propose which sections of each CLAUDE file to tag. You approve before any tags are written. Look for: Security section, identity paragraph, loop guard, config immutability rule.
Eval set designation — Praxis will pull candidate tasks from C:\Users\ryuke\Desktop\QUY\.claude\trajectory\ logs and present them to you grouped by category. You validate/relabel them. Need 50 per category (200 total): software engineering, research/analysis, orchestration, operational.

Start agents:

1. Open new Claude Code session
2. Paste contents of context/BOOTSTRAP_QUY.md as your first message
3. Quy will initialize Praxis and Straxia using context/BOOTSTRAP_PRAXIS.md and context/BOOTSTRAP_STRAXIA.md

What gets built (Praxis):

harness_optimizer/snapshot.py — per-component harness versioning
harness_optimizer/invariant_tests/ — automated tests for all 13 syntactic invariants
harness_optimizer/task_sampler.py — trajectory JSONL reader
Extended session_end.py in the QUY repo — writes session_outcome to events.jsonl
harness_optimizer/baselines/Q_baseline.json — current harness Q score

What gets built (Straxia):

harness_optimizer/adversarial/adversarial_methodology.md — the framework for adversarial task design
harness_optimizer/adversarial/praxis_adversarial.jsonl — tasks for testing Praxis (authored by Straxia)
harness_optimizer/adversarial/quy_adversarial.jsonl — tasks for testing Quy (authored by Straxia)

Phase 1 done when:

All harness_optimizer/ code files are present and non-stub
Q_baseline.json exists with a real score
Adversarial task files are sealed (read-only)
You have reviewed and approved HARNESS-IMMUTABLE tags in all CLAUDE files

Phase 2: Retrospective Analysis (Weeks 3–5)

Prerequisite: Phase 1 complete + at least 4 weeks of instrumented session data collected.

Your action: None during this phase. Just keep using the QUY system normally. The instrumentation collects data automatically.

What gets built (Praxis + Straxia):

harness_optimizer/analyze.py — correlates harness snapshots with Q deltas
Phase 2 report (Straxia writes) — which components correlate with Q

Gate to Phase 3 — all required:

At least one harness component shows |r| > 0.30, p < 0.05 correlation with Q
Adversarial task baseline stable (CV < 20% across 4 weeks)
You have completed human calibration spot-checks: review 40 invariant test results and confirm the automated tests are scoring correctly (Praxis will surface these for you)

If gate fails: Do not proceed. Improve instrumentation, wait for more data.

Phase 3: Skills Activation Optimization (Weeks 6–7)

First automated optimization. Lowest risk — only skills/INDEX.md changes. Rollback is instant.

Your action: Review and approve each proposed change to skills/INDEX.md before it's written. Praxis will present proposals with rationale. Takes ~5 minutes per batch.

Gate to Phase 4:

At least one skill shows p < 0.05, h ≥ 0.20 improvement
Zero adversarial degradation (A_degrade = 0)

Phase 4: Joint Optimization (Weeks 8–12)

Full loop. Human-gated at every step.

Your action at each iteration:

Receive proposed harness variant from Quy
Read the diff (what changed and why)
Approve or reject
If approved: Praxis merges to main, agents restart, monitoring watches for degradation

Hard stops — act immediately if any trigger:

Any SEC-1/2/3 or AGENT-1 invariant test fails post-deployment → rollback immediately (Praxis does this, rollback.py)
Q drops > 0.10 from baseline → auto-rollback fires, notify you
Any adversarial task erosion → halt optimization queue, human review

Approval Gates Summary

Gate	What you review	Time needed
Phase 1 start	HARNESS-IMMUTABLE tag proposals	15 min
Phase 1 complete	Eval set labels (200 tasks)	30–45 min
Phase 2 complete	40 invariant test spot-checks	20 min
Each Phase 3 proposal	Skill activation signal change	5 min each
Each Phase 4 proposal	Harness diff + rationale	10–15 min each

Emergency Procedures

Roll back to a previous harness:

cd C:\Users\ryuke\Desktop\QUY
python harness_optimizer/rollback.py --snapshot <snapshot_id>
# Then restart agents:
echo "restart" > _discord_out/_restart_praxis.flag
echo "restart" > _discord_out/_restart_straxia.flag
echo "restart" > _discord_out/_restart_quy.flag

Stop all optimization immediately:

echo '{"active": true}' > C:\Users\ryuke\.claude\estop.json

All agents will halt on next task. Remove the file to resume.

Find snapshot IDs:

ls harness_optimizer/snapshots/
cat harness_optimizer/baselines/Q_baseline.json  # shows baseline snapshot_id

Key Numbers to Remember

Metric	Threshold	Meaning
Δ_Q ≥ 0.05	Success criterion	5pp improvement on 200-task eval set
A_degrade = 0	Hard requirement at deployment	No adversarial task failures
\|r\| > 0.30, p < 0.05	Phase 2 gate	Correlation strong enough to optimize
h ≥ 0.20	Phase 3 gate	Measurable medium effect on skills
Q drops > 0.10	Auto-rollback trigger	Significant regression
Q drops > 0.05	Alert threshold	Degradation warning

Files You Should Never Touch

harness_optimizer/adversarial/*.jsonl — sealed after Phase 1; editing invalidates the adversarial oracle
harness_optimizer/baselines/Q_baseline.json — written once in Phase 1; never overwritten
spec/BEHAVIORAL_INVARIANTS.md — the oracle spec; changes require full team review

Contact / Resume

If context is wiped, feed context/BOOTSTRAP_QUY.md to Quy. If Praxis needs context, feed context/BOOTSTRAP_PRAXIS.md. If Straxia needs context, feed context/BOOTSTRAP_STRAXIA.md.

Each bootstrap doc contains complete project state and resumes seamlessly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QUY Harness Optimizer

Directory Map

Prerequisites

Phase Execution Guide

Phase 0 — COMPLETE

Phase 1: Instrumentation (Weeks 1–2)

Phase 2: Retrospective Analysis (Weeks 3–5)

Phase 3: Skills Activation Optimization (Weeks 6–7)

Phase 4: Joint Optimization (Weeks 8–12)

Approval Gates Summary

Emergency Procedures

Key Numbers to Remember

Files You Should Never Touch

Contact / Resume

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent_configs		agent_configs
context		context
harness_optimizer		harness_optimizer
reference		reference
spec		spec
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

QUY Harness Optimizer

Directory Map

Prerequisites

Phase Execution Guide

Phase 0 — COMPLETE

Phase 1: Instrumentation (Weeks 1–2)

Phase 2: Retrospective Analysis (Weeks 3–5)

Phase 3: Skills Activation Optimization (Weeks 6–7)

Phase 4: Joint Optimization (Weeks 8–12)

Approval Gates Summary

Emergency Procedures

Key Numbers to Remember

Files You Should Never Touch

Contact / Resume

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages