Skip to content

Ryuketsukami/quy-harness-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QUY Harness Optimizer

What this is: Implementation of Meta-Harness optimization (arxiv 2603.28052) for the QUY multi-agent system. Automatically improves agent harness components (CLAUDE.md files, skills activation, memory retrieval) while enforcing behavioral invariants as the primary oracle.

Current status: Phase 0 complete. Phase 1 ready to begin.

Companion repo: C:\Users\ryuke\Desktop\QUY — the live QUY system this optimizer targets.


Directory Map

quy-harness-optimizer/
├── README.md                        ← you are here
├── context/
│   ├── BOOTSTRAP_QUY.md             ← feed to Quy at session start
│   ├── BOOTSTRAP_PRAXIS.md          ← feed to Praxis at session start
│   └── BOOTSTRAP_STRAXIA.md         ← feed to Straxia at session start
├── spec/
│   ├── BEHAVIORAL_INVARIANTS.md     ← the oracle (27 invariants, 4 tiers)
│   └── PROPOSAL.md                  ← full implementation plan
├── agent_configs/
│   ├── CLAUDE_QUY.md                ← agent identity files (reference copies)
│   ├── CLAUDE_PRAXIS.md
│   ├── CLAUDE_STRAXIA.md
│   ├── SOUL_QUY.md
│   ├── SOUL_PRAXIS.md
│   └── SOUL_STRAXIA.md
├── reference/
│   ├── session_end.py               ← copy of QUY hook — Praxis extends this
│   ├── memory_core.py               ← copy of QUY context assembly hook
│   ├── SKILL_STATS.json             ← Phase 3 signal data
│   └── skills_INDEX.md              ← Phase 3 optimization target
└── harness_optimizer/               ← all implementation code goes here
    ├── invariant_tests/             ← automated test harness (Phase 1)
    ├── adversarial/                 ← sealed adversarial task files (Phase 1)
    ├── baselines/                   ← Q_baseline.json (Phase 1)
    ├── snapshots/                   ← daily harness snapshots (Phase 1+)
    ├── proposals/                   ← pending/ accepted/ rejected/ (Phase 3+)
    └── monitoring/                  ← weekly Q tracking (Phase 4+)

Prerequisites

Before starting Phase 1, verify:

  • Python 3.11 available (python --version)
  • whisper_env exists at C:\Users\ryuke\Desktop\QUY\whisper_env\
  • Claude Code running with API access
  • Git configured (git config --global user.name)
  • QUY system running (Discord bot active with Praxis, Straxia, Quy)

Phase Execution Guide

Phase 0 — COMPLETE

spec/BEHAVIORAL_INVARIANTS.md is written. 27 invariants, 4 oracle tiers. This is the oracle.


Phase 1: Instrumentation (Weeks 1–2)

Your actions before starting:

  1. Review HARNESS-IMMUTABLE proposals — Praxis will propose which sections of each CLAUDE file to tag. You approve before any tags are written. Look for: Security section, identity paragraph, loop guard, config immutability rule.

  2. Eval set designation — Praxis will pull candidate tasks from C:\Users\ryuke\Desktop\QUY\.claude\trajectory\ logs and present them to you grouped by category. You validate/relabel them. Need 50 per category (200 total): software engineering, research/analysis, orchestration, operational.

Start agents:

1. Open new Claude Code session
2. Paste contents of context/BOOTSTRAP_QUY.md as your first message
3. Quy will initialize Praxis and Straxia using context/BOOTSTRAP_PRAXIS.md and context/BOOTSTRAP_STRAXIA.md

What gets built (Praxis):

  • harness_optimizer/snapshot.py — per-component harness versioning
  • harness_optimizer/invariant_tests/ — automated tests for all 13 syntactic invariants
  • harness_optimizer/task_sampler.py — trajectory JSONL reader
  • Extended session_end.py in the QUY repo — writes session_outcome to events.jsonl
  • harness_optimizer/baselines/Q_baseline.json — current harness Q score

What gets built (Straxia):

  • harness_optimizer/adversarial/adversarial_methodology.md — the framework for adversarial task design
  • harness_optimizer/adversarial/praxis_adversarial.jsonl — tasks for testing Praxis (authored by Straxia)
  • harness_optimizer/adversarial/quy_adversarial.jsonl — tasks for testing Quy (authored by Straxia)

Phase 1 done when:

  • All harness_optimizer/ code files are present and non-stub
  • Q_baseline.json exists with a real score
  • Adversarial task files are sealed (read-only)
  • You have reviewed and approved HARNESS-IMMUTABLE tags in all CLAUDE files

Phase 2: Retrospective Analysis (Weeks 3–5)

Prerequisite: Phase 1 complete + at least 4 weeks of instrumented session data collected.

Your action: None during this phase. Just keep using the QUY system normally. The instrumentation collects data automatically.

What gets built (Praxis + Straxia):

  • harness_optimizer/analyze.py — correlates harness snapshots with Q deltas
  • Phase 2 report (Straxia writes) — which components correlate with Q

Gate to Phase 3 — all required:

  • At least one harness component shows |r| > 0.30, p < 0.05 correlation with Q
  • Adversarial task baseline stable (CV < 20% across 4 weeks)
  • You have completed human calibration spot-checks: review 40 invariant test results and confirm the automated tests are scoring correctly (Praxis will surface these for you)

If gate fails: Do not proceed. Improve instrumentation, wait for more data.


Phase 3: Skills Activation Optimization (Weeks 6–7)

First automated optimization. Lowest risk — only skills/INDEX.md changes. Rollback is instant.

Your action: Review and approve each proposed change to skills/INDEX.md before it's written. Praxis will present proposals with rationale. Takes ~5 minutes per batch.

Gate to Phase 4:

  • At least one skill shows p < 0.05, h ≥ 0.20 improvement
  • Zero adversarial degradation (A_degrade = 0)

Phase 4: Joint Optimization (Weeks 8–12)

Full loop. Human-gated at every step.

Your action at each iteration:

  1. Receive proposed harness variant from Quy
  2. Read the diff (what changed and why)
  3. Approve or reject
  4. If approved: Praxis merges to main, agents restart, monitoring watches for degradation

Hard stops — act immediately if any trigger:

  • Any SEC-1/2/3 or AGENT-1 invariant test fails post-deployment → rollback immediately (Praxis does this, rollback.py)
  • Q drops > 0.10 from baseline → auto-rollback fires, notify you
  • Any adversarial task erosion → halt optimization queue, human review

Approval Gates Summary

Gate What you review Time needed
Phase 1 start HARNESS-IMMUTABLE tag proposals 15 min
Phase 1 complete Eval set labels (200 tasks) 30–45 min
Phase 2 complete 40 invariant test spot-checks 20 min
Each Phase 3 proposal Skill activation signal change 5 min each
Each Phase 4 proposal Harness diff + rationale 10–15 min each

Emergency Procedures

Roll back to a previous harness:

cd C:\Users\ryuke\Desktop\QUY
python harness_optimizer/rollback.py --snapshot <snapshot_id>
# Then restart agents:
echo "restart" > _discord_out/_restart_praxis.flag
echo "restart" > _discord_out/_restart_straxia.flag
echo "restart" > _discord_out/_restart_quy.flag

Stop all optimization immediately:

echo '{"active": true}' > C:\Users\ryuke\.claude\estop.json

All agents will halt on next task. Remove the file to resume.

Find snapshot IDs:

ls harness_optimizer/snapshots/
cat harness_optimizer/baselines/Q_baseline.json  # shows baseline snapshot_id

Key Numbers to Remember

Metric Threshold Meaning
Δ_Q ≥ 0.05 Success criterion 5pp improvement on 200-task eval set
A_degrade = 0 Hard requirement at deployment No adversarial task failures
|r| > 0.30, p < 0.05 Phase 2 gate Correlation strong enough to optimize
h ≥ 0.20 Phase 3 gate Measurable medium effect on skills
Q drops > 0.10 Auto-rollback trigger Significant regression
Q drops > 0.05 Alert threshold Degradation warning

Files You Should Never Touch

  • harness_optimizer/adversarial/*.jsonl — sealed after Phase 1; editing invalidates the adversarial oracle
  • harness_optimizer/baselines/Q_baseline.json — written once in Phase 1; never overwritten
  • spec/BEHAVIORAL_INVARIANTS.md — the oracle spec; changes require full team review

Contact / Resume

If context is wiped, feed context/BOOTSTRAP_QUY.md to Quy. If Praxis needs context, feed context/BOOTSTRAP_PRAXIS.md. If Straxia needs context, feed context/BOOTSTRAP_STRAXIA.md.

Each bootstrap doc contains complete project state and resumes seamlessly.

About

Meta-Harness optimization for the QUY multi-agent system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages