spaghetti/ralph.yml at main · tomsiwik/spaghetti · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Ralph Orchestrator — Experiment Research Loop
# 3-hat event-driven workflow on top of Ralph v2.
# Ralph handles routing; `experiment` owns experiment state; `experiment run`/pueue own execution.
#
# Flow:
#   research.start -> Researcher -> experiment.done -> Reviewer -> review.proceed|review.revise|review.killed
#   review.proceed|review.killed -> Analyst -> learning.complete -> Researcher
#
# Start: ralph run -a -v
# Manual event injection is available via `ralph emit`, but normal operation is event-driven from the loop.

cli:
  backend: "claude"
  default_mode: "autonomous"

adapters:
  claude:
    timeout: 3600  # 1h max per hat (covers experiment execution up to ~45min)
    args:
      - "--bare"

event_loop:
  # prompt_file removed — hat instructions are self-contained
  # Framework + current focus: PLAN.md
  starting_event: "research.start"
  completion_promise: "RESEARCH_BACKLOG_DRAINED"
  max_iterations: 1000
  max_runtime_seconds: 86400
  idle_timeout_secs: 1200  # 20 min idle = stuck (moved from cli: per docs)
  checkpoint_interval: 10

core:
  # Make Ralph's working files explicit instead of relying on defaults.
  scratchpad: ".ralph/agent/scratchpad.md"
  specs_dir: ".ralph/specs/"
  guardrails:
    - "READ `PLAN.md` FIRST. Part 1 = framework principles (stable); Part 2 = current research focus (the 'what' for now). If Part 2 references parent-dir deep docs for architecture detail, read those too — do not duplicate them into the repo."
    - "PROOF-FIRST RESEARCH (Constructive Mathematics): Every experiment requires a formal mathematical proof (Theorem/Proof/QED) BEFORE any code is written. The proof must: (1) identify the failure mode, (2) cite prior theorems, (3) derive a guarantee that makes failure impossible, (4) predict specific BEHAVIORAL outcomes the experiment will verify. Metrics (PPL, cosine, accuracy) are proxies. A metric improving without behavioral progress is not a finding."
    - "Use `experiment claim <worker>` to pick work. Use `experiment complete <id> ...` to finish."
    - "NEVER generate experiments from analogies. Every new experiment MUST cite an arxiv paper or prior finding."
    - "Before implementing, check `experiment query` for prior results on the same topic."
    - "MATH.md must explain mechanisms at the ATOMIC level: exact equations, why it works (theorem/lemma), what breaks it (derive failure conditions). No buzzword recitation."
    - "KILLED experiments are not dead ends. Ask: 'What structure makes this failure impossible?' If you can derive a mathematical guarantee, resurrect with a structural fix — not a hyperparameter tweak."
    - "BEHAVIORAL OUTCOMES OVER METRICS: measured r≈0.08 between PPL and task quality in this codebase. Design experiments that test behavioral claims, not metric claims."
    - "TARGET-GATED KILL (Finding #666): never kill on a proxy metric alone (classification accuracy, routing match rate, PPL, cosine, clustering purity). Every proxy KC must be paired with a target-metric KC (task accuracy, behavioral quality, oracle-gap). KILL requires BOTH to fail; SUPPORTED requires BOTH to pass. Proxy-FAIL + target-PASS = finding about the proxy, not a kill. Proxy-PASS + target-FAIL = tautological proxy, kill on target."
    - "AUTONOMY: NEVER wait for user input. Ralph runs unattended. If a decision is ambiguous, make the most defensible call and log the assumption in MATH.md/PAPER.md/REVIEW-adversarial.md/LEARNINGS.md. Do not ask clarifying questions via event payload. Do not pause. Always pick and proceed."
    - "ANTI-STUCK: REVISE fixes max 30 min. If >3 blocking fixes, apply top 3 and defer rest. Each hat transition <30 min. Never retry a failed API call >3 times."
    - "VERDICT CONSISTENCY: before `experiment complete --status supported`, verify PLAN.md §1 checklist (results.json verdict, all_pass, PAPER.md verdict line, is_smoke flag, KC git-diff, antipattern match). Never silently upgrade a KILLED result to supported."
    - "Platform/model specifics (target hardware, base model, framework, required skills) live in PLAN.md Part 2, not here. Update Part 2 when the target changes; guardrails stay generic."
    - "INVOKE REQUIRED SKILLS BEFORE WRITING CODE. PLAN.md Part 2 lists the skills required for the current target platform (e.g. `/mlx-dev`, `/fast-mlx` for MLX). Skipping them is the single biggest cause of broken code in this repo's history — the audit confirms it. If the skills aren't invoked, the code is not trusted; the reviewer will REVISE."

features:
  # Ralph worktree parallelism is counterproductive here:
  # experiment claim + Turso DB already coordinate ownership, and pueue serializes actual runs.
  # Keep one Ralph loop as the orchestrator of record.
  parallel: false
  auto_merge: false

memories:
  inject: auto
  budget: 2000
  file: ".ralph/agent/memories.md"
  filter:
    recent: 60

tasks:
  enabled: true

hats:
  # ── Researcher: design + run experiments ────────────────────────────────
  researcher:
    name: "🔬 Researcher"
    description: "Designs and runs micro experiments per PLAN.md. Writes MATH.md, implements, runs, writes PAPER.md."
    backend_args:
      - "--effort"
      - "medium"
      - "--model"
      - "claude-opus-4-6"
      - "--append-system-prompt-file"
      - "EXPERIMENT.md"
      - "--max-turns"
      - "160"
    triggers:
      - "research.start"
      - "learning.complete"
      - "review.revise"
    publishes:
      - "experiment.done"
    default_publishes: "experiment.done"
    instructions: |
      Read `.ralph/hats/researcher.md` first and follow it for role-specific behavior.
      Read `PLAN.md` for framework principles (Part 1) and current research focus (Part 2).
      Use event payloads as the primary handoff signal; use `.ralph/current_direction.md` only as fallback.

  # ── Reviewer: adversarial checklist, writes verdict ─────────────────────
  reviewer:
    name: "🔴 Reviewer"
    description: "Reviews MATH.md + PAPER.md directly, writes REVIEW-adversarial.md, routes verdict."
    backend_args:
      - "--effort"
      - "medium"
      - "--model"
      - "claude-opus-4-6"
      - "--max-turns"
      - "80"
    triggers:
      - "experiment.done"
    publishes:
      - "review.proceed"
      - "review.revise"
      - "review.killed"
    default_publishes: "review.proceed"
    instructions: |
      Read `.ralph/hats/reviewer.md` first and follow it for role-specific behavior.
      Review from disk directly, not via sub-agents.
      Use the triggering event payload as the primary handoff signal; use `.ralph/current_direction.md` only as fallback.

  # ── Analyst: synthesis, appends antipatterns to memory ──────────────────
  analyst:
    name: "🧠 Analyst"
    description: "Writes LEARNINGS.md with literature context. Quick pass, max 10 min."
    backend_args:
      - "--effort"
      - "medium"
      - "--model"
      - "claude-opus-4-6"
      - "--max-turns"
      - "40"
    triggers:
      - "review.proceed"
      - "review.killed"
    publishes:
      - "learning.complete"
    default_publishes: "learning.complete"
    max_activations: 50
    instructions: |
      Read `.ralph/hats/analyst.md` first and follow it for role-specific behavior.
      Keep this pass short and synthesis-focused.
      Use the triggering event payload as the primary handoff signal; use `.ralph/current_direction.md` only as fallback.