Skip to content

Latest commit

 

History

History
264 lines (193 loc) · 9.98 KB

File metadata and controls

264 lines (193 loc) · 9.98 KB

Trace v5 — Local Treasure Hunt Full Arena Run (Mock Backend)

Date: 2026-02-09 Task: examples/local_treasure_hunt Mode: SAS + all 4 MAS architectures, mock (--allow-mock) Model: claude-sonnet-4-5 Instances: 5 (hunt_001 .. hunt_005) Outcome: COMPLETED — SAS P=1.000, MAS mixed results (mock-mode limitations)


Objective

Execute the full arena skill for local_treasure_hunt: validate evaluator, run SAS baseline, run all MAS architecture variants, recompute derived metrics, and summarise results.


Pre-Run Checks

1. Task Discovery — PATH MISMATCH (recovered)

First attempt: Searched for brainqub3/tasks/treasure_hunt/ — no files found.

Glob: brainqub3/tasks/treasure_hunt/**/*  -> No files found

Root Cause: The task lives under brainqub3/tasks/examples/local_treasure_hunt/, not brainqub3/tasks/treasure_hunt/. The examples/ prefix and local_ prefix were both required.

Recovery: Listed brainqub3/tasks/examples/ and found local_treasure_hunt/.

2. Evaluator Tests — PASSED

> python -m pytest brainqub3/tasks/examples/local_treasure_hunt/tests -q
..........                                                               [100%]

All 10 evaluator tests passed.

3. Python Path Issue

The first pytest invocation failed:

> python -m pytest brainqub3/tasks/treasure_hunt/tests -q
Exit code 1
C:\Users\johna\AppData\Local\Programs\Python\Python38-32\python.exe: No module named pytest

Root Cause: Bare python resolved to Python 3.8 (32-bit), which doesn't have pytest installed. The project requires Python 3.11.

Fix: All subsequent commands used the explicit Python 3.11 path:

C:\Users\johna\AppData\Local\Programs\Python\Python311\python.exe

4. Task Structure — VERIFIED

local_treasure_hunt/
  __init__.py
  evaluator.py
  task.md
  instances.jsonl
  prompts/
    sas.md
    orchestrator.md
    worker.md
  tests/
    __init__.py
    test_evaluator.py
  fixtures/
    inst_001/  (clue_base64.txt, clue_int.txt, clue_word.txt, noise.txt)
    inst_002/  ...
    inst_003/  ...
    inst_004/  ...
    inst_005/  ...

Phase 1: CLI Discovery

Ran --help on the CLI and sub-commands to determine exact flags:

brainqub3 run sas --help
  --task TASK  --model MODEL  --instances INSTANCES  --allow-mock  --require-live

brainqub3 run mas --help
  --task TASK  --arch ARCH  --model MODEL  --n-agents N_AGENTS  --instances INSTANCES  --allow-mock  --require-live

Confirmed available architectures from brainqub3/config/architectures.yaml:

Architecture n_agents orchestrator_rounds debate_rounds peer_rounds aggregation_policy
sas 1 1 0 0 single
independent 3 0 0 0 best_of_n
centralised 3 3 0 0 orchestrator_synthesis
decentralised 3 0 2 2 consensus_vote
hybrid 3 3 1 2 orchestrator_plus_peer_exchange

Task name resolution: TaskRegistry.resolve() checks tasks/<name> then examples/<name>, so --task local_treasure_hunt resolves correctly.


Phase 2: SAS Baseline

Error 1 — Unknown Model

Command:

python -m brainqub3.cli run sas --task local_treasure_hunt --allow-mock

Error:

Exit code 1
'Unknown model: claude-opus-4-6'

Root Cause: No --model flag was specified. _default_model() reads BRAINQUB3_DEFAULT_MODEL env var, which resolved to claude-opus-4-6 (likely inherited from the Claude Code session environment). This model is not in brainqub3/config/models.yaml, which only contains:

  • claude-sonnet-3-7 (intelligence_index: 42)
  • claude-sonnet-4-0 (intelligence_index: 48)
  • claude-sonnet-4-5 (intelligence_index: 55)

The runner validates the model name against models.yaml at runner.py:247:

raise KeyError(f"Unknown model: {model_name}")

Fix: Added explicit --model claude-sonnet-4-5 to all subsequent commands.

Successful SAS Run

Command:

"C:\...\Python311\python.exe" -m brainqub3.cli run sas --task local_treasure_hunt --model claude-sonnet-4-5 --allow-mock

Output:

SAS run complete: 2026-02-09T23-39-59Z__local_treasure_hunt__sas__claude-sonnet-4-5
Backend mode: mock
Backend reason: SDK missing query or ClaudeAgentOptions
Success rate P: 1.000

Analysis: The mock backend's _mock_treasure_hunt() static method correctly reads fixture files, extracts markers, and computes the MD5 flag — producing deterministic correct output for all 5 instances.


Phase 3: MAS Variants

All 4 MAS architectures were run. Centralised + hybrid ran in parallel, then independent + decentralised in parallel.

Run 1 — Centralised (3-agent)

MAS run complete: 2026-02-09T23-40-09Z__local_treasure_hunt__centralised__claude-sonnet-4-5
Backend mode: mock
Backend reason: SDK missing query or ClaudeAgentOptions
Success rate P: 0.000

Run 2 — Hybrid (3-agent)

MAS run complete: 2026-02-09T23-40-15Z__local_treasure_hunt__hybrid__claude-sonnet-4-5
Backend mode: mock
Backend reason: SDK missing query or ClaudeAgentOptions
Success rate P: 0.000

Run 3 — Independent (3-agent)

MAS run complete: 2026-02-09T23-40-28Z__local_treasure_hunt__independent__claude-sonnet-4-5
Backend mode: mock
Backend reason: SDK missing query or ClaudeAgentOptions
Success rate P: 1.000

Run 4 — Decentralised (3-agent)

MAS run complete: 2026-02-09T23-40-34Z__local_treasure_hunt__decentralised__claude-sonnet-4-5
Backend mode: mock
Backend reason: SDK missing query or ClaudeAgentOptions
Success rate P: 0.000

MAS Failure Analysis

Architecture P Why
Independent 1.000 Each worker runs independently; _mock_treasure_hunt() fires for each because the worker prompt contains fixture_dir. best_of_n aggregation picks the correct answer.
Centralised 0.000 Workers produce correct output, but the orchestrator synthesis prompt contains worker outputs, NOT the original fixture_dir. _mock_treasure_hunt() returns None (no fixture_dir match), and the fallback mock doesn't produce valid treasure hunt JSON.
Hybrid 0.000 Same synthesis-prompt issue as centralised — the orchestrator prompt regex can't find fixture_dir.
Decentralised 0.000 Peer-exchange and consensus-vote prompts contain prior agent outputs, not raw fixture paths — same mock extraction failure.

Root Cause: _mock_treasure_hunt() (runner.py:155-213) uses regex to extract fixture_dir from the prompt:

fixture_match = re.search(r"'fixture_dir'\s*:\s*'([^']+)'", prompt)

This only matches the raw instance dict in the initial worker prompt. Synthesis/consensus prompts contain worker outputs (the JSON flags), not the original instance — so the regex returns None and the mock falls through to the generic hello_world handler, producing invalid output.

This is expected mock-mode behaviour. The mock is designed for single-hop determinism, not multi-round orchestration. Live backend runs would not have this limitation.


Phase 4: Metrics Recomputation

All 5 runs were recomputed:

SAS:           P=1.000, T=3,  O%=0.00
Centralised:   P=0.000, T=12, O%=300.00
Hybrid:        P=0.000, T=30, O%=900.00
Independent:   P=1.000, T=9,  O%=200.00
Decentralised: P=0.000, T=27, O%=800.00

Results Summary

Architecture Run ID P T O% Backend
SAS 2026-02-09T23-39-59Z__local_treasure_hunt__sas__claude-sonnet-4-5 1.000 3 0% mock
Independent 2026-02-09T23-40-28Z__local_treasure_hunt__independent__claude-sonnet-4-5 1.000 9 200% mock
Centralised 2026-02-09T23-40-09Z__local_treasure_hunt__centralised__claude-sonnet-4-5 0.000 12 300% mock
Decentralised 2026-02-09T23-40-34Z__local_treasure_hunt__decentralised__claude-sonnet-4-5 0.000 27 800% mock
Hybrid 2026-02-09T23-40-15Z__local_treasure_hunt__hybrid__claude-sonnet-4-5 0.000 30 900% mock

Key Observations

  1. SAS is the optimal mock-mode architecture — perfect accuracy, minimal turns, zero overhead.
  2. Independent matches SAS accuracy but at 3x the cost (each of 3 workers solves independently).
  3. Orchestrated architectures all fail under mock — the mock backend cannot handle multi-round synthesis prompts where fixture_dir is absent from the prompt text.
  4. Overhead scales with coordination complexity: hybrid (900%) > decentralised (800%) > centralised (300%) > independent (200%) > SAS (0%).
  5. Turn count formula: T = n_agents * base_turns * (1 + orchestrator_rounds + debate_rounds + peer_rounds).

Errors Encountered (Chronological)

# Error Source Root Cause Resolution
1 No files found (treasure_hunt glob) Glob tool Task path is examples/local_treasure_hunt, not treasure_hunt Listed tasks/examples/ to find correct name
2 No module named pytest Bare python command Default python resolves to Python 3.8-32bit, not 3.11 Used explicit Python311\python.exe path
3 Unknown model: claude-opus-4-6 runner.py:247 (_resolve_model) BRAINQUB3_DEFAULT_MODEL env var set to unregistered model Added --model claude-sonnet-4-5 explicitly
4 MAS P=0.000 (centralised, hybrid, decentralised) runner.py:161 (_mock_treasure_hunt) Synthesis/consensus prompts don't contain fixture_dir; regex returns None; fallback mock produces wrong output Expected mock limitation — not a bug

Notes

  • All runs used mock backend (SDK missing query or ClaudeAgentOptions). The claude-agent-sdk package is not installed and no ANTHROPIC_API_KEY is configured.
  • To get meaningful MAS architecture comparison, re-run with --require-live after installing the SDK and setting the API key.
  • The arena skill was invoked but the task name needed to be local_treasure_hunt (the registry resolves it under examples/).
  • models.yaml should be updated to include claude-opus-4-6 if future runs target that model.