These notes are empirical. Treat them as run observations, not engine guarantees.
For code-grounded defaults and file paths, use:
references/workflow.mdreferences/debugging.mdreferences/experiment-protocol.mdreferences/evaluation-rubric.md
Patterns that consistently showed up in experiments:
- source material quality dominated downstream quality;
- model quality mattered much more for report depth than for raw runtime speed;
- round count alone was a poor proxy for actual reasoning depth;
- artifact inspection was more reliable than guessing from the final report text.
Model: Gemini Flash, free tierRounds: 20Actions: about 116Result: shallow report, slow generation because of free-tier limitsTakeaway: free or very small models can finish the pipeline, but the report usually stays superficial
Model: Gemini Flash, free tierRounds: 30Actions: about 187Result: the report completed, but depth stayed limitedTakeaway: extra rounds can add more data, but they do not compensate for a weak model
Model: Claude Opus 4 via APIRounds: 10Actions: about 134Estimated cost: about$9-10per runResult: much stronger reasoning and more convincing report sectionsTakeaway: better models improve both persona richness and report qualityRisk: on an8 GBmachine, 10 rounds was the practical ceiling before memory risk became noticeable
Model: GPT-5-class model through an OpenAI-compatible proxyRounds: 10Actions: only 8 initial actions across both platformsResult: runtime finished almost immediately, then report generation succeeded after proxy fixesTakeaway: a completed run does not necessarily mean meaningful per-round reasoning happened
-
Assistant content type mismatchSome OpenAI-compatible layers expect assistant output asoutput_text. A mismatch can break runs even when the model itself is fine. -
Environment variable override confusionRoot.envloading with override semantics can defeat values injected by a parent Flask process or wrapper script. -
Backend interpreter mismatchRunning outside the managed virtual environment caused dependency and startup problems.
| Model tier | Approx. cost | Quality | Speed |
|---|---|---|---|
| Free or small | $0 |
Low | Fast |
| Mid-range | $2-5 |
Medium | Medium |
| Large frontier model | $8-12 |
High | Slow |
| Subscription-backed proxy | Varies | Medium to high | Medium |
When you add new experiments, record:
- model name and route type
- round count
- action count
- whether runtime behavior looked substantive
- whether report generation succeeded
- cost estimate
- simulation score from
references/evaluation-rubric.md - report score from
references/evaluation-rubric.md - the artifact paths you actually checked
- the single most useful takeaway