This is a benchmark project. Every claim must be backed by evidence. Every run must be documented honestly.
- Name what you tested. If floop-the-binary didn't run, don't call it a "floop arm." Call it what it is: prompt injection, hardcoded behaviors, manual heuristics — whatever is accurate.
- Separate the tool from the technique. "3 hand-written sentences improved resolve rate" is a different finding than "floop improved resolve rate." Both are valuable. Don't conflate them.
- Report bad results. A run where floop hurts performance is just as publishable as one where it helps. The goal is truth, not marketing.
- Document the full pipeline. For each run: what binary ran, what version, where behaviors came from, how they were delivered to the agent, what the agent actually saw.
- Statistical honesty. Report confidence intervals and p-values. Don't cherry-pick metrics. A +15pp result with p=0.45 is "not significant" — say so.
- Version everything. floop version, model version, mini-SWE-agent version, config files — all go in the RUNBOOK.
- Reproducibility. Another person should be able to re-run any experiment from the RUNBOOK + committed configs.
config/— YAML configs for each arm,splits.jsonfor task splitsscripts/run_mswea.py— CLI wrapper bridging mini-SWE-agent to floop-bench pipelineharness/— DB layer (db.py) and SWE-bench eval integration (swebench_eval.py)analysis/analyze.py— statistical analysis (bootstrap CIs, McNemar's test)docs/RUNBOOK.md— experiment log with per-run results and findingsresults/— predictions, trajectories, eval output (mostly gitignored)
- mini-SWE-agent uses
swebench_xml.yamlbase config (XML action parsing, NOT tool calls) -cflag replaces config sections, doesn't merge — floop config must include full system_templatemodel_class: litellm(notopenrouter) for Gemini models- Gemini TPM rate limit: 1M tokens/min input — use
--delay 60between tasks - SWE-bench eval per-instance
report.jsonis more reliable than top-level summary