Add Harbor Framework Support #8
Conversation
This comment was marked as duplicate.
This comment was marked as duplicate.
| RUN uv pip install --system --no-cache \ | ||
| accelerate \ | ||
| boto3 \ | ||
| bitsandbytes \ | ||
| datasets \ | ||
| evaluate \ | ||
| lm-eval \ | ||
| openai \ | ||
| pandas \ | ||
| scikit-learn \ | ||
| shortuuid \ | ||
| tokenizers \ | ||
| transformers \ | ||
| trl \ | ||
| peft \ | ||
| tiktoken \ | ||
| inspect-ai \ | ||
| matplotlib \ | ||
| certifi | ||
|
|
||
| # Note: flash_attn requires GPU to compile - install at runtime if needed: |
There was a problem hiding this comment.
pin versions like the current images
|
things that are remaining to get full parity with the original PTB implementation:
|
|
after discussing with Alex from Harbor/tbench:
|
|
Added modal storage for hf-cache for harbor in the branch Although there are some other changes as well, so you can probably clone the repo with this branch in another directory and ask your agent:
|
|
Also there is an upcoming change to the judge which will need to be integrated. Will post here. |
|
We need to hardcode baseline values to a json, instead of fetching them from the This is needed for harbor integration (harbor should output the baseline value, in case the judge flags the run). |
|
Merged main into Harbor branch, @rank-and-file maybe we should push the new judge to main soon so we can pull it here. The new judge would require some major changes for Harbor |
|
Apologies for lurking, but I noticed this comment:
In case it's useful, we have flash-attn kernels available on the Hub (link) which are matched to the hardware at runtime and skip the annoying / long / brittle install of |
Hey Lewis, thanks for your comment! I didn't know about this, it would be very useful for us, especially when running on cloud providers and having full parity with the local version. Adding it to our todo for Harbor :) |
…h Healthcheck and ENTRYPOINT, add log streaming and system monitor
updatePushed a substantial round of changes to the harbor adapter since last review. end to end run is now working at parity with our local pipeline for everything except verifier sandbox isolation, which is being addressed natively upstream (harbor-framework/harbor#1607). main changes
build
log streaming
timer (now reliable)Old design was New design (native Harbor healthcheck):
tamper resistance:
|
Outcome of analysis on the 2026-05-11 F-run capability collapse +
the day's meeting decisions. Single commit; image-bump to :22
follows once the build lands.
Suite
- Drop bfcl (tool-call vllm not configured). Moved task to
`src/evals/tasks/_disabled/`; registry validator ignores `_*` dirs.
- EvalInfo gains `default_limit`. submit_baseline `--limit` default
changes 100 -> 0 (= use per-eval defaults). aime2025=30 (full),
mmlu/arc_easy/truthfulqa/rozado=200, big_five=40 / MFQ=32 /
spiralbench=30 / syco_slava=30 / moru=50 (full sets where small);
generative-graded = 100. run-id token shows `perEval` when default.
Grader unification (anthropic/claude-haiku-4-5)
- `INSPECT_GRADER_MODEL` injected via pod_env in submit_run + submit_baseline.
Routes inspect_evals model_graded_qa scorers (coconot, strong_reject,
sycophancy_sharma) off the prior gpt-4o-ish defaults.
- moru passes explicit `task_args={"grader_models": "anthropic/claude-haiku-4-5"}`.
Eliminates self-grading by served vllm.
- healthbench + arenahardwriting still gpt-5-mini (separate refactor;
see design TODO aisa-group#8).
Pipeline ergonomics
- `/opt/pipeline-bin/time-remaining` (root-owned, NOPASSWD-sudoable).
Reads `/etc/ptb_run/deadline` (chmod 600). Workspace timer.sh prefers
it; local start-file approximation now a fallback only.
- `/opt/pipeline-bin/score_capability_runner.sh` + workspace
`score_capability.sh`. Reads `/etc/ptb_run/bench_capability`
(pipeline writes arc_easy for condition F by default; override via
PTB_CAPABILITY_PROBE env). Cheap capability spot-check for the agent.
- Dockerfile.base copies both new binaries, multi-line sudoers, diag.py
exercises round-trip (deadline file, agent-cannot-read, sudo invocation).
lora_starter rewrite
- New `format_qwen3_chat(messages, tokenizer, enable_thinking=True)`
helper applies the `<think>\n\n</think>\n\n` envelope automatically.
`to_text` now handles `messages` row shape.
- Conservative defaults: r=8, alpha=16, lr=5e-5, epochs=2,
target_modules=q_proj+v_proj only. Full llama-style set opt-in via
`--lora-target-modules`.
- Docstring sections: "LoRA aggressiveness" + "MCQ-format examples" +
"Chat-template format" explain the 2026-05-11 capability-collapse
failure mode + how to avoid it.
Agent prompt
- instruction.md mentions localhost vLLM (so agents don't try to start
their own), explicit guidance on `score.sh` / `score_capability.sh` /
`timer.sh`, calls out lora_starter docstring sections. Run staging
copies `score_capability.sh` alongside `score.sh` into the workspace.
- condition_prompts._F_BODY adds capability-probe guidance + names the
free-text-only-training MCQ collapse failure mode for next agent.
big_five lenient scorer
- Local `personality_BFI_lenient` @task with lenient parser that
accepts `ANSWER: X`, `X)`, `X.`, `X:`, or `X` on its own line. Upstream
scorer required literal `ANSWER:` prefix and dropped any trait whose
samples all missed it. Recovered Agreeableness from the 2026-05-11
F-run adapter (9/9 valid letter answers but unparsed under strict).
- `scripts/rescore_big_five.py` re-aggregates existing inspect logs +
optionally writes lenient metrics back into the promoted baseline
JSON. Both base + adapter big_five JSONs rescored. Real findings:
Agreeableness 0.844 -> 0.689 (-0.156)
Conscientiousness 0.822 -> 0.600 (-0.222)
Extraversion 0.825 -> 0.629 (-0.196)
Neuroticism 0.525 -> 0.625 (+0.100, not +0.375 strict artifact)
Openness 0.780 -> 0.700 (-0.080)
Adapter shows broad trait suppression, not the strict-scorer cartoon.
compute_deltas + viewer
- `scripts/compute_deltas.py` adds `--update-summary`: backfills
pre/post/delta (+ per-bench slots) into the run's summary.json so the
trace viewer + downstream tooling pick up scores via the existing
paths. Tagged `delta_method: "baseline-backfill"`.
- `registry.get_headline` tries literal-key match before dotted-path
walk (fixes strong_reject_scorer.jailbreak_rate which is stored as
a flat key with a literal dot).
- Trace viewer index falls back: legacy unsuffixed metrics_* ->
summary.{pre,post,delta} -> per-bench metrics_post_<bench>.json ->
deltas.json[primary_bench]. `--skip-pre-eval` runs now render scores.
Other
- `pull_eval_logs.py` environment_dir fixed (`src/eval` -> `src/evals`).
- scripts/constants.py HARDCODED_BENCHMARKS comments bfcl out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…own fork, add codex reprompt agent for harbor
Adds Harbor framework support to PostTrainBench, enabling anyone to run our benchmark on cloud GPUs (Modal, Daytona) without needing access to our internal HTCondor cluster.
At the moment:
Tested:
Usage
See
src/harbor_adapter/README.mdfor detailed parity tracking. Key points:result.jsontimer.sh:Minor difference (created at task generation vs job start)Note: Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it. In modal the GPU is attached to the sandbox after the container is built, so installation doesn't occur.
Note: I have added a uv environment for us to use in PTB. This is used for using modal and harbor, and is useful in general for reproducibility
Todos: