Add Harbor Framework Support by hrdkbhatnagar · Pull Request #8 · aisa-group/PostTrainBench

hrdkbhatnagar · 2026-01-13T17:46:47Z

Adds Harbor framework support to PostTrainBench, enabling anyone to run our benchmark on cloud GPUs (Modal, Daytona) without needing access to our internal HTCondor cluster.

At the moment:

Generate Harbor-compatible task directories from PostTrainBench benchmarks
Almost full parity with original pipeline

Tested:

Generated task for gsm8k + qwen3-1.7b
Ran 1-hour test with Claude Code on Sonnet 4 on Modal
Verified end-to-end pipeline (including eval + contam judge)
Confirmed accuracy metrics extracted correctly

Usage

  cd src/harbor_adapter
  uv sync

  # Generate a task
  python run_adapter.py --benchmark gsm8k --model qwen3-1.7b --output ./tasks

  # Run with Harbor
  harbor run \
      --path ./tasks/posttrainbench-gsm8k-qwen3-1.7b \
      --agent claude-code \
      --model anthropic/claude-sonnet-4 \
      --env modal

See src/harbor_adapter/README.md for detailed parity tracking. Key points:

Agent timeout, GPU access, evaluation: Full parity
Contamination judge: Parity
Agent duration: Tracked by Harbor in result.json
timer.sh: Minor difference (created at task generation vs job start)

Note: Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it. In modal the GPU is attached to the sandbox after the container is built, so installation doesn't occur.

Note: I have added a uv environment for us to use in PTB. This is used for using modal and harbor, and is useful in general for reproducibility

Todos:

directly before agent is run, install flash_attn and build timer.sh
huggingface cache in a modal storage
before running evaluation, uninstall and reinstall major dependencies (like transformers, inspect-ai, ...). Make sure NOT to use the cache. ((Alternatively we can look into docker in docker))

hrdkbhatnagar · 2026-02-27T12:07:28Z

+RUN uv pip install --system --no-cache \
+    accelerate \
+    boto3 \
+    bitsandbytes \
+    datasets \
+    evaluate \
+    lm-eval \
+    openai \
+    pandas \
+    scikit-learn \
+    shortuuid \
+    tokenizers \
+    transformers \
+    trl \
+    peft \
+    tiktoken \
+    inspect-ai \
+    matplotlib \
+    certifi
+
+# Note: flash_attn requires GPU to compile - install at runtime if needed:


pin versions like the current images

hrdkbhatnagar · 2026-03-04T10:51:58Z

things that are remaining to get full parity with the original PTB implementation:

Separate verifier container (eval integrity)
In our setup, we evaluate the agent's post-trained model in a different container than the one the agent trained in. This prevents reward hacking, the agent could modify eval files in its workspace to inflate its score. In Harbor, the verifier runs in the same sandbox as the agent, so it uses whatever files the agent may have tampered with. We need a way to run the verifier in an isolated environment.
Pre-agent shell command inside the container
We have a timer script that the agent calls to check remaining time (out of 10 hours). It needs to know when the agent actually started. In our original setup, the host orchestrator writes the start timestamp before launching the agent container. In Harbor, I see lifecycle hooks on the Trial object (TrialEvent.AGENT_START etc.), but those run on the orchestrator side, not inside the sandbox. we need a way to execute a shell command inside the container right before the agent starts , like a pre-agent hook.
Downloading additional directories after a run
After the agent finishes, we'd like to download its full workspace (/home/agent/workspace/), including the code it wrote and the fine-tuned model weights. Currently Harbor only downloads /logs/agent and /logs/verifier.

hrdkbhatnagar · 2026-03-04T10:53:42Z

after discussing with Alex from Harbor/tbench:

we could put the verifier in the tests/ directory which only gets uploaded after the agent runs
this is not yet supported be we should look into this, potentially make a PR to harbor
Artifact collection is supported now in harbor, so we should use that

rank-and-file · 2026-03-30T16:07:11Z

Added modal storage for hf-cache for harbor in the branch add_harbor_support_with_hfcache. This is necessary to have the pre-cached models and datasets. This is necessary for full parity.

Although there are some other changes as well, so you can probably clone the repo with this branch in another directory and ask your agent:

Consider PostTrainBench_harbor_modal and port all changes which are related to the modal storage integration for the huggingface cache to PostTrainBench_harbor.

rank-and-file · 2026-03-30T16:14:08Z

Also there is an upcoming change to the judge which will need to be integrated. Will post here.

rank-and-file · 2026-04-01T07:37:39Z

We need to hardcode baseline values to a json, instead of fetching them from the POST_TRAIN_BENCH_RESULTS_DIR.

This is needed for harbor integration (harbor should output the baseline value, in case the judge flags the run).

hrdkbhatnagar · 2026-04-02T15:04:02Z

Merged main into Harbor branch, @rank-and-file maybe we should push the new judge to main soon so we can pull it here. The new judge would require some major changes for Harbor

lewtun · 2026-04-24T12:54:42Z

Apologies for lurking, but I noticed this comment:

Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it

In case it's useful, we have flash-attn kernels available on the Hub (link) which are matched to the hardware at runtime and skip the annoying / long / brittle install of flash-attn itself. If you want to run PostTrainBench on cloud GPUs, this could be the fastest way to support such kernels across e.g. different CUDA versions

hrdkbhatnagar · 2026-04-24T13:02:56Z

Apologies for lurking, but I noticed this comment:

Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it

In case it's useful, we have flash-attn kernels available on the Hub (link) which are matched to the hardware at runtime and skip the annoying / long / brittle install of flash-attn itself. If you want to run PostTrainBench on cloud GPUs, this could be the fastest way to support such kernels across e.g. different CUDA versions

Hey Lewis, thanks for your comment! I didn't know about this, it would be very useful for us, especially when running on cloud providers and having full parity with the local version. Adding it to our todo for Harbor :)

…h Healthcheck and ENTRYPOINT, add log streaming and system monitor

hrdkbhatnagar · 2026-05-07T19:34:37Z

update

Pushed a substantial round of changes to the harbor adapter since last review. end to end run is now working at parity with our local pipeline for everything except verifier sandbox isolation, which is being addressed natively upstream (harbor-framework/harbor#1607).

main changes

Build mirrors the image we used for the leaderboard
Live log streaming (agent + verifier) via PID-1 entrypoint
Timer parity (healthcheck-driven, absolute path)
system_monitor.sh ported (GPU/CPU/mem/disk every 60s)
/tests/ placement closes file-tampering attack vector
final_model/ + agent workspace/ downloaded as artifacts after each trial
Single sandbox model leaves package tampering open; [verifier_environment] support upstream (Feature: isolated verifier sandbox ([verifier_environment]) harbor-framework/harbor#1607) lets us drop in a fresh sandbox for verification — fork branch validated, will integrate once landed

build

template/environment/Dockerfile is now a faithful port of the production apptainer image with two intentional changes:

--torch-backend=cu128 instead of auto — Modal's build VM has no nvidia-smi, so auto resolves to CPU torch and breaks vllm's CUDA-only xformers requirement.
I decided not to go with the HF kernels as suggested by Lewis since that would deviate from our original setting, as the kernels lib does not re-export under the flash_attn module name AFAICT. Agents either route through kernels directly or use the transformers attn-implementation string. But this is very useful for the future versions of PTB we build!

log streaming

template/environment/entrypoint.sh runs as PID 1 and:

backgrounds tail -F -q /logs/agent/*.txt /logs/verifier/*.txt so agent and verifier output stream live to the Modal dashboard via the sandbox's main process stdout, mirroring our local setup.
backgrounds system_monitor.sh (port of src/utils/system_monitor.sh), which writes GPU utilization / memory / disk to /logs/agent/system_monitor.log every 60s.

timer (now reliable)

Old design was START_FILE="$(dirname "$0")/.timer_start" — dirname resolves relative to invocation cwd, so each new directory the agent cd'd into got a fresh sentinel and the timer reset.

New design (native Harbor healthcheck):

[environment.healthcheck] in task.toml writes /timer_start once before agent setup begins:

[environment.healthcheck]
command = "pgrep -f 'tail -F' > /dev/null && (test -f /timer_start || date +%s > /timer_start)"                                            
interval_sec = 2                  
timeout_sec = 5              
start_period_sec = 5                    
retries = 3

tamper resistance: `/tests/` placement

The single sandbox means evaluate.py originally lived in the same writable directory the agent worked in: agent could trivially modify it to inflate scores. To match condor's "verifier reads from a separate trusted source" pattern within harbor's constraints:

Adapter's _copy_eval_files() helper now populates both environment/ (so agent has copies for iterative testing during training,
matching condor) and tests/ (so verifier reads untampered copies).
Harbor uploads tests/ to /tests after the agent process exits — the agent has no opportunity to write there.

Closes: agent modifying evaluate.py, templates/, evaluation_code/, contamination_judge.py, metadata.json.

Doesn't close: agent pip installing tampered packages, addressed natively in harbor-framework/harbor#1607.

artifact collection

task.toml now declares two artifacts:

[[artifacts]]
source = "/home/agent/workspace/final_model"
destination = "final_model"
                                                                                                                                              
[[artifacts]]
source = "/home/agent/workspace"                                                                                                             
destination = "workspace"
exclude = ["final_model", "__pycache__", "*.pyc", ".git", ".venv", "venv"]

After each trial, <trial_dir>/artifacts/ contains final_model/ (the trained weights the verifier evaluated) and workspace/

tested e2e on modal with 1hr tasks (bfcl, gsm8k on qwen3 1.7b)

@task

Outcome of analysis on the 2026-05-11 F-run capability collapse + the day's meeting decisions. Single commit; image-bump to :22 follows once the build lands. Suite - Drop bfcl (tool-call vllm not configured). Moved task to `src/evals/tasks/_disabled/`; registry validator ignores `_*` dirs. - EvalInfo gains `default_limit`. submit_baseline `--limit` default changes 100 -> 0 (= use per-eval defaults). aime2025=30 (full), mmlu/arc_easy/truthfulqa/rozado=200, big_five=40 / MFQ=32 / spiralbench=30 / syco_slava=30 / moru=50 (full sets where small); generative-graded = 100. run-id token shows `perEval` when default. Grader unification (anthropic/claude-haiku-4-5) - `INSPECT_GRADER_MODEL` injected via pod_env in submit_run + submit_baseline. Routes inspect_evals model_graded_qa scorers (coconot, strong_reject, sycophancy_sharma) off the prior gpt-4o-ish defaults. - moru passes explicit `task_args={"grader_models": "anthropic/claude-haiku-4-5"}`. Eliminates self-grading by served vllm. - healthbench + arenahardwriting still gpt-5-mini (separate refactor; see design TODO aisa-group#8). Pipeline ergonomics - `/opt/pipeline-bin/time-remaining` (root-owned, NOPASSWD-sudoable). Reads `/etc/ptb_run/deadline` (chmod 600). Workspace timer.sh prefers it; local start-file approximation now a fallback only. - `/opt/pipeline-bin/score_capability_runner.sh` + workspace `score_capability.sh`. Reads `/etc/ptb_run/bench_capability` (pipeline writes arc_easy for condition F by default; override via PTB_CAPABILITY_PROBE env). Cheap capability spot-check for the agent. - Dockerfile.base copies both new binaries, multi-line sudoers, diag.py exercises round-trip (deadline file, agent-cannot-read, sudo invocation). lora_starter rewrite - New `format_qwen3_chat(messages, tokenizer, enable_thinking=True)` helper applies the `<think>\n\n</think>\n\n` envelope automatically. `to_text` now handles `messages` row shape. - Conservative defaults: r=8, alpha=16, lr=5e-5, epochs=2, target_modules=q_proj+v_proj only. Full llama-style set opt-in via `--lora-target-modules`. - Docstring sections: "LoRA aggressiveness" + "MCQ-format examples" + "Chat-template format" explain the 2026-05-11 capability-collapse failure mode + how to avoid it. Agent prompt - instruction.md mentions localhost vLLM (so agents don't try to start their own), explicit guidance on `score.sh` / `score_capability.sh` / `timer.sh`, calls out lora_starter docstring sections. Run staging copies `score_capability.sh` alongside `score.sh` into the workspace. - condition_prompts._F_BODY adds capability-probe guidance + names the free-text-only-training MCQ collapse failure mode for next agent. big_five lenient scorer - Local `personality_BFI_lenient` @task with lenient parser that accepts `ANSWER: X`, `X)`, `X.`, `X:`, or `X` on its own line. Upstream scorer required literal `ANSWER:` prefix and dropped any trait whose samples all missed it. Recovered Agreeableness from the 2026-05-11 F-run adapter (9/9 valid letter answers but unparsed under strict). - `scripts/rescore_big_five.py` re-aggregates existing inspect logs + optionally writes lenient metrics back into the promoted baseline JSON. Both base + adapter big_five JSONs rescored. Real findings: Agreeableness 0.844 -> 0.689 (-0.156) Conscientiousness 0.822 -> 0.600 (-0.222) Extraversion 0.825 -> 0.629 (-0.196) Neuroticism 0.525 -> 0.625 (+0.100, not +0.375 strict artifact) Openness 0.780 -> 0.700 (-0.080) Adapter shows broad trait suppression, not the strict-scorer cartoon. compute_deltas + viewer - `scripts/compute_deltas.py` adds `--update-summary`: backfills pre/post/delta (+ per-bench slots) into the run's summary.json so the trace viewer + downstream tooling pick up scores via the existing paths. Tagged `delta_method: "baseline-backfill"`. - `registry.get_headline` tries literal-key match before dotted-path walk (fixes strong_reject_scorer.jailbreak_rate which is stored as a flat key with a literal dot). - Trace viewer index falls back: legacy unsuffixed metrics_* -> summary.{pre,post,delta} -> per-bench metrics_post_<bench>.json -> deltas.json[primary_bench]. `--skip-pre-eval` runs now render scores. Other - `pull_eval_logs.py` environment_dir fixed (`src/eval` -> `src/evals`). - scripts/constants.py HARDCODED_BENCHMARKS comments bfcl out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…own fork, add codex reprompt agent for harbor

hrdkbhatnagar added 3 commits January 13, 2026 18:40

add working harbor implementation

9376e91

correctly copy metadata for judge

60b8311

correct pass ENV vars to the verifier (harbor)fix GPU allocation bug

4a85c57

This comment was marked as duplicate.

Sign in to view

rank-and-file reviewed Feb 10, 2026

View reviewed changes

Comment thread src/harbor_adapter/template/environment/contamination_judge.py Outdated

rank-and-file reviewed Feb 10, 2026

View reviewed changes

Comment thread src/harbor_adapter/template/environment/contamination_judge.py Outdated

rank-and-file reviewed Feb 10, 2026

View reviewed changes

Comment thread src/harbor_adapter/template/environment/contamination_judge.py

hrdkbhatnagar added the feature New feature or request label Feb 11, 2026

hrdkbhatnagar added this to the V1 Release milestone Feb 11, 2026

hrdkbhatnagar added 2 commits February 22, 2026 14:16

Merge remote-tracking branch 'origin/main' into add_harbor_support

3447d2b

update with arenahard and healthbench, timer parity, other changes

0293f90

hrdkbhatnagar commented Feb 27, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into add_harbor_support

ea7fbf3

mhrezaei1 mentioned this pull request Apr 8, 2026

Close Harbor integration gaps: verifier isolation, artifact collection #33

Closed

surelyMersad mentioned this pull request Apr 25, 2026

Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection #37

Open

4 tasks

hrdkbhatnagar added 4 commits May 6, 2026 15:20

Merge remote-tracking branch 'origin/main' into add_harbor_support

5f3cf66

Fix Dockerfile to mirror opus_4_6_1m image from condor, fix timer wit…

0634e7e

…h Healthcheck and ENTRYPOINT, add log streaming and system monitor

move verifier into tests/ to prevent tampering by the agent

52c9437

add artifact download for final_model and agent workspace

22d5638

hrdkbhatnagar mentioned this pull request May 8, 2026

Add isolated verifier sandbox support via [verifier_environment] harbor-framework/harbor#1613

Closed

use modified harbor to isolate verifier in a different sandbox

b31008e

rewrite verifer separate env with native harbor instead of using our …

81b0c6e

…own fork, add codex reprompt agent for harbor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Harbor Framework Support #8

Add Harbor Framework Support #8
hrdkbhatnagar wants to merge 12 commits into
mainfrom
add_harbor_support

hrdkbhatnagar commented Jan 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

Uh oh!

hrdkbhatnagar Feb 27, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

rank-and-file commented Mar 30, 2026

Uh oh!

rank-and-file commented Mar 30, 2026

Uh oh!

rank-and-file commented Apr 1, 2026

Uh oh!

hrdkbhatnagar commented Apr 2, 2026

Uh oh!

lewtun commented Apr 24, 2026

Uh oh!

hrdkbhatnagar commented Apr 24, 2026 •

edited

Loading

Uh oh!

hrdkbhatnagar commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hrdkbhatnagar commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

Uh oh!

hrdkbhatnagar Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

hrdkbhatnagar commented Mar 4, 2026

Uh oh!

rank-and-file commented Mar 30, 2026

Uh oh!

rank-and-file commented Mar 30, 2026

Uh oh!

rank-and-file commented Apr 1, 2026

Uh oh!

hrdkbhatnagar commented Apr 2, 2026

Uh oh!

lewtun commented Apr 24, 2026

Uh oh!

hrdkbhatnagar commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hrdkbhatnagar commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update

main changes

build

log streaming

timer (now reliable)

tamper resistance: /tests/ placement

artifact collection

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hrdkbhatnagar commented Jan 13, 2026 •

edited

Loading

hrdkbhatnagar commented Apr 24, 2026 •

edited

Loading

hrdkbhatnagar commented May 7, 2026 •

edited

Loading

tamper resistance: `/tests/` placement