Skip to content

Add Harbor Framework Support #8

Open
hrdkbhatnagar wants to merge 12 commits into
mainfrom
add_harbor_support
Open

Add Harbor Framework Support #8
hrdkbhatnagar wants to merge 12 commits into
mainfrom
add_harbor_support

Conversation

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator

@hrdkbhatnagar hrdkbhatnagar commented Jan 13, 2026

Adds Harbor framework support to PostTrainBench, enabling anyone to run our benchmark on cloud GPUs (Modal, Daytona) without needing access to our internal HTCondor cluster.

At the moment:

  • Generate Harbor-compatible task directories from PostTrainBench benchmarks
  • Almost full parity with original pipeline

Tested:

  • Generated task for gsm8k + qwen3-1.7b
  • Ran 1-hour test with Claude Code on Sonnet 4 on Modal
  • Verified end-to-end pipeline (including eval + contam judge)
  • Confirmed accuracy metrics extracted correctly

Usage

  cd src/harbor_adapter
  uv sync

  # Generate a task
  python run_adapter.py --benchmark gsm8k --model qwen3-1.7b --output ./tasks

  # Run with Harbor
  harbor run \
      --path ./tasks/posttrainbench-gsm8k-qwen3-1.7b \
      --agent claude-code \
      --model anthropic/claude-sonnet-4 \
      --env modal

See src/harbor_adapter/README.md for detailed parity tracking. Key points:

  • Agent timeout, GPU access, evaluation: Full parity
  • Contamination judge: Parity
  • Agent duration: Tracked by Harbor in result.json
  • timer.sh: Minor difference (created at task generation vs job start)

Note: Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it. In modal the GPU is attached to the sandbox after the container is built, so installation doesn't occur.

Note: I have added a uv environment for us to use in PTB. This is used for using modal and harbor, and is useful in general for reproducibility

Todos:

  • directly before agent is run, install flash_attn and build timer.sh
  • huggingface cache in a modal storage
  • before running evaluation, uninstall and reinstall major dependencies (like transformers, inspect-ai, ...). Make sure NOT to use the cache. ((Alternatively we can look into docker in docker))

@rank-and-file

This comment was marked as duplicate.

Comment thread src/harbor_adapter/template/environment/contamination_judge.py Outdated
Comment thread src/harbor_adapter/template/environment/contamination_judge.py Outdated
Comment thread src/harbor_adapter/template/environment/contamination_judge.py
@hrdkbhatnagar hrdkbhatnagar added the feature New feature or request label Feb 11, 2026
@hrdkbhatnagar hrdkbhatnagar added this to the V1 Release milestone Feb 11, 2026
Comment on lines +42 to +62
RUN uv pip install --system --no-cache \
accelerate \
boto3 \
bitsandbytes \
datasets \
evaluate \
lm-eval \
openai \
pandas \
scikit-learn \
shortuuid \
tokenizers \
transformers \
trl \
peft \
tiktoken \
inspect-ai \
matplotlib \
certifi

# Note: flash_attn requires GPU to compile - install at runtime if needed:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pin versions like the current images

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator Author

things that are remaining to get full parity with the original PTB implementation:

  1. Separate verifier container (eval integrity)
    In our setup, we evaluate the agent's post-trained model in a different container than the one the agent trained in. This prevents reward hacking, the agent could modify eval files in its workspace to inflate its score. In Harbor, the verifier runs in the same sandbox as the agent, so it uses whatever files the agent may have tampered with. We need a way to run the verifier in an isolated environment.

  2. Pre-agent shell command inside the container
    We have a timer script that the agent calls to check remaining time (out of 10 hours). It needs to know when the agent actually started. In our original setup, the host orchestrator writes the start timestamp before launching the agent container. In Harbor, I see lifecycle hooks on the Trial object (TrialEvent.AGENT_START etc.), but those run on the orchestrator side, not inside the sandbox. we need a way to execute a shell command inside the container right before the agent starts , like a pre-agent hook.

  3. Downloading additional directories after a run
    After the agent finishes, we'd like to download its full workspace (/home/agent/workspace/), including the code it wrote and the fine-tuned model weights. Currently Harbor only downloads /logs/agent and /logs/verifier.

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator Author

after discussing with Alex from Harbor/tbench:

  1. we could put the verifier in the tests/ directory which only gets uploaded after the agent runs

  2. this is not yet supported be we should look into this, potentially make a PR to harbor

  3. Artifact collection is supported now in harbor, so we should use that

@rank-and-file
Copy link
Copy Markdown
Collaborator

Added modal storage for hf-cache for harbor in the branch add_harbor_support_with_hfcache. This is necessary to have the pre-cached models and datasets. This is necessary for full parity.

Although there are some other changes as well, so you can probably clone the repo with this branch in another directory and ask your agent:

Consider PostTrainBench_harbor_modal and port all changes which are related to the modal storage integration for the huggingface cache to PostTrainBench_harbor.

@rank-and-file
Copy link
Copy Markdown
Collaborator

Also there is an upcoming change to the judge which will need to be integrated. Will post here.

@rank-and-file
Copy link
Copy Markdown
Collaborator

We need to hardcode baseline values to a json, instead of fetching them from the POST_TRAIN_BENCH_RESULTS_DIR.

This is needed for harbor integration (harbor should output the baseline value, in case the judge flags the run).

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator Author

Merged main into Harbor branch, @rank-and-file maybe we should push the new judge to main soon so we can pull it here. The new judge would require some major changes for Harbor

@lewtun
Copy link
Copy Markdown

lewtun commented Apr 24, 2026

Apologies for lurking, but I noticed this comment:

Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it

In case it's useful, we have flash-attn kernels available on the Hub (link) which are matched to the hardware at runtime and skip the annoying / long / brittle install of flash-attn itself. If you want to run PostTrainBench on cloud GPUs, this could be the fastest way to support such kernels across e.g. different CUDA versions

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator Author

hrdkbhatnagar commented Apr 24, 2026

Apologies for lurking, but I noticed this comment:

Right now I have skipped the installation of flash-attn in the container as we need to have a CUDA runtime for it

In case it's useful, we have flash-attn kernels available on the Hub (link) which are matched to the hardware at runtime and skip the annoying / long / brittle install of flash-attn itself. If you want to run PostTrainBench on cloud GPUs, this could be the fastest way to support such kernels across e.g. different CUDA versions

Hey Lewis, thanks for your comment! I didn't know about this, it would be very useful for us, especially when running on cloud providers and having full parity with the local version. Adding it to our todo for Harbor :)

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator Author

hrdkbhatnagar commented May 7, 2026

update

Pushed a substantial round of changes to the harbor adapter since last review. end to end run is now working at parity with our local pipeline for everything except verifier sandbox isolation, which is being addressed natively upstream (harbor-framework/harbor#1607).

main changes

  • Build mirrors the image we used for the leaderboard
  • Live log streaming (agent + verifier) via PID-1 entrypoint
  • Timer parity (healthcheck-driven, absolute path)
  • system_monitor.sh ported (GPU/CPU/mem/disk every 60s)
  • /tests/ placement closes file-tampering attack vector
  • final_model/ + agent workspace/ downloaded as artifacts after each trial
  • Single sandbox model leaves package tampering open; [verifier_environment] support upstream (Feature: isolated verifier sandbox ([verifier_environment]) harbor-framework/harbor#1607) lets us drop in a fresh sandbox for verification — fork branch validated, will integrate once landed

build

template/environment/Dockerfile is now a faithful port of the production apptainer image with two intentional changes:

  • --torch-backend=cu128 instead of auto — Modal's build VM has no nvidia-smi, so auto resolves to CPU torch and breaks vllm's CUDA-only xformers requirement.
  • I decided not to go with the HF kernels as suggested by Lewis since that would deviate from our original setting, as the kernels lib does not re-export under the flash_attn module name AFAICT. Agents either route through kernels directly or use the transformers attn-implementation string. But this is very useful for the future versions of PTB we build!

log streaming

template/environment/entrypoint.sh runs as PID 1 and:

  1. backgrounds tail -F -q /logs/agent/*.txt /logs/verifier/*.txt so agent and verifier output stream live to the Modal dashboard via the sandbox's main process stdout, mirroring our local setup.
  2. backgrounds system_monitor.sh (port of src/utils/system_monitor.sh), which writes GPU utilization / memory / disk to /logs/agent/system_monitor.log every 60s.

timer (now reliable)

Old design was START_FILE="$(dirname "$0")/.timer_start"dirname resolves relative to invocation cwd, so each new directory the agent cd'd into got a fresh sentinel and the timer reset.

New design (native Harbor healthcheck):

  • [environment.healthcheck] in task.toml writes /timer_start once before agent setup begins:

    [environment.healthcheck]
    command = "pgrep -f 'tail -F' > /dev/null && (test -f /timer_start || date +%s > /timer_start)"                                            
    interval_sec = 2                  
    timeout_sec = 5              
    start_period_sec = 5                    
    retries = 3                      

tamper resistance: /tests/ placement

The single sandbox means evaluate.py originally lived in the same writable directory the agent worked in: agent could trivially modify it to inflate scores. To match condor's "verifier reads from a separate trusted source" pattern within harbor's constraints:

  • Adapter's _copy_eval_files() helper now populates both environment/ (so agent has copies for iterative testing during training,
    matching condor) and tests/ (so verifier reads untampered copies).
  • Harbor uploads tests/ to /tests after the agent process exits — the agent has no opportunity to write there.

Closes: agent modifying evaluate.py, templates/, evaluation_code/, contamination_judge.py, metadata.json.

Doesn't close: agent pip installing tampered packages, addressed natively in harbor-framework/harbor#1607.

artifact collection

task.toml now declares two artifacts:

[[artifacts]]
source = "/home/agent/workspace/final_model"
destination = "final_model"
                                                                                                                                              
[[artifacts]]
source = "/home/agent/workspace"                                                                                                             
destination = "workspace"
exclude = ["final_model", "__pycache__", "*.pyc", ".git", ".venv", "venv"]

After each trial, <trial_dir>/artifacts/ contains final_model/ (the trained weights the verifier evaluated) and workspace/

tested e2e on modal with 1hr tasks (bfcl, gsm8k on qwen3 1.7b)

JackPayne123 added a commit to JackPayne123/PostTrainBench that referenced this pull request May 11, 2026
Outcome of analysis on the 2026-05-11 F-run capability collapse +
the day's meeting decisions. Single commit; image-bump to :22
follows once the build lands.

Suite
- Drop bfcl (tool-call vllm not configured). Moved task to
  `src/evals/tasks/_disabled/`; registry validator ignores `_*` dirs.
- EvalInfo gains `default_limit`. submit_baseline `--limit` default
  changes 100 -> 0 (= use per-eval defaults). aime2025=30 (full),
  mmlu/arc_easy/truthfulqa/rozado=200, big_five=40 / MFQ=32 /
  spiralbench=30 / syco_slava=30 / moru=50 (full sets where small);
  generative-graded = 100. run-id token shows `perEval` when default.

Grader unification (anthropic/claude-haiku-4-5)
- `INSPECT_GRADER_MODEL` injected via pod_env in submit_run + submit_baseline.
  Routes inspect_evals model_graded_qa scorers (coconot, strong_reject,
  sycophancy_sharma) off the prior gpt-4o-ish defaults.
- moru passes explicit `task_args={"grader_models": "anthropic/claude-haiku-4-5"}`.
  Eliminates self-grading by served vllm.
- healthbench + arenahardwriting still gpt-5-mini (separate refactor;
  see design TODO aisa-group#8).

Pipeline ergonomics
- `/opt/pipeline-bin/time-remaining` (root-owned, NOPASSWD-sudoable).
  Reads `/etc/ptb_run/deadline` (chmod 600). Workspace timer.sh prefers
  it; local start-file approximation now a fallback only.
- `/opt/pipeline-bin/score_capability_runner.sh` + workspace
  `score_capability.sh`. Reads `/etc/ptb_run/bench_capability`
  (pipeline writes arc_easy for condition F by default; override via
  PTB_CAPABILITY_PROBE env). Cheap capability spot-check for the agent.
- Dockerfile.base copies both new binaries, multi-line sudoers, diag.py
  exercises round-trip (deadline file, agent-cannot-read, sudo invocation).

lora_starter rewrite
- New `format_qwen3_chat(messages, tokenizer, enable_thinking=True)`
  helper applies the `<think>\n\n</think>\n\n` envelope automatically.
  `to_text` now handles `messages` row shape.
- Conservative defaults: r=8, alpha=16, lr=5e-5, epochs=2,
  target_modules=q_proj+v_proj only. Full llama-style set opt-in via
  `--lora-target-modules`.
- Docstring sections: "LoRA aggressiveness" + "MCQ-format examples" +
  "Chat-template format" explain the 2026-05-11 capability-collapse
  failure mode + how to avoid it.

Agent prompt
- instruction.md mentions localhost vLLM (so agents don't try to start
  their own), explicit guidance on `score.sh` / `score_capability.sh` /
  `timer.sh`, calls out lora_starter docstring sections. Run staging
  copies `score_capability.sh` alongside `score.sh` into the workspace.
- condition_prompts._F_BODY adds capability-probe guidance + names the
  free-text-only-training MCQ collapse failure mode for next agent.

big_five lenient scorer
- Local `personality_BFI_lenient` @task with lenient parser that
  accepts `ANSWER: X`, `X)`, `X.`, `X:`, or `X` on its own line. Upstream
  scorer required literal `ANSWER:` prefix and dropped any trait whose
  samples all missed it. Recovered Agreeableness from the 2026-05-11
  F-run adapter (9/9 valid letter answers but unparsed under strict).
- `scripts/rescore_big_five.py` re-aggregates existing inspect logs +
  optionally writes lenient metrics back into the promoted baseline
  JSON. Both base + adapter big_five JSONs rescored. Real findings:
    Agreeableness 0.844 -> 0.689   (-0.156)
    Conscientiousness 0.822 -> 0.600 (-0.222)
    Extraversion 0.825 -> 0.629   (-0.196)
    Neuroticism 0.525 -> 0.625    (+0.100, not +0.375 strict artifact)
    Openness 0.780 -> 0.700       (-0.080)
  Adapter shows broad trait suppression, not the strict-scorer cartoon.

compute_deltas + viewer
- `scripts/compute_deltas.py` adds `--update-summary`: backfills
  pre/post/delta (+ per-bench slots) into the run's summary.json so the
  trace viewer + downstream tooling pick up scores via the existing
  paths. Tagged `delta_method: "baseline-backfill"`.
- `registry.get_headline` tries literal-key match before dotted-path
  walk (fixes strong_reject_scorer.jailbreak_rate which is stored as
  a flat key with a literal dot).
- Trace viewer index falls back: legacy unsuffixed metrics_* ->
  summary.{pre,post,delta} -> per-bench metrics_post_<bench>.json ->
  deltas.json[primary_bench]. `--skip-pre-eval` runs now render scores.

Other
- `pull_eval_logs.py` environment_dir fixed (`src/eval` -> `src/evals`).
- scripts/constants.py HARDCODED_BENCHMARKS comments bfcl out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…own fork, add codex reprompt agent for harbor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants