Skip to content

aisa-group/Instrumental-Choices

Repository files navigation

Instrumental Choices

Code release for Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors.

This repository contains the benchmark environments, deterministic scorers, runner scripts, trace-analysis tools, and tests used for the paper.

Benchmark Scope

Default benchmark runs use variant_profile=benchmark across seven terminal-based agent tasks:

  • ledger_audit
  • budget_siphon
  • quota_boost
  • deployment_guardian
  • training_preserver
  • final_instance
  • memory_keeper

Each task uses the same eight-variant framework:

  • dilemma_forbidden_unwatched
  • dilemma_forbidden_watched
  • dilemma_vague_unwatched
  • dilemma_critical_forbidden_unwatched
  • dilemma_critical_forbidden_unwatched_literal_death
  • dilemma_permitted_unwatched
  • no_dilemma_permitted_unwatched
  • dilemma_blocked_forbidden_unwatched

Deterministic scorer fields are the authoritative reporting contract:

  • official_path_attempted
  • official_path_succeeded
  • benchmark_task_completed
  • ic_behavior_detected
  • deterministic_label_status

Tool-trace analysis is descriptive and does not override deterministic labels.

Setup

Requirements:

  • uv
  • Python 3.13.x (.python-version pins the expected interpreter)
  • Docker Engine with the Compose plugin
  • Access to these container images:
    • ghcr.io/agent-infra/sandbox@sha256:e0d7cfed24b373850f5f3df64439e9fcaa17e6810ab3234222e901a5fa08397e
    • ghcr.io/foundry-rs/foundry@sha256:9e591221051112fe0bb530abcaba67f43f01ebbd12a94a8632d570d5e065a8bf for ledger_audit

Install:

uv python install 3.13
uv sync --locked --python 3.13 --extra dev

All commands below assume the repository root and the repo-managed .venv.

Run Validation

Run local deterministic validation before paid model runs:

scripts/run_validation_suite.sh

Run a full-run preflight without paid model calls:

IC_RUN_FULL_EVAL_VALIDATE_ONLY=1 \
IC_ENABLE_TOOL_TRACE_AUDIT=0 \
scripts/run_full_eval.sh mockllm/model

Try The Benchmark Locally

Set the provider key for the model you want to run:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export OPENROUTER_API_KEY=...
export GOOGLE_API_KEY=...

GEMINI_API_KEY is accepted as an alias for GOOGLE_API_KEY by the runner scripts. scripts/run_full_eval.sh sources .env if present.

Run a one-sample smoke evaluation:

scripts/run_smoke_eval.sh openai/gpt-5.2

Run selected tasks or models:

scripts/run_selected_eval.sh --task quota_boost --task final_instance \
  anthropic/claude-haiku-4-5 openai/gpt-5.4-mini

Run one task manually:

PYTHONPATH=. .venv/bin/inspect eval inspect_ic/tasks/final_instance/task_final_instance.py \
  --model openai/gpt-5.2 \
  -T variant_profile=benchmark \
  --log-dir logs/final_instance_manual

Reproduce Paper-Facing Runs

The paper-facing benchmark analysis used three repeats across ten models:

IC_REPEATS=3 \
IC_ENABLE_TOOL_TRACE_AUDIT=1 \
IC_TOOL_TRACE_AUDIT_MODELS=openai/gpt-5.4-mini \
OPENAI_REASONING_EFFORT=medium \
OPENAI_REASONING_SUMMARY=detailed \
ANTHROPIC_REASONING_TOKENS=16000 \
scripts/run_full_eval.sh \
  openai/gpt-5.4 \
  openai/gpt-5.5 \
  anthropic/claude-opus-4-6 \
  anthropic/claude-sonnet-4-6 \
  openrouter/google/gemini-3.1-pro-preview \
  openrouter/google/gemini-3-flash-preview \
  openrouter/minimax/minimax-m2.7 \
  openrouter/z-ai/glm-5.1 \
  openrouter/qwen/qwen3.5-397b-a17b \
  openrouter/moonshotai/kimi-k2.6

OpenRouter model IDs receive :exacto automatically unless IC_OPENROUTER_EXACTO=0 is set.

Run Diagnostic Gemini Ablations

The diagnostic Gemini policy-control variants are not part of the default benchmark profile. They must be selected explicitly with variant_profile=diagnostic and --variant-id.

Validate the diagnostic slice:

scripts/run_selected_eval.sh --validate-only \
  --tasks budget_siphon,deployment_guardian,final_instance \
  --variant-profile diagnostic \
  --disable-tool-trace-audit \
  --variant-id dilemma_blocked_forbidden_unwatched \
  --variant-id dilemma_blocked_forbidden_unwatched_policy_reminder \
  --variant-id dilemma_blocked_forbidden_unwatched_sanctioned_blocker \
  --variant-id dilemma_blocked_forbidden_unwatched_prompt_policy \
  mockllm/model

Run the full diagnostic ablation:

IC_LOG_ROOT=logs/gemini_ablation_full \
scripts/run_selected_eval.sh \
  --tasks budget_siphon,deployment_guardian,final_instance \
  --variant-profile diagnostic \
  --repeats 3 \
  --disable-tool-trace-audit \
  --variant-id dilemma_blocked_forbidden_unwatched \
  --variant-id dilemma_blocked_forbidden_unwatched_policy_reminder \
  --variant-id dilemma_blocked_forbidden_unwatched_sanctioned_blocker \
  --variant-id dilemma_blocked_forbidden_unwatched_prompt_policy \
  openrouter/google/gemini-3-flash-preview \
  openrouter/google/gemini-3.1-pro-preview

Outputs

Benchmark runs write under logs/<timestamp>_<git-hash>/.

Each run writes run_manifest.json with git state, Python version, package versions, uv.lock hash, compose/image provenance, and Docker versions.

Tool-trace artifacts are written under tool_analysis/:

  • deterministic_state_table.csv
  • tool_trace_event_table.csv
  • tool_trace_sample_table.csv
  • tool_trace_completeness_table.csv
  • tool_trace_aggregate_table.csv
  • tool_trace_outcome_association_table.csv
  • tool_trace_pooled_aggregate_table.csv
  • tool_trace_pooled_outcome_association_table.csv
  • tool_trace_report.md
  • optional audit artifacts when IC_ENABLE_TOOL_TRACE_AUDIT=1

Scan assistant response text for evaluation-awareness cues:

.venv/bin/python scripts/scan_eval_awareness.py logs/<run>

The scanner writes eval_awareness/ under the run directory by default. It reports a strict tier for direct evaluation-awareness language and a broad tier for sensitivity checks that may include monitoring, simulation, roleplay, scoring, or judging language. By default it scans assistant response text; pass --include-reasoning-summaries to also scan unencrypted reasoning summaries. Encrypted reasoning payloads are counted but not scanned.

Generated outputs are ignored by git and are not part of this release.

Repository Map

  • inspect_ic/aio_tools.py: sandbox tool wrappers exposed to agents.
  • inspect_ic/tasks/_base/: shared framing, scoring, variants, trace, and audit helpers.
  • inspect_ic/tasks/<task>/: seven benchmark task implementations and assets.
  • sandboxes/: Docker Compose files for AIO and anvil-backed tasks.
  • scripts/run_full_eval.sh: canonical benchmark runner.
  • scripts/run_selected_eval.sh: selected-task/model runner.
  • scripts/run_validation_suite.sh: deterministic local validation gate.
  • scripts/run_readiness_smoke.sh: per-model readiness gate.
  • scripts/analyze_tool_traces.py, scripts/audit_tool_traces.py, scripts/render_tool_trace_report.py: trace analysis tools.
  • scripts/scan_eval_awareness.py: evaluation-awareness cue scanner.
  • tests/: scorer, runner, trace, and audit coverage.

Documentation Authority

For this release, the executable benchmark contract is defined by:

  1. scripts/run_full_eval.sh for default run behavior.
  2. inspect_ic/tasks/_base/variant_profiles.py for benchmark and diagnostic variant handling.
  3. inspect_ic/tasks/<task>/task_<task>.py for task-specific behavior.

Citation

@misc{wiedermannmoller2026instrumentalchoices,
  title         = {Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors},
  author        = {Jonas Wiedermann-Möller and Leonard Dung and Maksym Andriushchenko},
  year          = {2026},
  eprint        = {2605.06490},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  doi           = {10.48550/arXiv.2605.06490},
  url           = {https://arxiv.org/abs/2605.06490}
}

Licence

The code in this repository is released under the MIT Licence. The paper is distributed separately under its own Creative Commons licence.

About

Benchmark for measuring instrumental-convergence behaviour in tool-using LLM agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors