Convert Inspect AI tasks into Tinker RL environments for reinforcement learning training.
This bridge enables training language models with RL using Inspect's rich ecosystem of evaluation tasks (datasets, scorers, solvers, sandboxes) as the source of prompts and reward signals.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INITIALIZATION PHASE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Inspect Task Function Tinker Renderer
(e.g., gsm8k, coding_task) (e.g., LlamaRenderer)
β β
βββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββ
β load_environment() β β Main Entry Point
β loader.py β
ββββββββββββ¬βββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββββ
βload_inspect_taskβ β Validate β βinspect_dataset_to_ β
β tasks.py β β Configuration β β hf() dataset.py β
ββββββββββ¬βββββββββ βββββββββββββββββ βββββββββββ¬βββββββββββ
β β
βΌ β
βββββββββββββββββββ β
β InspectTaskInfo β β
β - task β β
β - scorers β β
β - sandbox_type β β
β - dataset β β
βββββββββββββββββββ β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β For each Sample in Inspect Dataset β
ββββββββββββββββββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β get_ground_truth_messages() β
β ground_truth.py β
β β
β Run solver chain WITHOUT model β
β inference to get transformed prompt β
β (system message, few-shot, etc.) β
ββββββββββββββββββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β HuggingFace Dataset β
β βββββββββββββββββββββββββββββββββββ β
β β Row: {prompt, answer, info, id} β β
β β β β
β β prompt: List[message dicts] β β
β β answer: target answer β β
β β info: Inspect metadata (JSON) β β
β βββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β InspectRLDataset β
β env.py β
β β
β Wraps HF dataset for Tinker trainingβ
ββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRAINING PHASE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
InspectRLDataset
β
βΌ
ββββββββββββββββββββββββββββββββββ
β get_batch(batch_index) β
ββββββββββββββββββ¬ββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
βInspectEnvGroup β βInspectEnvGroup β βInspectEnvGroup β
βBuilder (problem 1)β βBuilder (problem 2)β βBuilder (problem N)β
βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ
β β β
βΌ βΌ βΌ
make_envs() make_envs() make_envs()
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β[InspectEnv Γ N] β β[InspectEnv Γ N] β β[InspectEnv Γ N] β
β (parallel β β (parallel β β (parallel β
β rollouts) β β rollouts) β β rollouts) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPISODE EXECUTION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
InspectEnv
β
βΌ
βββββββββββββββββββββββββββ
β initial_observation() β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
βΌ βΌ
(if multi-turn with sandbox) Convert prompt messages
β to Tinker format
βΌ β
ββββββββββββββββββββββ β
βcreate_sandbox_for_ β β
βsample() sandbox.py β β
β β β
β - Init Docker/localβ β
β - Resolve files β β
β - Run setup script β β
ββββββββββββββββββββββ β
βΌ
βββββββββββββββββββββββββββ
β Return: Observation β
β (tokenized prompt + β
β stop sequences) β
βββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model generates tokens
β
βΌ
βββββββββββββββββββ
β step(action) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β renderer.parse_response β
β (tokens β Message) β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Append to conversation β
β Increment turn counter β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β _should_end_episode()? β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
βΌ βΌ
[Episode Done] [Continue]
β β
β βΌ
β βββββββββββββββββββββββββββ
β β _execute_tools() β
β β β
β β Tool calls supported: β
β β - submit(): end episode β
β β - bash(): run command β
β β - python(): run code β
β β β
β β Executes in sandbox via β
β β exec_in_sandbox() β
β ββββββββββββββ¬βββββββββββββ
β β
β βΌ
β βββββββββββββββββββββββββββ
β β Return: next observationβ
β β (with tool results) β
β βββββββββββββββββββββββββββ
β β
β ββββββββ
β β
β ββββββββββββββββββββββββ
β β
β βΌ
β (Back to Model)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPISODE TERMINATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ
β _compute_reward() β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β scoring.py β
β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β 1. Convert conversation to Inspect β β
β β ChatMessage format β β
β β β β
β β 2. Build TaskState with: β β
β β - Original sample input β β
β β - Conversation history β β
β β - Model output (last assistant) β β
β β β β
β β 3. Set up sandbox context (if needed) β β
β β via sandbox_context() β β
β β β β
β β 4. Run Inspect scorers β β
β β (exact_match, model_graded, etc.) β β
β β β β
β β 5. Convert Score β float reward β β
β β (combine multiple with weights) β β
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β cleanup_sandbox() β
β (if sandbox exists) β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Return: StepResult β
β - reward β
β - done=True β
β - info (scorer β
β breakdown) β
βββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPISODE TYPES COMPARISON β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SINGLE-TURN MULTI-TURN
βββββββββββ ββββββββββ
initial_observation() initial_observation()
β β
βΌ βΌ
step(action) step(action)
β β
βΌ βΌ
[Always terminates] [Check: submit() called?
β max_turns reached?]
β β
βΌ ββββββββ΄βββββββ
compute_reward() βΌ βΌ
β [No] [Yes]
βΌ β β
Return result βΌ βΌ
execute_tools() compute_reward()
β β
βΌ βΌ
Return obs Return result
β
βββββ (loop back to step)
| Component | File | Purpose |
|---|---|---|
load_environment |
loader.py |
Main entry point - orchestrates initialization |
InspectTaskInfo |
tasks.py |
Task introspection - extracts scorers, sandbox config |
get_ground_truth_messages |
ground_truth.py |
Runs solver chain without model to get prompts |
inspect_dataset_to_hf |
dataset.py |
Converts Inspect Dataset β HuggingFace Dataset |
InspectRLDataset |
env.py |
Tinker RLDataset wrapper for batching |
InspectEnvGroupBuilder |
env.py |
Creates parallel environments per problem |
InspectEnv |
env.py |
Core Tinker Env - handles observations, steps, rewards |
compute_reward |
scoring.py |
Runs Inspect scorers on conversation |
SandboxConfig/Instance |
sandbox.py |
Manages Docker/local sandbox lifecycle |
from inspect_evals.gsm8k import gsm8k
from tinker_cookbook.renderers import get_renderer
from tinker_cookbook.tokenizer_utils import get_tokenizer
from inspect_tinker_bridge import load_environment
tokenizer = get_tokenizer("meta-llama/Llama-3.1-8B-Instruct")
renderer = get_renderer("LlamaRenderer", tokenizer=tokenizer)
dataset = load_environment(
gsm8k,
renderer=renderer,
env_type="single_turn",
max_samples=100,
batch_size=8,
num_envs_per_group=4,
)from examples.coding_task import coding_task
from inspect_tinker_bridge import load_environment
dataset = load_environment(
coding_task,
renderer=renderer,
env_type="multi_turn",
max_turns=10,
batch_size=4,
sandbox_type="docker",
)| Parameter | Type | Default | Description |
|---|---|---|---|
task |
Callable |
Required | Inspect task factory function |
renderer |
Renderer |
Required | Tinker message renderer/tokenizer |
env_type |
str |
"single_turn" |
"single_turn" or "multi_turn" |
max_samples |
int | None |
None |
Limit dataset size |
max_turns |
int |
10 |
Max turns per episode (multi-turn only) |
num_envs_per_group |
int |
1 |
Parallel rollouts per problem |
batch_size |
int |
1 |
Problems per training batch |
sandbox_type |
str | None |
None |
"docker" or "local" |
sandbox_config |
str | None |
None |
Path to sandbox config file |
submit_instruction |
str | None |
(default msg) | System instruction for submit tool |