Summary
Inspect AI's plugin system supports @scorer, @solver, @task, @tool, @hooks, and custom model providers — but there is no extension point that lets a plugin modify a Task at load time before its scorers are bound to the runner. This means deployments that maintain cross-cutting quality metrics across many tasks (rubric judges, length budgets, refusal detectors, compliance scorers) cannot add those metrics centrally — they must edit each task's @task factory or fork the upstream package.
For deployments that consume tasks from multiple sources (inspect_evals, inspect_swe, third-party benchmark suites, in-house tasks), this becomes especially painful: scorers can't be injected without per-task edits to packages we don't own.
Concrete motivation
A few use cases that hit this gap:
- Cross-cutting quality metrics. Pair instruction-following benchmarks (e.g.
inspect_evals/ifbench) with a complete/helpful/concise rubric judge scored by a strong LLM. The pairing should be configurable per-deployment without forking inspect_evals.
- Centralized policy/compliance scorers. Security teams want to add a refusal-rate or content-filter scorer to every task without coordinating with task maintainers.
- Multi-tenant deployments composing tasks from multiple packages. Scorer injection should compose across
inspect_evals + inspect_swe + in-house tasks without forking any of them.
- Telemetry-adjacent observers that need scoring participation. Some quality observers need to be participants in the scoring pipeline (so failures fail the eval), not pure observers.
Why the existing extension points are not sufficient
I traced the current API and confirmed there is no path:
- Hooks are observe-only at scoring time.
SampleScoring carries only IDs (eval_set_id, run_id, eval_id, sample_id); a hook handler there cannot run a scorer because it has no TaskState. SampleEnd fires after log_sample() has serialized the sample, so mutations to data.sample.scores do not persist. TaskStart carries EvalSpec (immutable metadata), not the Task object.
- The CLI's
--scorer flag overrides the task's scorers, not appends. There is no --scorer-add or equivalent.
inspect score is out-of-process. It works post-hoc but is a separate invocation; failures don't fail the original eval, and operators have to wire two separate runs.
For deployments where a rubric scoring failure should fail the eval, inspect score is too loose.
Proposed extension point
A @task_modifier plugin registered via the existing inspect_ai entry-point group:
from inspect_ai import Task, task_modifier
@task_modifier
def my_overlay(task: Task) -> Task:
"""Apply deployment-wide overlays to any loaded task."""
extras = resolve_overlays_for(task.name) # e.g., from a checked-in YAML
if not extras:
return task
return task.copy(scorer=[*task.scorer, *extras])
Inspect's task loader collects all registered modifiers and applies them in registration order after the task's own factory has run, before the task is bound to the runner. This sits naturally alongside the existing plugin types and keeps the scorer-modification logic in plugin code rather than in the CLI parsing layer.
Notes on the design:
- Pure transformation, not in-place mutation — modifiers receive a
Task and return a Task. Avoids ordering surprises if multiple modifiers run.
- Modifiers run before any sample executes — no race with the runner; failures during modification fail eval startup, which is what we want.
- Composes with everything else — modifiers can inspect
task.name, task.metadata, the existing scorer list, etc., and decide whether/how to extend.
- Discoverable — registered through the same
inspect_ai entry-point group as the other plugin types.
A simpler shape — a CLI-level --scorer-add flag — would solve a subset of the use cases (interactive runs) but wouldn't help programmatic inspect_ai.eval(...) callers or deployments that want the modification policy as code rather than as command-line arguments. A @task_modifier plugin can express the CLI flag's behavior as one entry-point registration, but not vice versa.
What I'd find most useful in a response
- Whether the shape above (or some variant) feels consistent with the plugin-system direction Inspect is taking.
- Whether there's an existing recommended pattern I missed — happy to use the supported path if there is one.
Thanks — Inspect's plugin model is in general one of the cleanest I've used, and @task_modifier (or the equivalent) feels like a natural addition.
Summary
Inspect AI's plugin system supports
@scorer,@solver,@task,@tool,@hooks, and custom model providers — but there is no extension point that lets a plugin modify aTaskat load time before its scorers are bound to the runner. This means deployments that maintain cross-cutting quality metrics across many tasks (rubric judges, length budgets, refusal detectors, compliance scorers) cannot add those metrics centrally — they must edit each task's@taskfactory or fork the upstream package.For deployments that consume tasks from multiple sources (
inspect_evals,inspect_swe, third-party benchmark suites, in-house tasks), this becomes especially painful: scorers can't be injected without per-task edits to packages we don't own.Concrete motivation
A few use cases that hit this gap:
inspect_evals/ifbench) with a complete/helpful/concise rubric judge scored by a strong LLM. The pairing should be configurable per-deployment without forkinginspect_evals.inspect_evals+inspect_swe+ in-house tasks without forking any of them.Why the existing extension points are not sufficient
I traced the current API and confirmed there is no path:
SampleScoringcarries only IDs (eval_set_id, run_id, eval_id, sample_id); a hook handler there cannot run a scorer because it has noTaskState.SampleEndfires afterlog_sample()has serialized the sample, so mutations todata.sample.scoresdo not persist.TaskStartcarriesEvalSpec(immutable metadata), not theTaskobject.--scorerflag overrides the task's scorers, not appends. There is no--scorer-addor equivalent.inspect scoreis out-of-process. It works post-hoc but is a separate invocation; failures don't fail the original eval, and operators have to wire two separate runs.For deployments where a rubric scoring failure should fail the eval,
inspect scoreis too loose.Proposed extension point
A
@task_modifierplugin registered via the existinginspect_aientry-point group:Inspect's task loader collects all registered modifiers and applies them in registration order after the task's own factory has run, before the task is bound to the runner. This sits naturally alongside the existing plugin types and keeps the scorer-modification logic in plugin code rather than in the CLI parsing layer.
Notes on the design:
Taskand return aTask. Avoids ordering surprises if multiple modifiers run.task.name,task.metadata, the existing scorer list, etc., and decide whether/how to extend.inspect_aientry-point group as the other plugin types.A simpler shape — a CLI-level
--scorer-addflag — would solve a subset of the use cases (interactive runs) but wouldn't help programmaticinspect_ai.eval(...)callers or deployments that want the modification policy as code rather than as command-line arguments. A@task_modifierplugin can express the CLI flag's behavior as one entry-point registration, but not vice versa.What I'd find most useful in a response
Thanks — Inspect's plugin model is in general one of the cleanest I've used, and
@task_modifier(or the equivalent) feels like a natural addition.