Add a plugin extension point for modifying tasks at load time (e.g., injecting cross-cutting scorers)

**Summary**
  
Inspect AI's plugin system supports `@scorer`, `@solver`, `@task`, `@tool`, `@hooks`, and custom model providers — but there is no extension point that lets a plugin **modify a `Task` at load time** before its scorers are bound to the runner. This means deployments that maintain cross-cutting quality metrics across many tasks (rubric judges, length budgets, refusal detectors, compliance scorers) cannot add those metrics centrally — they must edit each task's `@task` factory or fork the upstream package.

For deployments that consume tasks from multiple sources (`inspect_evals`, `inspect_swe`, third-party benchmark suites, in-house tasks), this becomes especially painful: scorers can't be injected without per-task edits to packages we don't own.
  
**Concrete motivation**
  
A few use cases that hit this gap:
  
1. **Cross-cutting quality metrics.** Pair instruction-following benchmarks (e.g. `inspect_evals/ifbench`) with a complete/helpful/concise rubric judge scored by a strong LLM. The pairing should be configurable per-deployment without forking `inspect_evals`.
2. **Centralized policy/compliance scorers.** Security teams want to add a refusal-rate or content-filter scorer to every task without coordinating with task maintainers.
3. **Multi-tenant deployments composing tasks from multiple packages.** Scorer injection should compose across `inspect_evals` + `inspect_swe` + in-house tasks without forking any of them.
4. **Telemetry-adjacent observers that need scoring participation.** Some quality observers need to be participants in the scoring pipeline (so failures fail the eval), not pure observers.
  
**Why the existing extension points are not sufficient**
  
I traced the current API and confirmed there is no path:
  
- **Hooks are observe-only at scoring time.** `SampleScoring` carries only IDs (`eval_set_id, run_id, eval_id, sample_id`); a hook handler there cannot run a scorer because it has no `TaskState`. `SampleEnd` fires *after* `log_sample()` has serialized the sample, so mutations to `data.sample.scores` do not persist. `TaskStart` carries `EvalSpec` (immutable metadata), not the `Task` object.
- **The CLI's `--scorer` flag overrides the task's scorers, not appends.** There is no `--scorer-add` or equivalent.
- **`inspect score` is out-of-process.** It works post-hoc but is a separate invocation; failures don't fail the original eval, and operators have to wire two separate runs.
  
For deployments where a rubric scoring failure should fail the eval, `inspect score` is too loose.
  
**Proposed extension point**
  
A `@task_modifier` plugin registered via the existing `inspect_ai` entry-point group:
```  
  from inspect_ai import Task, task_modifier
  
  @task_modifier
  def my_overlay(task: Task) -> Task:
      """Apply deployment-wide overlays to any loaded task."""
      extras = resolve_overlays_for(task.name)  # e.g., from a checked-in YAML
      if not extras:
          return task
      return task.copy(scorer=[*task.scorer, *extras])
```  
Inspect's task loader collects all registered modifiers and applies them in registration order after the task's own factory has run, before the task is bound to the runner. This sits naturally alongside the existing plugin types and keeps the scorer-modification logic in plugin code rather than in the CLI parsing layer.
  
Notes on the design:
  
- **Pure transformation, not in-place mutation** — modifiers receive a `Task` and return a `Task`. Avoids ordering surprises if multiple modifiers run.
- **Modifiers run before any sample executes** — no race with the runner; failures during modification fail eval startup, which is what we want.
- **Composes with everything else** — modifiers can inspect `task.name`, `task.metadata`, the existing scorer list, etc., and decide whether/how to extend.
- **Discoverable** — registered through the same `inspect_ai` entry-point group as the other plugin types.
  
A simpler shape — a CLI-level `--scorer-add` flag — would solve a subset of the use cases (interactive runs) but wouldn't help programmatic `inspect_ai.eval(...)` callers or deployments that want the modification policy as code rather than as command-line arguments. A `@task_modifier` plugin can express the CLI flag's behavior as one entry-point registration, but not vice versa.
  
**What I'd find most useful in a response**
  
1. Whether the shape above (or some variant) feels consistent with the plugin-system direction Inspect is taking.
2. Whether there's an existing recommended pattern I missed — happy to use the supported path if there is one.
  
Thanks — Inspect's plugin model is in general one of the cleanest I've used, and `@task_modifier` (or the equivalent) feels like a natural addition.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a plugin extension point for modifying tasks at load time (e.g., injecting cross-cutting scorers) #3973

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add a plugin extension point for modifying tasks at load time (e.g., injecting cross-cutting scorers) #3973

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions