Skip to content

Add a plugin extension point for modifying tasks at load time (e.g., injecting cross-cutting scorers) #3973

@surgepower

Description

@surgepower

Summary

Inspect AI's plugin system supports @scorer, @solver, @task, @tool, @hooks, and custom model providers — but there is no extension point that lets a plugin modify a Task at load time before its scorers are bound to the runner. This means deployments that maintain cross-cutting quality metrics across many tasks (rubric judges, length budgets, refusal detectors, compliance scorers) cannot add those metrics centrally — they must edit each task's @task factory or fork the upstream package.

For deployments that consume tasks from multiple sources (inspect_evals, inspect_swe, third-party benchmark suites, in-house tasks), this becomes especially painful: scorers can't be injected without per-task edits to packages we don't own.

Concrete motivation

A few use cases that hit this gap:

  1. Cross-cutting quality metrics. Pair instruction-following benchmarks (e.g. inspect_evals/ifbench) with a complete/helpful/concise rubric judge scored by a strong LLM. The pairing should be configurable per-deployment without forking inspect_evals.
  2. Centralized policy/compliance scorers. Security teams want to add a refusal-rate or content-filter scorer to every task without coordinating with task maintainers.
  3. Multi-tenant deployments composing tasks from multiple packages. Scorer injection should compose across inspect_evals + inspect_swe + in-house tasks without forking any of them.
  4. Telemetry-adjacent observers that need scoring participation. Some quality observers need to be participants in the scoring pipeline (so failures fail the eval), not pure observers.

Why the existing extension points are not sufficient

I traced the current API and confirmed there is no path:

  • Hooks are observe-only at scoring time. SampleScoring carries only IDs (eval_set_id, run_id, eval_id, sample_id); a hook handler there cannot run a scorer because it has no TaskState. SampleEnd fires after log_sample() has serialized the sample, so mutations to data.sample.scores do not persist. TaskStart carries EvalSpec (immutable metadata), not the Task object.
  • The CLI's --scorer flag overrides the task's scorers, not appends. There is no --scorer-add or equivalent.
  • inspect score is out-of-process. It works post-hoc but is a separate invocation; failures don't fail the original eval, and operators have to wire two separate runs.

For deployments where a rubric scoring failure should fail the eval, inspect score is too loose.

Proposed extension point

A @task_modifier plugin registered via the existing inspect_ai entry-point group:

  from inspect_ai import Task, task_modifier
  
  @task_modifier
  def my_overlay(task: Task) -> Task:
      """Apply deployment-wide overlays to any loaded task."""
      extras = resolve_overlays_for(task.name)  # e.g., from a checked-in YAML
      if not extras:
          return task
      return task.copy(scorer=[*task.scorer, *extras])

Inspect's task loader collects all registered modifiers and applies them in registration order after the task's own factory has run, before the task is bound to the runner. This sits naturally alongside the existing plugin types and keeps the scorer-modification logic in plugin code rather than in the CLI parsing layer.

Notes on the design:

  • Pure transformation, not in-place mutation — modifiers receive a Task and return a Task. Avoids ordering surprises if multiple modifiers run.
  • Modifiers run before any sample executes — no race with the runner; failures during modification fail eval startup, which is what we want.
  • Composes with everything else — modifiers can inspect task.name, task.metadata, the existing scorer list, etc., and decide whether/how to extend.
  • Discoverable — registered through the same inspect_ai entry-point group as the other plugin types.

A simpler shape — a CLI-level --scorer-add flag — would solve a subset of the use cases (interactive runs) but wouldn't help programmatic inspect_ai.eval(...) callers or deployments that want the modification policy as code rather than as command-line arguments. A @task_modifier plugin can express the CLI flag's behavior as one entry-point registration, but not vice versa.

What I'd find most useful in a response

  1. Whether the shape above (or some variant) feels consistent with the plugin-system direction Inspect is taking.
  2. Whether there's an existing recommended pattern I missed — happy to use the supported path if there is one.

Thanks — Inspect's plugin model is in general one of the cleanest I've used, and @task_modifier (or the equivalent) feels like a natural addition.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions