Evaluation

Built-in evaluators

Two evaluation mechanisms run during training:

Perplexity / eval loss: computed from the training data distribution by running _eval_step() on the held-out eval split. Logged as eval_loss and eval_accuracy (per-token accuracy).
Task evaluators (e.g. HellaSwag): loaded from data.eval_datasets in config. Each evaluator is a Task subclass that runs independently of the training data.

`Task` interface

Task in ironcore/eval/tasks/base_task.py is the abstract base class for all evaluators.

process(model) → metrics dict

Internal steps:

_preprocess() — load and transform the dataset (using HuggingFace datasets.map())
_get_dataloader() — create a DataLoader from the preprocessed dataset
For each batch: _get_batch(batch) → _do_predict(model, inputs) → accumulate
_get_score(...) → return metrics dict

Dynamic loading

get_evaluators() in ironcore/eval/evaluator.py loads evaluators dynamically:

module = importlib.import_module(f"ironcore.eval.tasks.{task_name}")
evaluator_class = next(cls for cls in vars(module).values()
                       if isinstance(cls, type) and issubclass(cls, Task) and cls != Task)

The task name must match the module filename (lowercase). For example, {name: hellaswag} loads ironcore/eval/tasks/hellaswag.py and finds the HellaSwag class.

HellaSwag

HellaSwag in ironcore/eval/tasks/hellaswag.py measures commonsense NLI accuracy.

Format: prompt = ctx_a + " " + ctx_b + four candidate continuations. Each (prompt, continuation) pair is scored independently.

Scoring: per-token average cross-entropy loss. The candidate with the lowest loss is the model's prediction. Accuracy = fraction of questions where the predicted continuation matches the label.

_do_predict() calls model(input_ids, labels=None) to get logits, then computes F.cross_entropy(..., reduction="mean") per sample.

Adding a new evaluator

Create ironcore/eval/tasks/<name>.py (lowercase, matches config name field).
Define a class that subclasses Task.
Implement:
- _preprocess(examples) — HF datasets map() function; returns expanded dict.
- _get_batch(batch) — extracts inputs and labels from a dataloader batch.
- _do_predict(model, tokenized_inputs) — runs the model and returns per-sample scores.
- _get_score(...) — aggregates scores and returns a metrics dict with at least "score".
Add the task to data.eval_datasets in config.

Configuration reference

Field	Default	Description
`trainer.do_eval`	`false`	Enable evaluation during training
`trainer.eval_batch_size`	`null`	Batch size for evaluators (falls back to `micro_batch_size`)
`operation.eval_interval`	`100`	Evaluate every N steps (gated by `trainer.do_eval`)
`operation.eval_samples`	`100`	Number of samples for eval-loss evaluation
`data.eval_datasets`	`[]`	List of task evaluator configs

`data.eval_datasets` YAML example

data:
  eval_datasets:
    - name: hellaswag
      source: Rowan/hellaswag
      max_samples: 1000

Each entry is a DatasetConfig; name selects the Task module under ironcore/eval/tasks/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Built-in evaluators

`Task` interface

Dynamic loading

HellaSwag

Adding a new evaluator

Configuration reference

`data.eval_datasets` YAML example

FilesExpand file tree

eval.md

Latest commit

History

eval.md

File metadata and controls

Evaluation

Built-in evaluators

Task interface

Dynamic loading

HellaSwag

Adding a new evaluator

Configuration reference

data.eval_datasets YAML example

`Task` interface

`data.eval_datasets` YAML example