Two evaluation mechanisms run during training:
-
Perplexity / eval loss: computed from the training data distribution by running
_eval_step()on the held-out eval split. Logged aseval_lossandeval_accuracy(per-token accuracy). -
Task evaluators (e.g. HellaSwag): loaded from
data.eval_datasetsin config. Each evaluator is aTasksubclass that runs independently of the training data.
Task in ironcore/eval/tasks/base_task.py is the abstract base class for all evaluators.
process(model) → metrics dict
Internal steps:
_preprocess()— load and transform the dataset (using HuggingFacedatasets.map())_get_dataloader()— create aDataLoaderfrom the preprocessed dataset- For each batch:
_get_batch(batch)→_do_predict(model, inputs)→ accumulate _get_score(...)→ return metrics dict
get_evaluators() in ironcore/eval/evaluator.py loads evaluators dynamically:
module = importlib.import_module(f"ironcore.eval.tasks.{task_name}")
evaluator_class = next(cls for cls in vars(module).values()
if isinstance(cls, type) and issubclass(cls, Task) and cls != Task)The task name must match the module filename (lowercase). For example, {name: hellaswag} loads ironcore/eval/tasks/hellaswag.py and finds the HellaSwag class.
HellaSwag in ironcore/eval/tasks/hellaswag.py measures commonsense NLI accuracy.
Format: prompt = ctx_a + " " + ctx_b + four candidate continuations. Each (prompt, continuation) pair is scored independently.
Scoring: per-token average cross-entropy loss. The candidate with the lowest loss is the model's prediction. Accuracy = fraction of questions where the predicted continuation matches the label.
_do_predict() calls model(input_ids, labels=None) to get logits, then computes F.cross_entropy(..., reduction="mean") per sample.
- Create
ironcore/eval/tasks/<name>.py(lowercase, matches confignamefield). - Define a class that subclasses
Task. - Implement:
_preprocess(examples)— HF datasetsmap()function; returns expanded dict._get_batch(batch)— extracts inputs and labels from a dataloader batch._do_predict(model, tokenized_inputs)— runs the model and returns per-sample scores._get_score(...)— aggregates scores and returns a metrics dict with at least"score".
- Add the task to
data.eval_datasetsin config.
| Field | Default | Description |
|---|---|---|
trainer.do_eval |
false |
Enable evaluation during training |
trainer.eval_batch_size |
null |
Batch size for evaluators (falls back to micro_batch_size) |
operation.eval_interval |
100 |
Evaluate every N steps (gated by trainer.do_eval) |
operation.eval_samples |
100 |
Number of samples for eval-loss evaluation |
data.eval_datasets |
[] |
List of task evaluator configs |
data:
eval_datasets:
- name: hellaswag
source: Rowan/hellaswag
max_samples: 1000Each entry is a DatasetConfig; name selects the Task module under ironcore/eval/tasks/.