Skip to content

Learner.checkpoint_interval should be Meta, not Param #3

@bpiwowar

Description

@bpiwowar

Issue

Learner.checkpoint_interval is currently declared as a Param:

https://github.com/experimaestro/xpm-torch/blob/main/src/xpm_torch/learner.py#L146

checkpoint_interval: Param[int] = field(default=1, ignore_default=True)

This means changing it (e.g. from 1 to 15 to reduce checkpoint I/O on short epochs) invalidates the task hash and forces a fresh task directory. Since checkpoint_interval only controls how often state is persisted to disk — not the optimisation trajectory — it should be Meta so it can be tuned across runs without losing cached training state.

Suggested fix

checkpoint_interval: Meta[int] = field(default=1, ignore_default=True)

Context

Hit while tuning the cadence on a multi-day distillation run: switching steps_per_epoch from 8000 to 200 made checkpoint-every-epoch too noisy on disk, but raising checkpoint_interval from 1 → 15 forced a re-submission with a fresh hash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions