Skip to content

training is not properly restartable #37

@dhdaines

Description

@dhdaines

If running sphinxtrain train and it fails (as it does, because the Perl scripts have bugs, but also as it might if using spot instances on $CLOUD) then there's no easy way to restart training.

Back in the old days we would just sit in our offices at CMU all night running scripts_pl/NN.step/s***ve_confg.pl manually, but I would prefer to be in the forest picking mushrooms these days.

This isn't rocket science, at the very least it could just restart from the step and iteration, though of course, it would be much better to rerun just the parts that failed. We're not even using GPUs so there's no issue with repeatability when doing that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions