There are many variations on how people structure experiments and a range of metrics used in comparison.
The results we present are for the moment the simplest: we do all training/validation on the default train split, and evaluate once on the test set.
There are alternatives: we could perform stratified resamples or cross validate.