-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Thanks for your work! I wanted to follow up on a question I posted on HF earlier, which may have been overlooked.
While reading your paper, I noticed that when comparing the superbpe and standard BPE models, the superbpe has a shorter context length and requires more training steps (with FLOPs held constant). I'm particularly interested in understanding how the model performs under these two scenarios:
- The current SuperBPE model under the same number of training steps (early stopped checkpoints).
- Re-training with a context length of 4096 for the same number of training steps.
I think these might better reflect real training practices.
Looking forward to your insights!
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested