More comparison

Thanks for your work! I wanted to follow up on a question I posted on HF earlier, which may have been overlooked.

While reading your paper, I noticed that when comparing the superbpe and standard BPE models, the superbpe has a shorter context length and requires more training steps (with FLOPs held constant). I'm particularly interested in understanding how the model performs under these two scenarios:
 
1. The current SuperBPE model under the same number of training steps (early stopped checkpoints).
2. Re-training with a context length of 4096 for the same number of training steps.

I think these might better reflect real training practices.

Looking forward to your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More comparison #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

More comparison #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions