Skip to content

More comparison #4

@huyiwen

Description

@huyiwen

Thanks for your work! I wanted to follow up on a question I posted on HF earlier, which may have been overlooked.

While reading your paper, I noticed that when comparing the superbpe and standard BPE models, the superbpe has a shorter context length and requires more training steps (with FLOPs held constant). I'm particularly interested in understanding how the model performs under these two scenarios:

  1. The current SuperBPE model under the same number of training steps (early stopped checkpoints).
  2. Re-training with a context length of 4096 for the same number of training steps.

I think these might better reflect real training practices.

Looking forward to your insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions