Thank you for your excellent work!
I'm very interested in whether you have explored the impact of different tokenizers on model performance.
Furthermore, a closely related issue is that the drawbacks of BPE-Tokenizer have been a long-standing problem. Recently, some research has begun to address this issue by attempting to directly process byte sequences using attention mechanisms or prediction-based dynamic grouping approaches:
Byte Latent Transformer: Patches Scale Better Than Tokens
https://arxiv.org/abs/2412.09871
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
https://arxiv.org/abs/2506.14761
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
https://arxiv.org/abs/2508.05628
I would like to know whether you think it's possible to build a byte-level continuous autoregressive language model? Or to apply autoencoders to byte sequences.
Thank you for your excellent work!
I'm very interested in whether you have explored the impact of different tokenizers on model performance.
Furthermore, a closely related issue is that the drawbacks of BPE-Tokenizer have been a long-standing problem. Recently, some research has begun to address this issue by attempting to directly process byte sequences using attention mechanisms or prediction-based dynamic grouping approaches:
I would like to know whether you think it's possible to build a byte-level continuous autoregressive language model? Or to apply autoencoders to byte sequences.