Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string …
Brian Siyuan Zheng,
Alisa Liu,
Orevaoghene Ahia,
Jonathan Hayase,
Yejin Choi,
Noah A. Smith