Alisa Liu
Alisa Liu
Home
Publications
tokenization
Sampling from Your Language Model One Byte at a Time
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or …
Jonathan Hayase
,
Alisa Liu
,
Noah A. Smith
,
Sewoong Oh
Paper
BibTeX
Code
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string …
Brian Siyuan Zheng
,
Alisa Liu
,
Orevaoghene Ahia
,
Jonathan Hayase
,
Yejin Choi
,
Noah A. Smith
Paper
BibTeX
Code
SuperBPE: Space Travel for Language Models
The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within …
Alisa Liu*
,
Jonathan Hayase*
,
Sewoong Oh
,
Noah A. Smith
,
Yejin Choi
Paper
BibTeX
Code
Blog
HF Collection
BibTeX
×