Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Jonathan Hayase*
Alisa Liu*