Publications

Sampling from Your Language Model One Byte at a Time.
Preprint 2025.

Paper BibTeX Code

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations.
Preprint 2025.

Paper BibTeX Code

SuperBPE: Space Travel for Language Models.
COLM 2025.

Paper BibTeX Code Blog HF Collection

Tulu 3: Pushing Frontiers in Open Language Model Post-Training.
COLM 2025.

Paper BibTeX Code HF Collection

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models.
ACL 2025.

Paper BibTeX Dataset

LlamaPIE: Proactive In-Ear Conversation Assistants.
ACL Findings 2025.

Paper BibTeX

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?.
NeurIPS 2024.

Paper BibTeX Code

Tuning Language Models by Proxy.
COLM 2024 (Spotlight 🌟, top 7%).

Paper BibTeX Code

We're Afraid Language Models Aren't Modeling Ambiguity.
EMNLP 2023.

Paper BibTeX Code Dataset

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?.
EMNLP Findings 2023.

Paper BibTeX Code

Inverse Scaling: When Bigger Isn't Better.
TMLR 2023 (Featured 🌟).

Paper BibTeX Code

How Language Model Hallucinations Can Snowball.
ICML 2024.

Paper BibTeX Code

Self-Instruct: Aligning Language Models with Self-Generated Instructions.
ACL 2023.

Paper BibTeX Code

Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts.
ACL 2023.

Paper BibTeX

Generated Knowledge Prompting for Commonsense Reasoning.
ACL 2022.

Paper BibTeX Code

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts.
ACL 2021.

Paper BibTeX Code Slides News

Model Selection for Deep Audio Source Separation via Clustering Analysis.
DCASE 2020 (Best Student Paper Award).

Paper BibTeX Slides Talk

Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation.
ML4MD Workshop at ICML 2020.

Paper BibTeX Code Poster

Bach or Mock? A Grading Function for Chorales in the Style of J.S. Bach.
ML4MD Workshop at ICML 2020.

Paper BibTeX Code Poster

CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense.
RepEval Workshop at NAACL 2019.

Paper BibTeX Dataset