Open Data for Language Models

I am co-leading the data effort for OLMo with Luca Soldaini. We’ve released:

Dolma, the largest open dataset for language model pretraining to-date.
peS2o, a transformation of S2ORC optimized for pretraining language models of science,

Prior to this, I co-led the data curation efforts behind:

S2ORC, the largest, machine-readable collection of open-access full-text papers to-date. Request API access 🔑 here!
CORD-19, the most comprehensive, continually-updated set of COVID-19 literature at the time, and