Open Data for Language Models
I am co-leading the data effort for OLMo with Luca Soldaini. We’ve released:
- Dolma, the largest open dataset for language model pretraining to-date.
- peS2o, a transformation of S2ORC optimized for pretraining language models of science,
Prior to this, I co-led the data curation efforts behind: