I’m a researcher at the Allen Institute for AI on the Semantic Scholar Research team, where I work on NLP and text mining over scientific literature. Before that, I spent a couple years working as a data scientist in Seattle, and a year as an applied probability researcher at Academia Sinica in Taiwan. I graduated in 2015 with an MS in Statistics from the University of Washington.
There’s too much scientific literature being published for people to make sense of. It’d be great if NLP models could improve access to & understanding of the knowledge contained in those papers. Yet, NLP models that work well on news or Wikipedia articles often perform poorly when applied to scientific text. I’m interested in understanding why that is & how we can get these systems to perform better.
Language modeling for science
One of the best ways to improve performance on many scientific NLP tasks is to adapt the underlying language models to the scientific domain:
- SciBERT - basically BERT but for scientific text (code) (EMNLP 2019 paper)
- Don’t Stop Pretraining 🎶 your language models (code) (ACL 2020 paper) - 🎉 Runner-up for Best Paper
Scientific NLP tasks & datasets
We need new challenging scientific tasks & datasets for evaluating these models:
- Generating short TLDRs that summarize machine learning/AI papers (demo) (code) (arXiv preprint)
- Scientific fact checking! Can we verify claims using biomedical papers? (demo) (code) (arXiv preprint)
Resources for scientific NLP
Scientific text is difficult to access (copyright restrictions 😤). We need large, machine-readable, open-access corpora to support scientific NLP research:
- S2ORC: The Semantic Scholar Open Research Corpus (download) (ACL 2020 paper)
- CORD-19: The COVID-19 Open Research Corpus (download)(arXiv preprint) - Accepted to NLP-COVID at ACL 2020 (OpenReview)
Tools that make research less painful
- arXiv paper recommender w/ actionable explanations (link) (arXiv preprint)
- Adopted into production. Live on Semantic Scholar
Science of science
I’m interested (and concerned) about bias in scientific papers/publishing. Can we use NLP to study these biases?
It’d be great if more researchers in the NLP & text mining communities worked on scientific text. To promote this, I’ve co-organized workshops & shared tasks:
All of my projects have been collaborations with other awesome researchers. Many thanks to:
Waleed Ammar (Google), Iz Beltagy (AI2), Isabel Cachola (AI2), Arman Cohan (AI2), Doug Downey (AI2/Northwestern), Sergey Feldman (AI2), Suchin Gururangan (UW/AI2), Rodney Kinney (AI2), Ben Lee (UW), Ana Marasović (UW/AI2), Mark Neumann (AI2), Swabha Swayamdipta (UW/AI2), Dave Wadden (UW), Lucy Lu Wang (AI2).