I’m a researcher at the Allen Institute for AI on the Semantic Scholar Research team, where I work on NLP for scientific literature. Before that, I spent a couple years working as a data scientist in Seattle, and a year as a researcher in the Applied Probability group at Academia Sinica in Taiwan. I graduated in 2015 with an MS in Statistics from the University of Washington.
My research interests
It’s important yet tough for scientists to keep up with the rapid pace of publication. It’d be great if NLP models could improve access to & understanding of the valuable knowledge contained in academic literature. Yet, NLP models that work well on news or Wikipedia articles often perform poorly when applied to scientific text. What makes scientific text challenging? Why do existing models do so poorly on it? How can we overcome these limitations?
Adapting language models for science
One of the best ways to improve performance across many scientific NLP tasks is to adapt large language models to the scientific domain:
- Don’t Stop Pretraining 🎶 (code) (ACL 2020 paper) - 🎉 Runner-up for Best Paper
- SciBERT (code) (EMNLP 2019 paper)
Scientific NLP tasks & datasets
It’s hard to make progress without challenging tasks & datasets for evaluating our models:
- TLDR: Extreme summarization of scientific documents (demo) (code/download) (arXiv preprint) - 🎉 Accepted to EMNLP 2020 (Findings)
- Scientific claim verification (demo) (code/download) (arXiv preprint) - 🎉 Accepted to EMNLP 2020
Resources for scientific NLP research
Scientific papers can be difficult to access (paywalls, copyright 😤). We need large, machine-readable, open-access corpora to support scientific NLP research:
- S2ORC: The Semantic Scholar Open Research Corpus (download) (ACL 2020 paper)
- CORD-19: The COVID-19 Open Research Corpus (download)(arXiv preprint) - Accepted to NLP-COVID at ACL 2020 (OpenReview)
Helping researchers do research
\gammamean again? Hate flipping back to page 2 to find the definition? Our ScholarPhi tool provides just-in-time definitions of terms & math symbols right on the PDF (arXiv preprint)
LIME gives you post-hoc explanations of arbitrary model predictions. But what if a user says Show me more/less for that explanation? Tuning a linear model for this is easy, but for neural models, our solution is LIMEADE (arXiv preprint)
Science of science
I’m interested (and concerned) about bias in scientific research. How can NLP help us identify & quantify these biases?
It’d be great if more researchers in the NLP & text mining communities worked on scientific text. To promote this, I’ve co-organized workshops & shared tasks:
All of my projects have been collaborations with other awesome researchers ❤️. Many thanks to:
Waleed Ammar (Google), Iz Beltagy (AI2), Isabel Cachola (JHU), Arman Cohan (AI2), Doug Downey (AI2/Northwestern), Sergey Feldman (AI2), Suchin Gururangan (UW/AI2), Andrew Head (UC Berkeley), Dongyeop Kang (UC Berkeley), Rodney Kinney (AI2), Ben Lee (UW), Ana Marasović (UW/AI2), Mark Neumann (AI2), Swabha Swayamdipta (UW/AI2), Dave Wadden (UW), Lucy Lu Wang (AI2).