About me

I’m a researcher at the Allen Institute for AI on the Semantic Scholar Research team, where I work on NLP for scientific literature. Before that, I spent a couple years working as a data scientist in Seattle, and a year as a researcher in the Applied Probability group at Academia Sinica in Taiwan. I graduated in 2015 with an MS in Statistics from the University of Washington.

My research interests

It’s important yet tough for scientists to keep up with the rapid pace of publication. It’d be great if NLP models could improve access to & understanding of the valuable knowledge contained in academic literature. Yet, NLP models that work well on news or Wikipedia articles often perform poorly when applied to scientific text. What makes scientific text challenging? Why do existing models do so poorly on it? How can we overcome these limitations?

Adapting language models for science

One of the best ways to improve performance across many scientific NLP tasks is to adapt large language models to the scientific domain:

Scientific NLP tasks & datasets

It’s hard to make progress without challenging tasks & datasets for evaluating our models:

Resources for scientific NLP research

Scientific papers can be difficult to access (paywalls, copyright 😤). We need large, machine-readable, open-access corpora to support scientific NLP research:

Helping researchers do research

  • What does \gamma mean again? Hate flipping back to page 2 to find the definition? Our ScholarPhi tool provides just-in-time definitions of terms & math symbols right on the PDF (arXiv preprint)

  • LIME gives you post-hoc explanations of arbitrary model predictions. But what if a user says Show me more/less for that explanation? Tuning a linear model for this is easy, but for neural models, our solution is LIMEADE (arXiv preprint)

  • Prototype recommender system for arXiv papers (demo). Now adopted into production on Semantic Scholar (link)

Science of science

I’m interested (and concerned) about bias in scientific research. How can NLP help us identify & quantify these biases?

Professional organizations

It’d be great if more researchers in the NLP & text mining communities worked on scientific text. To promote this, I’ve co-organized workshops & shared tasks:

  • Shared tasks

    • EPIC-QA at TAC 2020 - Open domain question answering challenge: Can systems handle a mixture of questions from experts as well as consumers? (link)
    • TREC-COVID at TREC 2020 - Information retrieval challenge over an evolving CORD-19 corpus (link) (JAMIA 2020 paper) (SIGIR Forum 2020 paper)
  • Workshops

    • 1st SciNLP workshop at AKBC 2020 (link) (recorded talks) - What a success! 166 of 422 AKBC attendees signed up for our workshop! Stay tuned for the next one ;)

My collaborators

All of my projects have been collaborations with other awesome researchers ❤️. Many thanks to:

Waleed Ammar (Google), Iz Beltagy (AI2), Isabel Cachola (JHU), Arman Cohan (AI2), Doug Downey (AI2/Northwestern), Sergey Feldman (AI2), Suchin Gururangan (UW/AI2), Andrew Head (UC Berkeley), Dongyeop Kang (UC Berkeley), Rodney Kinney (AI2), Ben Lee (UW), Ana Marasović (UW/AI2), Mark Neumann (AI2), Swabha Swayamdipta (UW/AI2), Dave Wadden (UW), Lucy Lu Wang (AI2).