About me
I’m a researcher at the Allen Institute for AI on the Semantic Scholar Research team. Before that, I did some statistics in Seattle and some applied probability at Academia Sinica in Taiwan. I graduated in 2015 with an MS in Statistics from the University of Washington.
Research
I work on NLP over scientific literature, focusing on challenges scientists face with information overload and keeping up-to-date. I’m currently interested in:
- Language modeling for scientific text
- ✨Summarization of scientific papers
- ✨Fact checking claims about scientific phenomena using published literature
- ✨Corpora and resources to support other researchers interested in NLP for scientific text
- ✨Augmented reading of scientific papers to support definitions and note-taking
and I’ve also done some work in:
Language modeling
- Don’t Stop Pretraining 🎶: Adapt language models to domains and tasks (ACL 2020) (GitHub) - 🎉 Runner-up for Best Paper
- SciBERT: A pretrained language model for scientific text (EMNLP 2019) (GitHub)
Summarization
- TLDR: Extreme summarization of scientific documents (EMNLP 2020 - Findings) (GitHub)
- ✨ See live in production on Semantic Scholar
- In the news: Nature, MIT Tech Review, TNW
Fact checking
- SciFact: Scientific claim verification (EMNLP 2020) (GitHub)
- ✨ Follow progress on our public leaderboard and live demo
- In the news: MIT Tech Review, VentureBeat, ZDNet
Corpora and resources
- S2ORC: The Semantic Scholar Open Research Corpus (ACL 2020) (GitHub)
- ✨ s2orc-doc2json for parsing PDFs and LaTeX to JSON format (GitHub)
- CORD-19: The COVID-19 Open Research Corpus (NLP-COVID at ACL 2020) (OpenReview) (GitHub)
- SIIRH 2020 at ECIR 2020 keynote (18 min) (April 14, 2020)
- AWS Education: Research Seminar talk (60 min) (July 29, 2020)
- In the news: White House OSTP, Science, Nature, TechCrunch, Geekwire [1] [2]
Augmented Reading
- ✨ ScholarPhi: Just-in-Time, Position-Sensitive Definitions of Terms and Symbols (Accepted to CHI 2021)
- Try our live demo
Generation
- Citation text generation (arXiv 2020; under submission)
Explainable AI
- Explanation-based tuning of opaque machine learners (arXiv 2020; under submission)
Information extraction
- Document-Level definition detection in scholarly documents (SDP at EMNLP 2020)
- Combining distant and direct supervision for neural relation extraction (NAACL 2019)
- Construction of the literature graph in Semantic Scholar (NAACL 2018)
Science of science
- ✨ Text mining approaches for dealing with the rapidly expanding literature on COVID-19 (Briefings in Bioinformatics 2020)
- Quantifying sex bias in clinical trial participation (JAMA 2019)
- In the news: Quartz article
- Citation count analysis for papers with preprints (arXiv 2018)
Shared tasks
- SciVER at SDP 2021 (NAACL 2021) - Scientific claim verification (link)
- EPIC-QA at TAC 2020 - Open domain question answering challenge: Can systems handle a mixture of questions from experts as well as consumers? (link)
- TREC-COVID at TREC 2020 - Information retrieval challenge over an evolving CORD-19 corpus (link) (JAMIA 2020 paper) (SIGIR Forum 2020 paper)
Workshops
The 2nd SDP workshop will be at NAACL 2021! Stay tuned! (link)
1st SciNLP workshop at AKBC 2020 (link) (recorded talks) - What a success! 166 of 422 AKBC attendees signed up for our workshop! Stay tuned for the next one ;)
My collaborators
All of my projects have been collaborations with other awesome researchers ❤️. Many thanks to:
Waleed Ammar (Google), Iz Beltagy (AI2), Isabel Cachola (JHU), Arman Cohan (AI2), Doug Downey (AI2/Northwestern), Sergey Feldman (AI2), Suchin Gururangan (UW/AI2), Andrew Head (UC Berkeley), Dongyeop Kang (UC Berkeley), Rodney Kinney (AI2), Ben Lee (UW), Ana Marasović (UW/AI2), Mark Neumann (AI2), Swabha Swayamdipta (UW/AI2), Dave Wadden (UW), Lucy Lu Wang (AI2).