About me
I’m a researcher at the Allen Institute for AI on the Semantic Scholar Research team. Before that, I was a statistician in Seattle and a researcher at Academia Sinica in Taiwan. I graduated in 2015 with an MS in Statistics from the University of Washington.
My research interests
I work on NLP problems motivated by challenges in building applications to help scholars with their research. I’m currently mainly interested in:
- Language modeling for scientific text
- ✨Summarization of scientific papers
- ✨Fact checking claims using the broader scientific literature
- Corpora and resources to support other researchers interested in this problem space
- ✨Augmented reading of scientific papers to support definitions and note-taking
and I’ve also done some work in:
Language modeling
- Don’t Stop Pretraining 🎶 (code) (ACL 2020 paper) - 🎉 Runner-up for Best Paper
- SciBERT (code) (EMNLP 2019 paper)
Summarization
- ✨ TLDR: Extreme summarization of scientific documents (demo) (code/download) (Findings of EMNLP 2020 paper)
- ✨ See live in production on Semantic Scholar
- video (12 min) by Henry AI Labs
- In the news: Nature, MIT Tech Review, TNW
Fact checking
- ✨ SciFact: Scientific claim verification (demo) (code/download) (EMNLP 2020 paper)
- In the news: MIT Tech Review, VentureBeat, ZDNet
Corpora and resources
- S2ORC: The Semantic Scholar Open Research Corpus (download) (ACL 2020 paper)
- CORD-19: The COVID-19 Open Research Corpus (download)(NLP-COVID at ACL 2020 paper) (OpenReview)
- SIIRH 2020 at ECIR 2020 keynote (18 min) (April 14, 2020)
- NY-NLP meetup talk (30 min) (April 27, 2020)
- AWS Education: Research Seminar talk (60 min) (July 29, 2020)
- In the news: White House OSTP, Science, Nature, TechCrunch, Geekwire [1] [2]
Augmented Reading
- ✨ ScholarPhi: Just-in-Time, Position-Sensitive Definitions of Terms and Symbols (arXiv preprint; 2020; under submission)
Generation
- Citation text generation (arXiv preprint; 2020; under submission)
Explainable AI
- Explanation-based tuning of opaque machine learners (arXiv preprint; 2020; under submission)
Information extraction
- Document-Level definition detection in scholarly documents (SDP at EMNLP 2020)
- Combining distant and direct supervision for neural relation extraction (NAACL 2019)
- Construction of the literature graph in Semantic Scholar (NAACL 2018)
Scientometrics and Science of science
- ✨ Text mining approaches for dealing with the rapidly expanding literature on COVID-19 (Briefings in Bioinformatics 2020)
- Quantifying sex bias in clinical trial participation (JAMA 2019)
- In the news: Quartz article
- Citation count analysis for papers with preprints (arXiv preprint; 2018)
Shared tasks
- SciVer at SDP 2021 (NAACL 2021) - Scientific fact checking (link)
- EPIC-QA at TAC 2020 - Open domain question answering challenge: Can systems handle a mixture of questions from experts as well as consumers? (link)
- TREC-COVID at TREC 2020 - Information retrieval challenge over an evolving CORD-19 corpus (link) (JAMIA 2020 paper) (SIGIR Forum 2020 paper)
Workshops
The 2nd SDP workshop will be at NAACL 2021! Stay tuned! (link)
1st SciNLP workshop at AKBC 2020 (link) (recorded talks) - What a success! 166 of 422 AKBC attendees signed up for our workshop! Stay tuned for the next one ;)
My collaborators
All of my projects have been collaborations with other awesome researchers ❤️. Many thanks to:
Waleed Ammar (Google), Iz Beltagy (AI2), Isabel Cachola (JHU), Arman Cohan (AI2), Doug Downey (AI2/Northwestern), Sergey Feldman (AI2), Suchin Gururangan (UW/AI2), Andrew Head (UC Berkeley), Dongyeop Kang (UC Berkeley), Rodney Kinney (AI2), Ben Lee (UW), Ana Marasović (UW/AI2), Mark Neumann (AI2), Swabha Swayamdipta (UW/AI2), Dave Wadden (UW), Lucy Lu Wang (AI2).