Abstract: Recent advances in Natural Language Processing (NLP) have revolutionized the process of identifying products, people and places in unstructured text data. This task is referred to as Named Entity Recognition (NER) and forms the basis of many downstream NLP applications (e.g. AI assistants, search engines). Where previously extracting this type of information relied on complex rules engines fit to specific data sources, open-source libraries leveraging pre-trained large language models can achieve state-of-the-art performance on even noisy social media data.
In this workshop, I will demonstrate an end-to-end NER application to identify medications in social media data using the spaCy NLP library. We will begin with an overview of neural NER models, including recent transformer-based models (i.e. BERT). We will then dig into spaCy’s project structure and how to set up their base neural model for NER. Our first objective will be to identify medication entities in social media conversation on Twitter. Once we’ve run and examined the results of the model, we will see how we can achieve significant performance boosts by tweaking project parameters and using a transformer architecture.
In the next part of the workshop, we will walk through how to adapt our project to a larger, more complex dataset which includes multiple entity types (i.e. medications, dosages and frequencies). The strategies we will review in this section are broadly applicable to real-world use-cases outside of the medical context. We will conclude with a review of example applications of this technology and a discussion of next steps for improving our performance.
Participants in this workshop can expect to gain a general understanding of NER pipelines generally and how to implement one in spaCy specifically. They will leave with a project template that can be applied to their own use-cases.
Bio: Ben is a Senior Data Scientist at the Institute for Experiential AI. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has spoken at data science conferences , taught courses in Data Science, and helped organize the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.