Build a Question Answering System using DistilBERT in Python


Comprehending natural language text with its first-hand challenges of ambiguity, synonymity and co-reference has been a long-standing problem in Natural Language Processing. The domain of Natural Language Processing has seen a tremendous amount of research and innovation in the past couple of years to tackle this problem and to implement high quality machine learning and AI solutions using natural text by abstracting the underlying workings of the algorithms. This essentially allowed for quick application of pre-trained models and integrated them in the real-world industry use-cases. Question-Answering is one such area that is crucial in all sectors like finance, media, chatbots to explore large text datasets and find insights quickly. You can either build a closed domain QA system for specific use-case or work with open domain systems using some of the open-sourced language models that have been pre-trained on terabytes of data on general knowledge base. Fine-tuning it based on the problem at hand to add additional information is the way to efficiently implement a machine learning solution. The general idea is to identify K relevant sentences from the training corpus for a question query, that will then find the span of text from sentences which answers the question.

This talk will highlight the general concepts and ways of implementing language model DistilBERT and fine tuning the base model to build an efficient question-answering model. This also ensures that using the available open source platforms we are able to have better business outputs as well as better environment because training a single AI model contributes to 5 cars' lifetime worth of carbon emissions? Basic understanding of python is desirable. Code can be made available via GitHub for everyone to examine after the talk.

Session Outline
Phase 1: Understand the NLP based concepts
Familiarize yourself with rNLP terminology and process flow necessary to retrieve information from an unstructured pool of text corpus.

Phase 2: Deep-dive on BERT
Understand the BERT architecture it’s workings. Explore the problem statement and steps to solve it.

Phase 3: Walkthrough of the Code
Colab notebook walkthrough to go step by step in the process of building the question- answering model

Background Knowledge
Basic understanding of python


Jayeeta is a Data Scientist with 5+ years of industry experience. She recently led six-week NLP workshops in association with Women Who Code, Data Science track. Jayeeta has also been a speaker at International Conference on Machine Learning (ICML 2020), MLConf EU, WomenTech Global Conference, and Data Summit Connect. She works extensively on NLP projects where she gets to explore a lot of state-of-the-art models and build cool products and firmly believes that data is the best storyteller. Recently, Jayeeta joined MediaMath, a leader in the programmatic AdTech domain. Prior to this, she worked at Indellient, Omnicom, Deloitte, and Volvo Group. Jayeeta is also engaged with some amazing organizations to promote and inspire more women to take up STEM. Jayeeta received her Master of Science in Quantitative Methods and Modeling from City University of New York, NY, and Bachelor of Science in Economics and Statistics from West Bengal State University, India.
Website -

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google