Creating a Custom Vocabulary for NLP Tasks Using exBERT and spaCY


For NLP tasks, the first step is to pre-process text for training. Let’s say you have the English language model ,you will have a model that includes over 1 million items of vocabulary, many classes of entity recognition and a lot of compound noun recognition. But what happens when we need to add new terms and customize the vocabulary? In this tutorial, we show an approach on how to create a custom vocabulary that can be further used for any NLP tasks.

Session Outline:

1. Introduction to Language Models - terminologies such as vocabulary, common language models
2. Why do we need a custom vocabulary - examples of scenarios where custom terms are needed
3. How to add custom terms to a vocabulary - exBERT and spaCY tokenizer
- step by step approach of creating a custom vocabulary in python

Learning Objective:
By the end of the session, participants will be able to understand how to create their own custom vocabularies that can be further used to nlp tasks such as sentence completion, sentiment analysis and so on. The module will talk about the exBERT approach of adding additional terms to an existing vocabulary and go over the steps using an example from hugging face library. Participants will also learn about the pitfalls, if any using this approach and how spaCY tokenizer as an open source tool can be used to achieve this customization.


Swagata is a Data Professional with over 6 years experience in Healthcare, Retail and Platform Integration industry. She is an avid blogger and writes about state of the art developments in the AI space. She is particularly interested in Natural Language Processing, and focuses on researching how to make NLP models work in practical setting. In her spare time, she loves to play her guitar, sip masala chai and find new spots for doing Yoga. Connect with her here –

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google