
Abstract: Lack of a readily available dataset is a commonly seen scenario in industry projects involving NLP. It is also a situation researchers venturing into new problems or new languages often encounter. However, both traditional textbooks, as well as tutorials and workshops primarily focus on modeling and deploying models. In this workshop, I will introduce some strategies to create labeled datasets for a new task and build your first models with that data. At the end of this session, the participants are expected to get some ideas for solving the data bottleneck in their organization. The target audience are data scientists as well as those involved in requirements gathering for a given NLP problem.
Session Ouline
Lesson 1: Overview of different means of collecting labeled data for NLP, and ethical and other challenges involved.
Lesson 2: For a given problem description, what tools can we use to create annotated data? (an overview of tools, with specific examples using Doccano).
Lesson 3: How can we create labeled data automatically? - data labeling and augmentation for NLP. Tools used: Snorkel
Lesson 4: How to build a model using automatically labeled data and evaluate it with the gold-standard manually labeled data. Tools used: sklearn/huggingface
Preferred audience are people who already used NLP in their past work and are aware of the typical NLP system development pipeline e.g., how to represent text as a vector, how to use machine learning methods for NLP and how to evaluate them.
Background Knowledge
Preferred audience are people who already used NLP in their past work and are aware of the typical NLP system development pipeline e.g., how to represent text as a vector, how to use machine learning methods for NLP and how to evaluate them.
Bio: Sowmya Vajjala currently works as a researcher in Digital Technologies at National Research Council, Canada’s largest federal research and development organization. She has worked in the area of Natural Language Processing (NLP) over the past decade in various roles – as a software developer, researcher, educator, and a senior data scientist. She recently co-authored a book: “Practical Natural Language Processing: A Comprehensive Guide to Building Real World NLP Systems”, published by O’Reilly Media (June, 2020), which was also translated into Chinese. Her research interests lie in multilingual computing and the relevance of NLP beyond research both in industry practice as well as in other disciplines, through inter-disciplinary research.

Sowmya Vajjala, PhD
Title
Research Officer, Digital Technologies | National Research Council
