
Abstract: We are in the age of data. In recent years, many companies have already started collecting large amounts of data about their business. Many other companies are starting now. However, before you can train any decent supervised model you need ground truth data. And this is the ugly truth: before proceeding, you need a sufficiently large set of correctly labeled data records to describe your problem. And data labeling - especially in a sufficiently large amount - is … expensive. In this presentation we explain the main parts of the guided labeling procedure and we show a blueprint web-application, based on active learning and weak supervision, to interactively label any document set while investing only a fractional amount of time in manual labeling. Additionally the user can provide labeling functions or rules which can label portions of the dataset. Both labels and labeling function provided by the human-in-the-loop are processed by the guided labeling application to train a machine learning model to delegate the boring and expensive task of data labeling.
Bio: Paolo Tamagnini is a data scientist at KNIME, holds a master’s degree in data science from Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability.