
Abstract: Assembling comprehensive datasets is often the biggest challenge in creating effective analytics and machine learning models. This requires enriching existing datasets with external sources; examples such as product catalogs or company records can complement existing data gaps. However, external data sources can be a time consuming and expensive resource. Instead, what if you could use machine learning to extract previously inaccessible information from existing data? A solution which reliably extracts important data from text can enrich datasets in a repeatable fashion. With a relatively small amount of manual labelling, a Named Entity Recognition (NER) model can be trained to identify and extract entities.
In this session, we discuss how Named Entity Recognition algorithms can expand the accessible information in a dataset by extracting known entities from unstructured attributes. As an example, we’ll use a Recurrent Neural Net (RNN) to identify and retrieve product properties from supply chain datasets. These extracted attributes can be used for downstream data curation and analytics, such as units of measure standardization or price comparison. We benchmark the performance of our algorithms against more traditional extraction methods, including regular expressions. Finally, we show how the RNN models can be provided in a simple API for data enrichment at scale.
Bio: Ian is a DataOps Engineer at Tamr, where he works on designing and implementing Tamr’s data solutions for clients. Before Tamr, Ian applied machine learning models to research properties of materials. His previous work includes high-throughput modeling of superelasticity in a novel class of intermetallic crystals. Ian has a PhD in Mechanical Engineering from Colorado State University.