
Abstract: Assembling comprehensive datasets is often the biggest challenge in creating effective analytics and machine learning models. This requires enriching existing datasets with external sources; examples such as product catalogs or company records can complement existing data gaps. However, external data sources can be a time consuming and expensive resource. Instead, what if you could use machine learning to extract previously inaccessible information from existing data? A solution which reliably extracts important data from text can enrich datasets in a repeatable fashion. With a relatively small amount of manual labelling, a Named Entity Recognition (NER) model can be trained to identify and extract entities.
In this session, we discuss how Named Entity Recognition algorithms can expand the accessible information in a dataset by extracting known entities from unstructured attributes. As an example, we’ll use a Recurrent Neural Net (RNN) to identify and retrieve product properties from supply chain datasets. These extracted attributes can be used for downstream data curation and analytics, such as units of measure standardization or price comparison. We benchmark the performance of our algorithms against more traditional extraction methods, including regular expressions. Finally, we show how the RNN models can be provided in a simple API for data enrichment at scale.
Bio: Julia is the Director of Analytics at Tamr, where she is expanding the company's analytics and data science solutions. Before joining Tamr, she led end-to-end modeling and development of data science products at Aon's Intellectual Property Solutions group. Her previous experience includes technology-focused litigation consulting, quantitative finance, and private equity. Julia has a PhD in Physics from Harvard.