Abstract: Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Active learning is a semi-supervised machine learning technique that incorporates user feedback at each training iteration to ensure that an optimal datapoint is used for training. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database. We will make use of Neo4j, Cypher (the query language for graphs), and Python data science tools.
Bio: William Lyon is a Developer Relations Engineer at Neo4j, the open source graph database, where he builds tools for integrating Neo4j with other technologies and helps users be successful with graphs. He also leads Neo4j's Data Journalism Accelerator Program. Prior to Neo4j, he worked as a software engineer for a variety of startups, building quantitative trading tools, predictive APIs, and software for the real estate industry. William holds a masters degree in Computer Science from the University of Montana. You can find him online at lyonwj.com.