Applying an Active Learning Algorithm For Entity Deduplication In Graph Data


Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Active learning is a semi-supervised machine learning technique that incorporates user feedback at each training iteration to ensure that an optimal datapoint is used for training. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database. We will make use of Neo4j, Cypher (the query language for graphs), and Python data science tools.


William Lyon is a Developer Relations Engineer at Neo4j, the open source graph database, where he builds tools for integrating Neo4j with other technologies and helps users be successful with graphs. He also leads Neo4j's Data Journalism Accelerator Program. Prior to Neo4j, he worked as a software engineer for a variety of startups, building quantitative trading tools, predictive APIs, and software for the real estate industry. William holds a masters degree in Computer Science from the University of Montana. You can find him online at

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google