A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records


One method for approximate ("fuzzy") matching two strings is to compute the Levenshtein distance between the strings and accept a suitably low-valued result. An indexing technique that allows for this type of comparison in a time-sensitive manner is called Deletion Neighborhoods.

In this talk, we review string-oriented Deletion Neighborhoods and present a novel application of them where a similar technique may be applied to entire dataset records. Careful application of both string- and record-oriented indexing techniques allows for powerful searching and record deduplication capabilities.


Dan has been with LexisNexis Risk Solutions Group since 2014 and is an Enterprise Architect in the Solutions Lab Group. He has worked for Apple as well as Dun & Bradstreet, and he ran his own custom programming shop for a decade. He's been writing software professionally for more than 40 years and has worked on a myriad of systems, using many different programming languages.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google