
Abstract: Our research into applying fuzzy matching to surveys provides a real world context for understanding machine learning algorithms, which appeals to both technical and non technical audiences alike. With more and more data being collected from consumers, finding a efficient solution to aligning survey changes over time is necessary for the data to be used before it becomes outdated. Of course, in the real world naming conventions change all the time, either on purpose or by accident, so how then do we match that data over time? With the right domain expertise, the solution might be simple enough for one or two variables, but some surveys have thousands of variables. A survey with hundreds of questions also has thousands of responses associated with it and slight modifications, such as changing a response choice from '5 or more' to '5+,' are impossible to match using traditional lookup algorithms, which require exact matches. In our talk, we will introduce the complexities of aligning survey data across time, specifically, the time and effort required to match responses as surveys are updated. Specifically, we will discuss a survey that collects information on consumers attitudes, usage, and purchases for over 6,000 products. The survey data, which contains 20,000 variables across 26 categories, is refreshed twice a year and while approximately 80% of the questions remain the same, the remaining 20% represents new questions and modified answer choices and approximately 4,000 variables must be examined to align the new data with the previous survey responses. Our talk with include an explanation of why this problem lends itself to a 'fuzzy' solution and we will show how we leveraged the Levenshtein algorithm to match responses from one period to the next. Attendees will walk away with a high level understanding of fuzzy matching algorithms and learn how it can be effectively applied to solve a business problem. If you would like an overview of how we applied fuzzy matching to surveys, a snapshot of our work is also available online: http://www.nielsen.com/us/en/insights/journal-of-measurement/volume-1-issue-3/fuzzy-matching-to-the-rescue-aligning-survey-design-across-time.html
Bio: Jennifer Shin is a Senior Principal Data Scientist at The Nielsen Company and the Founder of 8 Path Solutions, a data science, analytics, and technology company in NYC. Jennifer is an experienced data scientist and management consultant who has led complex, large scale, and high profile projects for corporate, public, and private clients, including GE Capital, the Carlyle Group, Fortress Investment Group, the City of New York, and Columbia University. A recognized thought leader, her expertise has been featured in numerous publications, including USA Today, VentureBeat, and Reuters, and she has been identified by IBM as a Big Data & Analytics Hero and an Industry Influencer. Jennifer is also on the faculty in UC Berkeley Master of Information and Data Science program, lecturing in statistics and data science. She is also an instructor at Columbia Business School and on the Advisory Board for the M.S. in Data Analytics program at the City University of New York. Jennifer earned both her undergraduate degree in Economics, Mathematics, and Creative Writing and her graduate degree in Statistics from Columbia University.

Jennifer Shin
Title
Senior Principal Data Scientist at The Nielsen Company
