Unsupervised Learning Approach for Identifying Retail Store Employees Using Footfall Data


Analysis of customer visits (or footfall) in the store traced via geolocation enabled devices, helps digital firms understand customers and their buying behavior better. Insights gained through geo footfall analysis help clients and advertisers make an informed decision, choose profitable regions, recognize relevant advertising opportunities, and analyze their competitors to increase the success rate. But all this information can be disingenuous if people who walk past the store without entering, and staff of the store are not excluded. Therefore, two groups of people contributing to the footfall at the store can be considered outliers - people passing by the store, and employees of the store. The behavior of these outliers is expected to be different from the actual customers.

Since the data collected by geofencing the stores and pings from the SDK of the geo-enabled devices do not contribute much in tagging these outliers exclusively, these outliers are not very evident and cannot be removed by extreme value analysis. To tackle this problem we have formulated a multivariate approach to identify and remove these outliers from our source data. As we have no labeled data that marks a footfall as an employee or customer, we are using an unsupervised outlier detection model using the DBSCAN algorithm to provide a coherent and complete dataset with the labeled outliers. In this process, different techniques were taken into consideration to handle the effectiveness of features. Features like time spent by a visitor in and around the stores compared to other locations, monthly visit frequency, daily visit frequency, etc. were dominant in tagging the outliers.
Discovering the structure of data was another key step to optimize parameters of the DBSCAN algorithm for our use case namely, epsilon and minimal points.

Finally, the evaluation was done against the results obtained with that of the k-means algorithm, which showed that DBSCAN has a higher detection rate and a low rate of false positives in discovering outliers for the given problem statement.


Soumya Jain is currently working as a data scientist II in MiQ. She has done her engineering in computer science from BIT Durg, and MTech from IIIT Bangalore in data Science specialization. She has been employed for the past 1.5 years. She has a keen interest in the field of data science and finding valuable information from a dataset and making great stories out of it is what drives me to learn more.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google