
Abstract: Many retail giants are experiencing huge inventory loss and shrinkage problems because they process a huge number of transaction activities every day across the U.S. and offer a liberal shopping policy to provide a convenient customer shopping and return experience. Due to the facts that 1) they have highly imbalance and complex transaction data set as there are enormous transaction data while different types of anomalies are exceedingly rare; 2) there are seldom predefined labels available as it is not feasible to have human experts manually review every transaction and identify anomalies, it is a challenging task to investigate customers’ return behaviors and prevent fraudulent activities. Therefore, it is imperative to have an efficient and generalized anomaly detection system, which is capable of: 1) automatically detecting unknown anomaly customers’ activities, instead of depending on manual reviews; 2) enhancing real-time fraud detection by enriching the existing labels with more meaningful signals.
Traditional anomaly detection methods are generated using a predetermined set of assumptions to detect a specific type of anomalies such as point, contextual and collective anomalies. However, in this work, we propose a systematic, flexible, extensible and holistic anomaly detection architecture to augment the existing labels and detect anomalies with a low cost. This new system can flexibly incorporate deep learning-based anomaly detection models, or any other traditional machine learning models, and generate a unified anomaly score by the ensemble stacking algorithm to address different types of anomalies simultaneously. Specifically, the system consumes transaction data and features coming from different sources, and it is engineered to learn normal customers’ behaviors from good customers, which helps identify and separate fraudulent customers’ activities at run time. Instead of building one anomaly detection model with limited functionality, we build a plurality of individual anomaly detection models. For example, individual models can be deep learning-based models such as Autoencoder, statistical-based models such as Gaussian based models, or tree-based models, etc.. Each individual model score and their anomaly detection performance will be taken into account to generate a unified anomaly score . Next, new anomalies are identified based on the distribution of unified anomaly scores and they will be tagged as anomaly as part of the label augmentation purpose. Enriched with newly introduced fraud signals from the above anomaly detection system, along with business knowledge from human experts, our data science team has developed series of sophisticated, automated real-time fraud detection engines, to fraudulent transaction activities.
Bio: Chuying (Annie) Ma is a senior data scientist in Walmart Inkiru team, where she works on developing and implementing machine learning models and strategies for real-time fraud detection and risk mitigation. She has a Master of Science degree in Biostatistics from Harvard University and gets her bachelor’s degree in Statistics and Mathematics from University of Michigan – Ann Arbor. In her spare time, she enjoys playing violin and ukulele.