
Abstract: The session will focus on identifying rare events in text with positive unlabeled data. PU learners are massively used for one-class classification but the challenge becomes far steeper when the event under consideration has low probability of occurrence. We will discuss a novel algorithm (iCASSTLe) included in IEEE ICMLA 2018 where we use a two-staged semi-supervised approach to extract the relevant recall set using core components of NLP. By the end of this workshop, you will get a basic understanding of the following
- Difference b/w rare events & anomalies
- Basics of Text Mining
- Motivation behind Semi Supervised Learners
- Training PU Learners for Rare Events
Session Outline
Module I: Rare Events & how they differ from Anomalies
- Examples
- Major differences
- Degree of rarity
Module II: Rare Events in Text
- Examples
- Sentiment Inclination
- Token Sensitivity
- Data availability
Module III: Text Mining
- Text Cleaning & Pre-processing for Rare Events
- Numeric Representation of Text
- Live Exercise (R/Python script will be provided)
Module IV: Positive Unlabeled Learning
- Motivation & Examples
- Live Exercise
Module V: Semi Supervised Learning
- Motivation
- Entropy Regularization
- Logistic Regression (Binary Classification) with semi-supervision
Module VI: iCASSTLe
- Example use case
- Quantifying Degree of Severity
- Metric Formation & Stage I Classification
- Stage II Classification with ERLG
- Live Exercise
Background Knowledge
Basics of statistical learning, linear algebra - matrix factorization, vector space and distance, probability, logistic regression, entropy, monte carlo simulations, NLP basics, fair exposure to coding in R/Python
Bio: Debanjana is a Senior Data Scientist at Walmart Labs with 4+ years of experience in tech. At Walmart, she has been instrumental in developing ML-driven solutions in the compliance space dealing heavily in Natural Language Processing, Mixture Models and Rare Time Series. Currently, her focus is on building an AI to enable automated shelf curation for creative content on Walmart.com. She has filed 5 US patents in the field of Clustering & Anomaly Detection, Imbalance Text Classification and Stochastic Processes. In addition, she has three published papers to her credit. Debanjana has a master's degree in Statistics from Indian Institute of Technology (Kanpur).

Debanjana Banerjee
Title
Senior Data Scientist | Walmart Labs
