Abstract: Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.
Bio: Yuriy Guts is a Machine Learning Engineer at DataRobot with over 10 years of industry experience in data science and software architecture. His primary interests are productionalizing data science, automated machine learning, time series and multiseries forecasting, processing spoken and written language. He teaches AI and ML at UCU, and has led multiple data science and engineering teams in Ukraine.