Abstract: Our job as data scientists is to demand answers from the data, even if these answers are sometimes not in line with what we would like to hear. As they say, “The only thing worse than not knowing is not wanting to know”. There are many ways in which our data, the models we build with it, or the laws of statistics are misleading us into drawing the wrong conclusions. To be successful, we have to navigate our way through common pitfalls ranging from outliers to overfitting, selection biases, etc. And even that sometimes is not enough: we also have to be weary of more subtle phenomena, such as, for example, the Simpson’s paradox.
So how to figure out if the results of an analysis are unbiased and indeed depict a proper view of reality? How to check if a model is accurate? And how to even check if the data we are using are even correct? Validation might just be the most important part of the analytical process, yet it is often the most overlooked one. Thankfully, generations of statisticians have developed methods to confirm or infirm their results, and it is usually possible to catch your data in a lie before the model starts impacting the business irrevocably.
In this talk, I will discuss not only the many ways that data can deceive analysts (both human-driven and technical), but also some of the tools to avoid it and the consequences that can result if you don’t ensure that your data is actually telling you the truth, the whole truth, and nothing but the truth.
Bio: Dr. Jennifer Prendki is the Head of Data Science at Atlassian, where she leads all Search and Machine Learning initiatives and is in charge of leveraging the massive amount of data collected by the company to load the suite of Atlassian products with smart features. She received her PhD in Particle Physics from University UPMC - La Sorbonne in 2009 and has since that worked as a data scientist for many different industries. Prior to joining Atlassian, Jennifer was a Senior Data Science Manager in the Search team of Walmart eCommerce. She enjoys addressing both technical and non-technical audiences at conferences and sharing her knowledge and experience with aspiring data scientists.