
Abstract: Suppose you are an ecologist studying a rare species like a pangolin. You can use motion-triggered camera traps to collect data about the presence and abundance of species in the wild, but for every video showing a pangolin, you have 100 that show other species, and 100 more that are blank. You might have to watch hours of video to find one pangolin.
Deep learning can help. Project Zamba (https://zamba.drivendata.org) provides models that classify camera trap videos and identify the species that appear in them. Of course, the results are not perfect, but we can often remove 80% of the videos we don’t want while losing only 10-20% of the videos we want.
But there’s a problem. The output from deep learning classifiers is generally a “confidence”, not a probability. If a classifier assigns a label with 80% confidence, that doesn’t mean there is an 80% chance it is correct. However, with a modest number of human-generated labels, we can often calibrate the output to produce more accurate probabilities, and make better predictions.
In this talk, I’ll present use cases based on data from Africa, Hawaii, and New Zealand, and show how we can use deep learning and calibration to save the pangolin… or at least the pangolin videos. This real-world problem shows how users of ML models can tune the results to improve performance on their applications.
Background Knowledge:
No specific knowledge required, just a general understanding of what a classifier is.
Bio: Allen Downey is a Staff Scientist at DrivenData and professor emeritus at Olin College. He is the author of several books related to computer science and data science, including Think Python, Think Stats, Think Bayes, and Think Complexity. His blog, Probably Overthinking It, features articles about Bayesian statistics. He received his Ph.D. in Computer Science from U.C. Berkeley, and M.S. and B.S. degrees from MIT.