Abstract: Regardless of where you are in your data science career, you will eventually be confronted with datasets that cannot fit into memory of a single machine–and the problems that often come with this situation. In this talk, we will review key strategies that will help you adapt to your growing datasets. Importantly, we will consider when you might choose one strategy over another.
We will discuss different approaches you can take to adapt your data so that it fits in your existing analysis framework. Then we will review the steps you can take when the analysis is simply too big to fit in the RAM of a single machine. We will examine how you might speed up calculations by using parallel processes and/or GPUs and by using frameworks such as Python’s Dask and the R future package.
This discussion will equip you with strategies to tackle larger datasets. More data does not have to mean more problems!
Two broad problems with larger datasets: memory and speed
An overview of strategies to address these problems
Strategy 1: Make the data smaller - sample, summarize, and/or optimize your data to make it fit on your machine(s).
Strategy 2: Buy your way out - use cloud resources to solve the problem and keep your code the same
Strategy 3: Analyze the data in smaller chunks on a single node and combine the results.
Strategy 4: Analyze the data in smaller chunks on a multi-node cluster using a big data framework.
An example workflow in Saturn Cloud illustrating these strategies
An overview of other strategies to note - JIT compilation and code optimization
Bio: Nathan Ballou is a Senior Data Scientist at Saturn Cloud, a cloud workspace for the whole data science team. Prior to working at Saturn Cloud, Nathan worked as a data science consultant and as an operations research analyst. When Nathan’s not evangelizing machine learning at Saturn Cloud he can be found rowing on the Patapsco River in Baltimore.