Abstract: Moving data, transforming data types, taking small samples so they’ll fit in your sandbox – these are all things every data scientist puts up with as part of the routine of getting their job done. And when you’re finished, of course, some data engineer has to build full production pipelines to reproduce all that work at scale. But what if you could leave all the data where it is and analyze it in place? What if you could jump straight to the meat of the work, and when you’re finished, a single line of code would push it all into production?
In this talk. you’ll learn an alternative data science method that uses Jupyter notebooks and modern in-database machine learning. Using familiar basic Pandas and SciKit code, you’ll build a churn reduction model and put it to work in an environment exactly like production.
Familiarize yourself with how data is stored and accessed, how machine learning works in the database, and what functionalities are available. You’ll see how to set up a Jupyter environment. A step-by-step walkthrough of a churn reduction example using publicly available data set will demonstrate how to explore and visualize correlations, outliers, distribution, statistics, etc. Convert categorical to Boolean variables, and determine variables that are likely to contribute most to accuracy. While modifying the data set as needed for algorithm compatibility in a familiar environment, you’ll see how to use a distributed database to do the work across far larger datasets.
Using the prepared data, you’ll see how to train and validate a couple of different churn probability models. Evaluate each model and compare results. Save the model in the database, and apply to new data. The relatively short path from a trained model to an in-production model will be shown. You’ll even get a basic idea of how to compare models, version, and retrain.
• Modern in-database data preparation and machine learning
• How to use Python code and a Jupyter notebook to work with large scale data inside a database
• How to manage, train, and evaluate models inside a database
• What makes your model ready for production and how to get it there
Bio: In two decades in the data management industry, Paige Roberts has worked as an engineer, a trainer, a support technician, a technical writer, a marketer, a product manager, and a consultant.
She has built data engineering pipelines and architectures, documented and tested large scale open source analytics implementations, spun up Hadoop clusters from bare metal, picked the brains of some of the stars in the data analytics and engineering industry, championed data quality when that was supposedly passé, worked with a lot of companies in a lot of different industries, and questioned a lot of people's assumptions.
Now, she promotes understanding of Vertica, MPP data processing, open source, high scale data engineering, and how the analytics revolution is changing the world.