Abstract: Moving data, transforming data types, taking small samples so they’ll fit in your sandbox – these are all things every data scientist puts up with as routine. And when you’re finished, a data engineer has to build full production pipelines to reproduce all that work at scale. But what if you could leave all the data where it is and analyze it in place?
What if you could jump straight to the meat of the work, and when you’re finished, a single line of code would push it all into production?
In this tutorial, you’ll use familiar Pandas and SciKit code to build a churn reduction model without ever moving data.
• Modern in-database machine learning
• How to use Python code and a Jupyter notebook inside a database
• How to manage, train, and evaluate models inside a database
• What makes your model ready for production and how to get it there
Lesson 1: Set up environment and load data
Familiarize yourself with how data is stored and accessed in a Vertica analytical database. Set up your environment with MatPlotLib for visualization. Get a quick tour of what is possible in a VerticaPy notebook.
Lesson 2: Prepare Data
Load data. Explore and visualize correlations, outliers, distribution, statistics, etc. Convert categorical to Boolean variables, and determine variables that are likely to contribute most to accuracy. Modify data set as needed for algorithm compatibility.
Lesson 3: Train and Evaluate Model
Train and validate a couple of different churn probability models. Evaluate each model and compare results. Save the model in the database, and apply it to new data. Practice model comparison, retraining, and versioning.
Basic Python - familiarity with Pandas and Scikit helpful
Basic understanding of Jupyter notebook
Bio: Pranjal Singh has been a Data Scientist at Vertica since May 2020. Prior to joining Vertica, Pranjal received his Bachelor's degree in Data Science from Northeastern University in Boston, MA. He has experience with Software Engineering, Data Analytics, and Machine Learning. Pranjal has a passion for ML and Predictive Analytics, and helping organizations make better decisions with data. He's an avid sports fan, with a special interest in sports analytics and data.