Abstract: Tools like jupyter notebooks are great for getting started with data science and doing exploratory analysis, but they don’t make great reusable software. If you want to re-build a model, change parameters and compare results, configure your model for different settings, or generally write data science software, you need to expand your toolkit beyond notebooks.
This talk takes you through the process of formalizing a quick notebook-based data analysis and turning it into something more like modularized, tested, reusable data science software. We will start with a quick sprint through a data science problem from a popular online competition website, using primarily pandas and scikit-learn to do some data exploration, modeling, and validation.
Then we’ll take this code and begin to break it apart and re-build it, this time as a more formal set of python scripts. We’ll introduce some simple best practices for writing reusable python code, writing the stream-of-consciousness exploratory data science code as a series of functions that are modular and configurable for re-use. We’ll also add on some bells and whistles like light data governance, a command-line interpreter, and unit testing of the data science code.
This course is ideal for intermediate data scientists with some experience using python in a notebook but who are looking to write more mature and professional data science code. Familiarity with python and the python data science stack (pandas, scikit-learn) is assumed. Some familiarity with machine learning algorithms and best practices is also assumed; we will briefly cover several algorithms, metrics, validation methods, etc. but the main objective of this talk will be around assuming a participant who is already familiar with those basic ideas and is looking to build more robust and user-friendly tools with them.
Bio: Katie Malone is Director of Data Science at Civis Analytics, a data science software and services company. At Civis she leads the Data Science Research and Development department, which tackles some of Civis' most novel and challenging data science consulting engagements as well as writing the core data science code that powers the Civis Data Science Platform. A physicist by training, Katie spent her PhD searching for the Higgs boson at CERN and is also the instructor for Udacity's Introduction to Machine Learning course. As a side project she hosts a weekly podcast about data science and machine learning, Linear Digressions.