General Training Session: Turning a data science brain dump into software
General Training Session: Turning a data science brain dump into software


Tools like jupyter notebooks are great for getting started with data science and doing exploratory analysis, but they don’t make great reusable software. If you want to re-build a model, change parameters and compare results, configure your model for different settings, or generally write data science software, you need to expand your toolkit beyond notebooks.

This talk takes you through the process of formalizing a quick notebook-based data analysis and turning it into something more like modularized, tested, reusable data science software. We will start with a quick sprint through a data science problem from a popular online competition website, using primarily pandas and scikit-learn to do some data exploration, modeling, and validation.

Then we’ll take this code and begin to break it apart and re-build it, this time as a more formal set of python scripts. We’ll introduce some simple best practices for writing reusable python code, writing the stream-of-consciousness exploratory data science code as a series of functions that are modular and configurable for re-use. We’ll also add on some bells and whistles like light data governance, a command-line interpreter, and unit testing of the data science code.

This course is ideal for intermediate data scientists with some experience using python in a notebook but who are looking to write more mature and professional data science code. Familiarity with python and the python data science stack (pandas, scikit-learn) is assumed. Some familiarity with machine learning algorithms and best practices is also assumed; we will briefly cover several algorithms, metrics, validation methods, etc. but the main objective of this talk will be around assuming a participant who is already familiar with those basic ideas and is looking to build more robust and user-friendly tools with them.


Katie Malone is Director of Data Science at Civis Analytics, a data science software and services company. At Civis she leads the Data Science Research and Development department, which tackles some of Civis' most novel and challenging data science consulting engagements as well as writing the core data science code that powers the Civis Data Science Platform. A physicist by training, Katie spent her PhD searching for the Higgs boson at CERN and is also the instructor for Udacity's Introduction to Machine Learning course. As a side project she hosts a weekly podcast about data science and machine learning, Linear Digressions.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google