Integrating Pandas with Scikit-Learn, an Exciting New Workflow
Integrating Pandas with Scikit-Learn, an Exciting New Workflow


For Python data scientists, a typical workflow consists of using Pandas for exploratory data analysis before turning to Scikit-Learn for machine learning. Pandas and Scikit-Learn arose independently, each focusing on their specific tasks, and were never specifically designed to be integrated together. There was never a clearly defined and standardized process for transitioning between the two libraries. This lack of a concrete handoff lead to practitioners creating a variety of markedly different workflows to make this transition.

One of the main hurdles facing the Pandas to Scikit-Learn transition was the handling of string columns. Inputs to Scikit-Learn's machine learning models only allow for numeric arrays. The common scenario of taking a Pandas DataFrame with string columns and converting it to an array of only numeric values was quite painful. Yet another hurdle, was processing separate groupings of columns with separate functions.

With the recent release of Scikit-Learn version 0.20, many workflows will start looking similar. The brand new ColumnTransformer allows for direct Pandas integration to Scikit-Learn. It applies separate transformations to specific subsets of columns. The upgraded OneHotEncoder standardizes the encoding of string columns. Before, it only encoded columns containing numeric categorical data.

In this hands-on tutorial, we will use these new additions to Scikit-Learn to build a modern, robust, and efficient workflow for those starting from a Pandas DataFrame. There will be ample practice problems and detailed notes available so that you can use it immediately upon completion.


Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google