Salted Graphs – A (Delicious) Approach to Repeatable Data Science
Salted Graphs – A (Delicious) Approach to Repeatable Data Science

Abstract: 

Data Science is easy* when your data fit in memory, your functions are stateless, and everything is version controlled. As things grow, however, the data pipeline can become its own tangled mess – if you change a preprocessing step, does your model need to be retrained? What about predictions – did you rerun them after finding the best parameters for the model? Which teammates have which versions of your output?

Developers have juggled “Dependency Hell” issues for decades, and many tools exist to help keep our environments functional and complete. Yet there are few or no standard tools for the artifacts of a data science workflow, or even outputs from an ETL pipeline. Functions are expected to get “the most recent” input and we rely on eventual consistency, despite the obvious risks and prevalent failures.

At Solaria Labs, these problems are even more real as we are constantly developing brand new products from the ground up. In this workshop, I will discuss how we leverage dataflow programming libraries, such as Dask and Luigi, to structure and simplify our approach. I will also introduce the Salted Graph – a concept which allows rigorous tracking of data lineage within the code framework of choice, in a manner similar to Git, to provide a ‘controlled version’ for our data outputs.

Bio: 

Scott Gorlin is the Director of Applied Science for Solaria Labs, an incubation arm within Liberty Mutual Innovation focused on exploring emerging technologies and non-traditional business opportunities. Prior to joining Solaria in June of 2017, he led research and development for an ad-tech startup focused on automated campaign management and performance optimization. Scott earned a Ph.D. in Systems and Computational Neuroscience from the Massachusetts Institute of Technology and has long been an advocate and practitioner of repeatable and scalable science through code.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google