Abstract: Practicing data scientists typically spend the bulk of their time working developing models for a particular inference or prediction application, likely giving substantially less time to the equally complex problems stemming from system infrastructure. We might trivially think of these two often orthogonal concerns as the modeling problem and the engineering problem. The typical data scientist is trained to solve the former, often in an extremely rigorous manner, but can often wind up developing a series of ad hoc solutions to the latter.
This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.