Clean as You Go: Basic Hygiene in the Modern Data Stack

Abstract: 

When my children walk around the house, they generally leave a trail of mess behind them. They sometimes realize that they shouldn't be doing this, but they’re so excited to move on to the next thing that catches their eye that they’ll say “Oh, I’ll clean it up later.”

As grown adults with wisdom gained from experience, my wife and I know that this means either:

They’ve just signed themselves up for a massive future cleaning job, or …
… that someone else will have to clean up after them.

We know that this is not good behavior for a child, so why do we so often do this as Data Engineers?

The culture of “Move Fast and Break Things” has pressured us into closing tickets as quickly as possible, frequently pushing us towards the “Oh, I’ll clean it up later” mindset. While this may save us a few minutes in the short-term, we are creating long term headaches such as:

Piles of small cleanup tasks for later
Confusion among peers who try to use incomplete data assets
Lack of metadata to activate throughout the Modern Data Stack

Session Outline:

In this session, we’ll discuss best practices for keeping your data clean. While best practices are tool agnostic and can be entirely done using a plethora of Open Source Tools, we'll walk through some examples that make it easy to illustrate specific concepts:
Data Contracts with Gable
Data Tests and Descriptions with dbt
Data Observability with Metaplane
Data Cataloging with Secoda
Visual Sanity checks with Pickaxe
A few minutes of extra time up front can save us a bunch of time down the road if we work smarter, not harder.

Background Knowledge:

Attendees should leave with an understanding of how to modify their workflows to prioritize hygiene without disrupting the speed of business.

Bio: 

Eric Callahan serves as Principal, Data Solutions at Pickaxe Foundry. With over 15 years of experience, he assists clients in resolving various data challenges. His expertise encompasses data engineering, analytics, machine learning, and experimentation, gained through roles focused on both product and marketing. This diverse background provides him with a unique perspective on the interconnectedness of the data ecosystem. He actively participates as a panelist and speaker, sharing his insights on best practices for modern data stacks.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google