Abstract: Ensuring proper data quality is critical in the effective implementation of data pipelines for ML, data science, geospatial analysis, or general analytics.
Most engineering teams address data quality and pipeline orchestration as two separate tasks. In this presentation, Sandy Ryza will explain the benefits of a model in which arbitrary checks are included in the data orchestration logic, resulting in better control and integration of data quality checks at various steps in the pipeline.
To remain versatile, the orchestrator should not determine what “data quality” means to an organization, but rather facilitate the implementation and observability of data quality checks, no matter how data quality is defined. Checks should be intuitive to implement and the outcome of the checks should inform the pipeline logic.
Achieving this degree of flexibility without impacting performance requires careful design, and Sandy will share best practices and lessons learned on creating data quality checks that provide actionable insights for data engineering and ML teams.
Sandy's presentation will share how we designed and implemented data quality checks in the Python-based open-source platform Dagster, bringing data quality capabilities to the orchestration layer.
Participants will benefit most if they have a working understanding of data orchestration, ML and data science pipelines, and a general grasp of data quality techniques.
Bio: Sandy is a lead engineer, author, and thought leader in the domain of data engineering. Sandy co-wrote “Advanced Analytics with PySpark” and """"Advanced Analytics with Spark”. He led ML and data science teams at Cloudera, Remix, Clover Health, and KeepTruckin.
Sandy is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. Sandy is a regular speaker at data engineering and ML conferences.