Abstract: Data practitioners use orchestrators to schedule and run the computations that keep data assets, like datasets and ML models, up-to-date.
Traditional orchestrators think in terms of “tasks”. This talk discusses an alternative, declarative approach to data orchestration that puts data assets at the center. This approach, called “software-defined assets”, is implemented in Dagster, an open source data orchestrator.
In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.
Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.
Asset-based orchestration works well with modern data stack tools like dbt, Meltano, Airbyte, and Fivetran, because those tools already think in terms of assets.
Attendees of this session will learn how to build and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.
Bio: Sandy works at Elementl as the lead engineer for the Dagster project. Prior, he led machine learning and data science teams at KeepTruckin and Clover Health. He's a committer on Spark and Hadoop, and co-authored O'Reilly's Advanced Analytics with Spark.