Abstract: Data Lakes are a mainstream part of our data platforms these days. They underpin a lot of our data science and engineering workflows, but it’s hard to talk Data Lakes without someone mentioning the Data Swamp, simply because lakes used to be hard to get right. That’s no longer the case thanks to a wave of next-generation file formats, one of which is Delta. Whether you’re a data scientist, machine learning engineer, or a hardcore data engineer, there are a whole host of Delta features that completely change the landscape of how we’re building data platforms.
In this workshop, I will introduce the Delta file format and how it works, before taking a tour of the many features available in the open source project. I will show you how to get started with Delta in a spark environment, covering a range of features from simple merge statements and temporal querying, right down to some deeper performance tuning. You will leave this workshop ready to work in a Delta Lake architecture, confident that you will avoid the dreaded swamp!
Bio: Simon is the Director of Engineering for Advancing Analytics, a Microsoft Data Platform MVP and one of the few Databricks Beacons Globally. Simon has pioneered Lakehouse Architectures for a some of the world’s largest companies, challenging traditional analytical solutions and pushing for the very best for the data industry. Simon runs the Advancing Spark YouTube channel, where he can often be found digging into Spark features, investigating new Microsoft technologies and cheering on the Delta Lake project.