The Future is Composable: How we to Build a Data Lakehouse from Open Source and Survive

Abstract: 

The design of data stacks began to move from monolithic applications to more composable systems. Over the past decade, data lakes became ubiquitous in enterprises and an increasing number of them choose to use cloud object storage (e.g. AWS S3).

This trend favored the rise of more flexible architectures built on separation of storage from compute, such as the Data Lakehouse (DLH). The DLH aims at providing the user experience of OLAP monolithic systems with the flexibility of the data lake. It also promises to go beyond analytics and BI, enabling first-class developer experience also for data transformation, data science and ML.

In this talk, we describe how we built a serverless DLH by leveraging open standards and existing open source frameworks.

In the last years, open source tools for data systems became more reliable and the industry started to think in terms of composability of data systems. Instead of simply trying to improve the usability of all-purpose Big Data technologies, such as Spark, we chose to repurpose open components to support different use cases. This approach allowed us to build a more flexible system and to focus on developer experience.

We built upon the following foundational principles:

Storage is built on cloud object storage and open formats (Parquet and Iceberg), to ensure interoperability with other system.
Different computational engines should be available based on the use cases: for instance, it is desirable to avoid the JVM in medium sized workloads.
User interfaces vary depending on users: for instance, developers should be able to use either SQL or Python or both, depending on what they are trying to accomplish.

We were able to leverage open standards, such as Apache Arrow for in-memory columnar representation, Iceberg for metadata representation and Parquet for storage, and open source projects such as DuckDB for query engine and SQL support.

While building an entire DLH remains extremely hard, the progress made by open formats and open source frameworks allowed us to build faster, narrow down the problem scope and focus our resources on differentiating features and components that truly required to be built from scratch.

Bio: 

Ciro Greco - founder and CEO at Bauplan, a serverless computing platform for complex data workloads. Developers can write complex, multi-language data pipelines with zero environment management and infrastructure configuration and the same instantaneous feedback loop as running code locally. Formerly, he was the founder of Tooso, an NLP startup based in San Francisco. Tooso was acquired by Coveo in 2019 and Ciro was in the management team that brought Coveo to IPO in 2021. In a previous life he got a PhD in Neuroscience at Milan-Bicocca, a postdoctoral fellowship at Ghent University and he was visiting scientist at MIT.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google