Editor’s note: Christina is a speaker for ODSC East 2024 this April 23-25. Be sure to check out her talk, “Dive into Data: The Future of the Single Source of Truth is an Open Data Lake,” there!

Organizations that strive for high-performance data systems are increasingly turning towards the ELT (Extract, Load, Transform) model. This approach provides scalability, flexibility, and low latency: ELT allows for horizontal scaling of the target data store and enables users to perform analysis and transformations on the fly. However, choosing the right provider and target can be a daunting task, as it carries high stakes and high rewards. A successful implementation can deliver significantly faster insights, but a poorly executed one can result in costly infrastructure maintenance and vendor lock-in.

The Proliferation of EL Technology

Are you torn between building and buying software for your EL needs? Join the club. With limited engineering resources and a tight GTM timeline, my team started with a vendor. It helped us scale up to hundreds of SAAS connectors quickly, but the cost model and lack of customization became problematic over time. (I wrote about our learning curve here.) We decided to explore other options for our ingestion needs Airbyte and Meltano. I made a side-by-side comparison in the appendix.

I am not going to recommend the “best“ technology to you because there is no silver bullet. I do want to introduce a framework that can help you evaluate your opinions quantitatively, instead of making decisions based on what is popular right now. It is called Analytic Hierarchy Process, and it was a game-changer for my team.

Remember, developers are a key part of this equation. So don’t forget to consider their satisfaction with the technology. After all, happy developers deliver for successful projects and maintain the stack for the long term. 

The One True Destination

With the ability to accommodate both structured and unstructured data, data lakes are the perfect fit for a wide range of analytics, machine learning, and product use cases. Modern data lakehouses take full advantage of the infinite scalability of cloud storage, while also offering a range of advanced features such as ACID transactions, time travel, and schema evolution.  An open data lake, rather than proprietary targets, is the inevitable EL destination. By choosing open formats, you can dramatically reduce costs, prevent vendor lock-in, have control over your data, and freedom to innovate. By decoupling compute from storage, you can substantially improve data pipeline efficiency. (I shared my experience in this post.)

Be aware that an EL framework’s write-to-target is often overly general and inefficient. This is because the primary objective of EL is to deliver bronze data, not to optimize the user’s compute resources. Although it is possible to make the process more efficient, doing so requires additional infrastructure. Unfortunately, this is not always transparent to the user, especially when the EL provider is a closed-source service. In such cases, the user has to trust the vendor’s autoscaling mechanism, which may not always work optimally. 

Here is an example from Airbyte’s Redshift connector documentation. The default – insert data into columnar storage as row operations – is less optimal. 

The Redshift Destination offers two replication strategies. The first inserts via a typical SQL Insert statement. Although less efficient, this requires less user setup. 

The second inserts via streaming the data to an S3 bucket, and Copy-ing the data into Redshift. This is more efficient, and recommended for production workloads, but does require users to set up an S3 bucket and pass in additional credentials.

It is often beneficial to separate the process of writing to a target from data extraction, by using cloud storage as the initial landing zone. You can create a data extractor (e.g. a REST API service) that asynchronously pulls data from the source in increments. The service then stores the data in its raw format in S3, without performing heavy-duty data transformation. Instead, complex transformation tasks are delegated downstream to the silver/gold layers.

The Open Format Contest

Open table formats such as Delta, Iceberg, and Hudi have grown tremendously in popularity. All of them use Parquet, an open-source, high-performance, columnar file format, as the underlying storage layer. However, the metastores, which store data type, structure, and statistical information, are still vendor-specific. Using Delta on one of the most advanced enterprise data warehouses is still unreliable and considered experimental; the best-managed Spark compute platform does not yet natively support Iceberg. The battle continues.

Although overshadowed by challenges, hope finds a way: One promising step forward is the Uniform Project. Uniform enables users to work with open tables and access multiple clients, regardless of format. However, it is important to note that there are limitations to the current reader/writer versions, which may result in performance issues. I am excited about the development of these kinds of projects and welcome contributions from interested individuals.

Conclusion: The Hope for Future Data Lakes

We love modern EL technology and its ability to offer greater flexibility and faster insights. However, I believe that the productivity of these tools relies heavily on the skills of the developers who maintain them. From my perspective, cloud storage is the most reliable EL destination since it eliminates the need for inefficient transformations, vendor-specific “black boxes,” and proprietary targets. I foresee data science and analytics moving towards high-throughput, low-latency open tables. I look forward to a future where open-format wars are a thing of the past and metastores are unified. Would you like to join me in a dialogue on this topic at Open Data Science East 2024?

Appendix – ELT Choices Comparison

VendorAirbyte (self-host)Meltano (self-host)
OriginClosed sourceAirbyte (2020)GitLab (2018)
SourceSeveral hundreds of connectors and growingHundreds of connectors, custom connector development kitHundreds of connectors, customer REST API, and tap development kit
TargetSupport S3 (limited formats), data lake (requires metastore), data warehouses (requires separate compute)Support S3 in Avro, Parquet, Json Line. Response is converted multiple times. Multiple compaction strategies.Support S3 in Json and Parquet formats. Automatic compaction.
Ease of UseEasy to get started, restricted customizationEasy to get started, has an UI to help iterate through configurationSteeper learning curve. Once learned, easier to overwrite classes and methods.
CustomizationSubject to product roadmapLow code SDK, SDK, UI
Core is written in Java
Custom SDK
Written in Python
SpeedDepending on target computeCan be slow; read source code for implementation and data type conversion.Faster; write your own code for more efficiency.
TransformationSeparate setup with formulas or dbt core/cloudWith limitationsWith limitations
SupportPaid support planActive community SlackActive community slack

We want further filtering to avoid bringing in unnecessary data 

We want customization on REST API endpoints and requests 

We need control over how to handle data quality issues (e.g. dup handing and alerting)

  1. We are NOT looking for a low or no code solution.
    a. We need to be able to use time based filters, and inject it into both the request params and request body.
    b. We want to know/control the schema, and filter only for necessary fields.
  2. We are NOT looking for the number of connectors supported.
    a. We need an open standard that can extend to any REST API: GET or POST.
    b. We seek the depth not the breadth of the data, in a structure we want. (Each sentence is a row is not helpful. Sentences are not useful; summary is.)
  3. We are NOT looking for DB targets or transformations
    a. Most EL providers DB targets are not recommended for production. They tend to default to row by row operations, which is extremely inefficient for any analytical data warehouse.
    b. Our API data extractor will do as little work as possible. S3 will be our target.
  4. We need an easy integration with our cloud and scheduler
    a. We need to be able to use Airflow to schedule the work and supply parameters
    b. We want to host the worker in our cloud environment, preferably EKS

About the Author:

Christina is passionate about open source, multi-cloud, scalable and efficient data pipelines. She makes data informed architectural decisions to build modern systems that support advanced analytics, machine learning, and customer facing product use cases. She has a keen interest in interdisciplinary areas such as DevOps, MLOps and Cloud FinOps. She loves to learn, share and contribute to the open source community.