Tathagata Das

Tathagata Das

Staff Software Engineer at Databricks

    Tathagata Das is a Staff Software Engineer at Databricks, an Apache Spark™ committer and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Previously, he was a grad student at UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica. In addition, he is also an author of the book "Learning Spark 2nd Edition", published by O'Reilly - https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

    All Sessions by Tathagata Das

    Day 1 04/23/2024
    10:00 am - 10:30 am

    Designing ETL Pipelines with Delta Lake and Structured Streaming — How to Architect Things Right

    <span class="etn-schedule-location"> <span class="firstfocus">Data Engineering</span> </span>

    DE Summit: Structured Streaming has proven to be the best framework for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiple ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going to examine a number of common streaming design patterns in the context of the following questions. - WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? - WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? - HOW are going to architect the solution? And how much are you willing to pay for it?

    Open Data Science

     

     

     

    Open Data Science
    One Broadway
    Cambridge, MA 02142
    info@odsc.com

    Privacy Settings
    We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
    Youtube
    Consent to display content from - Youtube
    Vimeo
    Consent to display content from - Vimeo
    Google Maps
    Consent to display content from - Google