
Abstract: Data integration, the art of synthesizing data from diverse sources into a singular, coherent view, is pivotal in today's multifaceted data ecosystem. While many companies rely on third-party services like Fivetran, Airbyte, or AWS Glue for data integration, Apache SeaTunnel emerges as a renovated technology in the open world. Our team at J.P. Morgan & Chase has harnessed SeaTunnel's efficient and scalable design to construct a robust foundational data ingestion service. This workshop will offer a hands-on experience, showcasing how we integrate data from various data sources (such as Amazon Dynamodb, RDS) into our data lake (S3 and Snowflake). Participants will gain insights into crafting and deploying SeaTunnel code across various environments and will learn about our contributions to this groundbreaking project.
Session Outline:
Establishing Fundamental Data Infrastructure: Utilizing Java, Gradle, Docker, and Kubernetes.
Project Scalability: Enhancing data processing capabilities with Apache Spark.
Diverse Data Source Connectivity: Simplifying integration with Apache SeaTunnel.
Advanced Stream Processing: Delving deeper with Apache Flink.
Learning Objectives:
This session aims to provide a practical demonstration of constructing comprehensive end-to-end data pipelines, reflecting real-world industrial applications and efficiencies.
Bio: Hao Xu currently serves as a Lead Software Engineer at J.P. Morgan & Chase, where he plays a pivotal role in the payment team. His expertise lies in developing reliable data and machine learning infrastructure, and architecting systems that ingest and transform data across millions of domains. Before JP Morgan and Chase, he was a key contributor to the ML Platform team at Affirm, where he applied his skills to machine learning initiatives, tackling complex challenges in loan underwriting and fraud detection.