Building a near-real-time data pipeline in the cloud with Spark & Python
Building a near-real-time data pipeline in the cloud with Spark & Python


Learn more about how we built, tested and delivered a near-real-time data pipeline using Apache Spark in the cloud in two weeks -- and still saw our families. We faced a looming deadline, and real-time analytics requirements. Using a cloud-based platform with Spark and Impala running on Microsoft Azure, and armed with a few hundred lines of Python code, we designed, tested and deployed an end-to-end data pipeline and analytics infrastructure in two weeks. The project had its challenges, both technical and operational; learn what we learned and our tips for success.


Dan Stair is a Senior Engineer with Cazena’s data engineering platform team. He has years of experience developing and running high-performance databases and Hadoop clusters. At Cazena, Dan focuses on developing software and processes that make cloud data platform deployments automated, efficient and secure. He also works closely with companies to configure and optimize cloud data environments for unique requirements.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google