Abstract: Real-time data application is always one of the scariest ones for junior data scientists and (even) data engineers. However, the demand for building and managing real-time systems is growing rapidly. For most data scientists and researchers from non-engineering background, working with real-time data applications is not a trivial thing. In this workshop, we are going to build a real-time data application in 90 minutes to introduce hands-on experience to the audience. The application is simple yet complete, it is designed to be friendly and approachable. To serve a broader audience, mixed conceptual contents and practical coding will be implemented.
The workshop will follow the conventional four stages of a data pipeline: ingestion, storage, transformation and delivery. I will also cover some machine learning devops to demystify the fancy words in the context of this pipeline.
The project starts with the public Twitter API to stream live data from twitter feeds. The data will be stored in a document database with proper schema as well as a relational database for persistence. A machine learning model will be trained iteratively using the incoming new data. The original model was initiated with batch data offline. Bridging between API, databases and the model is handled by Kafka or faust. Basic concepts of the pub/sub model will be introduced to serve the beginner level audience who have no prior knowledge.
At the next stage, Streamlit is used to query the document database to serve users’ requests on the client-facing end. I will cover the life cycle of the requests/response scenario and its implementation.
Bio: Ron Li is a data science instructor and senior data scientist at Galvanize, Inc. Before that, He worked on machine learning and knowledge graphs at the Information Sciences Institute. Ron has published a 4.5-star rating book Essential Statistics for Non-STEM Data Analysts. He has also authored/co-authored several academic papers, taught data science to non-STEM professionals as pro bono service, and gave talks at conferences like PyData.