
Abstract: Generating datasets for ML models is a tried and true (and mundane) workflow - fetch the data from somewhere, sanitize it, normalize it and output it into CSV or another format.
Some folks may have certain pieces of this workflow automated but it is usually quite brittle which makes automating the whole thing an arduous task.
But there is a way to do this without the gymnastics - the pattern is called “replays”.
In this talk, I am going to detail the steps that our team took to be able to generate granular, realtime datasets directly from Kafka streams.
Attendees will leave with real, actionable steps on how to implement self-serve, real-time dataset generation using OSS.
Bio: Daniel is the co-founder and CTO of Streamdal.com, a streaming data performance monitoring company. Prior to starting his own company, he was a principal engineer at companies like New Relic, InVision, Community and DigitalOcean, and before that spent over ten years in the data center space focusing on integrations and R&D. Dan specializes in distributed system design and is a huge advocate for all things asynchronous.
He has been writing Go since 2014, works primarily in backend, listens to a lot of black metal and prefers Stella's over IPA’s. He resides in Portland, Oregon but is originally from [Riga, Latvia](https://goo.gl/maps/9wruLSg4RBU2).