Abstract: We at LinkedIn leverage Jupyter notebooks extensively to do ad-hoc data analysis and our data scientists, engineers and developers spend a lot of time iterating over the query development lifecycle. We have created a hosted notebook platform at LinkedIn for our internal users) called Darwin (Data Analytics & Relevance Workbench at LinkedIn) - a one-stop solution for a complete Jupyter notebook/query life cycle (query development, query testing and query productionizing). Requirements for query development include connecting to various data sources like HDFS, MySQL, Kafka, etc., utilizing data engines like Apache Hadoop, Apache Spark & Trino and using libraries like TensorFlow to build state of the art machine learning models. Requirements for query testing include viewing the result in tabular format, pivot over the same and analyze with the help of visualization tools and collaborating with peers using Git and RB. Requirements for query productionizing include creating a shareable report link of the executions, scheduling the query at a set frequency & publishing the query to another internal app. We will also present the capabilities that we have added in our hosted notebook platform to iterate/pivot/visualize using a customized extension developed called a workbook along with features for productionizing them with creating custom schedules, creating customized dashboards & sharing seamlessly with other users.
We will share the advancements done in this field of Jupyter notebooks over the past few years which enables any user to be more productive with ad-hoc analysis using Jupyter notebooks. These learnings will enable data scientists/ machine learners to easily iterate and share their findings with the broader community.
LinkedIn is a data-driven company. Every team consumes and produces data that improves user experience on LinkedIn. Join Swasti Kakker and Manu Ram Pandit to explore the scalable, extensible unified platform LinkedIn is building leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git, and Restli that enforces productive data science and improves development experience.
Bio: Swasti Kakker is a senior software development engineer on the data analytics and infrastructure team at LinkedIn, where she worked on the design and implementation of Darwin - a hosted Jupyter notebook solution. She has worked on features like scheduling notebooks based on a cron expression, creating publishable reports from executions of a notebook, introducing Language servers in notebooks and integrating notebooks with various apps at LinkedIn. She works closely with stakeholders to understand the expectations and requirements of the platform that would improve developer productivity. Her passion lies in increasing and improving developer productivity by designing and implementing scalable platforms. She has also spoken previously at international conferences like Grace Hopper, Orlando and O’reilly Strata, New York in 2019.