Abstract: We at LinkedIn leverage Jupyter notebooks extensively to do ad-hoc data analysis and our data scientists, engineers and developers spend a lot of time iterating over the query development lifecycle. We have created a hosted notebook platform at LinkedIn for our internal users) called Darwin (Data Analytics & Relevance Workbench at LinkedIn) - a one-stop solution for a complete Jupyter notebook/query life cycle (query development, query testing and query productionizing). Requirements for query development include connecting to various data sources like HDFS, MySQL, Kafka, etc., utilizing data engines like Apache Hadoop, Apache Spark & Trino and using libraries like TensorFlow to build state of the art machine learning models. Requirements for query testing include viewing the result in tabular format, pivot over the same and analyze with the help of visualization tools and collaborating with peers using Git and RB. Requirements for query productionizing include creating a shareable report link of the executions, scheduling the query at a set frequency & publishing the query to another internal app. We will also present the capabilities that we have added in our hosted notebook platform to iterate/pivot/visualize using a customized extension developed called a workbook along with features for productionizing them with creating custom schedules, creating customized dashboards & sharing seamlessly with other users.
We will share the advancements done in this field of Jupyter notebooks over the past few years which enables any user to be more productive with ad-hoc analysis using Jupyter notebooks. These learnings will enable data scientists/ machine learners to easily iterate and share their findings with the broader community.
LinkedIn is a data-driven company. Every team consumes and produces data that improves user experience on LinkedIn. Join Swasti Kakker and Manu Ram Pandit to explore the scalable, extensible unified platform LinkedIn is building leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git, and Restli that enforces productive data science and improves development experience.
Bio: Manu Ram Pandit is a Staff software engineer on the data analytics and infrastructure team at LinkedIn, where he’s influenced the design and implementation of hosted notebooks, providing a seamless experience to end users. Manu has worked on setting up multiple features in the platform like sharing/choosing custom docker environments & recently is involved with visualization efforts to effectively view big data visualizations.He works closely with customers, engineers, and product to understand and define the requirements and design of the system. He has extensive experience in building complex and scalable applications. Previously, he was with Paytm, Amadeus, and Samsung, where he built scalable applications for various domains.