Abstract: We at LinkedIn leverage Jupyter notebooks extensively to do ad-hoc data analysis and our data scientists, engineers and developers spend a lot of time iterating over the query development lifecycle. We have created a hosted notebook platform at LinkedIn for our internal users) called Darwin (Data Analytics & Relevance Workbench at LinkedIn) - a one-stop solution for a complete Jupyter notebook/query life cycle (query development, query testing and query productionizing). Requirements for query development include connecting to various data sources like HDFS, MySQL, Kafka, etc., utilizing data engines like Apache Hadoop, Apache Spark & Trino and using libraries like TensorFlow to build state of the art machine learning models. Requirements for query testing include viewing the result in tabular format, pivot over the same and analyze with the help of visualization tools and collaborating with peers using Git and RB. Requirements for query productionizing include creating a shareable report link of the executions, scheduling the query at a set frequency & publishing the query to another internal app. We will also present the capabilities that we have added in our hosted notebook platform to iterate/pivot/visualize using a customized extension developed called a workbook along with features for productionizing them with creating custom schedules, creating customized dashboards & sharing seamlessly with other users.
We will share the advancements done in this field of Jupyter notebooks over the past few years which enables any user to be more productive with ad-hoc analysis using Jupyter notebooks. These learnings will enable data scientists/ machine learners to easily iterate and share their findings with the broader community.
LinkedIn is a data-driven company. Every team consumes and produces data that improves user experience on LinkedIn. Join Swasti Kakker and Manu Ram Pandit to explore the scalable, extensible unified platform LinkedIn is building leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git, and Restli that enforces productive data science and improves development experience.
Bio: Balamurugan Gangadharan is Senior Staff Software Engineer on the Data and Artificial Intelligence Platform group at LinkedIn, playing the role of Tech Lead for DARWIN, a Data Science and Artificial Intelligence platform. He has been playing a key role in shaping the roadmap of the product by working with various stakeholders of the product as well as in designing and implementing multiple features in DARWIN. He has extensive experience building highly scalable and complex distributed systems. Previously, he has served senior roles at companies like Qubole, Nutanix, Dell etc where he has built enterprise and cloud native solutions for complex problems.