Abstract: According to the 2021 Kaggle Machine Learning & Data Science Survey, variants of SQL such as MySQL, PostgreSQL, and Microsoft SQL Server dominate the field for Enterprise databases used by data scientists and machine learning engineers. The reasons for this are obvious. In addition to the ease of creating data collection pipelines, there is an abundance of tooling in the form of open source software such as Python packages designed for interfacing with relational databases. Training courses on the use of relational databases in data science and machine learning is plentiful and online examples are ubiquitous.
However, the use of tabular data makes a key assumption: that all data points, represented as rows, are independent of each other, or at least independently drawn from the same statistical distribution. But what about when this is not the case? What about when there are relationships between the data points, or even several different relationships between them? This can be seen in SQL queries where there are multiple JOIN statements required to create a suitable output.
In reality, multiple JOIN statements are one of the key indicators that the data scientist is dealing with data that might be better represented in a graph structure, thus breaking the requirement of independence between data points. In this talk we will explore the advantages and disadvantages of using graph structures for data science problems. We will discuss how to identify “graph-y” problems, examples of problems that are easier to solve with graphs, how to model the data in a graphical representation, and the resulting output with a look at how these results can be used to further enhance data science and machine learning workflows.
Bio: Dr. Clair Sullivan is currently a graph data science advocate at Neo4j, working to expand the community of data scientists and machine learning engineers using graphs to solve challenging problems. She received her doctorate degree in nuclear engineering from the University of Michigan in 2002. After that, she began her career in nuclear emergency response at Los Alamos National Laboratory where her research involved signal processing of spectroscopic data. She spent 4 years working in the federal government on related subjects and returned to academic research in 2012 as an assistant professor in the Department of Nuclear, Plasma, and Radiological Engineering at the University of Illinois at Urbana-Champaign. While there, her research focused on using machine learning to analyze the data from large sensor networks. Deciding to focus more on machine learning, she accepted a job at GitHub as a machine learning engineer while maintaining adjunct assistant professor status at the University of Illinois. In 2021 she joined Neo4j as a Graph Data Science Advocate. Additionally, she founded a company, La Neige Analytics, whose purpose is to provide data science expertise to the ski industry. She has authored 4 book chapters, over 20 peer-reviewed papers, and more than 30 conference papers. Dr. Sullivan was the recipient of the DARPA Young Faculty Award in 2014 and the American Nuclear Society’s Mary J. Oestmann Professional Women’s Achievement Award in 2015.