
Abstract: Hands-on tutorial working with open source integration of several disparate graph technologies available as Python open source, made to work efficiently with popular tooling for PyData Science and Data Engineering practices. This provides an abstraction for _Graph Data Science_ practices and in particular an introduction to _Graph Thinking_.
Python offers excellent libraries for working with graphs: semantic technologies, graph queries, interactive visualizations, graph algorithms, probabilistic graph inference, as well as embedding and other integrations with deep learning. However, most of these approaches share little common ground, nor do many of them integrate effectively with popular data science tools (pandas, scikit-learn, spaCy, PyTorch), nor efficiently with popular data engineering infrastructure such as Spark, RAPIDS, Ray, Parquet, fsspect, etc. The library has use cases in large enterprise firms in industry and is also used as a teaching tool.
This workshop uses `kglab` – an open source project focused on integrating the priorities described above, and moreover providing ways to leverage disparate graph technologies in ways that complement each other, to produce Hybrid AI solutions for industry use cases. At its core, this effort is about self-supervised learning in graph data science workflows, leading toward Hybrid AI solutions. We’ll cover some of the less intuitive learnings which have provided practical guidance in this work. For example, the notion of “Thinking Sparse and Dense”, to make the most of available subsystems, in software and hardware respectively, when working with graph data. Similarly, how transforms and inverse transforms based on algebraic graph theory apply for effective design patterns in this integration work. We’ll also consider when to make trade-offs between more analytic methods versus tools that allow for uncertainty in the data, and also how to blend data-intensive machine learning with rule systems based on domain expertise.
In this workshop, we'll switch between brief lecture-based segments (using the public slides), hands-on coding work based on Jupyter notebooks (see GitHub repo and the online syllabus), and Q&A sessions.
Outline:
- Sources for data and controlled vocabularies: using a progressive example based on a Kaggle dataset for food/recipes
- KG Construction in rdflib and Serialization in TTL, JSON-LD, Parquet, etc.
- Transformations between RDF graphs and algebraic objects
- Interactive Visualization with PyVis
- Querying with SPARQL, with results in pandas
- Graph-based validation with SHACL constraint rules
- A sampler of graph algorithms in networkx and igraph
- Inference based on semantic closures: RDFS, OWL-RL, SKOS
- Inference and data quality checks based on probabilistic soft logic
- Embedding (deep learning) for data preparation and KG construction
Background Knowledge
- Some coding experience in Python (you can read a 20-line program)
- Interest in use cases that require knowledge graph representation
- Additionally, if you've completed Algebra 2 in secondary school and have some business experience working with data analytics – both can come in handy.
Bio: Known as a "player/coach", with core expertise in data science, natural language, machine learning, cloud computing; 38+ years tech industry experience, ranging from Bell Labs to early-stage start-ups. Advisor for Amplify Partners, IBM Data Science Community, Recognai, KUNGFU.AI, Primer. Lead committer PyTextRank. Formerly: Director, Community Evangelism @ Databricks, and Apache Spark. Cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise.