Data visualization sometimes gets categorized as a field separate from machine learning or data science. Skill in designing effective, attractive plots and graphs doesn’t show up in job descriptions in the same way as experience with Keras or XGBoost. I think this is a mistake. In my years of data science practice, I have never built a model without visualizing some of the data, performance metrics, or output – or more often, all three. While not all data scientists are visual learners like me, it’s still true that visualization gives us a way to experience and understand our data that is distinct from what tables or text can provide.
This is a big reason why it took me a bit longer than I’d like to develop my skills in machine learning in Python. The data visualization ecosystem in Python has just never stood up to the tools R makes available, and visualizing is a non-negotiable part of machine learning development for me. I’ve complained about this before, I’ll readily admit it. However, as my career path has grown into the Python sphere more and more, I have had to bite the bullet and use Python tools to do data visualization tasks too.
This progression has led me to think a lot about what it is that the Python libraries lack that the ggplot2 world in R provides, and what it is about a library for data visualization that creates a passionate, happy user base. This foundation is what I use to critique and assess the six Python libraries I’m going to be discussing at ODSC East 2021.
Ease of Use
The first make-or-break moment for any software, including a dataviz library, is when you first pick it up. Jan needs to generate a plot to see how this data is shaped, or needs to show something to a peer, and so googles or asks someone “what’s the best data visualization library in python?”. How many steps, lines of code, or different docs pages will Susan put up with to get this visualization done? How many new concepts or paradigms will Bob want to learn before getting the one plot he needs? This process needs to be as easy and short as possible if you want someone to become a user. There are really three possible results of the first interaction: “Hey, that was pretty good. I’ll use that again!”, “Ugh, that was tough but I got it done. Not looking forward to the next time though.”, or “This is a mess, I don’t have time for this, I’m going to find some other tool to use.”
Sensible, consistent grammar
If the library makes it past that first hurdle, then you have a user. They might have varying levels of enthusiasm about the tool, however. So, what’s the development of this user going to look like? Grammar, in particular its intuitiveness and consistency, is key here. Once you learn a few key elements or functions, can you reasonably guess how to adapt them to a new use case? If you know how to create a scatterplot, does this give you any help when you need to create a line graph next? Or are you starting over from essentially scratch? People like to feel as though they are making progress and developing sophistication of their understanding of the tool as they keep working with it. Discovering that every new use case requires a new skill set or memorization is a big bummer. Think of it like learning spoken or written language. Nobody likes learning English verbs because so many of them are irregular – you try to guess what the correct conjugation might be, and you’re wrong, AGAIN, and eventually, this becomes really frustrating. Nobody wants this experience when making data visualizations.
Once the user has developed a robust understanding of the library, and is getting comfortable with making a variety of plots, they’re going to run up against customization needs. Perhaps they need a label just exactly right here, to point out an outlier – or they want to remove just the y axis tick marks, or they want to fill just this little bit with color. Perhaps there is a brand theme they must use for their company or organization. This kind of thing is the difference between a library that’s all right for exploratory analysis and messing around, and the library that you can use to make plots that really illuminate important things, or plots you’d be willing to show to your peers, boss, or clients.
If someone does all their exploratory analysis and then has to switch proverbial horses midstream, rewriting plots into another toolkit that gives them the specific features they want, they’re going to be annoyed, and they’re going to be less efficient. The user’s time is valuable, and even if they CAN do this kind of thing, that doesn’t mean they’ll want to. Within reason, a good plotting library will be full featured and allow pretty detailed customizations.
Beautiful, readable results
In my experience, people either ONLY think about the aesthetics of plots, or they don’t think of it at all. The truth is, everyone deserves good looking data visualizations. It’s not just for beautifying our environments, but because if a plot is appealing and easy on the eyes, more people will look at it, and the message it’s trying to convey will get to a larger audience. This matters!
However, attractive design shouldn’t take all day, and it shouldn’t be its own source of massive frustration for a non-designer user. A good data visualization library ought to have reasonably attractive design elements out of the gate, and it needs to be customizable along with other aspects of the plot. Having built-in themes, support for schemes like a color brewer, and font versatility, for example, all add to the effectiveness by making the result more pleasant to look at.
In my opinion, these are the major considerations worth our attention when evaluating a data visualization library. I’d argue that R’s ggplot2 ecosystem does a remarkable job hitting these marks, which is why it has such a strong and enthusiastic following. To make a comparison, I tested six python libraries on these same criteria: matplotlib, seaborn, bokeh, altair, plotnine, and plotly. To find out more about my assessments, and to see samples of my code and plots from all these libraries, join me at ODSC East 2021 in my session, “Going Beyond Matplotlib and Seaborn: A survey of Python Data Visualization Tools“!
Stephanie Kirmer is a Senior Data Scientist at Saturn Cloud, a company making large scale Python easy and accessible to the data community using Dask. Throughout her career, she’s used varied tools to make effective data visualizations, including as a DS Tech Lead at a travel data startup, and as a Senior Data Scientist at Uptake, an industrial data science company. She holds Master’s degrees in sociology and education, and was formerly an adjunct faculty member at DePaul University in Chicago.
Sometimes, I’m stumped when people ask me what I do for a living. When I reply that I help organizations communicate better with and about data, or call myself a data translator, I see the gaze of the person I’m talking to fog over. You may be familiar with this look – it’s a common reaction many people have once data or numbers are introduced into a conversation.
A perfect example to demonstrate what being a data translator means came up this past weekend. My partner and a bunch of his college buddies are huge UNC basketball fans, and on Saturday, the Tar Heels (UNC) played Duke – a fierce and famed rivalry. The game was tight – much closer than anyone who has been following either team this season could have expected. In the post-game recap conversation, our friend shared this analysis of the probability of the Tar Heels winning the game out to the group. I sensed the immediate opportunity to translate this chart for the members of our group chat that do not have a master’s degree in statistics.
Thanks to Luke Benz (@recspecs730) for putting out this (and many other) visualizations about college hoops on his Twitter feed.
This chart demonstrates the hope, and ultimate defeat, that Tar Heel fans felt throughout the game. But if you’re unfamiliar with an output like this one, the clear heartbreak may not be immediately understood.
My job as a data translator and storyteller is to take charts like the one above (from the ncaahoopR package developed by Luke Benz), and transform them to be accessible and easily understood to the general population. Like a version of Google Translate dedicated to breaking down the work of data scientists.
I’d make a few changes to this chart to help interpret the game for my Tar Heel fan friends.
- Remove Duke from the equation – win probabilities for a two-team game are inverses, so the other team values are not essential and ultimately a bit confusing.
- Add in some annotations to call out key points in the game.
- Create a title that makes clear what the chart is about and the additional context that will be important for a Tar Heel fan to walk away with.
My version below contains the same data and information but transformed for an audience of college basketball enthusiasts.
We are living in an incredible time for data – never before have we had access to so many data sets (big and small) or visualization tools. With this access comes the need to really put thought into the data we share – the why and the how – because we no longer need to put a ton of thought or work into creating visualizations. The most impressive data analysis is useless without the ability to clearly communicate essential takeaways and offer up persuasive recommendations.
I challenge you to think about the data and dataviz you create and distribute through the lens of a translator. Be particular and intentional about the data and visualizations you share. Determine their importance and how these data points help influence decision-makers and tell clear stories to your audience. Consider the possibility that everyone who uses your data does not have your background, and instead, help them learn through your expertise via clear insights and uncluttered visualizations.
I’m excited to share more about creating strong data stories at ODSC East. Please come check out my session “The Art (and Importance) of Data Storytelling” to learn more about strategic choices in visualization design and the influential power you can harness as a data translator.
Diedre Downing is a Lead Data Storytelling Trainer at StoryIQ where she helps organizations improve their communication with and about data. An accidental math teacher, Diedre learned the power of demystifying numbers in New York City classrooms and the power of influencing decision-makers with data during her time running WeTeachNYC.org for the NYC Department of Education. Diedre is an Adjunct Lecturer at Hunter College in New York and has spoken at NCTM, iNACOL, and Learning Forward about adult learning methodology and best practices in professional learning.
As data becomes increasingly interconnected and systems increasingly sophisticated, it’s essential to make use of the rich and evolving relationships within our data. Graphs are uniquely suited to this task because they are, very simply, a mathematical representation of a network. The objects that makeup graphs are called nodes (or vertices) and the links between them are called relationships (or edges).
A property graph model consists of entities, often called nodes, and links between them, often called relationships. Nodes and relationships can also contain properties and attributes.
Graph algorithms are built to operate on relationships and are exceptionally capable of finding structures and revealing patterns in connected data. This is important because real-world networks tend to form highly dense groups with structure and “lumpy” distributions. We see this behavior in everything from IT and social networks to economic and transportation systems. Traditional statistical approaches don’t fully utilize the topology of data itself and often “average out” distributions. Graph analytics vary from conventional analysis by calculating metrics based on the relationships between things.
Graph algorithms are used when we need to understand structures and relationships to answer questions about the pathways that things might take, how they flow, who influences that flow, and how groups interact. This is essential for tasks like forecasting behavior, understanding dynamic groups, or finding predictive components and patterns.
There are many types of graph algorithms but the three classic categories consider the overall nature of the graph: pathfinding, centrality, and community detection. However, other graph algorithms such as similarity and link prediction consider and compare specific nodes.
- Pathfinding algorithms are fundamental to graph analytics and explore routes between nodes.
- Centrality algorithms help us understand the impact of individual nodes to the overall network. They identify the most influential nodes and help us understand group dynamics.
- Community algorithms find communities where members have more relationships within the group that outside it. This helps infer similar behavior or preferences, estimate resiliency and prepare data for other analyses.
- Similarity algorithms look at how alike individual nodes are by comparing the properties and attributes of nodes.
- Link Prediction algorithms consider the proximity of nodes as well as structural elements, such as potential triangles, to estimate the formation of new relationships or the existence of undocumented connections.
Example Combating Fraud
Let’s say we’re trying to combat fraud in online orders. We likely already have profile information or behavioral indicators that would flag fraudulent behavior. However, it can be difficult to differentiate between behaviors that indicate a minor offense, unusual activity, and a fraud ring. This can lead us into a lose-lose choice: Chase all suspicious orders—which is costly and slows business—or let most suspicious activity go by. Moreover, as criminal activity evolves, we could be blind to new patterns.
Graph algorithms, such as Louvain Modularity, can be used for more advanced community detection to find group interacting at different levels. For example, in a fraud scenario, we may want to correlate tightly knit groups of accounts with a certain threshold of returned products. Or perhaps we want to identify which accounts in each group have the most overall incoming transactions, including indirect paths, using the PageRank algorithm.
To illustrate these algorithms, below is a screenshot using Louvain and PageRank on season two of Game of Thrones. It finds community groups and the most influential characters using our experimental tool, the Graph Algorithms Playground. Notice how Jon is influential in a weakly-connected community but not overall, and that the Daenerys group is isolated. Interestingly, it’s been noted that highly connected “islands” of communities can signal fraud in certain financial networks.
We’ve quickly overviewed what graphs are and how graph algorithms are uniquely suited for today’s connected data, however, we’ve just scratched the surface of what’s possible. If you’re interested in diving deeper, consider attending our training, “Reveal Predictive Patterns with Neo4j Graph Algorithms” at ODSC West 2019 on Wednesday, October 30th.
[Related Article: Creating Multiple Visualizations in a Single Python Notebook]
We also recommend downloading a free copy of the O’Reilly book, “Graph Algorithms: Practical Examples in Apache Spark and Neo4j” while it’s still available. This book walks through hands-on examples of how to use graph algorithms in Apache Spark and Neo4j, including a chapter dedicated to machine learning.
Jennifer is a Developer Relations Engineer at Neo4j, conference speaker, blogger, and an avid developer and problem-solver. She has worked with a variety of commercial and open source tools and enjoys learning new technologies, sometimes on a daily basis! Her passion is finding ways to organize chaos and deliver software more effectively.
Amy E. Hodler
Amy is a network science devotee and a program director for AI and graph analytics at Neo4j. Amy is the co-author of Graph Algorithms: Practical Examples in Apache Spark and Neo4j. She tweets @amyhodler
Originally posted on OpenDataScience.com