Archetypal Analysis: Maintaining Contrastive Categories in Cluster Analysis


Cluster analysis is the task of finding and organizing data observations into groups that share characteristics. K-means is a common machine learning technique and is great at finding clouds of concise clouds of homogenous data. K-means works by finding “centroids”, or centers of average tendency in the data, and we simply classify individual observations based on which centroid is mathematically closest to it. Philosophically, we say that the centroids have the characteristics that define what it means to be a member of the given cluster.

In the real world, there are many categories that we do not define by their “average tendency”. For instance, in American politics, we define voter’s political ideology in contrastive categories of liberal and conservative. Even someone who is a moderate liberal we will still categorize as liberal, even if their ideology has more common with a moderate conservative than an extreme liberal. We do this because they have qualities that at least begin to approach a stereotypical extreme. This works for many other categories as well (value shoppers, athletes, action films, etc). K-means will often struggle to organize data this way, as it uses centroids rather than contrastive extremes to exemplify cluster membership.

Archetypal analysis, on the other hand, mirrors that contrastive way of thinking into it’s machine learning approach to clustering data. Like K-means, it finds artificial data points to exemplify cluster membership. But instead of centroids, Archetypal analysis looks for extremal values on the periphery of the data distribution, called “archetypes” to exemplify cluster membership, and cluster membership is defined by their similarities to archetypes. As such, archetypal analysis is a useful tool for maintaining contrastive categories in organizing data into classes.

My presentation will cover archetypal analysis in greater detail, comparing it to more traditional clustering techniques. I will also cover the underlying math and machine learning approach to the technique, as well as its implementation in R. I’ll also share a few case studies from my own industry (Market research) where the technique is used. Attendees should be at least somewhat familiar with cluster analysis generally, be able to follow along in R code, and have a rudimentary understanding of basic statistical and machine learning ideas.


Jacob Nelson is a data scientist working in the market research industry. Jacob designs research, collects survey data, and uses statistics and machine learning to find insights about consumer behavior that help steer clients' brand strategy. In his field, he is well appreciated for boldy finding new approaches to traditional tasks to improve research outcomes and predictive insights. He earned his Masters Degree in Political Science at Utah State University in 2016, acquiring skills in data science and quantitative research methodology along the way. He has been working as a data scientist in the market research industry ever since. Jacob currently works for Harris Poll, an American market research and analytics company.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google