Topic Modeling using pre-trained large language model embeddings


Finding common topics discussed in a set of text responses is often performed using techniques that learn how often words occur together in a set of responses.

In real-world cases, the number of words in each response may be small, for example, finding which common topics are discussed in social media comments that mention a particular brand. This case poses a problem for these traditional methods of topic modeling: if two words with similar meanings rarely appear together in the dataset, the model will not be able to learn that they represent a common topic. Here, using a pre-trained large language model (LLM) can help. Because LLMs are trained on a much larger dataset, they contain richer information about when words typically appear together in the wild, beyond a limited dataset. The LLM translates the responses to high-dimensional vector embeddings, without requiring any expensive re-training of the model. Once embeddings have been generated, a clustering algorithm like K-means or HDBSCAN clustering can group the data into discrete sets of documents that share semantic similarity. Though measuring the distance between high-dimensional datapoints is easy, visualizing high-dimensional relationships is challenging. Luckily, there are several techniques for reducing high-dimensional data to a more digestible level. In particular, the UMAP algorithm does a good job of capturing both global and local structure in a 2D or 3D reduction that can be easily plotted and inspected. In this talk, I will show how to find topics in brief text responses and create interactive visualizations of the results, using several free open-source Python packages.

Background Knowledge:

Light familiarity is okay, all of the tools will be introduced and any Python coding will not be complicated.


Matt Bezdek is a Senior Data Scientist at Elder Research. In his work, he empowers commercial clients to make better business decisions, with expertise in machine learning, forecast modeling, natural language processing, and visualization. He has a PhD in Cognitive Psychology from Stony Brook University and has conducted neuroimaging research at Georgia Tech and Washington University in St. Louis.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google