A Comparison of Topic Modeling Methods in Python


We consider three topic modeling methods in Python, utilizing tools in the scikit-learn and gensim packages. These methods are (1) K-Means Clustering, (2) Latent Dirichlet Allocation, and (3) Non-negative Matrix Factorization. We show how these methods can be used to perform topic modeling using the same data set, together with common preprocessing steps in the analysis. We discuss some of the advantages and drawbacks of each method, concentrating especially on the central question of "How many topics are contained in the documents in the data set?"


Russell Martin is a data scientist in residence at the Data Incubator, where he instructs fellows, teaches online courses, and leads training courses with corporate partners. Russ lived and worked in the UK for 17 years, including at Warwick University and the University of Liverpool, where he taught in the Department of Computer Science. He holds a PhD in applied mathematics from the Georgia Institute of Technology.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google