Mining “Concept Embeddings” from Open-Source Data to Classify Previously Unseen Log Messages
Mining “Concept Embeddings” from Open-Source Data to Classify Previously Unseen Log Messages


Given the verbosity with which modern software produces logs – it is useful to have many dimensions for filtering when looking for specific content. Often, groups of log messages will relate to a general software concept (e.g. security, resource utilization, database access, etc.) and it can be useful to examine these messages as a group or to pinpoint messages that fall within the intersection of one or more of these concepts. Here we describe our approach to classifying previously unseen log messages into these software concept categories. To handle the large domain-specific vocabulary used by log messages we augmented the “continuous bag of words” (CBOW) embedding training process with an additional semi-supervised training step in which we create a “concept vector”. This vector of concept terms was produced by interrogating the initial embedding and manually filtering out-of-concept or ambiguous concept terms. This vector is then used as the seed for a second “concept embedding” in which terms that associate strongly with each concept vector co-localize. This technique enabled us to minimize the amount of manual example labeling required for training our classifiers while enabling them to correctly classify log messages with terms unknown to the concept vector, the labeled training set or even the model’s human creator: (e.g. the log message “There was an error getting a DBCP datasource.” is correctly classified as a database message because of the term “DBCP”). Our embeddings were trained on open-source data sets, including content from stack exchange, RFC data and published sample software logs.


David Nellinger Adamson is the Chief Data Scientist at Zebrium. For the past six years he has applied machine learning towards automating the identification and diagnosis of problems within complex software products. At Zebrium, this means autonomously inferring the latent structure of log messages so that it can be made fully accessible for programmatic analysis. Prior to joining Zebrium, David was the Architect Data Scientist for the InfoSight team at HPE Nimble Storage. He earned his Ph.D. in Biophysics from UC Berkeley (2013) and bachelor’s degrees in Chemistry and Physics from the University of Chicago (2007).

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google