Abstract: GitHub repositories contain the vast majority of the code hosted at GitHub and represent the primary GitHub resource for developers, contributors and students. Repository characterization, therefore, is of great importance at GitHub, since it facilitates content search and discoverability, and promotes connections and collaborations among users. However, this characterization is made difficult by the fact that we have a repository count of well over 100 million and content that varies greatly over time within and across repositories.
At GitHub, we borrow a Natural Language Processing (NLP) technique and leverage user activity to characterize repositories. We attain such characterization without looking at their content and even without extracting any feature from them. The first step of our method, which sets up the possibility of using an NLP model, is to build “sentences of repositories”. Each sentence of repositories is a list of all the repositories visited by a given user over a certain time period. All these lists are then fed to a Doc2Vec model, which ingests them as if they were regular collections of natural language texts. In our approach, the list of repositories visited by the user can be interpreted as a “document”, while each repository - simply defined by its unique ID within the list - represents the equivalent of a “word” in that document. With millions of lists of repositories available each day, the NLP model has enough information to determine, in a fully unsupervised way, the “character” of each repository, and to store it in vector form, much in the same way that it would determine the meaning of words from a corpus of traditional text. This strategy allows us to associate each GitHub repository with a unique vector, called an “embedding”, which can later be used for multiple purposes, such as to determine the degree of similarity between any repository pair or to perform cluster analysis.
This technique is extremely flexible and can be used for more than just repository characterization. It can be transferred to any other domain where the characterization of a given entity or event can be obtained from the analysis of sequences of such events.
Bio: Romano Foti is a Senior Machine Learning Engineer at GitHub. His work is primarily focused on characterizing the core entities of the GitHub platform, such as users, repositories, issues and code. Since joining GitHub in 2017, he has contributed to, among other things, the discover repositories feed and the trending repositories and trending users page on github.com/explore. He has also worked on coding language and framework detection. He uses a variety of machine learning techniques and an end-to-end approach to data-driven problems, with quick passes through exploration, development, validation and deployment. Prior to GitHub, Romano worked as a Senior Data Scientist for OnDeck, specializing on marketing strategy and analytics. Romano Foti received his PhD in Civil and Environmental Engineering at Colorado State University in 2011.