
Abstract: Extracting structured information from CVs is one of the top problems that AI is addressing in the job market sector. CV parsing techniques are key tools in improving the user experience of a job portal, as they make it easier and faster to introduce data relevant for finding a job. For this reason, many players in the market are trying to develop this functionality in different ways. Even though this kind of task can be usually solved using Named Entity Recognition (NER) techniques, the fact that CVs can be built following a wide variety of formats, layouts and can be written in multiple languages complicates the adoption of out-of-the box solutions.
In order to solve this problem at InfoJobs, the leading job board in Spain, the Machine Learning team has developed a Deep Learning solution to parse user CVs and automatically identify relevant information for the platform. The designed training pipeline starts with data augmentation techniques that generate a large amount of synthetic CVs using real InfoJobs data. These are used to train a preliminary model which is later fine-tuned using real CVs labelled by humans, avoiding overfitting over synthetic structures and greatly reducing the amount of information needed for training.
This new approach leverages the efficiency of BERT, a powerful model based on the transformer architecture which achieves great results in NER tasks. We iterate upon simple approaches by implementing a Nested NER solution, which allows us to identify the overarching structures within the original résumés. Furthermore, combining different predictions, we are able to extract information from long spans of raw text, thus overcoming the original BERT limitation of 512 tokens.
This solution solves two user pains recurrently reported. First, the inconvenience of being asked for too much information during the sign up and CV edition processes, which results in a high dropout rate. Using the CV parsing functionality, this process is considerably simplified. Second, the need to gain control on how and what users show about themselves to the recruiters. This second pain could be easily solved simply by allowing candidates to upload their own CV but, without a good CV parsing functionality, we would end up having much less information about candidates and, as a result, all matching functionalities (such as recommendations) would be impacted.
As a result, thanks to the increase in quantity and quality of the gathered data, the CV parsing model directly impacts the matching functionalities in InfoJobs, making it easier and faster for job seekers to find their perfect vacancy in our platform.
Bio: Didac, PhD in Physics and expert Data Scientist, is part of the Machine Learning team of InfoJobs (the job portal of Adevinta Spain). His work is focused on delivering product solutions based on Artificial Intelligence algorithms, most of them related with NLP. He has also driven several initiatives in Barcelona to leverage the power of data and Machine Learning to social initiatives, for example being a board member of DataForGoodBCN and organising data competitions to tackle problems with a clear social background.

Didac Fortuny Almiñana, PhD
Title
Data Scientist | Adevinta Spain
