Online job ads contain valuable information that can help companies and policy makers better understand job markets. How have salaries for different occupations changed over time? How have regional labor markets evolved? To answer these questions well, skills mentioned in unstructured text need to be extracted and mapped to complex, changing taxonomies, presenting a challenging language processing task.
For instance, there are many different types of software development positions, all with different requirements, must-have experience and salary trends. We need to look at skills in much more detail to understand how job markets are changing.
In this post, I’ll show you how our Data Science team at Nesta built an NLP pipeline using spaCy and Prodigy to extract skills from 7 million online job ads to better understand UK skill demand. Core to the pipeline is our custom mapping step, that allows users to match extracted skills to any government standard, like the European Commission’s Skills, Competences, Qualifications and Occupations (ESCO) taxonomy.
Our flexible, transparent approach means that data science teams across the UK government, like the Cabinet Office, can easily use our system to understand evolving skill requirements, and discover internal mismatches.
Labor markets and Data Science at Nesta
Nesta’s Data Science team has worked with online job advertisements to make sense of the UK’s labor market for many years. We’ve developed methodologies from purchased datasets, like the UK’s first data-driven skills taxonomy and mapping occupational-similarity for at-risk of automation jobs.
While proprietary data vendors provide metadata like skill lists and Standardized Occupation Codes (SOC), their methodology for doing so is often a black box and the data comes with restrictions on what can be shared publicly. So we decided to build the open-source alternative for insights from job advertisements to be more transparent and to have more control over what we could share.
With initial funding from the UK’s Department of Education, we built out the infrastructure to scrape online job ads from job aggregator sites. To date, we’ve collected over 7 million job adverts from all over the UK.
We have also built tools and algorithms to:
- Extract relevant information from a given job advert, like company descriptions or qualification level;
- Standardize the extracted information to official, government-released standards by developing custom mapping steps and;
- Open source our approaches.
Extracting skills from job ads
Understanding what skills are required for occupations is relevant for many groups like policy makers, local authorities and career advisors. This information means that they can make informed labor market policies, address regional skill shortages or advise job seekers. To aid in this, we developed a custom NLP solution to extract skills from unstructured job ads and map them onto official skill lists.
Organizations like the European Commission and Lightcast release their own structured skills lists. These taxonomies contain rich information like skills, skill definitions and classifications. Since the taxonomies reflect the skills demanded in the labor force, they contain thousands and thousands of labels. They are also updated to reflect skill changes in the labor market over time. As a result, training and keeping updated a Named Entity Recognition (NER) model to extract specific skill labels would be a labeling nightmare.
Instead, we developed a custom NLP solution to predict generic entities and added a custom mapping step to handle different, official taxonomies. Our approach used spaCy pipelines and Prodigy for efficient annotation, model training, quality control and evaluation.
From unstructured job ads to structured skill lists
The end-to-end workflow we developed starts with identifying mentions of SKILL
, MULTISKILL
and EXPERIENCE
from online job adverts. In this first step, we annotated 375 examples manually to train spaCy’s NER model. As part of our annotation process, the team labeled a handful of job adverts collectively to get a sense of the task and discuss edge cases as they came up. This approach allowed us to highlight frequent mentions of multiple skills that were not easily separable. For example, the skill “developing visualizations and apps” is in fact two skills: “developing visualizations” and “developing apps”. However, a purely out-of-the-box NER approach wouldn’t nicely separate the skill entity mentioned.
Therefore, the second step aimed to separate multi-skill entities into their constituent skills. We trained a simple classifier to binarily predict if skills were multi-skill or not by using a training set of labeled skill and multi-skill entities. The features included the length of the entity, a boolean if the token "and"
was in the entity and a boolean if the token ","
was in the entity. If the entity was a multi-skill, we use spaCy’s dependency parsing to split them based on a series of linguistic rules such as splitting on noun phrases and the token "and"
.
Once we had our extracted skills and separated multi-skills, we built a custom, flexible mapping step to map onto any official taxonomy.
Customized skills mapping
Our custom mapping step uses semantic similarity and the structure of a given taxonomy to standardize the extracted skills. This means that we can take advantage of metadata associated with skills, like skill definitions and categories.
We embed extracted skills and a taxonomy to calculate the cosine similarity with every skill or skill group. If the cosine similarity is above a certain threshold, we map directly at the skill level. If it is below a certain threshold we then calculate both the max share and max cosine similarity with skills in a skill group.
For example, the skill “mathematics” may be too vague to be mapped at the skill level, but it maps nicely to a number of similar skills like “philosophy of mathematics” and “using mathematical tools and equipment”, which sit under the skill group “natural sciences, mathematics and statistics”. In this instance, the share of similar-ish skills is above a certain threshold and we are able to assign “mathematics” to the appropriate skill group. We move up a given taxonomy until a broad-enough skill group can be confidently assigned to the given entity.
To make this workflow as accessible as possible, we released Nesta’s most popular GitHub repository to date as a Python library so end-users could:
- Extract skills and experiences from job ads and;
- Map skills to pre-configured taxonomies (European Commission, Lightcast or a toy taxonomy for testing).
Since its initial release earlier last year, we retrained the NER model we used in the first step of the workflow using Prodigy. Prodigy is a modern data development and annotation tool for creating training and evaluation data for human-in-the-loop machine learning that provides a fully scriptable back-end and a web application for annotation.
We took advantage of Prodigy’s active learning functionality and used our initial model to rapidly label additional SKILL
,MULTISKILL
and EXPERIENCE
mentions. During the annotation process, we were able to identify other categories that could be interesting for analysis, including job benefits like flexible working hours and paternity leave. In a single terminal command, we labeled more data to re-train our skills NER model and started to extract a new category BENEFIT
.
Results and evaluation
We evaluated every component of the workflow in addition to the workflow as a whole. For the NER model, we used nervaluate
, a Python module for evaluating NER that accounts for partial matches.
The lightweight NER model (3.66 MB) achieved 68% accuracy and 52% recall. Meanwhile, the multiskill classifier achieved 91% accuracy.
For skill extraction on the whole, we manually labeled a random sample of 171 extracted skills and felt that 75% were well extracted while 19% were quite well extracted. Finally, for skill mapping on the whole, we felt that 64% were well matched to standardized skills and 27% were quite well matched.
Although the NER model’s accuracy appears not as high, partial matches are still useful for our purposes. For example, if the model predicts “Excel” as a skill, as opposed to “Microsoft Excel”, it is still usefully mapped to an appropriate standard skill. This is similarly the case for extracted skills that might be missing the appropriate verb, like “industrial trucks” as opposed to “drive industrial trucks”.
The strengths of our end-to-end workflow include:
- We can extract skills that have not been seen before. For example, although the ESCO taxonomy does not contain the programming skill “React”, the model was able to detect “React” as a skill, and map it to “use scripting programming”.
- The library can be adapted to your chosen taxonomy. We have built the library in such a way that you can map skills to a custom taxonomy if desired.
- You can match to different levels of the taxonomy. This can be handy when a job advert mentions a broad skill group (e.g. “computer programming”) rather than a specific skill (e.g. “Python”).
Its limitations include:
- Metaphors: The phrase “understand the bigger picture” is matched to the ESCO skill “interpreting technical documentation and diagrams”.
- Multiple skills: We use basic semantic rules to split multi-skill sentences up, e.g. “developing visualizations and apps”, but the rules aren’t complete enough to split up more complex sentences.
With our skills pipeline and a rich database of online job ads, we are able to analyze skill demand across Nesta’s focus areas, like improving outcomes for young children in disadvantaged backgrounds and home decarbonization.
For example, we compared skills demanded for early year professionals to similarly paid roles in hospitality in light of the UK government’s expanded childcare hours. We also explored current skills gaps in the heat pump installer industry in the UK.
Future plans
We are about to release our current work on measuring the greenness of jobs at the skill-, occupation- and industry-level, which relied heavily on Prodigy’s flexible custom recipes to incorporate Large Language Models (LLMs) in the labeling process. We have also kicked off work on making our suite of job advert algorithms as accessible as possible, allowing any end-user with job adverts to seamlessly extract and standardize information like salaries, occupations and benefits, where both custom spaCy pipelines and Prodigy will play a central role in this process.
This research has been funded by the Office for National Statistics as part of the research program of the Economic Statistics Centre of Excellence (ESCoE). Liz Gallagher is the co-creator of the library. Cath Sleeman provided overall guidance on the project. Jack Vines built the data infrastructure to collect job adverts.