Methods ======= In this article, we discuss the analysis and machine learning model for predicting whether a job is predominantly software job or not. We discuss data collection, preparing and training the model. We discuss the performance metrics and use the model to predict software job classification for jobs from 2014--2019. The article is divided into the following sections: .. contents:: :local: Data collection --------------- Data for jobs was obtained from https://jobs.ac.uk by scraping the website. Each job has a unique id of the form AAANN. Then the data is cleaned and we ensure that at least the following attributes are there for each job in our dataset: * **Job Title**: Title of the job, like Research Associate, Research Software Engineer, Lecturer, etc. We use text features from the job title as one of our features. * **Description**: This key contains the description of the job. We use text features from the description as one of our features. * **Date of publishing**: The date the job was published. This is essential to do an analysis over time. * **Salary**: This information is essential for our analysis. The salary can be a single value or a range of values depending on experience. * **Employer**: This is the employer that posted the job. We are only interested in universities in the United Kingdom. Therefore we use a list of all universities in UK and only keep the ones that matches an element in that list. * **Type of role**: This field is an array of the type of job is given. It can be one or more of these values: [Academic or Research, Professional or Managerial, Technical, Clerical, Craft or Manual, PhD, Masters]. We use this to ignore PhD or Master positions for job predictions as we are not interested in them. The cleaned data is stored in a database. Training data ------------- We use supervised classification for the data, so we need labels. To classify the jobs in two categories we needed to have a training dataset. We asked experts to read a subset of jobs as presented on a website (without pictures) in which category that jobs fallen into. They had the choice between 4 options: * This job is **mostly** for a software development position (*most*). * This jobs requires **some** software development (*some*). * This jobs does **not** require software development (*none*). * There is not enough information to decide (*NA*). Each job was shown several times (up to three times) to different experts until a consensus emerged. A job is classified as software job if two participants assigned *most* or *some* to the question: how much of this person's time would be spent developing software? If no consensus emerged, a third rater was used to derive a majority rating. Only jobs with a clear classification were kept for building the model. We performed an inter-rater reliability calculation using Krippendorff's alpha, and obtained alpha = 0.6774 for the first two raters. We do not include the third rater as the third rater did not rate all the data, but if we do, we obtain an alpha of 0.6116. Thus the data just crosses the minimum acceptable threshold for data analysis, and it is generally recommended to have alpha above 0.800. We can only rely on this training data as a tentative signal for a software job. Ideally we should try to obtain better, more consistent ratings, either by using more expert raters with a clearly defined set of questions or many raters, such as in crowdsourcing. Features -------- We use text features from the description and job title to train our model. The other job attributes are not expected to have any relationship with whether it is a software job or not (such as employer or salary). This is particularly true for academic software jobs as many of them have a salary in the UK on the same scale as postdoctoral research associates. For text features, the standard is to use TF-IDF with some mixture of n-grams. We chose to use unigrams and bigrams as they provide sufficient information without being noisy. We perform standard text cleaning operations such as removing stopwords, punctuation, currency symbols. We use information gain to get an understanding of which features are relevant. Most of the features are not very predictive, so we keep the first 24,000 n-grams for description, out of a total of 133,300 n-grams. Job title has relatively fewer text, so we keep all of the 4,308 n-grams from job title. Altogether, we then have 28,308 features. Model ----- We train a set of supervised classifiers on our dataset and select the model with the best mean precision across 5 folds of the data. As our dataset is unbalanced, accuracy is not a good measure for evaluating model performance. Precision which is defined as the ratio of true positives to all predicted positives is a better metric for us, because (i) we want to have a very low proportion of false positives, and (ii) we are not as interested in ensuring low false negatives: classifying a software job as a non-software job. With a high precision we can then assert that the number of jobs is an underestimate of the total number of jobs. If we get high recall (proportion of actual software jobs predicted correctly), then we can additionally claim that the obtained classification of software jobs is close to the actual number. **Classifiers**. We trained the following supervised classifiers: support vector machines, logistic regression, random forests (ensemble of decision trees), single decision tree classifier and gradient boosting classifier. The best model is selected and hyperparameters tuned sing nested cross validation. In simple cross validation, the dataset is split into K folds, with K-1 folds used for training, and the other used for model evaluation. The average score across the folds is then compared for the different hyperparameters and the hyperparameter with the best performance is chosen. However, this leads to the model evaluation being conflated with the parameter selection, with the same test set being used for both. This can risk our model appearing better than it is (Cowley and Talbot 2010). **Nested cross validation**. This issue is rectified by performing nested cross validation. The inner loop selects the best model and tunes the parameters, and the outer loop evaluates the model. We use 5 folds for both the inner and outer loops, using stratified fold which preserves the proportion of jobs in each class (our dataset is unbalanced, with more non-software jobs than software jobs). For each model, we obtain a set of metrics. While we want to optimise for precision, we also want a high enough recall. One way of accomplishing this is by using the F1 score which is the harmonic mean of the precision and recall. However, the F1 gives equal weight to both precision and recall. While we could have experimented with variations of weighted F1 score, we opted for the following model selection criterion: select the model with a precision above 0.90 and with the highest recall. Following this selection criterion, we obtained the chosen model as the logistic regression model with C = 10000, balanced class weights and the L-BFGS solver. It has the following metrics: ================== ====== Metric Value ================== ====== Precision 0.9007 Recall 0.3549 Balanced accuracy 0.6688 F1 0.4914 ROC AUC 0.9093 ================== ====== **Model ensemble**. To obtain confidence intervals for the probability estimates obtained from logistic regression, we create a model ensemble by doing 100 different splits of the training data and using that to train the best model while keeping the hyperparameters fixed. Prediction ---------- We predict using the model ensemble for a dataset collected from 2014--2019, containing 344,012 jobs. Of these, only 335,437 had both the description and job title correctly parsed from the jobs.ac.uk data. We further drop based on the following criteria: * After dropping jobs without salary: 274,913 * After dropping jobs without posted: 274,912 * After dropping jobs at PhD level: 260,821 Using the ensemble we generate 100 different predictions for each job from which we obtain bootstrap confidence intervals and estimates for the probability for each job. The probability bound is used to generate upper and lower bounds of the total number of jobs. Out of the 260,821 jobs, there were 33,704 (32000--35,413, based on 95% CI of probability being greater than 0.5) jobs classified. This translates to a proportion of 12.9% (95% CI 12.3--13.6%) of all jobs being classified as requiring some software development. We note that the precision is high while recall is low. The model is conservative; the target job type is precisely identified with few false positives, but in doing so, the model fails to identify many jobs. The reported estimates should be considered an underestimate for the target job type. Out of the 33,704 jobs classified as software jobs, 513 (1.5% of all software jobs) had the words 'research' and 'software' in their job title, explicitly indicating their nature. This metric can be used to track adoption of the nomenclature of research software engineering in the UK academic job market. Out of the 33,704 software jobs, 25,634 (76.1%) were fixed term and 6,738 (20.0%) were permanent positions.