Fake Job Posting Prediction

In this project, we have predicted fake job postings from a list of given jobs posted. The dataset has been picked from Kaggle which consists for 17,880 rows of job postings. This document describes Topic Modelling technique used in conjuction with Classification Models to predict fake jobs out of real ones with high accuracy.

Data

The dataset contains:

17,880 rows
18 features
- 5 features (title, company_profile, description, requirements and benefits) are long texts
- Rest 13 features are mainly numeric fields or categorical data

The dataset is provided with Fraudulent column where value of 1 denotes the job is a fraud and 0 for real jobs.

Dataset contains lot of missing values which are used as a valid observation. It could mean that fake posts often have missing fields.

Data Cleaning

Following are the steps performed for data engineering:

Replace null to string "missing" - instead of dropping missing, use as valid observation. It could mean that fake posts often have missing data
Separate country, state and city from location column
Drop non-english text entries
Clean text columns - separate sentences, remove URLs, non-ascii characters, punctuation, extra spaces and white space
Redefine education bins - some rows have "some high school coursework" or "high school or equivalent" etc. which are replaced with "less than high school" for generalizing it
Drop salary column: it is very often missing and unsure what units are used in foreign countries, inconsistent time frame. There is no way to standardize this column for such wide range of values

EDA

Exploratory Data Analysis of this dataset can be found at this URL. It also contains detailed description of various analysis and insights found about the data.

Modeling

Topic Modeling

We have used topic Latent Dirichlet Allocation (LDA) to find the number of topics and generate probability of each row. Later, we have used this extra feature in our classification models explained later. Sections of the code is adapted from link.

Here are the steps performed in Topic Modeling

Combine text fields into single string
Tokenize, remove stop words, lemmatize based on POS
Build term frequency corpus
Build LDA model
- Tune based on topic coherence. Specifically C_V coherence value
Add topic probabilities as metadata to the dataset

Topic Coherence is the degree of sementic similarity between high scoring words in the topic. It is modern alternative to Perplexity which is how surprised a model is by the new data (normalized log-likelihood of held out test data). CV_coherence is a "measure based on sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses Normalized Pointwise Mutual Information (NPMI) and cosine similarity." Link

Parameters:

Number of topics
Alpha
Eta
- Selecting alpha and eta with built-in auto method which learns asymetric prior from the data

From Link: (assuming symmetric data), alpha represents document-topic density - with higher alpha, documents are made up of more topics and with lower alpha, documents contains fewer topics. Beta represents topic-word density - with high beta, topics are made up of most of the words in the corpus, and with low beta, they consists of few words.

Coherence Score Graph

Maximum coherence score with 22 topics.

Visualize topics using LDAvis

The topics visualization can be found at [link].

Merge Topic Probabilities with the Original Data

Following are the steps performed:

We created a blank dataframe and initialized with single column with value 0
Looped through LDA result, create series with topic probabilities and append onto the dataframe
Merged with the original dataset
Finally, replaced missing with 0s: if topic is missing, then 0% probability in that topic

Classification

Dummy Variables

All variables are categorical, create dummies and drop one level to avoid collinearity
There are many values for countries, so created dummies only if more that 100 posts in that country

SMOTE: Class Imbalance

SMOTE sampling on the training data such that even number of observations with each class. This function also does 80/20 train/test split.

SMOTE: synethic minority over-sampling technique
Synthesize new examples for the minority class rather than oversample, which doesn't add any new information.

"… SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b"

SMOTE sampling on training data:

Original number of fraudulent in data is 687
Length of oversampled data is 26956
Number of real in oversampled data 13478
Number of fraudulent in oversampled data 13478
Proportion of real data in oversampled data is 0.5
Proportion of fraudulent data in oversampled data is 0.5

Classification Models Used

Unregularized Logistic Regression
Regularized (Lasso) Logistic Regression with Cross-Validation
Ensemble Tree Models

Unregularized Logistic Regression

We have used three versions:

original imbalanced dataset
balanced weighting
SMOTE

Metrics for imbalanced data

Metric	Value
Accuracy	0.9551101072840203
TPR Recall	0.19653179190751446
TNR	0.9940635203324428
FPR	0.005936479667557139
FNR	0.8034682080924855
Precision	0.6296296296296297
Area under ROC	0.5952976561199786
Area under PR	0.16298560467949763

Metrics for balanced class weighting

Metric	Value
Accuracy	0.8215697346132129
TPR Recall	0.8092485549132948
TNR	0.8222024339566637
FPR	0.1777975660433363
FNR	0.1907514450867052
Precision	0.18944519621109607
Area under ROC	0.8157254944349792
Area under PR	0.16262502145543048

Metric for SMOTE

Metric	Value
Accuracy	0.8204404291360813
TPR Recall	0.7976878612716763
TNR	0.8216087859899079
FPR	0.178391214010092
FNR	0.2023121387283237
Precision	0.18673883626522328
Area under ROC	0.8096483236307922
Area under PR	0.1588407258416689

Comparing baseline models

We can see that imbalanced model is heavily biased towards accuracy and precision, not recall. SMOTE and class weighting are very similar.

Regularized (Lasso) Logistic Regression with Cross-Validation

Cross Validation to choose regularization parameter
No need to scale or normalize since all features are categorical or probabilities between 0 and 1
refit = true means will refit with the best selected parameters after CV
Fit without intercept so can include all topic levels (which sum to 1). Still need to remove 1 level from the other dummies.
Increasing max_iter even to 5000 does not get rid of convergence warning

Iterations: repeat the following with each of these 4 scoring metrics: roc auc, accuracy, precision, recall

Balanced weighting
- class_weighting = 'balanced'
- penalty = 'l1'
- fit(X_train, y_train)
SMOTE
- penalty = 'l1'
- fit(os_data_X, os_data_y)
SMOTE with elastic net penalty
- SMOTE better than balanced weighting so keep with that
- penalty = 'elasticnet'
- l1_ratios = [0, .25, .5, .75, 1]
- fit(os_data_X, os_data_y)

Balanced Weighting

Insights:

accuracy FPR = 0, FNR = 1. Almost always predicts real.
ROC, precision, recall all pretty similar.
Recall does the best at minimizing FNR (as is its purpose) and everything else is just slightly worse
Precision more balanced, unclear which is better
Baseline only slightly worse than tuned results

Best: Recall or precision

SMOTE

Insights:

best model: ROC (all effectively the same incl baseline)

Best: ROC

Elastic Net

Insights:

Not materially different from lasso smote, not using

Comparing SMOTE vs Class Imbalance

Very similar with some tradeoffs. Ultimately choosing SMOTE ROC as best model because most balanced between tradeoffs. The benefits of class weighting precision are small (precision, FPR) and worse in many areas (FNR, TPR)
Also, SMOTE was consistent across all 4 metrics and thus is a very robust model, likely would perform well on new data.

Choosing Best Model: SMOTE ROC

ROC Curve

The lines show the 0.5 threshold. The threshold is appropriate because it reaches close to the top left of the graph and thus has a good tradeoff between FPR and TPR

Ensemble Tree Models

Code modified from https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ and https://machinelearningmastery.com/xgboost-for-imbalanced-classification/
Specifically, guidance in how to and in what order to tune parameters from https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/.

We did train/test split + SMOTE sampling. No need to drop one level of dummies in this case.

Baseline Model

Results with all default values. 3 iterations:

Unbalanced original data
SMOTE
Balanced class weighting.

Before Parameter Tuning

Unbalanced Metrics

Metric	Value
Accuracy	0.9757199322416714
TPR/recall	0.5895953757225434
TNR	0.9955476402493322
FPR	0.004452359750667854
FNR	0.41040462427745666
Precision	0.8717948717948718
Area under ROC	0.7925715079859378
Area Under PR	0.5340513972079693

SMOTE Metrics

Metric	Value
Accuracy	0.9717673630717109
TPR/recall	0.6936416184971098
TNR	0.9860492727812408
FPR	0.013950727218759276
FNR	0.3063583815028902
Precision	0.718562874251497
Area under ROC	0.8398454456391753
Area Under PR	0.5133884126597368