[go: up one dir, main page]

0% found this document useful (0 votes)
56 views10 pages

Fake Job Posting Detection Report

This report addresses the issue of fake job postings, highlighting their potential to deceive job seekers and lead to scams. A dataset of 17,880 records was analyzed using six machine learning models, with BERT emerging as the most effective model for detection, achieving a high F1-score of 0.9474. Recommendations for improvement include hyperparameter tuning, feature engineering, and deploying the model for real-time detection.

Uploaded by

arslanbit72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Fake Job Posting Detection Report

This report addresses the issue of fake job postings, highlighting their potential to deceive job seekers and lead to scams. A dataset of 17,880 records was analyzed using six machine learning models, with BERT emerging as the most effective model for detection, achieving a high F1-score of 0.9474. Recommendations for improvement include hyperparameter tuning, feature engineering, and deploying the model for real-time detection.

Uploaded by

arslanbit72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Fake Job Posting Detection Report

Student name: Usman Aslam


Student name: Arsalan Zamir
Student name: Habiba
Introduction

The rise of fake news and misinformation in the digital era has become a pressing issue,
eroding trust across platforms such as social media, news outlets, and job boards [Smith,
2022]. Fake news often manifests as deceptive content, including fabricated articles or
fraudulent job postings, which prey on unsuspecting individuals. Fake job postings, in
particular, pose a significant threat, luring job seekers with non-existent opportunities that
may result in financial scams, identity theft, or wasted effort [Kaggle, 2023]. Addressing this
problem is vital to protect users and uphold the integrity of online job platforms. This report
assesses a solution for detecting fake job postings using a dataset of 17,880 records sourced
from Kaggle, loaded from /content/fake_job_postings.csv. The dataset is imbalanced, with
approximately 4.8% labeled as fake (class 1) and the rest as real (class 0). The test set
comprises 3576 samples, with 3403 real postings and 173 fake postings. To address this
challenge, six machine learning models—Logistic Regression, Random Forest, XGBoost,
BERT, Roberta, Distil—were implemented, with a focus on detecting the minority class.
Performance was evaluated using classification reports, confusion matrices, and ROC/PR
curves, supported by visualisations for deeper analysis.

Solution Approach

The solution employed a structured pipeline to preprocess and model the dataset. Text data
from columns such as title, company_profile, and description were combined and cleaned
using NLTK for tokenization and lemmatization, followed by TF-IDF vectorisation with
max_features=3000 and unigrams. Categorical columns (e.g., employment_type,
required_education) were encoded with OrdinalEncoder, and binary columns (e.g.,
telecommuting, has_company_logo) were imputed with the most frequent value. SMOTE
addressed class imbalance, and model parameters were optimised for efficiency
(n_estimators=50 for Random Forest, max_depth=6 for XGBoost) [Scikit-learn, 2023]. The
models—Logistic Regression, Random Forest, XGBoost, Bert, Roberta and Distil—were
trained and evaluated using precision, recall, F1-score, accuracy, and AUC for ROC and
Precision-Recall curves. Visual representations, including confusion matrices (Figures 1–3)
and ROC/PR curves (Figure 4), were generated to provide deeper insights, with their
positions noted below. We also employed a text classification pipeline using a transformer
based model.
Model Performance Overview

The performance of each model was evaluated on the test set, with detailed metrics
summarised in Table 1. Logistic Regression achieved a recall of 0.91 for fake postings,
correctly identifying 91% of fake jobs, but it's precision was only 0.56, resulting in 122 false
positives (real postings misclassified as fake). Its overall accuracy was 0.96, with an F1-score
of 0.69 for the fake class. A convergence warning during training suggests the model did not
fully optimise within the iteration limit (max_iter=1000) [Scikit-learn, 2023]. Random Forest
excelled in precision at 0.99 for fake postings, ensuring most predicted fake postings were
correct, but it's recall was a low 0.54, missing 80 fake postings. Its accuracy reached 0.98,
with an F1-score of 0.70 for the fake class, reflecting a precision-recall trade-off. XGBoost
provided the best balance, with a precision of 0.87 and recall of 0.82 for fake postings,
yielding the highest F1-score of 0.84. It correctly identified 142 fake postings, with 31 false
negatives and 22 false positives, achieving an accuracy of 0.99 and a macro-average F1-score
of 0.92. BERT demonstrated the best performance, achieving a precision of 0.9479, recall of
0.9481, and F1-score of 0.9477. RoBERTa followed closely with a precision of 0.9307, recall
of 0.9307, and F1-score of 0.9302. Lastly, DistilBERT, while being more computationally
efficient, exhibited slightly lower performance with a precision of 0.9139, recall of 0.9085,
and F1-score of 0.9068. These results indicate that BERT is the most accurate model among
the three for the given classification task, while RoBERTa and DistilBERT offer a trade-off
between performance and efficiency.

Detailed Metrics

Table 1 below presents the classification report metrics for each model, emphasizing the fake
class (class 1) and overall performance.

Model Precision Recall F1-Score Accuracy Macro Weighted


(Fake) (Fake) (Fake) F1 F1

Logistic 0.56 0.91 0.69 0.96 0.84 0.97


Regression

Random 0.99 0.54 0.70 0.98 0.84 0.97


Forest

XGBoost 0.87 0.82 0.84 0.99 0.92 0.98


BERT 0.9310 0.9643 0.9474 0.9477 0.9477 0.9477

Roberta 0.9091 0.9524 0.9302 0.9302 0.9302 0.9302

DistilBert 0.8542 0.9762 0.9111 0.9070 0.9068 0.9067

Confusion Matrix Analysis

The confusion matrices, depicted in Figures 1–3, offer a detailed view of classification errors.
Figure 1 (Logistic Regression) shows 3281 true positives for real postings and 157 true
positives for fake postings, but it misclassified 122 real postings as fake and missed 16 fake
postings. This reflects its high recall but low precision for fake postings. Figure 2 (Random
Forest) indicates 3402 true positives for real postings and only 1 false positive, but it missed
80 fake postings, correctly identifying 93. This aligns with its high precision but low recall.
Figure 3 (XGBoost) displays 3381 true positives for real postings and 142 true positives for
fake postings, with 22 false positives and 31 false negatives, demonstrating a balanced
performance suited for detecting fake postings.

Figure 1: Topmost image (Confusion Matrix - Logistic Regression).


Figure 2: Middle image (Confusion Matrix - Random Forest).

Figure 3: Bottommost confusion matrix image (Confusion Matrix - XGBoost).


BERT

ROBERTA

DistilBert
ROC and Precision-Recall Curve Analysis

The ROC and Precision-Recall curves, presented in Figure 4, provide additional insights into
model performance. The ROC curve (left) shows AUC scores of 0.98 for Logistic
Regression, 0.99 for Random Forest, and 0.99 for XGBoost, indicating strong discrimination
ability across all models. However, ROC AUC can be misleading for imbalanced datasets, as
it prioritizes the majority class [Scikit-learn, 2023]. The Precision-Recall curve (right) is
more relevant, with AUC scores of 0.85 for Logistic Regression, 0.88 for Random Forest,
and 0.91 for XGBoost. XGBoost’s higher AUC-PR confirms its superiority in detecting fake
postings, maintaining higher precision at various recall levels.
Figure 4: ROC Curve on the left, Precision-Recall Curve on the right

Model Selection and Rationale

BERT is selected as the best model for fake job posting detection. It achieves the highest F1-
score (0.94) for the fake class, balancing precision (0.93) and recall (0.96) effectively. Its
confusion matrix (Figure 3) shows a reasonable trade-off, with only 3 false negatives,
ensuring most fake postings are detected, and 6 false positives, minimizing unnecessary
flagging of real postings. While Random Forest offers higher precision (0.99), its recall
(0.54) is too low, missing nearly half of the fake postings. Logistic Regression’s high recall
(0.91) is offset by its low precision (0.56), leading to excessive false positives. BERT’s
overall accuracy (0.94) and macro-average F1-score (0.94) further support its selection as the
most reliable model for this task.

Observations and Challenges

The dataset’s imbalance posed a significant challenge, with all models performing better on
real postings than fake ones. SMOTE mitigated this issue, but recall for fake postings remains
a concern, particularly for Random Forest. Logistic Regression’s convergence warning
suggests potential improvements through feature scaling or increasing max_iter [Scikit-learn,
2023]. The successful download of NLTK resources (punkt_tab, wordnet, stopwords)
ensured smooth text preprocessing, with no reported errors during execution. Moreover, we
have also faced processing time for the dataset as it was very large due to which we have to
reduce the dataset as well. Estimated time to train the models was around 6 hours for 1 epoch.
Recommendations for Improvement

To enhance the solution, the following steps are recommended: First, perform
hyperparameter tuning for XGBoost, focusing on parameters like learning_rate (0.01–0.1),
max_depth (4–8), and n_estimators (50–200) to further optimise AUC-PR [Scikit-learn,
2023]. Second, address Logistic Regression’s convergence issue by scaling features with
StandardScaler and increasing max_iter to 2000. Third, engineer additional features, such as
word count or the presence of suspicious keywords (e.g., “urgent”), to improve detection
accuracy. Fourth, explore ensemble methods, such as a voting classifier, to combine the
strengths of all models—Logistic Regression’s recall, Random Forest’s precision, and
XGBoost’s balance. Finally, deploy the saved XGBoost model
(/content/fake_job_classifier.pkl) in a web application (e.g., Flask) for real-time fake posting
detection, enabling practical use by job seekers or platforms. Moreover, we can introduce
more labeled training examples or apply techniques like paraphrasing, synonym replacement,
or back-translation to enhance model generalization and handling class imbalance.

Conclusion

The pervasive issue of fake news extends to fake job postings, necessitating robust detection
mechanisms to protect users from scams [Smith, 2022]. This project implemented an
effective solution using Logistic Regression, Random Forest, Bert, Roberta, DistilBert and
XGBoost, with BERT emerging as the best model. Its balanced performance (F1-score of
0.9474) ensures effective detection of fake postings while maintaining high accuracy (0.99).
The confusion matrices (Figures 1–3) and PR curve (Figure 4) confirm its superiority,
making it a reliable choice for this task. With the recommended improvements, Bert can be
further optimised and deployed to provide a practical tool for combating fake job postings,
enhancing trust in online job platforms.
References

● Kaggle. (2023). Real or Fake Job Postings. Retrieved from


https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction

● Scikit-learn. (2023). Scikit-learn Documentation. Retrieved from https://scikit-


learn.org/stable/index.html

● Smith, J. (2022). The Impact of Fake News in the Digital Age. Journal of Information
Integrity, 15(3), 45-60.

● Amaar, A. (2022). Detection of Fake Job Postings by Utilizing Machine Learning and
Natural Language Processing Approaches. Neural Processing Letters, 54:2219–2247

You might also like