0% found this document useful (0 votes)

56 views10 pages

Fake Job Posting Detection Report

This report addresses the issue of fake job postings, highlighting their potential to deceive job seekers and lead to scams. A dataset of 17,880 records was analyzed using six machine learning models, with BERT emerging as the most effective model for detection, achieving a high F1-score of 0.9474. Recommendations for improvement include hyperparameter tuning, feature engineering, and deploying the model for real-time detection.

Uploaded by

arslanbit72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views10 pages

Fake Job Posting Detection Report

Uploaded by

arslanbit72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Fake Job Posting Detection Report

Student name: Usman Aslam

Student name: Arsalan Zamir
Student name: Habiba
Introduction

The rise of fake news and misinformation in the digital era has become a pressing issue,
eroding trust across platforms such as social media, news outlets, and job boards [Smith,
2022]. Fake news often manifests as deceptive content, including fabricated articles or
fraudulent job postings, which prey on unsuspecting individuals. Fake job postings, in
particular, pose a significant threat, luring job seekers with non-existent opportunities that
may result in financial scams, identity theft, or wasted effort [Kaggle, 2023]. Addressing this
problem is vital to protect users and uphold the integrity of online job platforms. This report
assesses a solution for detecting fake job postings using a dataset of 17,880 records sourced
from Kaggle, loaded from /content/fake_job_postings.csv. The dataset is imbalanced, with
approximately 4.8% labeled as fake (class 1) and the rest as real (class 0). The test set
comprises 3576 samples, with 3403 real postings and 173 fake postings. To address this
challenge, six machine learning models—Logistic Regression, Random Forest, XGBoost,
BERT, Roberta, Distil—were implemented, with a focus on detecting the minority class.
Performance was evaluated using classification reports, confusion matrices, and ROC/PR
curves, supported by visualisations for deeper analysis.

Solution Approach

The solution employed a structured pipeline to preprocess and model the dataset. Text data
from columns such as title, company_profile, and description were combined and cleaned
using NLTK for tokenization and lemmatization, followed by TF-IDF vectorisation with
max_features=3000 and unigrams. Categorical columns (e.g., employment_type,
required_education) were encoded with OrdinalEncoder, and binary columns (e.g.,
telecommuting, has_company_logo) were imputed with the most frequent value. SMOTE
addressed class imbalance, and model parameters were optimised for efficiency
(n_estimators=50 for Random Forest, max_depth=6 for XGBoost) [Scikit-learn, 2023]. The
models—Logistic Regression, Random Forest, XGBoost, Bert, Roberta and Distil—were
trained and evaluated using precision, recall, F1-score, accuracy, and AUC for ROC and
Precision-Recall curves. Visual representations, including confusion matrices (Figures 1–3)
and ROC/PR curves (Figure 4), were generated to provide deeper insights, with their
positions noted below. We also employed a text classification pipeline using a transformer
based model.
Model Performance Overview

The performance of each model was evaluated on the test set, with detailed metrics
summarised in Table 1. Logistic Regression achieved a recall of 0.91 for fake postings,
correctly identifying 91% of fake jobs, but it's precision was only 0.56, resulting in 122 false
positives (real postings misclassified as fake). Its overall accuracy was 0.96, with an F1-score
of 0.69 for the fake class. A convergence warning during training suggests the model did not
fully optimise within the iteration limit (max_iter=1000) [Scikit-learn, 2023]. Random Forest
excelled in precision at 0.99 for fake postings, ensuring most predicted fake postings were
correct, but it's recall was a low 0.54, missing 80 fake postings. Its accuracy reached 0.98,
with an F1-score of 0.70 for the fake class, reflecting a precision-recall trade-off. XGBoost
provided the best balance, with a precision of 0.87 and recall of 0.82 for fake postings,
yielding the highest F1-score of 0.84. It correctly identified 142 fake postings, with 31 false
negatives and 22 false positives, achieving an accuracy of 0.99 and a macro-average F1-score
of 0.92. BERT demonstrated the best performance, achieving a precision of 0.9479, recall of
0.9481, and F1-score of 0.9477. RoBERTa followed closely with a precision of 0.9307, recall
of 0.9307, and F1-score of 0.9302. Lastly, DistilBERT, while being more computationally
efficient, exhibited slightly lower performance with a precision of 0.9139, recall of 0.9085,
and F1-score of 0.9068. These results indicate that BERT is the most accurate model among
the three for the given classification task, while RoBERTa and DistilBERT offer a trade-off
between performance and efficiency.

Detailed Metrics

Table 1 below presents the classification report metrics for each model, emphasizing the fake
class (class 1) and overall performance.

Model Precision Recall F1-Score Accuracy Macro Weighted

(Fake) (Fake) (Fake) F1 F1

Logistic 0.56 0.91 0.69 0.96 0.84 0.97

Regression

Random 0.99 0.54 0.70 0.98 0.84 0.97

Forest

XGBoost 0.87 0.82 0.84 0.99 0.92 0.98

BERT 0.9310 0.9643 0.9474 0.9477 0.9477 0.9477

Roberta 0.9091 0.9524 0.9302 0.9302 0.9302 0.9302

DistilBert 0.8542 0.9762 0.9111 0.9070 0.9068 0.9067

Confusion Matrix Analysis

The confusion matrices, depicted in Figures 1–3, offer a detailed view of classification errors.
Figure 1 (Logistic Regression) shows 3281 true positives for real postings and 157 true
positives for fake postings, but it misclassified 122 real postings as fake and missed 16 fake
postings. This reflects its high recall but low precision for fake postings. Figure 2 (Random
Forest) indicates 3402 true positives for real postings and only 1 false positive, but it missed
80 fake postings, correctly identifying 93. This aligns with its high precision but low recall.
Figure 3 (XGBoost) displays 3381 true positives for real postings and 142 true positives for
fake postings, with 22 false positives and 31 false negatives, demonstrating a balanced
performance suited for detecting fake postings.

Figure 1: Topmost image (Confusion Matrix - Logistic Regression).

Figure 2: Middle image (Confusion Matrix - Random Forest).

Figure 3: Bottommost confusion matrix image (Confusion Matrix - XGBoost).

BERT

ROBERTA

DistilBert
ROC and Precision-Recall Curve Analysis

The ROC and Precision-Recall curves, presented in Figure 4, provide additional insights into
model performance. The ROC curve (left) shows AUC scores of 0.98 for Logistic
Regression, 0.99 for Random Forest, and 0.99 for XGBoost, indicating strong discrimination
ability across all models. However, ROC AUC can be misleading for imbalanced datasets, as
it prioritizes the majority class [Scikit-learn, 2023]. The Precision-Recall curve (right) is
more relevant, with AUC scores of 0.85 for Logistic Regression, 0.88 for Random Forest,
and 0.91 for XGBoost. XGBoost’s higher AUC-PR confirms its superiority in detecting fake
postings, maintaining higher precision at various recall levels.
Figure 4: ROC Curve on the left, Precision-Recall Curve on the right

Model Selection and Rationale

BERT is selected as the best model for fake job posting detection. It achieves the highest F1-
score (0.94) for the fake class, balancing precision (0.93) and recall (0.96) effectively. Its
confusion matrix (Figure 3) shows a reasonable trade-off, with only 3 false negatives,
ensuring most fake postings are detected, and 6 false positives, minimizing unnecessary
flagging of real postings. While Random Forest offers higher precision (0.99), its recall
(0.54) is too low, missing nearly half of the fake postings. Logistic Regression’s high recall
(0.91) is offset by its low precision (0.56), leading to excessive false positives. BERT’s
overall accuracy (0.94) and macro-average F1-score (0.94) further support its selection as the
most reliable model for this task.

Observations and Challenges

The dataset’s imbalance posed a significant challenge, with all models performing better on
real postings than fake ones. SMOTE mitigated this issue, but recall for fake postings remains
a concern, particularly for Random Forest. Logistic Regression’s convergence warning
suggests potential improvements through feature scaling or increasing max_iter [Scikit-learn,
2023]. The successful download of NLTK resources (punkt_tab, wordnet, stopwords)
ensured smooth text preprocessing, with no reported errors during execution. Moreover, we
have also faced processing time for the dataset as it was very large due to which we have to
reduce the dataset as well. Estimated time to train the models was around 6 hours for 1 epoch.
Recommendations for Improvement

To enhance the solution, the following steps are recommended: First, perform
hyperparameter tuning for XGBoost, focusing on parameters like learning_rate (0.01–0.1),
max_depth (4–8), and n_estimators (50–200) to further optimise AUC-PR [Scikit-learn,
2023]. Second, address Logistic Regression’s convergence issue by scaling features with
StandardScaler and increasing max_iter to 2000. Third, engineer additional features, such as
word count or the presence of suspicious keywords (e.g., “urgent”), to improve detection
accuracy. Fourth, explore ensemble methods, such as a voting classifier, to combine the
strengths of all models—Logistic Regression’s recall, Random Forest’s precision, and
XGBoost’s balance. Finally, deploy the saved XGBoost model
(/content/fake_job_classifier.pkl) in a web application (e.g., Flask) for real-time fake posting
detection, enabling practical use by job seekers or platforms. Moreover, we can introduce
more labeled training examples or apply techniques like paraphrasing, synonym replacement,
or back-translation to enhance model generalization and handling class imbalance.

Conclusion

The pervasive issue of fake news extends to fake job postings, necessitating robust detection
mechanisms to protect users from scams [Smith, 2022]. This project implemented an
effective solution using Logistic Regression, Random Forest, Bert, Roberta, DistilBert and
XGBoost, with BERT emerging as the best model. Its balanced performance (F1-score of
0.9474) ensures effective detection of fake postings while maintaining high accuracy (0.99).
The confusion matrices (Figures 1–3) and PR curve (Figure 4) confirm its superiority,
making it a reliable choice for this task. With the recommended improvements, Bert can be
further optimised and deployed to provide a practical tool for combating fake job postings,
enhancing trust in online job platforms.
References

● Kaggle. (2023). Real or Fake Job Postings. Retrieved from

https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction

● Scikit-learn. (2023). Scikit-learn Documentation. Retrieved from https://scikit-

learn.org/stable/index.html

● Smith, J. (2022). The Impact of Fake News in the Digital Age. Journal of Information
Integrity, 15(3), 45-60.

● Amaar, A. (2022). Detection of Fake Job Postings by Utilizing Machine Learning and
Natural Language Processing Approaches. Neural Processing Letters, 54:2219–2247

Updated Fake Job Posting Detection Presentation
No ratings yet
Updated Fake Job Posting Detection Presentation
13 pages
Final
No ratings yet
Final
30 pages
Fake Job Detection
No ratings yet
Fake Job Detection
3 pages
Fake Job Prediction
No ratings yet
Fake Job Prediction
23 pages
Project Report: Fake Job Prediction
No ratings yet
Project Report: Fake Job Prediction
3 pages
Online Recruitment Fraud (ORF) Detection Using Deep Learning Approaches
No ratings yet
Online Recruitment Fraud (ORF) Detection Using Deep Learning Approaches
18 pages
Fake Job Post Detection Using Machine Learning
100% (1)
Fake Job Post Detection Using Machine Learning
24 pages
Sample IEEE Article Ready Format
No ratings yet
Sample IEEE Article Ready Format
5 pages
Fake Job Posting Detection Using SVM & RF
No ratings yet
Fake Job Posting Detection Using SVM & RF
5 pages
Predicting Fraudulant Job Ads With Machine Learning
No ratings yet
Predicting Fraudulant Job Ads With Machine Learning
3 pages
Fake Job Detection via ML Classifiers
No ratings yet
Fake Job Detection via ML Classifiers
3 pages
Orf Review
No ratings yet
Orf Review
10 pages
Fakejobdett
No ratings yet
Fakejobdett
9 pages
Fakejobpublished
No ratings yet
Fakejobpublished
5 pages
Fake Job Detection Research Proposal
No ratings yet
Fake Job Detection Research Proposal
4 pages
Fake Job Detection System Guide
No ratings yet
Fake Job Detection System Guide
7 pages
Fake Job Entry Detectionnn
No ratings yet
Fake Job Entry Detectionnn
25 pages
Fake Job Detection
No ratings yet
Fake Job Detection
2 pages
Fakejob
No ratings yet
Fakejob
5 pages
Fin Ijprems1680687249
No ratings yet
Fin Ijprems1680687249
6 pages
Project Paper
No ratings yet
Project Paper
7 pages
XGBoost vs Naive Bayes for Job Fraud Detection
No ratings yet
XGBoost vs Naive Bayes for Job Fraud Detection
6 pages
Fake Job Posting Detection
No ratings yet
Fake Job Posting Detection
5 pages
Fake Job Post Prediction: Supervisor: I.Lakshmi Manikyamba Ass0Ciate Professor-Cse
No ratings yet
Fake Job Post Prediction: Supervisor: I.Lakshmi Manikyamba Ass0Ciate Professor-Cse
10 pages
Fake News Detection Expanded
No ratings yet
Fake News Detection Expanded
5 pages
Analyzing The Performance of Novel Logistic Regression Over Linear Regression Algorithms
No ratings yet
Analyzing The Performance of Novel Logistic Regression Over Linear Regression Algorithms
5 pages
Fake Jobs Code
No ratings yet
Fake Jobs Code
3 pages
Online Recruitment Fraud (ORF) Detection Using Deep Learning Approaches
No ratings yet
Online Recruitment Fraud (ORF) Detection Using Deep Learning Approaches
21 pages
Project Viva
No ratings yet
Project Viva
4 pages
Fake News Phase 3
No ratings yet
Fake News Phase 3
3 pages
Fake E Job Posting Prediction Based On A
No ratings yet
Fake E Job Posting Prediction Based On A
7 pages
Summer Intern
No ratings yet
Summer Intern
34 pages
Fake Job Detection via Machine Learning
No ratings yet
Fake Job Detection via Machine Learning
6 pages
Machine Learning for Fake Job Detection
No ratings yet
Machine Learning for Fake Job Detection
5 pages
A Comparative Study On Fake Job Post Prediction Using Different Machine Learning Techniques
No ratings yet
A Comparative Study On Fake Job Post Prediction Using Different Machine Learning Techniques
11 pages
Fake Job Post Detection Using Machine Learning
No ratings yet
Fake Job Post Detection Using Machine Learning
9 pages
Detection of Online Employment Scam Through Fake Jobs Using Random Forest Classifier
No ratings yet
Detection of Online Employment Scam Through Fake Jobs Using Random Forest Classifier
8 pages
Fraudulent Job Listings - Building Predictive Model Presentation
No ratings yet
Fraudulent Job Listings - Building Predictive Model Presentation
30 pages
Final Year Project - Nagabhusana K Nagabhusana K
No ratings yet
Final Year Project - Nagabhusana K Nagabhusana K
6 pages
Detecting Fake Job Posts with AI
No ratings yet
Detecting Fake Job Posts with AI
7 pages
Identification of Online Recruitment Fraud (ORF) Through Predictive Models
No ratings yet
Identification of Online Recruitment Fraud (ORF) Through Predictive Models
13 pages
Protecting Job Seekers from Scams
No ratings yet
Protecting Job Seekers from Scams
3 pages
1822 B.E Cse Batchno 220
No ratings yet
1822 B.E Cse Batchno 220
74 pages
Final Report
No ratings yet
Final Report
17 pages
Machine Learning-Powered Web Application For Predicting and Identifying Fake Job Listing
No ratings yet
Machine Learning-Powered Web Application For Predicting and Identifying Fake Job Listing
6 pages
Fake Online Job Recruitment
100% (1)
Fake Online Job Recruitment
13 pages
FinalDemoPresentation (Group 22)
No ratings yet
FinalDemoPresentation (Group 22)
24 pages
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
No ratings yet
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
20 pages
2024 Eai Airo
No ratings yet
2024 Eai Airo
7 pages
FAKE NEWS Paper
No ratings yet
FAKE NEWS Paper
45 pages
Fake News Classification Methodology With Enhanced BERT
No ratings yet
Fake News Classification Methodology With Enhanced BERT
12 pages
A Comparative Study On Fake Job Post Prediction Using Different Data Mining Techniques
100% (1)
A Comparative Study On Fake Job Post Prediction Using Different Data Mining Techniques
5 pages
Bibilography 5
No ratings yet
Bibilography 5
29 pages
Fake Job Post Prediction Using ML
No ratings yet
Fake Job Post Prediction Using ML
7 pages
Proposal FN
No ratings yet
Proposal FN
9 pages
2023-V14I209 Fake Job Detection Using Machine Learning
No ratings yet
2023-V14I209 Fake Job Detection Using Machine Learning
8 pages
Ijett V68i4p209s
No ratings yet
Ijett V68i4p209s
6 pages
Fake Job Listing Detection Using Machine Learning Approach
No ratings yet
Fake Job Listing Detection Using Machine Learning Approach
6 pages
Prova-Regular Pattern and Anomaly Detection On Corporate Transaction Time Series
No ratings yet
Prova-Regular Pattern and Anomaly Detection On Corporate Transaction Time Series
6 pages
AD - Mini Project Report
100% (1)
AD - Mini Project Report
42 pages
Python For Finance and Algorithmic Trading 2nd Edition Machine Learning Deep Learning Time Series Analysis Risk and Portfolio Management For MetaTrader 5 Live Trading Inglese Download
100% (1)
Python For Finance and Algorithmic Trading 2nd Edition Machine Learning Deep Learning Time Series Analysis Risk and Portfolio Management For MetaTrader 5 Live Trading Inglese Download
41 pages
Handwritten Gregg Shorthand Recognition: R. Rajasekaran K. Ramar
No ratings yet
Handwritten Gregg Shorthand Recognition: R. Rajasekaran K. Ramar
8 pages
cs231n 2018 Lecture04
No ratings yet
cs231n 2018 Lecture04
101 pages
Machine Learning and Deep Learning An Overview of Concepts
No ratings yet
Machine Learning and Deep Learning An Overview of Concepts
9 pages
AI Robots
No ratings yet
AI Robots
15 pages
Data Analytics III-i
No ratings yet
Data Analytics III-i
85 pages
CV Igor Chiriac en
No ratings yet
CV Igor Chiriac en
2 pages
Using HTK
No ratings yet
Using HTK
36 pages
Unit - 1
No ratings yet
Unit - 1
9 pages
AI Threat Detection in Network Security
No ratings yet
AI Threat Detection in Network Security
9 pages
Unit I - 1.1 - Applications of Machine Learning at CSJMU - 6 Slides Handouts
No ratings yet
Unit I - 1.1 - Applications of Machine Learning at CSJMU - 6 Slides Handouts
3 pages
Active Learning in Machine Learning
No ratings yet
Active Learning in Machine Learning
6 pages
CV Ai ML Questions
No ratings yet
CV Ai ML Questions
4 pages
AI-Driven Career Growth (2025-2035)
No ratings yet
AI-Driven Career Growth (2025-2035)
6 pages
2021 PEREPU - Artificial Intelligence - Deep Learning For Detection of Text Polarity in Natural Scene Images
No ratings yet
2021 PEREPU - Artificial Intelligence - Deep Learning For Detection of Text Polarity in Natural Scene Images
6 pages
Assignment No 4 - KNN Twitter
No ratings yet
Assignment No 4 - KNN Twitter
3 pages
8.1. Machine Learning Decision Tree
No ratings yet
8.1. Machine Learning Decision Tree
48 pages
Gesture Recognition System With Machine Learning
No ratings yet
Gesture Recognition System With Machine Learning
26 pages
Computer Science BSC Curriculum (June 2020)
No ratings yet
Computer Science BSC Curriculum (June 2020)
275 pages
4 - Unsupervised Classification
No ratings yet
4 - Unsupervised Classification
21 pages
DeepMind Reinforcement Learning Overview
No ratings yet
DeepMind Reinforcement Learning Overview
216 pages
DR Mehdi Hassan
No ratings yet
DR Mehdi Hassan
53 pages
Batch 6 Documentation
No ratings yet
Batch 6 Documentation
44 pages
The CEO's Guide To The Generative AI Revolution - BCG
100% (2)
The CEO's Guide To The Generative AI Revolution - BCG
14 pages
Pythia Chem A User Friendly Machine Learning Toolkit For Chemistry
No ratings yet
Pythia Chem A User Friendly Machine Learning Toolkit For Chemistry
39 pages
Clusters - Density-Based
No ratings yet
Clusters - Density-Based
12 pages
Internship Dairy 1
No ratings yet
Internship Dairy 1
5 pages
(Ebook) Mastering PyTorch: Create and Deploy Deep Learning Models From CNNs To Multimodal Models, LLMS, and Beyond - 2nd Edition by Ashish Ranjan Jha ISBN 9781801074308, 1801074305 Download
50% (2)
(Ebook) Mastering PyTorch: Create and Deploy Deep Learning Models From CNNs To Multimodal Models, LLMS, and Beyond - 2nd Edition by Ashish Ranjan Jha ISBN 9781801074308, 1801074305 Download
88 pages

Fake Job Posting Detection Report

Uploaded by

Fake Job Posting Detection Report

Uploaded by

Fake Job Posting Detection Report

Student name: Usman Aslam

Model Precision Recall F1-Score Accuracy Macro Weighted

Logistic 0.56 0.91 0.69 0.96 0.84 0.97

Random 0.99 0.54 0.70 0.98 0.84 0.97

XGBoost 0.87 0.82 0.84 0.99 0.92 0.98

Roberta 0.9091 0.9524 0.9302 0.9302 0.9302 0.9302

DistilBert 0.8542 0.9762 0.9111 0.9070 0.9068 0.9067

Confusion Matrix Analysis

Figure 1: Topmost image (Confusion Matrix - Logistic Regression).

Figure 3: Bottommost confusion matrix image (Confusion Matrix - XGBoost).

Model Selection and Rationale

Observations and Challenges

● Kaggle. (2023). Real or Fake Job Postings. Retrieved from

● Scikit-learn. (2023). Scikit-learn Documentation. Retrieved from https://scikit-

You might also like