International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
Fake Job Posting Detection Using Machine Learning: A
Comparative Study
Shaik Mohammed Imran1, Gadupudi Mokshagna2
1B.Tech Student, Department of Computer Science and Engineering (AI & ML), Pragati Engineering College, East
Godavari, India
2B.Tech Student, Department of Computer Science and Engineering (AI & ML), Pragati Engineering College, East
Godavari, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Online job portals have become primary 1.2 Research Objectives
platforms for job seekers, but they are increasingly targeted
by fraudsters posting fake job listings. This study presents a The primary objective of this research is to evaluate
comprehensive comparison of three machine learning both traditional machine learning models and deep
approaches for automated fake job posting detection: TFIDF learning models for the task of fake job posting detection.
with Logistic Regression, XGBoost, and BERT-based models. This includes a detailed performance comparison in terms
Using the Employment Scam Dataset from Kaggle of prediction accuracy, speed, and computational
containing 17,880 job postings, we evaluate these models on efficiency. Another key goal is to identify and analyze the
accuracy, precision, recall, F1-score, and computational most significant features that contribute to detecting
efficiency. Our results demonstrate that XGBoost achieves fraudulent postings, such as missing job details or specific
the highest accuracy of 97.2%, while TF-IDF with Logistic keyword patterns. Finally, the study aims to propose a
practical and scalable fraud detection framework that can
Regression provides the fastest processing time suitable for be integrated into real-time job portal systems to enhance
real-time applications. This research contributes to user safety and trust.
protecting job seekers from employment scams and can be
integrated into job portal platforms for automated fraud 2. LITERATURE REVIEW
detection
2.1Related Work
Key Words: Fake Job Detection, Machine Learning,
TF-IDF, XGBoost, BERT, Text Classification, TF-IDF and n-gram based models are widely used in spam
Employment Fraud, Natural Language Processing filtering. XGBoost has been successfully applied in financial
fraud detection. BERT and transformer models have shown
1.INTRODUCTION strong performance in text classification. However, specific
studies on fake job detection using a comparative model
Fake job scams are increasing rapidly across online job analysis are limited.
portals, affecting millions of job seekers with emotional
and financial consequences. These portals process a large 2.2 Research Gap
number of postings daily, making manual review difficult.
While machine learning has been applied to text While several studies have explored the use of machine
classification and spam detection, limited comparative learning for detecting fraudulent content, there is a lack of
studies exist for fake job posting detection. comprehensive comparison between traditional machine
learning techniques and modern deep learning models
1.1Problem Statement specifically for fake job posting detection. Existing
research often overlooks the practical aspects of
Fake job scams are increasing rapidly across online job deployment, including the computational requirements
portals, affecting millions of job seekers with emotional and real-time applicability of these models. Additionally,
and financial consequences. These portals process a large there is limited investigation into which features are most
number of postings daily, making manual review difficult. indicative of fraudulent job postings, leaving a gap in
While machine learning has been applied to text understanding the underlying patterns that distinguish
classification and spam detection, limited comparative
legitimate listings from scams.
studies exist for fake job posting detection.
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 78
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
3. METHODOLOGY
3.1 Dataset Description
The dataset used is the Kaggle Employment Scam Dataset
containing seventeen thousand eight hundred eighty job
postings. It is a binary classification task with about four
point eight percent labeled as fraudulent. Key features
include job title, description, requirements, benefits, and
company details.
3.2 Data Preprocessing
Preprocessing steps include cleaning text to remove HTML
and special characters, combining title and description,
extracting word count features, and binary indicators for
missing information. Data was split into training and
testing sets using stratified sampling
3.3 Models Used
TF-IDF with Logistic Regression
TF-IDF vectors were extracted and combined with
numerical features. Logistic Regression was
applied with L2 regularization. It is fast and
interpretable.
BERT (Simplified)
Twitter-RoBERTa was used to generate
embeddings. Enhanced TF-IDF with Random
Forest was also tested as a fallback. These models
capture deep semantic patterns.
XGBoost
This model uses decision trees with gradient
boosting. It handles complex feature interactions
and is robust to overfitting.
© 2025, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 79
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
3.4 Evaluation Metrics 5.2 Limitations
Models were evaluated using accuracy, precision, recall, This study is subject to certain limitations. The dataset
and F1 score. Training time and prediction time were used originates from a single source, which may limit the
recorded. A five fold cross validation was used. generalizability of the findings across different job
platforms or regions. Moreover, the analysis is restricted
4 RESULTS AND ANALYSIS to English-language postings, excluding fraud patterns
that may exist in non-English job markets. Additionally,
4.1 Performance Comparison the manual feature engineering applied to traditional
machine learning models might not capture deeper,
The three models were evaluated based on several complex patterns that automated or neural approaches
performance metrics, including accuracy, precision, recall, could potentially identify.
F1 score, training time, and prediction time. The TF-IDF
with Logistic Regression model achieved an accuracy of 6 CONCLUSION
96.34 percent, a precision of 85.23 percent, a recall of
78.92 percent, and an F1 score of 81.95 percent. It was This study presents a detailed comparison of machine
also the fastest model, with a training time of 2.45 seconds learning models for detecting fake job postings. XGBoost
and a prediction time of just 0.08 seconds. XGBoost showed the best performance while Logistic Regression
delivered the best overall performance, with an accuracy was best for fast applications. The results can be directly
of 97.21 percent, precision of 88.91 percent, recall of 84.56 applied to improve trust and safety in online job portals.
percent, and an F1 score of 86.68 percent. However, it
required more training time at 15.23 seconds and had a 7 CODE AND DATA AVAILABILITY
prediction time of 0.21 seconds. The BERT-based
approach or Enhanced TF-IDF model showed an accuracy Code implemented in Python using scikit learn and
of 95.98 percent, precision of 82.34 percent, recall of 80.12 transformers
percent, and an F1 score of 81.21 percent. This model had
Dataset available publicly on Kaggle
the highest computational cost, with a training time of
45.67 seconds and a prediction time of 1.34 seconds. Modular pipeline supports easy integration and testing
These results highlight the trade-offs between model
accuracy and efficiency for different deployment 8 REFERENCES
scenarios.
[1] Kaggle, "Fake-job postings dataset", Available at:
4.2 Key Findings https://www.kaggle.com/datasets/shivamb/real-or-fake-
fake-jobposting-prediction
XGBoost had the highest performance overall. TF-IDF with
Logistic Regression had the lowest latency and is suitable
[2] J. Friedman, "Greedy Function Approximation: A
for real time systems. BERT showed good results but
Gradient Boosting Machine", The Annals of Statistics, vol.
required more resources.
29, no. 5, pp. 1189–1232, 2001.
4.3 Feature Importance
[3] [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova,
Important indicators include text length, missing fields "BERT: Pre-training of Deep Bidirectional Transformers
like salary or location, keywords such as urgent and work for Language Understanding", arXiv preprint,
from home, and incomplete company profiles. arXiv:1810.04805, 2018.
4.4 Error Analysis [4] [4] F. Pedregosa et al., "Scikit-learn: Machine Learning
in Python", Journal of Machine Learning Research, vol. 12,
False positives included legitimate remote jobs. Some pp. 2825–2830, 2011.
sophisticated scams were false negatives. The class
imbalance was handled using stratified sampling and [5] [5] Hugging Face, "Transformers Library", Available at:
robust metrics. https://huggingface.co/transformers
5 DISCUSSION [6] [6] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree
Boosting System", Proceedings of the 22nd ACM SIGKDD
5.1 Practical Implications
Conference, pp. 785–794, 2016.
XGBoost is recommended for batch fraud detection.
Logistic Regression is best for real time filtering. A hybrid
system can combine speed and accuracy.
© 2025, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 80