duplicate_question_detection_report__1_
duplicate_question_detection_report__1_
Submitted by:
Submitted To:
This is to certify that the work presented in this project report, entitled
”Identifying Duplicate Questions in Q&A Communities Using Machine
Learning” is the original work of Ishak Gauri, Rahul Kumar, Subhash Ya-
dav, Mukul Sharma and Aayush Kumar Jha . The project has been carried
out under my supervision and guidance as part of the requirements for the Bach-
elor of Technology (B. Tech) degree in Computer Science Engineering.
To the best of my knowledge, this work has not been submitted previously, in
part or in full, to any other university or institution for the award of any degree
or diploma. The research and implementation presented in this report adhere
to academic and ethical standards and contribute to advancements in natural
language processing and information retrieval.
I commend the candidates for their dedication, analytical approach, and con-
tributions to this field of study.
We, the undersigned, hereby declare that the project report titled ”Identify-
ing Duplicate Questions in Q&A Communities Using Machine Learn-
ing” is an original work carried out by us as part of the requirements for the Bach-
elor of Technology (B. Tech) degree in Computer Science Engineering at Quantum
University, Roorkee, India. This project was conducted under the supervision of
Asst. Prof. Amit Kumar and adheres to academic and ethical standards.
To the best of our knowledge, this work is free from any form of plagiarism and
has not been submitted, in part or in full, to any other institution or university
for the award of a degree or diploma. All sources of information, contributions,
and references have been duly acknowledged following proper citation norms.
We take full responsibility for the authenticity and integrity of the research
findings, methodologies, and conclusions presented in this report.
Submitted by:
Ishak Gauri (QID: 22030522)
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 Summary of Existing Research . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Machine Learning Era . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Deep Learning Revolution . . . . . . . . . . . . . . . . . . . 5
2.2 Gaps in Current Knowledge / Existing Works . . . . . . . . . . . . 5
2.3 Justification for the Proposed Research . . . . . . . . . . . . . . . . 6
3 Implementation Methodology 7
3.1 Methodology Design . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Collection Methods/Sources . . . . . . . . . . . . . . . . . . . 7
3.2.1 Primary Data Sources . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Data Characteristics . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Tools and Techniques Used for Analysis . . . . . . . . . . . . . . . . 8
3.3.1 Machine Learning Algorithms . . . . . . . . . . . . . . . . . 8
3.3.2 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . 8
3.3.3 Feature Extraction Methods . . . . . . . . . . . . . . . . . . 8
3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.1 Data Preprocessing Pipeline . . . . . . . . . . . . . . . . . . 9
3.4.2 Feature Extraction Module . . . . . . . . . . . . . . . . . . . 10
3.4.3 Model Training and Evaluation . . . . . . . . . . . . . . . . 10
3.5 Implementation Details and Results . . . . . . . . . . . . . . . . . . 10
3.5.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . 11
3.5.3 Extreme Gradient Boosting (XG-Boost) . . . . . . . . . . . 12
3.5.4 Siamese Manhattan LSTM (MALSTM) . . . . . . . . . . . . 12
3.5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 12
3.5.6 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . 13
4
CONTENTS 5
6
List of Tables
7
Chapter 1
Introduction
1.1 Background
Community Question Answering (CQA) platforms facilitate knowledge sharing by
allowing users to post and answer questions. However, a common challenge these
platforms face is the presence of duplicate questions—queries that have already
been asked and answered. These redundancies create noise in the system, making
it difficult for users to retrieve relevant answers efficiently.
Programming Community Question Answering (PCQA) platforms, a special-
ized subset of CQA, focus on programming-related queries. Unlike general CQA
forums, PCQA contains not only natural language text but also programming code
snippets, making duplicate detection more complex. Traditional approaches to du-
plicate question detection in PCQA have primarily relied on supervised learning
techniques that analyse textual similarities. However, these methods often neglect
the structural and semantic differences between programming code and natural
language, limiting their effectiveness.
The presence of duplicate questions negatively impacts the efficiency of PCQA
platforms by increasing search complexity and leading to redundant discussions.
Existing methods predominantly extract textual features without considering programming-
specific elements, leading to suboptimal duplicate detection.
To address this gap, our research aims to introduce novel feature extraction
techniques that incorporate both textual and code-related characteristics. By inte-
grating deep learning based continuous word representations, probabilistic models
from information retrieval, and association pairs derived from machine transla-
tion, we seek to improve the accuracy of duplicate detection in PCQA platforms.
1.2 Motivation
The exponential growth of online Q&A platforms has created an unprecedented
volume of user-generated content. While this democratization of knowledge shar-
ing has numerous benefits, it has also introduced significant challenges in content
management and information retrieval. The primary motivation for this research
stems from several critical issues:
1
CHAPTER 1. INTRODUCTION 2
1.3 Applications
The proposed approach can be utilized in various domains, including:
• The study primarily considers English-based PCQA platforms and does not
address multilingual challenges
• Computational complexity and scalability issues for very large datasets are
not extensively addressed
• The study focuses on textual similarity and does not consider user behavior
patterns or temporal factors
Literature Review
4
CHAPTER 2. LITERATURE REVIEW 5
• Data scarcity and annotation costs: Deep learning models require large-
scale labelled datasets, which are often expensive and time-consuming to
annotate manually
Implementation Methodology
7
CHAPTER 3. IMPLEMENTATION METHODOLOGY 8
4. Stop Word Removal: Filtering out common words that don’t contribute
to meaning
Performance Metrics:
• Accuracy: 78.2%
• Precision: 0.76
• Recall: 0.74
• F1-Score: 0.75
Performance Metrics:
• Accuracy: 81.5%
• Precision: 0.79
• Recall: 0.78
• F1-Score: 0.785
• Accuracy: 84.3%
• Precision: 0.82
• Recall: 0.81
• F1-Score: 0.815
• Accuracy: 87.1%
• Precision: 0.85
• Recall: 0.84
• F1-Score: 0.845
In the graph, you can see that the unique questions are more compared to
repeated questions, which reflects the natural distribution in Q&A platforms.
The example shows how the system successfully identifies duplicate questions even
when they are phrased differently, demonstrating the effectiveness of semantic
similarity detection.
14
CHAPTER 4. EXPECTED OUTCOMES AND SIGNIFICANCE 15
• Active Learning: Implementing systems that can learn from user feedback
to improve detection accuracy
4.4 Conclusion
This research presents a comprehensive approach to duplicate question detection
in Q&A communities using machine learning techniques. Through systematic
evaluation of multiple algorithms and feature extraction methods, we have demon-
strated the effectiveness of combining traditional machine learning with modern
deep learning approaches.
The Siamese Manhattan LSTM model achieved the highest performance with
87.1% accuracy, followed by XG-Boost with 84.3% accuracy. These results indicate
that deep learning models, particularly those designed for similarity detection, are
well-suited for this task.
The developed system provides a practical solution for Q&A platforms to au-
tomatically identify and manage duplicate questions, thereby improving user ex-
perience and platform efficiency. The comprehensive feature engineering approach
and systematic evaluation methodology contribute to the broader field of natural
language processing and information retrieval.
Future work will focus on extending the approach to handle multilingual con-
tent, improving real-time performance, and developing more interpretable models
for better user trust and adoption.
[4] A. Vaswani et al., “Attention is All You Need,” Advances in Neural Informa-
tion Processing Systems, vol. 30, pp. 5998-6008, 2017.
[9] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word
representation,” Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP), pp. 1532-1543, 2014.
17
BIBLIOGRAPHY 18
[12] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” Pro-
ceedings of the 22nd acm sigkdd international conference on knowledge dis-
covery and data mining, pp. 785-794, 2016.
[15] S. Iyer, N. Dandekar, and K. Csernai, “First quora dataset release: Question
pairs,” Quora Engineering Blog, 2017.