DON BOSCO INSTITUTE AND TECHNOLOGY
Premier Automobiles Road, Kurla (W), Mumbai - 400070
Department of Computer Engineering
(Session 2025-2026 Odd)
MINI PROJECT PROPOSAL
on
“Multi-label Text Classification”
Subject : Natural Language Processing
Semester : VII (BE Computers)
Subject In-Charge : Ms. Pooja Bansode
Group Members:
Name Roll No.
Multi-label Text Classification
1) Abstract:
With continuous increase in available data, there is a pressing need to organize
it and modern classification problems often involve the prediction of multiple
labels simultaneously associated with a single instance. Known as Multi-Label
Classification, it is one such task which is omnipresent in many real-world
problems. Multi-label classification assigns to each sample a set of target labels.
This can be thought as predicting properties of a data-point that are not
mutually exclusive. Every developer/engineer/student has used the website
Stack Overflow more than once in their journey. Widely considered as one of
the largest and more trusted websites for developers to learn and share their
knowledge, the website presently hosts in excess of 10,000,000 questions. In
this project we try to predict the question tags based on the question text asked
on Stack Overflow. The most common question tags on Stack Overflow include
Java, JavaScript, C#, PHP, Android amongst others.
2) Design/Workflow Diagram:
3) Algorithms/Methodology Used:
Data cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When
combining multiple data sources, there are many opportunities for data to be
duplicated or mislabeled. If data is incorrect, outcomes and algorithms are
unreliable, even though they may look correct.
TF-IDF
TFIDF, short for term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important a word is to
a document in a collection or corpus.[1] It is often used as a weighting
factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word
appears in the document and is offset by the number of documents in the
corpus that contain the word, which helps to adjust for the fact that some
words appear more frequently in general. Variations of the tf–idf weighting
scheme are often used by search engines as a central tool in scoring and
ranking a document's relevance given a user query. tf–idf can be
successfully used for stop-words filtering in various subject fields,
including text summarization and classification.
Logistic regression
Logistic regression is a classification algorithm, used when the value of the
target variable is categorical in nature. Logistic regression is most commonly
used when the data in question has binary output, so when it belongs to one
class or another, or is either a 0 or 1.
SVM
A support vector machine (SVM) is a supervised machine learning model
that uses classification algorithms for two-group classification problems.
After giving an SVM model sets of labeled training data for each category,
they’re able to categorize new text.
4) Possible input and expected outcome:
We will be developing a text classification model that analyzes a textual
description of questions as input and predicts multiple labels associated with
the question as output.
5) References:
[1] https://towardsdatascience.com/multi-label-text-classification-with-
scikit-learn-30714b7819c5
[2] https://towardsdatascience.com/journey-to-the-center-of-multi-label-
classification-384c40229bff
[3] https://towardsdatascience.com/multi-label-text-classification-with-
scikit-learn-30714b7819c5