0% found this document useful (0 votes)

66 views6 pages

Department of Electronics & Communication Engineering MANIT Bhopal

This document proposes a project to identify the publication house of any given article using machine learning. A group of 5 students from MANIT Bhopal will work on the project under the guidance of Dr. R.K Baghel. The project aims to classify editorials by analyzing writing styles and patterns to attribute an article to its correct publishing house. The group will collect data from different sources, clean it, extract features, and test various classifiers like AdaBoost, Perceptron, Random Forest, and Naive Bayes. They expect to generate confusion matrices to evaluate model performance. Individual tasks are assigned, with the project expected to be completed from October 2019 to April 2020.

Uploaded by

Diksha Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views6 pages

Department of Electronics & Communication Engineering MANIT Bhopal

Uploaded by

Diksha Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Department of Electronics & Communication Engineering

MANIT Bhopal

Major Project Proposal Document

Group No. : 24

Project Title:-Identification of Publication House of any Article using ML

S.No. Scholar No. Name Signature

1 161114094 Shaily Jaloree

2 161114084 Sulabh Jain

3 161114075 Akshay Shrivastava

4 161114106 Diksha Sharma

5 161114115 Deepali Rawat

Name & Signature of faculty mentor : Dr. R.K Baghel
Project Title: Identification of Publication House of any Article using ML

Project Category: Software

Introduction:
Author identification of a text is a real issue that is encountered across various
walks of life. Grading of an individual’s work is a key attribute in distance
education. It is difficult for institutions to verify whether the individual participated
is the person enrolled. Academics are continuously tapping for plagiarism, while
intellectuals often recover unattributed texts whose authors need to be identified. It
is with this reason that makes this field in machine learning particularly appealing.

Text mining is a new field attempting to extract necessary information from natural
language text. It may be established as the process of analyzing text to extract
information that is useful for particular purposes. Unlike the data stored in
databases, text is unstructured and ambiguous. Therefore, it is algorithmically
complex to deal with.

A standard approach for plagiarism detection initially defines a set of style markers
and then either counts manually these markers in the text under study or tries to
find tools that can provide these counts accurately.

It is difficult for the institutions to detect plagiarism in assessment of answers

submitted by students. Thus, there is need of Author Identification. It also plays an
important role in E-mails, short text messages and in many other domains for
avoiding duplication of author identity.

The initial approach was based on Bayesian Statistical analysis of the occurrences
of prepositions and conjunctions like ‘or’, ‘to’, ‘and’ etc. Thereby giving a way to
differentiate the identity of each author.

Following approaches included the "Type-Token Ratio". This was orthodoxly used
for small dataset. Analysts have calculated the average repetition of words to be
around 40% of the original word count. The term 'types' represent the number of
unique words in the text and the word 'token' represent the total number of words.

The more recent techniques and the base of our attempt is the "CUSUM"
technique, which assumes that every author tend to use a set of words in his/her
writing style. The proper nouns used in the text are treated as a single symbol, the
average sentence length is calculated and each document sentence is marked with
respect to that as '+' or '-' to show the comparison.

NLTK is a well-known platform for writing Python programs to work with human
language data. It provides user-friendly interfaces to over 50 lexical resources such
as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for
industrial-strength NLP libraries, and an active discussion forum.

Applications of Text Categorization are numerous. Similar techniques are also

being used at sentence level rather than document level for word sense
disambiguation.

The construction of the models can be divided into following stages:-

1. Data Collection
2. Data Cleaning and Preprocessing
3. Feature Extraction
4. Classification Methodology
Expected Outcome:
We in our project aim to contribute to this area of text analytics. Building on the
existing structure of work we train learning algorithms to identify patterns in the
editorial pieces published by a news publication house everyday and then attribute
any given editorial to its publishing house. Just like different authors have different
writing styles which makes author identification an achievable task, the same can
be applied for associating a given text article to a publication house.

Every Newspaper Editorial is written and supervised by a team of editors. The

characteristic traits of that editorial is decided by the supervisor(author), thus,
identifying these traits of editorial is like Author Identification.

Each publication aims at a specific demographic viz. “The Hindu” is generally

aimed at groups of avid and intellectuals thereby making them use better
vocabulary, whereas “TOI” is generally for orthodox people and they tend to use
general vocabulary.

Therefore, we try to exploit the above two points to extract features for our
analysis.

Data from different Publication House will be collected for training and testing our
model. Once the data is collected from different sources in the raw form it will be
directly used in the algorithm.

Data cleaning and pre-processing will involve scraping of data to replace all the
special characters with spaces.

Feature extraction step will make the use of different parts of speech like pronouns,
conjunctions and prepositions to determine the writing style and uniqueness.

Using matplotlib library, we will generate confusion matrices for different

classifiers. And by observing them we can study the percentage of text articles
being classified correctly for different newspapers
Project Time Schedule:
 Project Proposal Submission – October 9th , 2019
 Data Collection – October 2019
 Data Cleaning & Pre-processing – November 2019
 Design of Model – January 2019
 Training & Testing – February 2019
 Implementation – March 2019
 Final Realization & Presentation- April 2019

Individual Student Responsibility:

Scholar Name Responsibility

No.

161114094 Shaily Jaloree Data cleaning & Pre-processing

161114084 Sulabh Jain AdaBoost Classifier & Data Collection and

Extraction

161114075 Akshay Shrivastava Perception Classifer & Testing

161114106 Diksha Sharma Random Forest Classifier & Report and PPT

161114115 Deepali Rawat Bernoulli Naïve Bayes Classifier & Training

Pagacz Matma Podstawa
No ratings yet
Pagacz Matma Podstawa
191 pages
6014
No ratings yet
6014
36 pages
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
No ratings yet
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
62 pages
Mimno Umass 0118D 10907
No ratings yet
Mimno Umass 0118D 10907
121 pages
8th Sem Final
No ratings yet
8th Sem Final
31 pages
Unstyle: A Tool For Evading Authorship Attribution
No ratings yet
Unstyle: A Tool For Evading Authorship Attribution
80 pages
Unit2 02
No ratings yet
Unit2 02
7 pages
Signed Report
No ratings yet
Signed Report
37 pages
Generative AI Report
No ratings yet
Generative AI Report
15 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
Author Identification On Anonymous Regional Literature
No ratings yet
Author Identification On Anonymous Regional Literature
7 pages
Ch03 Step Wise
No ratings yet
Ch03 Step Wise
36 pages
MS, PHD Thesis Template
No ratings yet
MS, PHD Thesis Template
24 pages
Detecting Stylistic Fingerprints of Large Language
No ratings yet
Detecting Stylistic Fingerprints of Large Language
10 pages
Review1
No ratings yet
Review1
19 pages
Proposal - Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
No ratings yet
Proposal - Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
11 pages
WAMIS Mobile APP User Manual
No ratings yet
WAMIS Mobile APP User Manual
17 pages
Report 83
No ratings yet
Report 83
50 pages
Text Classification and Processing Using NLP
No ratings yet
Text Classification and Processing Using NLP
21 pages
Chenet MA EEMCS
No ratings yet
Chenet MA EEMCS
57 pages
Eecs 2015 164
No ratings yet
Eecs 2015 164
95 pages
Exploring The Potential of Large Language Models
No ratings yet
Exploring The Potential of Large Language Models
9 pages
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
No ratings yet
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
42 pages
SERVO DRIVER 1525brs - Manual
100% (1)
SERVO DRIVER 1525brs - Manual
19 pages
A System For Detection of Plagiarism of Ideas Based On Deep Learning Algorithm
No ratings yet
A System For Detection of Plagiarism of Ideas Based On Deep Learning Algorithm
8 pages
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
No ratings yet
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
18 pages
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
No ratings yet
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
6 pages
CBLM Css Uc3 - Setup Server
No ratings yet
CBLM Css Uc3 - Setup Server
123 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Thesis Final
0% (1)
Thesis Final
186 pages
Exp 7
No ratings yet
Exp 7
9 pages
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
No ratings yet
Detecting Plagiarism in Academics Using Levenshtein Distance Algorithm and Semantic Similarity
3 pages
News Article Text Classification and Summary For Authors and Topics
No ratings yet
News Article Text Classification and Summary For Authors and Topics
12 pages
GIS For Higher Education
100% (1)
GIS For Higher Education
69 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Authorship Analysis As A Text Classification or Clustering Problem - by Nabanita Roy - Towards Data Science
No ratings yet
Authorship Analysis As A Text Classification or Clustering Problem - by Nabanita Roy - Towards Data Science
14 pages
Project Report
No ratings yet
Project Report
12 pages
Authorship Analysis and Identification Techniques: A Review
No ratings yet
Authorship Analysis and Identification Techniques: A Review
6 pages
AI Project
No ratings yet
AI Project
3 pages
Making Innovation Work
0% (1)
Making Innovation Work
13 pages
JETIR1706044
No ratings yet
JETIR1706044
3 pages
3036 6076 1 PB
No ratings yet
3036 6076 1 PB
10 pages
Openmark III Site Planning Template - 2022 NEW
No ratings yet
Openmark III Site Planning Template - 2022 NEW
20 pages
Author Profiling Using Machine Learning Techniques
No ratings yet
Author Profiling Using Machine Learning Techniques
5 pages
Palagiarism Detection
No ratings yet
Palagiarism Detection
14 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Detection of Cyberbullying On Social Media Using Machine Learning
No ratings yet
Detection of Cyberbullying On Social Media Using Machine Learning
5 pages
Privacy by Design
No ratings yet
Privacy by Design
9 pages
Prelim Lesson 1 A. The Study of Globalization
No ratings yet
Prelim Lesson 1 A. The Study of Globalization
41 pages
One-Class Learning For AI-Generated Essay Detection
No ratings yet
One-Class Learning For AI-Generated Essay Detection
24 pages
CIC2601 Assignment 2 PDF
No ratings yet
CIC2601 Assignment 2 PDF
11 pages
Combining Classifiers and Relevance Feedback For The Ambiguous Author Name Problem of Scientific Papers in Digital Libraries
No ratings yet
Combining Classifiers and Relevance Feedback For The Ambiguous Author Name Problem of Scientific Papers in Digital Libraries
3 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Topic Analysis Presentation
No ratings yet
Topic Analysis Presentation
23 pages
VNP - Internship Job Description - Muhua Guo
No ratings yet
VNP - Internship Job Description - Muhua Guo
3 pages
NLP MTE Syllabus and Practice Problems
No ratings yet
NLP MTE Syllabus and Practice Problems
2 pages
ECE249REPORT
No ratings yet
ECE249REPORT
9 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
IJRPR7794
No ratings yet
IJRPR7794
3 pages
Synopsis - Final Year Project
No ratings yet
Synopsis - Final Year Project
12 pages
The Secret of NIM: Tensions
No ratings yet
The Secret of NIM: Tensions
4 pages
Manual SDM230 Modbus
No ratings yet
Manual SDM230 Modbus
13 pages
Ai Project Report PDF
No ratings yet
Ai Project Report PDF
54 pages
Report
No ratings yet
Report
2 pages
Social Network Analysis (SNA) Is A Quantitative Method To Examine and Measure
No ratings yet
Social Network Analysis (SNA) Is A Quantitative Method To Examine and Measure
10 pages
Op Amp PDF
No ratings yet
Op Amp PDF
34 pages
QTP Certification Questions
No ratings yet
QTP Certification Questions
12 pages
LINSN LED Control System Manual
No ratings yet
LINSN LED Control System Manual
23 pages
Iftikhar Mahmud Nafis: Junior Flutter Developer
No ratings yet
Iftikhar Mahmud Nafis: Junior Flutter Developer
1 page
Ram
No ratings yet
Ram
6 pages
Bateria 3s Carregador S-8254A
No ratings yet
Bateria 3s Carregador S-8254A
25 pages
Communication Online Training Courses - LinkedIn Learning, Formerly
No ratings yet
Communication Online Training Courses - LinkedIn Learning, Formerly
5 pages
LD5760-DS-02 Leadtrend
No ratings yet
LD5760-DS-02 Leadtrend
20 pages
Datasheet Skytrap Light
No ratings yet
Datasheet Skytrap Light
4 pages
Building Web Applications With Flask - Sample Chapter
0% (1)
Building Web Applications With Flask - Sample Chapter
10 pages
DataStage Vs Informatica
No ratings yet
DataStage Vs Informatica
3 pages
CAIE IGCSE ICT Theory Revision Notes - ZNotes
No ratings yet
CAIE IGCSE ICT Theory Revision Notes - ZNotes
1 page
New Holland T5000 Series Tractor Specifications: T5040 T5050 T5060 T5070
No ratings yet
New Holland T5000 Series Tractor Specifications: T5040 T5050 T5060 T5070
2 pages
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Natural Language Processing with Java
From Everand
Natural Language Processing with Java
Richard M. Reese
No ratings yet
An Introduction to Python Programming: A Practical Approach: step-by-step approach to Python programming with machine learning fundamental and theoretical principles.
From Everand
An Introduction to Python Programming: A Practical Approach: step-by-step approach to Python programming with machine learning fundamental and theoretical principles.
Dr. Krishna Kumar Mohbey
No ratings yet
Exploring the Fascinating World of Natural Language Processing (NLP): Revolutionizing Communication and Empowering Machines through NLP Techniques and Applications
From Everand
Exploring the Fascinating World of Natural Language Processing (NLP): Revolutionizing Communication and Empowering Machines through NLP Techniques and Applications
daniel Huston
No ratings yet
Around the Texts of Writing Center Work: An Inquiry-Based Approach to Tutor Education
From Everand
Around the Texts of Writing Center Work: An Inquiry-Based Approach to Tutor Education
R. Mark Hall
5/5 (1)
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Introduction to Data Analysis in Qualitative Research
From Everand
Introduction to Data Analysis in Qualitative Research
Asher Shkedi
No ratings yet
How to Research Qualitatively: Tips for Scientific Working
From Everand
How to Research Qualitatively: Tips for Scientific Working
Martin Gertler
No ratings yet
Writing Quality Research Papers: Brief Guidelines to enhance the quality of Research papers/ Manuscript
From Everand
Writing Quality Research Papers: Brief Guidelines to enhance the quality of Research papers/ Manuscript
Dr. Pawan Singh
No ratings yet
Qualitative Research for Beginners: From Theory to Practice
From Everand
Qualitative Research for Beginners: From Theory to Practice
Ismail Sheikh Ahmad Ph.D.
3/5 (2)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
ChatGPT Guide to Scientific Thesis Writing: AI Research writing assistance for UG, PG, & Ph.d programs
From Everand
ChatGPT Guide to Scientific Thesis Writing: AI Research writing assistance for UG, PG, & Ph.d programs
Jayachandran M
4/5 (1)

Department of Electronics & Communication Engineering MANIT Bhopal

Uploaded by

Department of Electronics & Communication Engineering MANIT Bhopal

Uploaded by

Department of Electronics & Communication Engineering

Major Project Proposal Document

Project Title:-Identification of Publication House of any Article using ML

S.No. Scholar No. Name Signature

1 161114094 Shaily Jaloree

2 161114084 Sulabh Jain

3 161114075 Akshay Shrivastava

4 161114106 Diksha Sharma

5 161114115 Deepali Rawat

Project Category: Software

It is difficult for the institutions to detect plagiarism in assessment of answers

Applications of Text Categorization are numerous. Similar techniques are also

The construction of the models can be divided into following stages:-

Every Newspaper Editorial is written and supervised by a team of editors. The

Each publication aims at a specific demographic viz. “The Hindu” is generally

Using matplotlib library, we will generate confusion matrices for different

Individual Student Responsibility:

Scholar Name Responsibility

161114094 Shaily Jaloree Data cleaning & Pre-processing

161114084 Sulabh Jain AdaBoost Classifier & Data Collection and

161114075 Akshay Shrivastava Perception Classifer & Testing

161114115 Deepali Rawat Bernoulli Naïve Bayes Classifier & Training

You might also like