[go: up one dir, main page]

0% found this document useful (0 votes)
66 views6 pages

Department of Electronics & Communication Engineering MANIT Bhopal

This document proposes a project to identify the publication house of any given article using machine learning. A group of 5 students from MANIT Bhopal will work on the project under the guidance of Dr. R.K Baghel. The project aims to classify editorials by analyzing writing styles and patterns to attribute an article to its correct publishing house. The group will collect data from different sources, clean it, extract features, and test various classifiers like AdaBoost, Perceptron, Random Forest, and Naive Bayes. They expect to generate confusion matrices to evaluate model performance. Individual tasks are assigned, with the project expected to be completed from October 2019 to April 2020.

Uploaded by

Diksha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views6 pages

Department of Electronics & Communication Engineering MANIT Bhopal

This document proposes a project to identify the publication house of any given article using machine learning. A group of 5 students from MANIT Bhopal will work on the project under the guidance of Dr. R.K Baghel. The project aims to classify editorials by analyzing writing styles and patterns to attribute an article to its correct publishing house. The group will collect data from different sources, clean it, extract features, and test various classifiers like AdaBoost, Perceptron, Random Forest, and Naive Bayes. They expect to generate confusion matrices to evaluate model performance. Individual tasks are assigned, with the project expected to be completed from October 2019 to April 2020.

Uploaded by

Diksha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Department of Electronics & Communication Engineering

MANIT Bhopal

Major Project Proposal Document

Group No. : 24

Project Title:-Identification of Publication House of any Article using ML

S.No. Scholar No. Name Signature

1 161114094 Shaily Jaloree

2 161114084 Sulabh Jain

3 161114075 Akshay Shrivastava

4 161114106 Diksha Sharma

5 161114115 Deepali Rawat


Name & Signature of faculty mentor : Dr. R.K Baghel
Project Title: Identification of Publication House of any Article using ML

Project Category: Software

Introduction:
Author identification of a text is a real issue that is encountered across various
walks of life. Grading of an individual’s work is a key attribute in distance
education. It is difficult for institutions to verify whether the individual participated
is the person enrolled. Academics are continuously tapping for plagiarism, while
intellectuals often recover unattributed texts whose authors need to be identified. It
is with this reason that makes this field in machine learning particularly appealing.

Text mining is a new field attempting to extract necessary information from natural
language text. It may be established as the process of analyzing text to extract
information that is useful for particular purposes. Unlike the data stored in
databases, text is unstructured and ambiguous. Therefore, it is algorithmically
complex to deal with.

A standard approach for plagiarism detection initially defines a set of style markers
and then either counts manually these markers in the text under study or tries to
find tools that can provide these counts accurately.

It is difficult for the institutions to detect plagiarism in assessment of answers


submitted by students. Thus, there is need of Author Identification. It also plays an
important role in E-mails, short text messages and in many other domains for
avoiding duplication of author identity.

The initial approach was based on Bayesian Statistical analysis of the occurrences
of prepositions and conjunctions like ‘or’, ‘to’, ‘and’ etc. Thereby giving a way to
differentiate the identity of each author.

Following approaches included the "Type-Token Ratio". This was orthodoxly used
for small dataset. Analysts have calculated the average repetition of words to be
around 40% of the original word count. The term 'types' represent the number of
unique words in the text and the word 'token' represent the total number of words.

The more recent techniques and the base of our attempt is the "CUSUM"
technique, which assumes that every author tend to use a set of words in his/her
writing style. The proper nouns used in the text are treated as a single symbol, the
average sentence length is calculated and each document sentence is marked with
respect to that as '+' or '-' to show the comparison.

NLTK is a well-known platform for writing Python programs to work with human
language data. It provides user-friendly interfaces to over 50 lexical resources such
as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for
industrial-strength NLP libraries, and an active discussion forum.

Applications of Text Categorization are numerous. Similar techniques are also


being used at sentence level rather than document level for word sense
disambiguation.

The construction of the models can be divided into following stages:-

1. Data Collection
2. Data Cleaning and Preprocessing
3. Feature Extraction
4. Classification Methodology
Expected Outcome:
We in our project aim to contribute to this area of text analytics. Building on the
existing structure of work we train learning algorithms to identify patterns in the
editorial pieces published by a news publication house everyday and then attribute
any given editorial to its publishing house. Just like different authors have different
writing styles which makes author identification an achievable task, the same can
be applied for associating a given text article to a publication house.

Every Newspaper Editorial is written and supervised by a team of editors. The


characteristic traits of that editorial is decided by the supervisor(author), thus,
identifying these traits of editorial is like Author Identification.

Each publication aims at a specific demographic viz. “The Hindu” is generally


aimed at groups of avid and intellectuals thereby making them use better
vocabulary, whereas “TOI” is generally for orthodox people and they tend to use
general vocabulary.

Therefore, we try to exploit the above two points to extract features for our
analysis.

Data from different Publication House will be collected for training and testing our
model. Once the data is collected from different sources in the raw form it will be
directly used in the algorithm.

Data cleaning and pre-processing will involve scraping of data to replace all the
special characters with spaces.

Feature extraction step will make the use of different parts of speech like pronouns,
conjunctions and prepositions to determine the writing style and uniqueness.

Using matplotlib library, we will generate confusion matrices for different


classifiers. And by observing them we can study the percentage of text articles
being classified correctly for different newspapers
Project Time Schedule:
 Project Proposal Submission – October 9th , 2019
 Data Collection – October 2019
 Data Cleaning & Pre-processing – November 2019
 Design of Model – January 2019
 Training & Testing – February 2019
 Implementation – March 2019
 Final Realization & Presentation- April 2019

Individual Student Responsibility:

Scholar Name Responsibility


No.

161114094 Shaily Jaloree Data cleaning & Pre-processing

161114084 Sulabh Jain AdaBoost Classifier & Data Collection and


Extraction

161114075 Akshay Shrivastava Perception Classifer & Testing

161114106 Diksha Sharma Random Forest Classifier & Report and PPT

161114115 Deepali Rawat Bernoulli Naïve Bayes Classifier & Training

You might also like