Department of Electronics & Communication Engineering MANIT Bhopal
Department of Electronics & Communication Engineering MANIT Bhopal
MANIT Bhopal
Group No. : 24
Introduction:
Author identification of a text is a real issue that is encountered across various
walks of life. Grading of an individual’s work is a key attribute in distance
education. It is difficult for institutions to verify whether the individual participated
is the person enrolled. Academics are continuously tapping for plagiarism, while
intellectuals often recover unattributed texts whose authors need to be identified. It
is with this reason that makes this field in machine learning particularly appealing.
Text mining is a new field attempting to extract necessary information from natural
language text. It may be established as the process of analyzing text to extract
information that is useful for particular purposes. Unlike the data stored in
databases, text is unstructured and ambiguous. Therefore, it is algorithmically
complex to deal with.
A standard approach for plagiarism detection initially defines a set of style markers
and then either counts manually these markers in the text under study or tries to
find tools that can provide these counts accurately.
The initial approach was based on Bayesian Statistical analysis of the occurrences
of prepositions and conjunctions like ‘or’, ‘to’, ‘and’ etc. Thereby giving a way to
differentiate the identity of each author.
Following approaches included the "Type-Token Ratio". This was orthodoxly used
for small dataset. Analysts have calculated the average repetition of words to be
around 40% of the original word count. The term 'types' represent the number of
unique words in the text and the word 'token' represent the total number of words.
The more recent techniques and the base of our attempt is the "CUSUM"
technique, which assumes that every author tend to use a set of words in his/her
writing style. The proper nouns used in the text are treated as a single symbol, the
average sentence length is calculated and each document sentence is marked with
respect to that as '+' or '-' to show the comparison.
NLTK is a well-known platform for writing Python programs to work with human
language data. It provides user-friendly interfaces to over 50 lexical resources such
as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for
industrial-strength NLP libraries, and an active discussion forum.
1. Data Collection
2. Data Cleaning and Preprocessing
3. Feature Extraction
4. Classification Methodology
Expected Outcome:
We in our project aim to contribute to this area of text analytics. Building on the
existing structure of work we train learning algorithms to identify patterns in the
editorial pieces published by a news publication house everyday and then attribute
any given editorial to its publishing house. Just like different authors have different
writing styles which makes author identification an achievable task, the same can
be applied for associating a given text article to a publication house.
Therefore, we try to exploit the above two points to extract features for our
analysis.
Data from different Publication House will be collected for training and testing our
model. Once the data is collected from different sources in the raw form it will be
directly used in the algorithm.
Data cleaning and pre-processing will involve scraping of data to replace all the
special characters with spaces.
Feature extraction step will make the use of different parts of speech like pronouns,
conjunctions and prepositions to determine the writing style and uniqueness.
161114106 Diksha Sharma Random Forest Classifier & Report and PPT