0% found this document useful (0 votes)

3 views8 pages

NLP2

The document discusses text normalization and its importance in preparing textual data for machine processing by simplifying and cleaning it. It outlines the steps involved in text normalization, including sentence segmentation, tokenization, and the elimination of stop words, followed by the Bag of Words and TFIDF algorithms for converting text into numerical representations. The document emphasizes that high TFIDF values indicate important words for a given corpus, helping computers understand natural language better.

Uploaded by

Davprakash Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views8 pages

NLP2

Uploaded by

Davprakash Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Text processing/Data processing, Bag of words, TFIDF-

we all know that the language of computers is Numerical, the very first step that comes to our mind is to
convert our language to numbers. This conversion takes a few steps to happen.
The first step to it is Text Normalisation. Since human languages are complex, we need to first of all
simplify them in order to make sure that the understanding becomes possible.
Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than the actual data.

Text Normalization-

Original form of text not suitable for a machine to process due to many unnecessary word and symbol in it. The
process of simplifying the text to make it suitable for machine processing is called text normalization.

In this process we remove unnecessary text and break down the text into smaller token.

Entire text that come from all the documents to be processed by a machine is called corpus.

Text Normalization is a process to reduce the variations in text’s word forms to a common form .
Text normalization simplifies the text for further processing.

Text Normalization—

Raw Text from user Text Normalization process Output

1. Sentence Segmentation----
In this first stage, all the text in a corpus is broken down into sentence. Each sentence is treated as a separated
string of letters to be processed.

Example-
2.Tokenisation---:
In this step, each sentence is further broken into individual text pieces called Token.
A token can be a word , number or symbol (punctuations, special character etc)

3.Eliminating Stop words, Special Characters and numbers—

There are certain token which occur several times in the corpus .Mostly these are auxiliary verb (is,
are ,was),punctuation, preposition (on, at, in),articles (a, an ,the) and other such word like such, there ,them or
and etc. All these word are called stop words because they pose unnecessary processing effort.

Note- if a corpus include document of financial transaction then numbers are not stop word there.
Some examples of stop words are-

4.) Converting text to a common case-:

Convert all the token into a common text case preferably lower. It will eliminate the differences in the
interpretation of same token such as Truth, truth, truth.

5.) Stemming -:
Certain word have some affixes (letter that appear at the end of the word).In stemming these affixes are
removed to keep only the root or original word. Some example are –
Hours> Hour ,Eating >Eat

Stemming does not take into account if the stemmed word is meaningful or not. It just removes the affixes
hence it is faster.

6. Lemmatization-:
Lemmatization is the process of stemming as well as converting the stemmed word to their proper form to
keep them meaningful. A word which is stemmed and converted to its meaningful form is called lemma.
we have normalised our text to tokens which are the simplest form of words. Now it is time to convert the
tokens into numbers. For this, we would use the Bag of Words algorithm.
For extracting feature of the text ,we need to get it converted into suitable numeric form .this is done by the
help of various algorithm.
 Bag of word
 Term frequency-inverse Document Frequency

Bag of words (BOW)---

Bag of words algorithm is used to extract two feature of text in the corpus – vocabulary and frequency.
Vocabulary refer to the unique word identified in the corpus and frequency is the number of occurrences of
each term in the corpus.

Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from the
unique list of words has occurred.
4. Create document vectors for all the documents. Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot

Step 2: Create Dictionary

list down all the words which occur in all three documents:

Dictionary:
Step 3: Create document vector—

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by
1. And if the word does not occur in that document, put a 0 under it.

Step 4: Repeat for all documents Same exercise has to be done for all the documents. Hence, the table
becomes:
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still not converted
to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF: Term Frequency & Inverse Document Frequency---

In this graph we plot occurrence of word and versus their value . As we see if the word have highest
occurrences in all the document of corpus they have neglible value hence they are termed as stop word.
These word removed at pre-processing stage and now we move ahead from the stop word and
occurrence level drops drastically and the word which have sufficient occurrence in the corpus and have
some amount of value are termed as frequent words. Further occurrence of word drop, the value of
such word rises. These word occur the least but add the most value to the corpus.

Term Frequency—

Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary in
each document.
Document Frequency---

Document Frequency is the number of documents in which the word occurs irrespective of how many
times it has occurred in those documents.

Inverse Document Frequency---

we need to put the document frequency in the denominator while the total number of documents is the
numerator. Here, the total number of documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log( IDF(W) ).

Here, log is to the base of 10.

Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document while
the IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row of the
document vector table.
Finally, the words have been converted to numbers. These numbers are the values of each for each
document.

Summarising the concept, we can say that:

1. Words that occur in all the documents with high term frequencies have the least values and are
considered to be the stop words.

2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common
word for all documents.

3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.

NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
Intro to NLP and Chatbots
No ratings yet
Intro to NLP and Chatbots
3 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Text
No ratings yet
Text
3 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
NLP Applications and Techniques
No ratings yet
NLP Applications and Techniques
7 pages
DSC 202
No ratings yet
DSC 202
8 pages
NLP Techniques and Applications
No ratings yet
NLP Techniques and Applications
17 pages
NLP Q&A for Class X AI Course
No ratings yet
NLP Q&A for Class X AI Course
7 pages
NLP-Questions Class 10 Ai
No ratings yet
NLP-Questions Class 10 Ai
8 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
BBC Sports Text Preprocessing Guide
No ratings yet
BBC Sports Text Preprocessing Guide
6 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Bag of Words Vs Tfidf
No ratings yet
Bag of Words Vs Tfidf
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
517-C-30070-Assignment - Chapter NLP
No ratings yet
517-C-30070-Assignment - Chapter NLP
9 pages
NLP
No ratings yet
NLP
4 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
6103 Text Analysis - Related Concepts (Lecture 11)
No ratings yet
6103 Text Analysis - Related Concepts (Lecture 11)
3 pages
Sample Paper Questions - NLP (Part 1)
No ratings yet
Sample Paper Questions - NLP (Part 1)
7 pages
Exp 7
No ratings yet
Exp 7
9 pages
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
No ratings yet
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
7 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Module III
No ratings yet
Module III
42 pages
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
No ratings yet
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
4 pages
NLP Revision Notes
No ratings yet
NLP Revision Notes
6 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Delhi Public School Bangalore North
No ratings yet
Delhi Public School Bangalore North
8 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Quantitative Text Analysis Methods
No ratings yet
Quantitative Text Analysis Methods
55 pages
Q ClassX AI Ch7
No ratings yet
Q ClassX AI Ch7
6 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
NLP Pipeline and Morphology
No ratings yet
NLP Pipeline and Morphology
21 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Text Analysis with TF-IDF and NLTK
No ratings yet
Text Analysis with TF-IDF and NLTK
10 pages
Bag of Words
No ratings yet
Bag of Words
3 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
NLP Exam Questions 2023-24
No ratings yet
NLP Exam Questions 2023-24
5 pages
Series-Parallel Circuit Analysis
No ratings yet
Series-Parallel Circuit Analysis
17 pages
Midas Civil Getting Started V2 1
No ratings yet
Midas Civil Getting Started V2 1
256 pages
Revised 19EEE301 - CO-PO Mapping Justification
No ratings yet
Revised 19EEE301 - CO-PO Mapping Justification
14 pages
DLL Q3 WK4
No ratings yet
DLL Q3 WK4
1 page
Instrumented Principal Component Analysis
No ratings yet
Instrumented Principal Component Analysis
71 pages
M. Barone - The Vacuum As Ether in The Last Century
No ratings yet
M. Barone - The Vacuum As Ether in The Last Century
10 pages
Control Syllabus
No ratings yet
Control Syllabus
2 pages
DAV Short Notes
No ratings yet
DAV Short Notes
5 pages
Section 1: Engineering Mathematics
No ratings yet
Section 1: Engineering Mathematics
3 pages
On Noteworthy Applications of Laplace Transform in Real Life
100% (2)
On Noteworthy Applications of Laplace Transform in Real Life
7 pages
STA 112 Probability Notes 2021
No ratings yet
STA 112 Probability Notes 2021
26 pages
Verification of Pipette Calibration
100% (2)
Verification of Pipette Calibration
2 pages
Mesh Current Method Explained
No ratings yet
Mesh Current Method Explained
4 pages
ISO GPS Standards List
No ratings yet
ISO GPS Standards List
22 pages
2 15
No ratings yet
2 15
4 pages
Supervised Kernel Thinning: Albert Gong Kyuseong Choi Raaz Dwivedi
No ratings yet
Supervised Kernel Thinning: Albert Gong Kyuseong Choi Raaz Dwivedi
56 pages
CHAPTER 6 - Basic Statistic Concepts
No ratings yet
CHAPTER 6 - Basic Statistic Concepts
46 pages
Graph f2 PDF
No ratings yet
Graph f2 PDF
3 pages
Cash Flow Matrix
No ratings yet
Cash Flow Matrix
5 pages
Assignment 28-Sept-2024 - 241005 - 215505
No ratings yet
Assignment 28-Sept-2024 - 241005 - 215505
7 pages
Common Test 3 Fiitjee Phase 1 Class 11
83% (6)
Common Test 3 Fiitjee Phase 1 Class 11
9 pages
Delhi Public School Bopal, Ahmedabad Assignment: Mathematics CLASS 9 (2020-21) CH:2 Polynomials
No ratings yet
Delhi Public School Bopal, Ahmedabad Assignment: Mathematics CLASS 9 (2020-21) CH:2 Polynomials
3 pages
Image Deblurring
No ratings yet
Image Deblurring
8 pages
Problems and Solutions Section 5.6 (5.67 Through 5.73) 5.67
No ratings yet
Problems and Solutions Section 5.6 (5.67 Through 5.73) 5.67
8 pages
What Are The Two Basic Systems of Cost Accounting and Under What Conditions May Each Be Used Advantageously
100% (1)
What Are The Two Basic Systems of Cost Accounting and Under What Conditions May Each Be Used Advantageously
2 pages
Demand in Economics
No ratings yet
Demand in Economics
20 pages
My Mistakes Analysis - Matrices - Quizrr
No ratings yet
My Mistakes Analysis - Matrices - Quizrr
1 page
Influence of Mathematics in Carpentry: Objective
No ratings yet
Influence of Mathematics in Carpentry: Objective
3 pages
Probabilistic Failure Risk Assessment For Aeroengine Disks Considering A Transient Process
No ratings yet
Probabilistic Failure Risk Assessment For Aeroengine Disks Considering A Transient Process
12 pages
Hsslive Xii Physics QB Solved Seema 2024
No ratings yet
Hsslive Xii Physics QB Solved Seema 2024
84 pages

NLP2

Uploaded by

NLP2

Uploaded by

Text processing/Data processing, Bag of words, TFIDF-

Raw Text from user Text Normalization process Output

3.Eliminating Stop words, Special Characters and numbers—

4.) Converting text to a common case-:

Bag of words (BOW)---

Here is the step-by-step approach to implement bag of words algorithm:

Step 1: Collecting data and pre-processing it.

Step 2: Create Dictionary

TFIDF: Term Frequency & Inverse Document Frequency---

Inverse Document Frequency---

Here, log is to the base of 10.

Summarising the concept, we can say that:

You might also like