0% found this document useful (0 votes)

32 views16 pages

Introduction To

The document provides an overview of Natural Language Processing (NLP), a sub-field of AI focused on enabling computers to understand human language. It outlines various applications of NLP, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, while also discussing the differences between human and computer languages. Additionally, it details the data processing steps involved in NLP, such as text normalization, tokenization, and the Bag of Words algorithm.

Uploaded by

Jitin Kohli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views16 pages

Introduction To

Uploaded by

Jitin Kohli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction to NLP

1. Data Science: It is all about applying mathematical and statistical principles to data or in simple
words, Data Science is the study of Data, This data can be of 3 types – Audio, Visual and
Textual.
○ Data Science works around numbers and tabular data.
2. Computer Vision: In simple words is identifying the symbols from the given object (pictures) and
learning the pattern and alert or predicting the future object using the camera.
○ Computer Vision is all about visual data like images and videos.

Natural Language Process (NLP)

Natural Language Processing (NLP) is the sub-field of AI that focuses on the ability of a computer to
understand human language (command) as spoken or written and to give an output by processing it, is
called Natural Language Processing (NLP). It is a component of Artificial Intelligence.

Applications of Natural Language Processing

Some of the applications of Natural Language Processing that are used in the real-life scenario:

Automatic Summarization
1. Automatic summarization is relevant not only for summarizing the meaning of documents
and information but also to understand the emotional meanings within the information
(such as in collecting data from social media)
Sentiment Analysis
1. Definition: Identify sentiment among several posts or even in the same post where
emotion is not always explicitly expressed.
2. Companies use it to identify opinions and sentiments to understand what customers
think about their products and services.

Text classification
1. Text classification makes it possible to assign predefined categories to a document and
organize it to help you find the information you need or simplify some activities.
2. For example, an application of text categorization is spam filtering in email.

Virtual Assistants
1. Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an integral part of
our lives. Not only can we talk to them but they also have the ability to make our lives
easier.
2. By accessing our data, they can help us in keeping notes of our tasks, making calls for
us, sending messages, and a lot more.
3. With the help of speech recognition, these assistants can not only detect our speech
but can also make sense of it.
4. According to recent research, a lot more advancements are expected in this field in
the near future

ChatBots
One of the most common applications of Natural Language Processing is a chatbot. Let us try
some of the chatbots and see how they work.

Types of ChatBots

With the help of this experience, we can understand that there are 2 types of chatbots around us:
Script-bot and Smart-bot. Let us understand what each of

Script Bot
1. Script bots are easy to make
2. Script bots work around a script that is programmed in them
3. Mostly they are free and are easy to integrate into a messaging platform
4. No or little language processing skills
5. Limited functionality
6. Example: the bots which are deployed in the customer care section of various
companies

Smart Bot
1. Smart bots are flexible and powerful
2. Smart bots work on bigger databases and other resources directly
3. Smart bots learn with more data
4. Coding is required to take this up on board
5. Wide functionality
6. Example: Google Assistant, Alexa, Cortana, Siri, etc.

Human Language VS Computer Language

Human Language
1. Our brain keeps on processing the sounds that it hears around itself and tries to make
sense of them all the time.
○ Example: In the classroom, as the teacher delivers the session, our brain is
continuously processing everything and storing it someplace. Also, while this is
happening, when your friend whispers something, the focus of your brain
automatically shifts from the teacher’s speech to your friend’s conversation.
○ So now, the brain is processing both the sounds but is prioritizing the one on which
our interest lies.
2. The sound reaches the brain through a long channel. As a person speaks, the sound
travels from his mouth and goes to the listener’s eardrum. The sound striking the eardrum
is converted into neuron impulses, gets transported to the brain, and then gets processed.
3. After processing the signal, the brain gains an understanding of its meaning of it. If it is
clear, the signal gets stored. Otherwise, the listener asks for clarity from the speaker. This
is how human languages are processed by humans.

Computer Language
1. Computers understand the language of numbers. Everything that is sent to the machine
has to be converted to numbers.
2. While typing, if a single mistake is made, the computer throws an error and does not
process that part. The communications made by the machines are very basic and simple.
3. Now, if we want the machine to understand our language, how should this happen? What
are the possible difficulties a machine would face in processing natural language? Let us
take a look at some of them here:
Arrangement of the words and meaning

There are rules in human language. There are nouns, verbs, adverbs, and adjectives. A word can be a
noun at one time and an adjective some other time. There are rules to provide structure to a language.

Syntax: Syntax refers to the grammatical structure of a sentence.

Human communication is complex. There are multiple characteristics of the human

language that might be easy for a human to understand but extremely difficult for a
computer to understand.

Semantics: It refers to the meaning of the sentence.

et’s understand Semantics and Syntax with some examples:

1. Different syntax, same semantics: 2+3 = 3+2

○ Here the way these statements are written is different, but their meanings
are the same that is 5.
2. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
○ Here the statements written have the same syntax but their meanings are
different. In Python 2.7, this statement would result in 1 while in Python 3, it
would give an output of 1.5.

Data Processing
now it is time to see how Natural Language Processing makes it possible for machines to
understand and speak Natural Languages just like humans.

Since we all know that the language of computers is Numerical, the very first step that
comes to our mind is to convert our language to numbers. This conversion takes a few
steps to happen. The first step to it is Text Normalisation.

Text Normalisation helps in cleaning up the textual data in such a way that it comes down
to a level where its complexity is lower than the actual data.

1. Text Normalisation

In-Text Normalization, we undergo several steps to normalize the text to a lower level.
That is, we will be working on text from multiple documents and the term used for the
whole textual data from all the documents altogether is known as corpus.

Corpus: A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.

Corpus is a collection of text or audio files organised into data sets .

https://www.youtube.com/watch?v=kDRpyJSE_6s

A corpus can be defined as a collection of text documents. It can be thought of as just a

bunch of text files in a directory, often alongside many other directories of text files.

2. Sentence Segmentation

Under sentence segmentation, the whole corpus is divided into sentences. Each sentence
is taken as a different data so now the whole corpus gets reduced to sentences.

Example:

Before Sentence Segmentation

“You want to see the dreams with close eyes and achieve them? They’ll remain dreams, look for
AIMs and your eyes have to stay open for a change to be seen.”

After Sentence Segmentation

1. You want to see the dreams with close eyes and achieve them?
2. They’ll remain dreams, look for AIMs and your eyes have to stay open for a change to
be seen.

3. Tokenisation

After segmenting the sentences, each sentence is then further divided into tokens. A
“Token” is a term used for any word or number or special character occurring in a
sentence.

Under Tokenisation, every word, number, and special character is considered separately
and each of them is now a separate token.

Example:

1. You want to see the Y w t t dr w c e a achi t ?

dreams with close eyes o a o h e i l y n eve h
and achieve them? u nt e a t o e d e
m h s s m
s e
4. Removal of Stopwords

In this step, the tokens which are not necessary are removed from the token list. To make
it easier for the computer to focus on meaningful terms, these words are removed.

Along with these words, a lot of times our corpus might have special characters and/or
numbers.

Removal of special characters and/or numbers depends on the type of corpus that we are
working on and whether we should keep them in it or not.

For example: if you are working on a document containing email IDs, then you might not
want to remove the special characters and numbers whereas in some other textual data if
these characters do not make sense, then you can remove them along with the
stopwords.

Stopwords: Stopwords are the words that occur very frequently in the corpus but do not
add any value to it.

Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.

Example

1. You want to see the dreams with close eyes and achieve them?
○ the removed words would be
○ to, the, and, ?
2. The outcome would be:
○ You want see dreams with close eyes achieve them

5. Converting text to a common case

we convert the whole text into a similar case, preferably lower case. This ensures that the
case sensitivity of the machine does not consider the same words as different just
because of different cases.
6. Stemming

Definition: Stemming is a technique used to extract the base form of the words by
removing affixes from them. It is just like cutting down the branches of a tree to its stems.

The stemmed words (words which we get after removing the affixes) might not be
meaningful.

Example:

Words Affixes Stem

healing ing heal

dreams s dream

studies es studi

7. Lemmatization

Definition: In lemmatization, the word we get after affix removal (also known as lemma) is
a meaningful one and it takes a longer time to execute than stemming.

Lemmatization makes sure that a lemma is a word with meaning

Difference between stemming and lemmatization
Stemming

1. The stemmed words might not be meaningful.

2. Caring ➔ Car

lemmatization

1. The lemma word is a meaningful one.

2. Caring ➔ Care
Bag of word Algorithm
Bag of Words is a Natural Language Processing model which helps in extracting features
out of the text which can be helpful in machine learning algorithms. In a bag of words, we
get the occurrences of each word and construct the vocabulary for the corpus.

Bag of Words just creates a set of vectors containing the count of word occurrences in
the document (reviews). Bag of Words vectors is easy to interpret

he bag of words gives us two things:

● A vocabulary of words for the corpus

● The frequency of these words (number of times it has occurred in the whole corpus).

Steps of the bag of words algorithm

1. Text Normalisation: Collecting data and pre-processing it
2. Create Dictionary: Making a list of all the unique words occurring in the corpus.
(Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many
times the word from the unique list of words has occurred.
4. Create document vectors for all the documents.

Example:

Step 1: Collecting data and pre-processing it.

Raw Data

● Document 1: Aman and Anil are stressed

● Document 2: Aman went to a therapist
● Document 3: Anil went to download a health chatbot

Processed Data

● Document 1: [aman, and, anil, are, stressed ]

● Document 2: [aman, went, to, a, therapist]
● Document 3: [anil, went, to, download, a, health, chatbot]
Step 2: Create Dictionary

Definition of Dictionary:

Dictionary in NLP means a list of all the unique words occurring in the corpus. If some
words are repeated in different documents, they are all written just once while creating the
dictionary.

Dictionary:

aman and anil are stressed went

download health chatbot therapist a to

Some words are repeated in different documents, they are all written just once, while creating
the dictionary, we create a list of unique words.

Step 3: Create a document vector

Definition of Document Vector: The document Vector contains the frequency of each word
of the vocabulary in a particular document.

How to make a document vector table?

In the document, vector vocabulary is written in the top row. Now, for each word in the
document, if it matches the vocabulary, put a 1 under it. If the same word appears again,
increment the previous value by 1. And if the word does not occur in that document, put a
0 under it.

aman an ani ar stresse w t a therapi downloa healt chatbot

d l e d en o st d h
t

1 1 1 1 1 0 0 0 0 0 0 0

step 4: Creating a document vector table for all documents

ama and an ar stresse went t a therapi downloa health chatb
n il e d o st d ot

1 1 1 1 1 0 0 0 0 0 0 0

1 0 0 0 0 1 1 1 1 0 0 0

0 0 1 0 0 1 1 1 0 1 1 1

In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents. Take a look at this table and analyze the
positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still
not converted to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF
TFIDF stands for Term Frequency & Inverse Document Frequency.

Term Frequency

Term Frequency: Term frequency is the frequency of a word in one document.

Term frequency can easily be found in the document vector table in that table we mention
the frequency of each word of the vocabulary in each document.

Example of Term Frequency:

ama an ani ar stress we t a therapi downlo healt chatb

n d l e ed nt o st ad h ot

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0

0 0 1 0 0 1 1 1 0 1 1 1

Inverse Document Frequency

To understand IDF (Inverse Document Frequency) we should understand DF (Document

Frequency) first.

DF (Document Frequency)

Definition of Document Frequency (DF): Document Frequency is the number of

documents in which the word occurs irrespective of how many times it has occurred in
those documents. (Source: CBSE)

Example of Document Frequency:

ama an ani ar stress we t a therapi downlo healt chatb

n d l e ed nt o st ad h ot

2 1 2 1 1 2 2 2 1 1 1 1

We can observe from the table is:

● Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in
two documents.
● Rest of them occurred in just one document hence the document frequency for
them is one.

IDF (Inverse Document Frequency)

Definition of Inverse Document Frequency (IDF): In the case of inverse document

frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator.

Example of Inverse Document Frequency:

ama an an ar stres we to a therapi downlo heal chatb

n d il e sed nt st ad th ot
3/2 3/ 3/ 3/ 3/1 3/2 3/ 3/ 3/1 3/1 3/1 3/1
1 2 1 2 2

Formula of TFIDF

The formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

We don’t need to calculate the log values by ourselves. We simply have to use the log function in
the calculator and find out!

Example of TFIDF:

ama and anil are str went to a ther dow hea cha
n ess apis nloa lth tbo
ed t d t

1*lo 1*l 1*lo 1*l 1*l 0*lo 0*lo 0*lo 0*lo 0*lo 0*l 0*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

1*lo 0*l 0*lo 0*l 0*l 1*lo 1*lo 1*lo 1*lo 0*lo 0*l 0*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

0*lo 0*l 1*lo 0*l 0*l 1*lo 1*lo 1*lo 0*lo 1*lo 1*l 1*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

Here, we can see that the IDF values for Aman in each row are the same and a similar
pattern is followed for all the words in the vocabulary. After calculating all the values, we
get:
am an ani are stres we to a thera downl heal chat
an d l sed nt pist oad th bot

0.1 0.4 0.1 0.4 0.47 0 0 0 0 0 0 0

76 77 76 77 7

0.1 0 0 0 0 0.1 0.1 0.1 0.477 0 0 0

76 76 76 76

0 0 0.1 0 0 0.1 0.1 0.1 0 0.477 0.4 0.47

76 76 76 76 77 7

Finally, the words have been converted to numbers. These numbers are the values of each
document. Here, we can see that since we have less amount of data, words like ‘are’ and
‘and’ also have a high value. But as the IDF value increases, the value of that word
decreases. That is, for example:

● Total Number of documents: 10

● Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, the number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable
value in the corpus.

Important concepts to remember:

1. Words that occur in all the documents with high term frequencies have the least values
and are considered to be the stopwords.
2. For a word to have a high TFIDF value, the word needs to have a high term frequency but
less document frequency which shows that the word is important for one document but is
not a common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its
applications are:

1. Document Classification – Helps in classifying the type and genre of a document.

2. Topic Modelling – It helps in predicting the topic for a corpus.
3. Information Retrieval System – To extract the important information out of a
corpus.
4. Stop word filtering – Helps in removing the unnecessary words from a text body.

Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
IP Projects NLP
No ratings yet
IP Projects NLP
8 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
11 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
AI-Natural Language Processing
No ratings yet
AI-Natural Language Processing
49 pages
Chapter 6 - NLP Question Answer
No ratings yet
Chapter 6 - NLP Question Answer
7 pages
Ai Part B ch12
No ratings yet
Ai Part B ch12
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
UNIT 5 Application AI
No ratings yet
UNIT 5 Application AI
16 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Applications of Natural Language Processing
No ratings yet
Applications of Natural Language Processing
61 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
Ai NLP
No ratings yet
Ai NLP
34 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
10-Unit 6 NLP-Notes and Exercise
No ratings yet
10-Unit 6 NLP-Notes and Exercise
13 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
AI With Natural Language Processing and Speech Recognition
No ratings yet
AI With Natural Language Processing and Speech Recognition
16 pages
What Is Natural Language Processing (NLP) ?
No ratings yet
What Is Natural Language Processing (NLP) ?
11 pages
NLP 1
No ratings yet
NLP 1
3 pages
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
4 pages
TSA Book
No ratings yet
TSA Book
154 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
NLP for Tech Enthusiasts
No ratings yet
NLP for Tech Enthusiasts
40 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Unit No 1 Introduction To NLP
No ratings yet
Unit No 1 Introduction To NLP
20 pages
Natural Language Processing Using Artificial Intelligence
No ratings yet
Natural Language Processing Using Artificial Intelligence
3 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
9 pages
Artificial Intelligence Class X Unit 7: Natural Language Processing
No ratings yet
Artificial Intelligence Class X Unit 7: Natural Language Processing
10 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
NLP Course Overview Fall 2020
No ratings yet
NLP Course Overview Fall 2020
44 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Harambe University
No ratings yet
Harambe University
8 pages
NLP Applications and Chatbots Guide
No ratings yet
NLP Applications and Chatbots Guide
71 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Table of Content
No ratings yet
Table of Content
13 pages
Module-1 - Introduction To Natural Language Processing
No ratings yet
Module-1 - Introduction To Natural Language Processing
70 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
Module 1
No ratings yet
Module 1
49 pages
Bibliography: Journal
No ratings yet
Bibliography: Journal
41 pages
3 Culvert Design
No ratings yet
3 Culvert Design
5 pages
Norriseal PDF
100% (2)
Norriseal PDF
349 pages
Production Planning and Inventory Control
100% (1)
Production Planning and Inventory Control
404 pages
Hiring Manager 30 60 90 Day Check in Questions
No ratings yet
Hiring Manager 30 60 90 Day Check in Questions
2 pages
MMA Plover
No ratings yet
MMA Plover
2 pages
Add Multiple Products To Cart Magento
No ratings yet
Add Multiple Products To Cart Magento
5 pages
British Parliamentary Reforms
No ratings yet
British Parliamentary Reforms
2 pages
Rse-P Ii Short Manual: W X +12.0 Bar +12.0 Bar
100% (2)
Rse-P Ii Short Manual: W X +12.0 Bar +12.0 Bar
81 pages
Rohit Kumar AWS
No ratings yet
Rohit Kumar AWS
1 page
Donors Tax and Remedies Under LGC Compiled Digests
No ratings yet
Donors Tax and Remedies Under LGC Compiled Digests
15 pages
Verified PDF Download Testbank Country Music Cowboy Fast Instant Download
No ratings yet
Verified PDF Download Testbank Country Music Cowboy Fast Instant Download
408 pages
106 Ignition
No ratings yet
106 Ignition
2 pages
Formula Syntax & Functions - Notion Help Center
No ratings yet
Formula Syntax & Functions - Notion Help Center
2 pages
Memorandum of Appeal
100% (8)
Memorandum of Appeal
6 pages
Blackrock Weekly Outlook
No ratings yet
Blackrock Weekly Outlook
6 pages
Lesson 14 - Business Etiquette & Personal Branding
No ratings yet
Lesson 14 - Business Etiquette & Personal Branding
14 pages
En 353 - 2
No ratings yet
En 353 - 2
5 pages
Iron & Manganese Removal in Water
100% (2)
Iron & Manganese Removal in Water
6 pages
Article 21
No ratings yet
Article 21
17 pages
Marketing Mix
No ratings yet
Marketing Mix
18 pages
Green Logistics - Ha Vi
No ratings yet
Green Logistics - Ha Vi
94 pages
2 Community Organizing
No ratings yet
2 Community Organizing
54 pages
Skylark2 Pamp - 1
No ratings yet
Skylark2 Pamp - 1
20 pages
13th April, 2021
No ratings yet
13th April, 2021
6 pages
Lab Manual WORD AND EXCEL
100% (1)
Lab Manual WORD AND EXCEL
14 pages
SolidWorks PCB Course Guide
No ratings yet
SolidWorks PCB Course Guide
17 pages
Final Report 3
No ratings yet
Final Report 3
65 pages
Tutorial & Exercises For Catia Lab
No ratings yet
Tutorial & Exercises For Catia Lab
79 pages
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
No ratings yet
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
10 pages
CLINIMED Atraumix Scissor For Atraumatic Tissue Dissection
No ratings yet
CLINIMED Atraumix Scissor For Atraumatic Tissue Dissection
2 pages

Introduction To

Uploaded by

Introduction To

Uploaded by

Introduction to NLP

Natural Language Process (NLP)

Applications of Natural Language Processing

Human Language VS Computer Language

Syntax: Syntax refers to the grammatical structure of a sentence.

Human communication is complex. There are multiple characteristics of the human

Semantics: It refers to the meaning of the sentence.

et’s understand Semantics and Syntax with some examples:

1. Different syntax, same semantics: 2+3 = 3+2

Corpus is a collection of text or audio files organised into data sets .

A corpus can be defined as a collection of text documents. It can be thought of as just a

Before Sentence Segmentation

After Sentence Segmentation

1. You want to see the Y w t t dr w c e a achi t ?

5. Converting text to a common case

Words Affixes Stem

healing ing heal

Lemmatization makes sure that a lemma is a word with meaning

1. The stemmed words might not be meaningful.

1. The lemma word is a meaningful one.

he bag of words gives us two things:

● A vocabulary of words for the corpus

Steps of the bag of words algorithm

Step 1: Collecting data and pre-processing it.

● Document 1: Aman and Anil are stressed

● Document 1: [aman, and, anil, are, stressed ]

aman and anil are stressed went

download health chatbot therapist a to

Step 3: Create a document vector

How to make a document vector table?

aman an ani ar stresse w t a therapi downloa healt chatbot

step 4: Creating a document vector table for all documents

Term Frequency: Term frequency is the frequency of a word in one document.

Example of Term Frequency:

ama an ani ar stress we t a therapi downlo healt chatb

Inverse Document Frequency

To understand IDF (Inverse Document Frequency) we should understand DF (Document

Definition of Document Frequency (DF): Document Frequency is the number of

Example of Document Frequency:

ama an ani ar stress we t a therapi downlo healt chatb

We can observe from the table is:

IDF (Inverse Document Frequency)

Definition of Inverse Document Frequency (IDF): In the case of inverse document

Example of Inverse Document Frequency:

ama an an ar stres we to a therapi downlo heal chatb

The formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

0.1 0.4 0.1 0.4 0.47 0 0 0 0 0 0 0

0.1 0 0 0 0 0.1 0.1 0.1 0.477 0 0 0

0 0 0.1 0 0 0.1 0.1 0.1 0 0.477 0.4 0.47

● Total Number of documents: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, the number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Important concepts to remember:

1. Document Classification – Helps in classifying the type and genre of a document.

You might also like