[go: up one dir, main page]

0% found this document useful (0 votes)
32 views16 pages

Introduction To

The document provides an overview of Natural Language Processing (NLP), a sub-field of AI focused on enabling computers to understand human language. It outlines various applications of NLP, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, while also discussing the differences between human and computer languages. Additionally, it details the data processing steps involved in NLP, such as text normalization, tokenization, and the Bag of Words algorithm.

Uploaded by

Jitin Kohli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

Introduction To

The document provides an overview of Natural Language Processing (NLP), a sub-field of AI focused on enabling computers to understand human language. It outlines various applications of NLP, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, while also discussing the differences between human and computer languages. Additionally, it details the data processing steps involved in NLP, such as text normalization, tokenization, and the Bag of Words algorithm.

Uploaded by

Jitin Kohli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to NLP

1. Data Science: It is all about applying mathematical and statistical principles to data or in simple
words, Data Science is the study of Data, This data can be of 3 types – Audio, Visual and
Textual.
○ Data Science works around numbers and tabular data.
2. Computer Vision: In simple words is identifying the symbols from the given object (pictures) and
learning the pattern and alert or predicting the future object using the camera.
○ Computer Vision is all about visual data like images and videos.

Natural Language Process (NLP)


Natural Language Processing (NLP) is the sub-field of AI that focuses on the ability of a computer to
understand human language (command) as spoken or written and to give an output by processing it, is
called Natural Language Processing (NLP). It is a component of Artificial Intelligence.

Applications of Natural Language Processing

Some of the applications of Natural Language Processing that are used in the real-life scenario:

Automatic Summarization
1. Automatic summarization is relevant not only for summarizing the meaning of documents
and information but also to understand the emotional meanings within the information
(such as in collecting data from social media)
Sentiment Analysis
1. Definition: Identify sentiment among several posts or even in the same post where
emotion is not always explicitly expressed.
2. Companies use it to identify opinions and sentiments to understand what customers
think about their products and services.

Text classification
1. Text classification makes it possible to assign predefined categories to a document and
organize it to help you find the information you need or simplify some activities.
2. For example, an application of text categorization is spam filtering in email.

Virtual Assistants
1. Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an integral part of
our lives. Not only can we talk to them but they also have the ability to make our lives
easier.
2. By accessing our data, they can help us in keeping notes of our tasks, making calls for
us, sending messages, and a lot more.
3. With the help of speech recognition, these assistants can not only detect our speech
but can also make sense of it.
4. According to recent research, a lot more advancements are expected in this field in
the near future

ChatBots
One of the most common applications of Natural Language Processing is a chatbot. Let us try
some of the chatbots and see how they work.

Types of ChatBots

With the help of this experience, we can understand that there are 2 types of chatbots around us:
Script-bot and Smart-bot. Let us understand what each of

Script Bot
1. Script bots are easy to make
2. Script bots work around a script that is programmed in them
3. Mostly they are free and are easy to integrate into a messaging platform
4. No or little language processing skills
5. Limited functionality
6. Example: the bots which are deployed in the customer care section of various
companies

Smart Bot
1. Smart bots are flexible and powerful
2. Smart bots work on bigger databases and other resources directly
3. Smart bots learn with more data
4. Coding is required to take this up on board
5. Wide functionality
6. Example: Google Assistant, Alexa, Cortana, Siri, etc.

Human Language VS Computer Language


Human Language
1. Our brain keeps on processing the sounds that it hears around itself and tries to make
sense of them all the time.
○ Example: In the classroom, as the teacher delivers the session, our brain is
continuously processing everything and storing it someplace. Also, while this is
happening, when your friend whispers something, the focus of your brain
automatically shifts from the teacher’s speech to your friend’s conversation.
○ So now, the brain is processing both the sounds but is prioritizing the one on which
our interest lies.
2. The sound reaches the brain through a long channel. As a person speaks, the sound
travels from his mouth and goes to the listener’s eardrum. The sound striking the eardrum
is converted into neuron impulses, gets transported to the brain, and then gets processed.
3. After processing the signal, the brain gains an understanding of its meaning of it. If it is
clear, the signal gets stored. Otherwise, the listener asks for clarity from the speaker. This
is how human languages are processed by humans.

Computer Language
1. Computers understand the language of numbers. Everything that is sent to the machine
has to be converted to numbers.
2. While typing, if a single mistake is made, the computer throws an error and does not
process that part. The communications made by the machines are very basic and simple.
3. Now, if we want the machine to understand our language, how should this happen? What
are the possible difficulties a machine would face in processing natural language? Let us
take a look at some of them here:
Arrangement of the words and meaning

There are rules in human language. There are nouns, verbs, adverbs, and adjectives. A word can be a
noun at one time and an adjective some other time. There are rules to provide structure to a language.

Syntax: Syntax refers to the grammatical structure of a sentence.

Human communication is complex. There are multiple characteristics of the human


language that might be easy for a human to understand but extremely difficult for a
computer to understand.

Semantics: It refers to the meaning of the sentence.

et’s understand Semantics and Syntax with some examples:

1. Different syntax, same semantics: 2+3 = 3+2


○ Here the way these statements are written is different, but their meanings
are the same that is 5.
2. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
○ Here the statements written have the same syntax but their meanings are
different. In Python 2.7, this statement would result in 1 while in Python 3, it
would give an output of 1.5.

Data Processing
now it is time to see how Natural Language Processing makes it possible for machines to
understand and speak Natural Languages just like humans.

Since we all know that the language of computers is Numerical, the very first step that
comes to our mind is to convert our language to numbers. This conversion takes a few
steps to happen. The first step to it is Text Normalisation.

Text Normalisation helps in cleaning up the textual data in such a way that it comes down
to a level where its complexity is lower than the actual data.

1. Text Normalisation

In-Text Normalization, we undergo several steps to normalize the text to a lower level.
That is, we will be working on text from multiple documents and the term used for the
whole textual data from all the documents altogether is known as corpus.

Corpus: A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.

OR

Corpus is a collection of text or audio files organised into data sets .


https://www.youtube.com/watch?v=kDRpyJSE_6s

A corpus can be defined as a collection of text documents. It can be thought of as just a


bunch of text files in a directory, often alongside many other directories of text files.

2. Sentence Segmentation

Under sentence segmentation, the whole corpus is divided into sentences. Each sentence
is taken as a different data so now the whole corpus gets reduced to sentences.

Example:

Before Sentence Segmentation

“You want to see the dreams with close eyes and achieve them? They’ll remain dreams, look for
AIMs and your eyes have to stay open for a change to be seen.”

After Sentence Segmentation

1. You want to see the dreams with close eyes and achieve them?
2. They’ll remain dreams, look for AIMs and your eyes have to stay open for a change to
be seen.

3. Tokenisation

After segmenting the sentences, each sentence is then further divided into tokens. A
“Token” is a term used for any word or number or special character occurring in a
sentence.

Under Tokenisation, every word, number, and special character is considered separately
and each of them is now a separate token.

Example:

1. You want to see the Y w t t dr w c e a achi t ?


dreams with close eyes o a o h e i l y n eve h
and achieve them? u nt e a t o e d e
m h s s m
s e
4. Removal of Stopwords

In this step, the tokens which are not necessary are removed from the token list. To make
it easier for the computer to focus on meaningful terms, these words are removed.

Along with these words, a lot of times our corpus might have special characters and/or
numbers.

Removal of special characters and/or numbers depends on the type of corpus that we are
working on and whether we should keep them in it or not.

For example: if you are working on a document containing email IDs, then you might not
want to remove the special characters and numbers whereas in some other textual data if
these characters do not make sense, then you can remove them along with the
stopwords.

Stopwords: Stopwords are the words that occur very frequently in the corpus but do not
add any value to it.

Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.

Example

1. You want to see the dreams with close eyes and achieve them?
○ the removed words would be
○ to, the, and, ?
2. The outcome would be:
○ You want see dreams with close eyes achieve them

5. Converting text to a common case

we convert the whole text into a similar case, preferably lower case. This ensures that the
case sensitivity of the machine does not consider the same words as different just
because of different cases.
6. Stemming

Definition: Stemming is a technique used to extract the base form of the words by
removing affixes from them. It is just like cutting down the branches of a tree to its stems.

The stemmed words (words which we get after removing the affixes) might not be
meaningful.

Example:

Words Affixes Stem

healing ing heal

dreams s dream

studies es studi

7. Lemmatization

Definition: In lemmatization, the word we get after affix removal (also known as lemma) is
a meaningful one and it takes a longer time to execute than stemming.

Lemmatization makes sure that a lemma is a word with meaning


Difference between stemming and lemmatization
Stemming

1. The stemmed words might not be meaningful.


2. Caring ➔ Car

lemmatization

1. The lemma word is a meaningful one.


2. Caring ➔ Care
Bag of word Algorithm
Bag of Words is a Natural Language Processing model which helps in extracting features
out of the text which can be helpful in machine learning algorithms. In a bag of words, we
get the occurrences of each word and construct the vocabulary for the corpus.

Bag of Words just creates a set of vectors containing the count of word occurrences in
the document (reviews). Bag of Words vectors is easy to interpret

he bag of words gives us two things:

● A vocabulary of words for the corpus


● The frequency of these words (number of times it has occurred in the whole corpus).

Steps of the bag of words algorithm


1. Text Normalisation: Collecting data and pre-processing it
2. Create Dictionary: Making a list of all the unique words occurring in the corpus.
(Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many
times the word from the unique list of words has occurred.
4. Create document vectors for all the documents.

Example:

Step 1: Collecting data and pre-processing it.

Raw Data

● Document 1: Aman and Anil are stressed


● Document 2: Aman went to a therapist
● Document 3: Anil went to download a health chatbot

Processed Data

● Document 1: [aman, and, anil, are, stressed ]


● Document 2: [aman, went, to, a, therapist]
● Document 3: [anil, went, to, download, a, health, chatbot]
Step 2: Create Dictionary

Definition of Dictionary:

Dictionary in NLP means a list of all the unique words occurring in the corpus. If some
words are repeated in different documents, they are all written just once while creating the
dictionary.

Dictionary:

aman and anil are stressed went

download health chatbot therapist a to

Some words are repeated in different documents, they are all written just once, while creating
the dictionary, we create a list of unique words.

Step 3: Create a document vector

Definition of Document Vector: The document Vector contains the frequency of each word
of the vocabulary in a particular document.

How to make a document vector table?

In the document, vector vocabulary is written in the top row. Now, for each word in the
document, if it matches the vocabulary, put a 1 under it. If the same word appears again,
increment the previous value by 1. And if the word does not occur in that document, put a
0 under it.

aman an ani ar stresse w t a therapi downloa healt chatbot


d l e d en o st d h
t

1 1 1 1 1 0 0 0 0 0 0 0

step 4: Creating a document vector table for all documents


ama and an ar stresse went t a therapi downloa health chatb
n il e d o st d ot

1 1 1 1 1 0 0 0 0 0 0 0

1 0 0 0 0 1 1 1 1 0 0 0

0 0 1 0 0 1 1 1 0 1 1 1

In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents. Take a look at this table and analyze the
positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still
not converted to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF
TFIDF stands for Term Frequency & Inverse Document Frequency.

Term Frequency

Term Frequency: Term frequency is the frequency of a word in one document.

Term frequency can easily be found in the document vector table in that table we mention
the frequency of each word of the vocabulary in each document.

Example of Term Frequency:

ama an ani ar stress we t a therapi downlo healt chatb


n d l e ed nt o st ad h ot

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0

0 0 1 0 0 1 1 1 0 1 1 1

Inverse Document Frequency

To understand IDF (Inverse Document Frequency) we should understand DF (Document


Frequency) first.

DF (Document Frequency)

Definition of Document Frequency (DF): Document Frequency is the number of


documents in which the word occurs irrespective of how many times it has occurred in
those documents. (Source: CBSE)

Example of Document Frequency:

ama an ani ar stress we t a therapi downlo healt chatb


n d l e ed nt o st ad h ot

2 1 2 1 1 2 2 2 1 1 1 1

We can observe from the table is:

● Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in
two documents.
● Rest of them occurred in just one document hence the document frequency for
them is one.

IDF (Inverse Document Frequency)

Definition of Inverse Document Frequency (IDF): In the case of inverse document


frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator.

Example of Inverse Document Frequency:

ama an an ar stres we to a therapi downlo heal chatb


n d il e sed nt st ad th ot
3/2 3/ 3/ 3/ 3/1 3/2 3/ 3/ 3/1 3/1 3/1 3/1
1 2 1 2 2

Formula of TFIDF

The formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

We don’t need to calculate the log values by ourselves. We simply have to use the log function in
the calculator and find out!

Example of TFIDF:

ama and anil are str went to a ther dow hea cha
n ess apis nloa lth tbo
ed t d t

1*lo 1*l 1*lo 1*l 1*l 0*lo 0*lo 0*lo 0*lo 0*lo 0*l 0*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

1*lo 0*l 0*lo 0*l 0*l 1*lo 1*lo 1*lo 1*lo 0*lo 0*l 0*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

0*lo 0*l 1*lo 0*l 0*l 1*lo 1*lo 1*lo 0*lo 1*lo 1*l 1*l
g(3/ og( g(3/ og( og( g(3/ g(3/ g(3/ g(3) g(3) og( og(
2) 3) 2) 3) 3) 2) 2) 2) 3) 3)

Here, we can see that the IDF values for Aman in each row are the same and a similar
pattern is followed for all the words in the vocabulary. After calculating all the values, we
get:
am an ani are stres we to a thera downl heal chat
an d l sed nt pist oad th bot

0.1 0.4 0.1 0.4 0.47 0 0 0 0 0 0 0


76 77 76 77 7

0.1 0 0 0 0 0.1 0.1 0.1 0.477 0 0 0


76 76 76 76

0 0 0.1 0 0 0.1 0.1 0.1 0 0.477 0.4 0.47


76 76 76 76 77 7

Finally, the words have been converted to numbers. These numbers are the values of each
document. Here, we can see that since we have less amount of data, words like ‘are’ and
‘and’ also have a high value. But as the IDF value increases, the value of that word
decreases. That is, for example:

● Total Number of documents: 10


● Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, the number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable
value in the corpus.

Important concepts to remember:

1. Words that occur in all the documents with high term frequencies have the least values
and are considered to be the stopwords.
2. For a word to have a high TFIDF value, the word needs to have a high term frequency but
less document frequency which shows that the word is important for one document but is
not a common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its
applications are:

1. Document Classification – Helps in classifying the type and genre of a document.


2. Topic Modelling – It helps in predicting the topic for a corpus.
3. Information Retrieval System – To extract the important information out of a
corpus.
4. Stop word filtering – Helps in removing the unnecessary words from a text body.

You might also like