Module-1 - Introduction To Natural Language Processing
Module-1 - Introduction To Natural Language Processing
Natural language processing or NLP is a subfield of natural languages and computer science
that studies the interactions between human language and computer systems. The field is also
known as computational linguistics and artificial intelligence in the linguistic domain. NLP
primarily relates to applications of natural language processing in languages like English or
French primarily for use by humans. But with NLP evolution, there are new potential
applications for natural language processing in fields such as law enforcement analysis with
criminal profiles, medical diagnosis, and treatment with personalized medicine dashboards.
Human language is a very complex and unique ability that only humans possess. There are
thousands of human languages with millions of words in our vocabularies, where several words
have multiple meanings, which further complicates matters. Computers can perform several
high-level tasks, but the one thing they have lacked is the ability to communicate like human
beings. NLP is an interdisciplinary field of artificial intelligence and linguistics that bridges this
gap between computers and natural languages.
There are infinite possibilities for arranging words in a sentence. It is essentially impossible to
form a database of all sentences from a language and feed it to computers. Even if possible,
computers could not understand or process how we speak or write; language is unstructured to
machines.
Stages of NLP:
The first stage is called tokenization. A string of words or sentences is broken down into
components or tokens. This retains the essence of each word in the text.
The next step is stemming, where the affixes are removed from the words to derive the stem.
For example, “runs,” “ran,” and “running” all have the same stem, “run.”
Lemmatization is the next stage. The algorithm looks for the meaning of a word in a dictionary,
and its root word is determined to derive its significance in the relevant context. For example,
the root of better is not “bet” but good. Several words have multiple meanings, which depend
on the context of the text. For instance, in the phrase “give me a call,” “call” is a noun. But in
“call the doctor,” “call” is a verb. In this stage, NLP analyzes the position and context of the
token to derive the correct meaning of the words, which is called parts of speech tagging.
The next stage is known as “named entity recognition.” In this stage, the algorithm analyzes the
entity associated with a token. For example, the token “London” is associated with location, and
“Google” is associated with an organization. Chunking is the final stage of natural language
processing, which picks individual pieces of information and groups them into more significant
parts. All these functions run on NLTK, a tool designed by Python. All NLP processes and text
analysis use this natural language toolkit library.
History of NLP:
The evolution of NLP is an ongoing process. The earliest work of NLP started as machine
translation, which was simplistic in approach. The idea was to convert one human language into
another, and it began with converting Russian into English. This led to converting human
language into computer language and vice versa. In 1952, Bell Labs created Audrey, the first
speech recognition system. It could recognize all ten numerical digits. However, it was
abandoned because it was faster to input telephone numbers with a finger. In 1962 IBM
demonstrated a shoebox-sized machine capable of recognizing 16 words
DARPA developed Harpy at Carnegie Mellon University in 1971. It was the first system to
recognize over a thousand words. The evolution of natural language processing gained
momentum in the 1980s when real-time speech recognition became possible due to
advancements in computing performances. There was also innovation in algorithms for
processing human languages, which discarded rigid rules and moved to machine learning
techniques that could learn from existing data of natural languages. Earlier, chatbots were rule-
based, where experts would encode rules mapping what a user might say and what an
appropriate reply should. However, this was a tedious process and yielded limited possibilities.
An early example of rule-based NLP was Eliza, created by MIT in 1960. Eliza used synthetic
rules to identify the meaning in the written text, which it would turn around and ask the user
about. Of course, NLP evolution has taken place in the last fifty years. The branches of
computational grammar and statistics gave NLP a different direction, giving rise to statistical
language processing and information extraction fields.
With the evolution of NLP, speech recognition systems are using deep neural networks.
Different vowels or sounds have different frequencies, which are discernible on a spectrogram.
This allows computers to recognize spoken vowels and words. Each sound is called a
phoneme, and speech recognition software knows what these phonemes look like. Along with
analyzing different words, NLP helps discern where sentences begin and end. And ultimately,
speech is converted to text.
Speech synthesis gives computers the ability to output speech. However, these sounds are
discontinuous and seem robotic. While this was very prominent in the hand-operated machine
from Bell Labs, today’s computer voices like Siri and Alexa have improved.
We are now seeing an explosion of voice interfaces on phones and cars. This creates a positive
feedback loop with people using voice interaction more often, which gives companies more data
to work on. This enables better accuracy, leading to people using voice more, and the loop
continues.
NLP evolution has happened by leaps and bounds in the last decade. NLP integrated with deep
learning and machine learning has enabled chatbots and virtual assistants to carry out
complicated interactions.
Chatbots now operate beyond the domain of customer interactions. They can handle human
resources and healthcare, too. NLP in healthcare can monitor treatments and analyze reports
and health records. Cognitive analytics and NLP are combined to automate routine tasks.
The evolution of NLP has happened with time and advancements in language technology. Data
scientists developed some powerful algorithms along the way; some of them are as follows:
Bag of words: This model counts the frequency of each unique word in an article. This is done
to train machines to understand the similarity of words. However, millions of individual words are
in millions of documents; hence, maintaining such vast data is practically unimaginable.
TF-IDF: TF (term frequency) is calculated as the number of times a certain term appears out of
the number of terms present in the document. This system also eliminates “stop words,” like “is,”
“a,” “the,” etc.
Co-occurrence matrix: This model was developed since the previous models could not solve the
problem of semantic ambiguity. It tracked the context of the text but required a lot of memory to
store all the data.
Transformer models: This is the encoder and decoder model that uses attention to train the
machines that imitate human attention faster. BERT, developed by Google based on this model,
has been phenomenal in revolutionizing NLP.
Carnegie Mellon University and Google have developed XLNet, another attention networkbased
model that has supposedly outperformed BERT in 20 tasks. BERT has exponentially improved
search results on browsers. Megatron and GPT-3 are based on this architecture used in speech
synthesis and image processing.
In this encoder-decoder model, the encoder tells the machine what it should think and
remember from the text. The decoder uses those thoughts to decide the appropriate reply and
action.
For example, in the sentence “I would like some strawberry___.” The ideal words for this blank
would be “cake” or “milkshake.” In this sentence, the encoder focuses on the word strawberry,
and the decoder pulls the right word from a cluster of terms related to strawberry.
NLP evolves every minute as more and more unstructured data is accumulated. So, there is no
end to the evolution of natural language processing.
As more and more data is generated, NLP will take over to analyze, comprehend, and store the
data. This will help digital marketers analyze gigabytes of data in minutes and strategize
marketing policies accordingly.
NLP concerns itself with human language. However, NLP evolution will eventually bring into its
domain non-verbal communications, like body language, gestures, and facial expressions.
To analyze non-verbal communications, NLP must be able to use biometrics like facial
recognition and retina scanner. Just as NLP is adept at understanding sentiments behind
sentences, it will eventually be able to read the feelings behind expressions. If this integration
between biometrics and NLP happens, the interaction between humans and computers will take
on a whole new meaning.
The next massive step in AI is the creation of humanoid robotics by integrating NLP with
biometrics. Through robots, computer-human interaction will move into computer-human
communication. Visual assistants do not even begin to cover the scope of NLP in the future.
When coupled with advancements in biometrics, NLP evolution can create robots who can see,
touch, hear, and speak, much like humans. NLP will shape the communication technologies of
the future
NLP solves the root problem of machines not understanding human language. With its
evolution, NLP has surpassed traditional applications, and AI is being used to replace human
resources in several domains.
Virtual assistants like Cortana, Siri, and Alexa are boons of NLP evolution. These assistants
comprehend what you say, give befitting replies, or take appropriate actions, and do all this
through NLP. Intelligent chatbots are taking the world of customer service by storm. They are
replacing human assistance and conversing with customers just like humans do. They interpret
the written text, and it decides on actions accordingly. NLP is the working mechanism behind
such chatbots.
NLP also helps in sentiment analysis. It recognizes the sentiment behind posts. For instance, it
determines whether a review is positive, negative, serious or sarcastic. NLP mechanisms help
companies like Twitter remove tweets with foul language, etc.
NLP automatically sorts our emails into social, promotions, inbox, and spam categories. This
NLP task is known as text classification.
Other importances of NLP are seen in checking spellings, keyword research, and extracting
information. Plagiarism checkers also run on NLP programs.
NLP also drives advertisement recommendations. It matches advertisements with our history.
NLP helps machines understand natural languages and perform language-related tasks. It
makes it possible for computers to analyze more language-based data than humans.
A language has millions of words, several dialects, and thousands of grammatical and structural
rules. It is essential to comprehend human text’s synthetic and semantic context, which is not
possible by computers.
1. Tokenization
The lexical phase in Natural Language Processing (NLP) involves scanning text and breaking it
down into smaller units such as paragraphs, sentences, and words. This process, known as
tokenization, converts raw text into manageable units called tokens or lexemes. Tokenization is
essential for understanding and processing text at the word level.
In addition to tokenization, various data cleaning and feature extraction techniques are applied,
including:
These steps enhance the comprehensibility of the text, making it easier to analyze and process.
2. Morphological Analysis
Types of Morphemes
i) Free Morphemes: Text elements that carry meaning independently and make sense
on their own. For example, "bat" is a free morpheme.
ii) Bound Morphemes: Elements that must be attached to free morphemes to convey
meaning, as they cannot stand alone. For instance, the suffix "-ing" is a bound
morpheme, needing to be attached to a free morpheme like "run" to form "running."
Predicting Word Forms: It aids in anticipating different forms of a word based on its
morphemes.
By identifying and analyzing morphemes, the system can interpret text correctly at the
most fundamental level, laying the groundwork for more advanced NLP applications.
Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing
(NLP). This phase is essential for understanding the structure of a sentence and assessing its
grammatical correctness. It involves analyzing the relationships between words and ensuring
their logical consistency by comparing their arrangement against standard grammatical rules.
1) Role of Parsing
Parsing examines the grammatical structure and relationships within a given text. It
assigns Parts-Of-Speech (POS) tags to each word, categorizing them as nouns, verbs,
adverbs, etc. This tagging is crucial for understanding how words relate to each other
syntactically and helps in avoiding ambiguity. Ambiguity arises when a text can be
interpreted in multiple ways due to words having various meanings. For example, the
word "book" can be a noun (a physical book) or a verb (the action of booking
something), depending on the sentence context.
Examples of Syntax
During parsing, each word in the sentence is assigned a POS tag to indicate its
grammatical category. Here’s an example breakdown:
POS Tags:
Assigning POS tags correctly is crucial for understanding the sentence structure and
ensuring accurate interpretation of the text.
By analyzing and ensuring proper syntax, NLP systems can better understand and generate
human language. This analysis helps in various applications, such as machine translation,
sentiment analysis, and information retrieval, by providing a clear structure and reducing
ambiguity.
Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on
extracting the meaning from text. Unlike syntactic analysis, which deals with grammatical
structure, semantic analysis is concerned with the literal and contextual meaning of words,
phrases, and sentences.
Semantic analysis aims to understand the dictionary definitions of words and their usage in
context. It determines whether the arrangement of words in a sentence makes logical sense.
This phase helps in finding context and logic by ensuring the semantic coherence of sentences.
1) Named Entity Recognition (NER): NER identifies and classifies entities within the text,
such as names of people, places, and organizations. These entities belong to predefined
categories and are crucial for understanding the text's content.
2) Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous
words based on context. For example, the word "bank" can refer to a financial institution or the
side of a river. WSD uses contextual clues to assign the appropriate meaning.
This sentence is grammatically correct but does not make sense semantically. An apple cannot
eat a person, highlighting the importance of semantic analysis in ensuring logical coherence.
This phrase is interpreted literally as someone asking for the current time, demonstrating how
semantic analysis helps in understanding the intended meaning.
Semantic analysis is essential for various NLP applications, including machine translation,
information retrieval, and question answering. By ensuring that sentences are not only
grammatically correct but also meaningful, semantic analysis enhances the accuracy and
relevance of NLP systems.
Discourse Integration is the fourth phase of Natural Language Processing (NLP). This phase
deals with comprehending the relationship between the current sentence and earlier sentences
or the larger context. Discourse integration is crucial for contextualizing text and understanding
the overall message conveyed.
Discourse integration examines how words, phrases, and sentences relate to each other within
a larger context. It assesses the impact a word or sentence has on the structure of a text and
how the combination of sentences affects the overall meaning. This phase helps in
understanding implicit references and the flow of information across sentences.
Importance of Contextualization
In conversations and texts, words and sentences often depend on preceding or following
sentences for their meaning. Understanding the context behind these words and
sentences is essential to accurately interpret their meaning.
To understand what "this" refers to, we need to examine the preceding or following sentences.
Without context, the statement's meaning remains unclear.
Anaphora Resolution: "Taylor went to the store to buy some groceries. She realized she forgot
her wallet."
In this example, the pronoun "she" refers back to "Taylor" in the first sentence. Understanding
that "Taylor" is the antecedent of "she" is crucial for grasping the sentence's meaning.
Discourse integration is vital for various NLP applications, such as machine translation,
sentiment analysis, and conversational agents. By understanding the relationships and context
within texts, NLP systems can provide more accurate and coherent responses.
Pragmatic Analysis is the fifth and final phase of Natural Language Processing (NLP), focusing
on interpreting the inferred meaning of a text beyond its literal content. Human language is often
complex and layered with underlying assumptions, implications, and intentions that go beyond
straightforward interpretation. This phase aims to grasp these deeper meanings in
communication.
Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming to
understand what the writer or speaker truly intends to convey. In natural language, words and
phrases can carry different meanings depending on context, tone, and the situation in which
they are used.
In human communication, people often do not say exactly what they mean. For instance, the
word "Hello" can have various interpretations depending on the tone and context in which it is
spoken. It could be a simple greeting, an expression of surprise, or even a signal of anger.
Thus, understanding the intended meaning behind words and sentences is crucial.
"What time is it?" might be a straightforward request for the current time, but it could also imply
concern about being late.
The word "falling" literally means collapsing, but in this context, it means the speaker is
expressing love for someone.
Pragmatic analysis is essential for applications like sentiment analysis, conversational AI, and
advanced dialogue systems. By interpreting the deeper, inferred meanings of texts, NLP
systems can understand human emotions, intentions, and subtleties in communication, leading
to more accurate and human-like interactions.
1. Language differences
The human language and understanding is rich and intricated and there many languages
spoken by humans. Human language is diverse and thousand of human languages spoken
around the world with having its own grammar, vocabular and cultural nuances. Human
cannot understand all the languages and the productivity of human language is high.
There is ambiguity in natural language since same words and phrases can have different
meanings and different context. This is the major challenges in understating of natural
language.
There is a complex syntactic structures and grammatical rules of natural languages. The
rules are such as word order, verb, conjugation, tense, aspect and agreement. There is rich
semantic content in human language that allows speaker to convey a wide range of
meaning through words and sentences. Natural Language is pragmatics which means that
how language can be used in context to approach communication goals. The human
language evolves time to time with the processes such as lexical change.
2.Training Data
Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target.
Training data is composed of both the features (inputs) and their corresponding labels
(outputs). For NLP, features might include text data, and labels could be categories,
sentiments, or any other relevant annotations.
It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.
Development Time and Resource Requirements for Natural Language Processing (NLP)
projects depends on various factors consisting the task complexity, size and quality of the
data, availability of existing tools and libraries, and the team of expert involved. Here are
some key points:
sentiment of the text may require less time compared to more complex tasks
annotate, and preprocess the large text datasets and can be resource-intensive
annotations.
right algorithms machine learning algorithms that is best for Natural Language
Processing task.
consists of powerful hardware (GPUs or TPUs) and time for training the
topic focus, or conversational cues can give valuable clues for solving
ambiguities.
meaning based on word, lexical relationships and semantic roles. Tools such as
phrasing ambiguities.
patterns.
● Statistical methods: Statistical methods and machine learning models are used
to learn patterns from data and make predictions about the input phrase.
● Tokenization: The text is split into individual tokens with the help of
words and grammatical error that makes it easy to correct the phrase.
● Language Models: With the help of language models that is trained on large
corpus of data to predict the likelihood of word or phrase that is correct or not
It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness,
equity, and inclusivity in natural language processing applications. Here are some key
points for mitigating biases in NLP algorithms.
training data used to develop NLP algorithms is diverse, representative and free
from biases.
● Analysis and Detection of bias: Apply bias detection and analysis method on
training data to find biases that is based on demographic factors such as race,
gender, age.
● Data Preprocessing: Data Preprocessing the most important process to train
to learn fair representations that are invariant to protect attributes like race or
gender.
for fairness and bias with the help of metrics and audits. NLP models are
Words with multiple meaning plays a lexical challenge in Nature Language Processing
because of the ambiguity of the word. These words with multiple meaning are known as
polysemous or homonymous have different meaning based on the context in which they
are used. Here are some key points for representing the lexical challenge plays by words
with multiple meanings in NLP:
semantic networks are the semantic representation can find the semantic
context and constraints for determining the correct context of the word.
8. Addressing Multilingualism
languages and serve as valuable resources for training NLP models and
systems.
and inform access across language barriers and can be used as preprocessing
It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models. Here are some key points
to approach the solution:
making.
● Confidence Scores: The confidence scores or probability estimates is calculated
for NLP predictions to assess the certainty of the output of the model.
adjusted to make the balance between sensitivity (recall) and specificity. False
to reduce uncertainty.
Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless
interaction between users and machines. Implementing real time natural language
processing pipelines gives to capability to analyze and interpret user input as it is received
involving algorithms are optimized and systems for low latency processing to confirm
quick responses to user queries and inputs.
Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history
tracking, and generating relevant responses based on the ongoing dialogue. Apply intent
recognition algorithm to find the underlying goals and intentions expressed by users in
their messages.
● Quantity and Quality of data: High quality of data and diverse data is used to
and phrases.
vocabulary expansion.
can be used to transfer knowledge from large dataset to specific tasks with
Types of Ambiguities
1. Lexical Ambiguity
Lexical means relating to words of a language. During Lexical analysis given paragraphs are broken
down into words or tokens. Each token has got specific meaning. There can be instances where a
single word can be interpreted in multiple ways. The ambiguity that is caused by the word alone
In the above sentence, it is unclear whether bat refers to a nocturnal animal bat or a cricket bat. Just
by looking at the word it does not provide enough information about the meaning hence we need to
1. a) Polysemy
· Thanks to the new windows, this room is now so light and airy = lit by the natural light of
day.
In the above example, light has different meanings but they are related to each other.
· Pole and Pole — The first Pole refers to a citizen of Poland who could either be referred to
as Polish or a Pole. The second Pole refers to a bamboo pole or any other wooden pole.
Syntactic meaning refers to the grammatical structure and rules that define how words should be
combined to form sentences and phrases. A sentence can be interpreted in more than one way due
3. Semantic Ambiguity
Semantics is nothing but “Meaning”. The semantics of a word or phrase refers to the way it is
typically understood or interpreted by people. Syntax describes the rules by which words can be
Semantic Ambiguity occurs when a sentence has more than one interpretation or meaning.
The interpretations can be Sriya loves Seema’s mother or Sriya likes her mother.
The above sentence can be interpreted as either ‘the lasagna was burnt and the pie wasn’t’ or both
were burnt.
4. Anaphoric Ambiguity
A word that gets its meaning from a preceding word or phrase is called an anaphor.
In this example, the word she is an anaphor and refers back to a preceding expression i.e., Susan.
The linguistic element or elements to which an anaphor refers is called an antecedent. The
Ambiguity that arises when there is more than one reference to the antecedent is known as
Anaphoric Ambiguity.
Example 1: “The horse ran up the hill. It was very steep. It soon got tired.”
In this example, there are two ‘it’, and it is unclear to which each ‘it’ refers, this leads to Anaphoric
Ambiguity. The sentence will be meaningful if first ‘it’ refers to the hill and 2nd ‘it’ refers to the
horse. Anaphors may not be in the immediately previous sentence. They may present in the
sentences before the previous one or may present in the same sentence.
Anaphoric references may not be explicitly present in the previous sentence rather they might refer
Example 2: “I went to the hospital, and they told me to go home and rest.”
In this sentence, ‘they’ does not explicitly refer to the hospital instead it refers to the Dr or staff who
Example 4: “A puppy drank the milk. The cute little dog was satisfied.”
5. Pragmatic ambiguity
Pragmatics focuses on the real-time usage of language like what the speaker wants to convey and
how the listener infers it. Situational context, the individuals’ mental states, the preceding dialogue,
and other elements play a major role in understanding what the speaker is trying to say and how
Example:
Let’s try to understand a few basic rules first…
1. Machine doesn’t understand characters, words or sentences.
3. Text data must be encoded as numbers for input or output for any machine.
As mentioned in the above points we cannot pass raw text into machines as input until and unless
Text encoding is a process to convert meaningful text into number / vector representation so as to
preserve the context and relationship between words and sentences, such that a machine can
understand the pattern associated in any text and can make out the context of sentences.
There are a lot of methods to convert Text into numerical vectors, they are:
- Index-Based Encoding
- TF-IDF Encoding
- Word2Vector Encoding
- BERT Encoding
As this is a basic explanation of NLP text encoding hence we will be skipping the last 2 methods, i.e.
Word2Vector and BERT as they are quite complex and powerful implementations of Deep Learning
Encoding: Word2Vec
Before we deep dive into each method let’s set some ground examples so as to make it easier to
follow through.
Document Corpus: This is the whole set of text we have, basically our text corpus, can be anything
Example: We have 5 sentences namely, [“this is a good phone” , “this is a bad mobile” , “she is a good
cat” , “he has a bad temper” , “this mobile phone is not good”]
Data Corpus: It is the collection of unique words in our document corpus, i.e. in our case it looks like
this:
[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper” , “this”]
This will make it easier to understand and grasp the intuition behind these methods.
1. Index-Based Encoding:
As the name mentions, Index based, we surely need to give all the unique words an index, like we
have separated out our Data Corpus, now we can index them individually, like…
a:1
bad : 2
this : 13
Now that we have assigned a unique index to all the words so that based on the index we can
uniquely identify them, we can convert our sentences using this index-based method.
It is very trivial to understand, that we are just replacing the words in each sentence with their
respective indexes.
Now we have encoded all the words with index numbers, and this can be used as input to any
But there is a tiny bit of issue which needs to be addressed first and that is the consistency of the
input. Our input needs to be of the same length as our model, it cannot vary. It might vary in the real
world but needs to be taken care of when we are using it as input to our model.
Now as we can see the first sentence has 5 words, but the last sentence has 6 words, this will cause
So to take care of that issue what we do is max padding, which means we take the longest sentence
from our document corpus and we pad the other sentence to be as long. This means if all of my
sentences are of 5 words and one sentence is of 6 words I will make all the sentences of 6 words.
Now how do we add that extra word here? In our case how do we add that extra index here?
If you have noticed we didn’t use 0 as an index number, and preferably that will not be used
anywhere even if we have 100000 words long data corpus, hence we use 0 as our padding index.
This also means that we are appending nothing to our actual sentence as 0 doesn’t represent any
[ 13 7 1 4 10 0 ] ,
[ 13 7 1 2 8 0 ] ,
[ 11 7 1 4 3 0 ] ,
[ 6 5 1 2 12 0 ] ,
[ 13 8 10 7 9 4 ]
And this is how we keep our input’s integrity the same and without disturbing the context of our
sentences as well.
our sentences. It will make sense once we see actually how to do it.
Data Corpus:
[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper” , “this”]
As we know that our data corpus will never change, so if we use this as a baseline to create
encodings for our sentences, then we will be on an upper hand to not pad any extra words.
So our first sentence becomes a combination of all the words we have and we do not have.
[1,0,0,1,0,0,1,0,0,1,0,0,1]
1. Binary BOW.
2. BOW
The difference between them is, in Binary BOW we encode 1 or 0 for each word appearing or non-
appearing in the sentence. We do not take into consideration the frequency of the word appearing
in that sentence.
In BOW we also take into consideration the frequency of each word occurring in that sentence.
Let’s say our text sentence is “this is a good phone this is a good mobile” (FYI just for reference)
If you see carefully, we considered the number of times the words “this”, “a”, “is” and “good” have
occurred.
3. TF-IDF Encoding:
Term Frequency — Inverse Document Frequency
As the name suggests, here we give every word a relative frequency coding w.r.t the current
Term Frequency: Is the occurrence of the current word in the current sentence w.r.t the total
Inverse Data Frequency: Log of Total number of words in the whole data corpus w.r.t the total
TF:
Term-Frequency
IDF:
Inverse-Data-Frequency
One thing to note here is we have to calculate the word frequency of each word for that particular
sentence, because depending on the number of times a word occurs in a sentence the TF value can
change, whereas the IDF value remains constant, until and unless new sentences are getting added.
Data Corpus: [“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper”
, “this”]
TF-IDF : “this” in sentence1 : Number of “this” word in sentence1 / total number of words in
sentence1
IDF : log(total number of words in the whole data corpus / total number of sentences having “this”
word)
TF : 1 / 5 = 0.2
then rest of the process remains same as BOW, here we replace the word not with the frequency of
its occurrence but rather with the TF-IDF value for that word.
As we can see that we have replaced all the words appearing in that sentence with their respective
tf-idf values, one thing to notice here is, we have similar tf-idf values of multiple words. This is a
rare case that has happened with us as we had few documents and almost all words had kind of
similar frequencies.
Regular Expressions is very popular among programmers and can be applied in many
programming languages like Java, JS, php, C++, etc. Regular Expressions are useful for
numerous practical day-to-day tasks that a data scientist encounters. It is one of the key
concepts of Natural Language Processing that every NLP expert should be proficient in.
Regular Expressions are used in various tasks such as data pre-processing, rule-based
information mining systems, pattern matching, text feature engineering, web scraping, data
extraction, etc.
What are Regular Expressions?
patterns embedded in the text. Let’s consider this example: Suppose we have a list of friends-
And if we want to select only those names on this list which match the certain pattern such as
The names having the first two letters- S and U, followed by only three positions that can be
taken up by any letter. What do you think, which names fit this criterion? Let’s go one by one,
the name Sunil and Sumit fit this criterion as they have S and U in the beginning and three more
letters after that. While rest of the three names are not following the given criteria as Ankit is
starting with the alphabet A whereas Surjeet and Surabhi have more than three characters post
S and U.
What we have done here is that we have a pattern(or criteria) and a list of names and we’re
trying to find the name that matches the given pattern. That’s exactly how regular expressions
work.
In RegEx, we’ve different types of patterns to recognize different strings of characters. Let’s
understand these terms in a bit more detail but first understand the concept of Raw Strings.
Now let’s start with the concept of Raw String. Python raw string treats backslash(\) as a literal
character. Let’s look at some examples to understand. We have a couple of backslashes here.
and \n disappeared from the path. This is not what we want. Here we use “r” expression to
print("raw string:",path)
As you can see we have the entire path printed out here by simply using “r” in front of the path.
It is always recommended to use raw string while dealing with Regular expressions.
Python has a built-in module to work with regular expressions called “re”. Some common
● re.match()
● re.search()
● re.findall()
1. re.match(pattern, string)
The re.match function returns a match object on success and none on failure.
import re
Here Pattern = ‘Analytics’ and String = ‘Analytics Vidhya is the largest data science
community of India’. Since the pattern is present at the beginning of the string we got
the matching Object as an output. And since the output of the re.match is an object, we
will use the group() function of the match object to get the matched expression.
As you can see, we got our required output using the group() function. Now let us have a look at
Here as you can notice, our pattern(largest) is not present at the beginning of the string, hence
2. re.search(pattern, string)
Matches the first occurrence of a pattern in the entire string(and not just at the beginning).
# search for the pattern "founded" in a given string
Since our pattern(founded) is present in the string, re.search() has matched the single
3. re.findall(pattern, string)
It will return all the occurrences of the pattern from the string. I would recommend you to use
Since we’ve ‘founded’ twice here in the string, re.findall() has printed it out twice in the output.
Now we’re going to look at some special sequences that come up with Regular expressions.
These are used to extract a different kind of information from a given text. Let’s take a look at
them-
1. \b
\b returns a match where the specified pattern is at the beginning or at the end of a word.
str = r'Analytics Vidhya is the largest Analytics community of India'
x = re.findall(r"est\b", str)
print(x)
As you can see it returned the last three characters of the word “largest”.
2. \d
\d returns a match where the string contains digits (numbers from 0-9).
x = re.findall("\d", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
This function has generated all the digits from the string i.e 2, 1, and 9 separately. But is
this what we want? I mean, 1and 9 were together in the string but in our output we got 1
and 9 separated. Let’s see how we can get our desired output-
x = re.findall("\d+", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
We can solve this problem by using the ‘+’ sign. Notice how we used ‘\d+’ instead of ‘\d’.
Adding ‘+’ after ‘\d’ will continue to extract the digits till we encounter a space. We can
infer that \d+ repeats one or more occurrences of \d till the non-matching character is
3. \D
\D returns a match where the string does not contain any digit. It is basically the
opposite of \d.
#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
We’ve got all the strings where there are no digits. But again we are getting individual
characters as output and like this, they really don’t make sense. By now I believe you know how
#Check if the word does not contain any digits (numbers from 0-9):
x = re.findall("\D+", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
Bingo! use \D+ instead of just \D to get characters that make sense.
4. \w
\w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9,
#returns a match at every word character (characters from a to Z, digits from 0-9, and the
underscore _ character)
x = re.findall("\w+",str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
5. \W
#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-
space etc.):
x = re.findall("\W", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
#Search for a string that starts with "ro", followed by any number of characters
x = re.findall("ro.", str) #searches one character after ro
x2 = re.findall("ro...", str) #searches three characters after ro
print(x)
print(x2)
We got “roh” and “roh” as our first output since we used only one dot after “ro”. Similarly, “rohan”
and “rohit” as our second output since we used three dots after “ro” in the second statement.
It checks whether the string starts with the given pattern or not.
x = re.findall("^Data", str)
if (x):
print("Yes, the string starts with 'Data'")
else:
print("No match")
This caret(^) symbol checked whether the string started with “Data” or not. And since
our string is starting with the word Data, we got this output. Let’s check the other case
as well-
# try with a different string
x2 = re.findall("^Data", str2)
if (x2):
print("Yes, the string starts with 'data'")
else:
print("No match")
Here in this case you can see that the new string is not starting with the word “Data”, hence we
got No match.
It checks whether the string ends with the given pattern or not.
x = re.findall("Science$", str)
if (x):
print("Yes, the string ends with 'Science'")
else:
print("No match")
The dollar($) sign checks whether the string ends with the given pattern or not. Here, our
pattern is Science and since the string ends with Science we got this output.
4- (*) matches for zero or more occurrences of the pattern to the left of it
str = "easy easssy eay ey"
#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
The above code block basically checks if the string contains the pattern”eas*y” this means “ea”
followed by one or more occurrences of “s” and ending with “y”. We got these three strings as
output -” easy”, “easssy”, and “eay” because they match the given pattern. But the string “ey”
x = re.findall("eas+y", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
One major difference between * and + is that + checks for one or more occurrences of the
pattern to the left of it. Like in this above example we got “easy” and “easssy” as output but not
“eay” and “ey” because “eay” does not contain any instance of the character “s” and “ey” has
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
The question mark(?) looks for zero or one occurrence of the pattern to the left of it. That is why
we got “easy” and “eay” as our output since only these two strings contains one and zero
occurrence of the character “s” respectively, along with the pattern starting with “ea” and ending
with “y”.
7- (|) either or
str = "Analytics Vidhya is the largest data science community of India"
x = re.findall("data|India", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
The pipe(|) operator checks whether any of the two patterns, to its left and right, is present in the
String or not. Here in the above example, we’re checking the String either contains data or
India. Since both of them are present in the String, we got both as the output.
Let’s look at another example:
x = re.findall("data|India", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
Here the pattern is the same but the String contains only “data” and hence we got only [‘data’]
as the output.
What is lexicon?
The lexicon refers to the collection of words, phrases, or symbols in a specific
language. It encompasses the vocabulary of a language and includes various
linguistic attributes associated with each word, such as part-of-speech tags,
semantic information, pronunciation, and more. It serves as a comprehensive
repository of linguistic knowledge, enabling NLP systems to process and
understand natural language text.
Components of a Lexicon
A lexicon comprises several components that provide rich information about
words and their properties. These components include:
1. Words and their Meanings:
The core component of a lexicon is the listing of words, each associated
with its corresponding meaning(s). This provides the fundamental
building blocks for language understanding.
2. Part-of-Speech (POS) Tags:
POS tags assign grammatical categories to words, such as noun, verb,
adjective, adverb, and more. POS tags play a vital role in syntactic
analysis and help disambiguate word meanings based on their context.
3. Pronunciation:
Lexicons often include information about the pronunciation of words,
helping in tasks such as text-to-speech synthesis and speech
recognition.
4. Semantic Information:
Some lexicons include semantic attributes associated with words, such
as word senses, synonyms, antonyms, and hypernyms. These semantic
relationships enable algorithms to infer deeper meaning from text.
Lexical Semantics
Lexical Processing encompasses various techniques and methods used to
handle and analyze words or lexemes in natural language. It involves tasks
such as normalizing word forms, disambiguating word meanings, and
establishing translation equivalences between different languages. Lexical
processing is an essential component in many language-related applications,
including information retrieval, machine translation, natural language
understanding, and text analysis.
1. Sentiment Analysis:
Lexicons play a crucial role in sentiment analysis, where the goal is to
determine the sentiment expressed in a given text. Lexicons contain
sentiment scores or polarity labels associated with words. For example,
the word "happy" might have a positive sentiment score, while "sad"
could have a negative sentiment score. Implementations involve using
lexicons to assign sentiment scores to words in a text and aggregating
them to determine the overall sentiment of the text.
2. Text Classification:
Lexicons serve as valuable resources for text classification tasks.
Lexicons can provide features for classification algorithms, aiding in
better feature representation and decision-making. For example, a
lexicon might contain words associated with specific topics or domains.
Implementations involve incorporating lexicon-based features into
classification algorithms to improve the accuracy of text classification.
3. Machine Translation:
Lexicons are utilized in machine translation systems to provide
translation equivalents for words or phrases. For example, a lexicon
might contain mappings between English and French words.
Implementations involve leveraging the lexicon to translate words or
phrases during the translation process.
4. Word Sense Disambiguation:
Lexicons with semantic information aid in word sense disambiguation,
where the correct meaning of a word in a specific context needs to be
determined. For example, a lexicon might contain multiple senses of the
word "bank" (financial institution vs. river bank). Implementations involve
using the lexicon to disambiguate the correct sense based on the
context in which the word appears.
5. Named Entity Recognition (NER):
Lexicons are used in NER to identify and classify named entities such
as person names, locations, organizations, etc. For example, a lexicon
might contain a list of known organization names. Implementations
involve matching the words in a text with the entries in the lexicon to
identify and extract named entities.
Noun –
A noun is a word that names a person, place, thing, state, or quality. It can be
singular or plural. Nouns are a part of speech.
A noun is a type of word that stands for either a real thing or an idea. This can include living beings,
locations, actions, characteristics, conditions, and concepts. Nouns can act as either the subject or
the object in a sentence, phrase, or clause.
Pronoun –
best policy
Adjective –
● Sentences: Supercars are expensive, The red chair is for kids, Ram is
Verb –
writing poems.
Adverb –
An adverb is a type of word that usually changes or adds to the meaning of a verb, an adjective,
another adverb, a determiner, a clause, a preposition, or even a whole sentence. Adverbs often
describe how something is done, where it happens, when it occurs, how often it takes place, and to
what degree or certainty. They help answer questions like how, in what way, when, where, and to
what extent.
● Function: Describes a verb, adjective, or adverb
Preposition –
Conjunction –
don’t have a car but I know how to drive, She failed the exam
though she worked hard, He will come after he finishes his match.
Interjection –
● Sentences: Oh! I got fail again, Wow! I got the job, Alas! She is no
These are the main parts of speech, but there are additional subcategories
and variations within each. Understanding the different parts of speech can
help construct grammatically correct sentences and express ideas clearly.
Choose the correct Parts of Speech of the BOLD word from the following
questions.
POS tagging could be the very first task in text processing for further downstream tasks in
NLP, like speech recognition, parsing, machine translation, sentiment analysis, etc.
The particular POS tag of a word can be used as a feature by various Machine Learning
algorithms used in Natural Language Processing.
Introduction
Simply put, In Parts of Speech tagging for English words, we are given a text of English
words we need to identify the parts of speech of each word.
Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN
Although it seems easy, Identifying the part of speech tags is much more complicated than
simply mapping words to their part of speech tags.
Why Difficult ?
Words often have more than one POS tag. Let’s understand this by taking an easy example.
The relationship of “back” with adjacent and related words in a phrase, sentence, or
paragraph is changing its POS tag.
It is quite possible for a single word to have a different part of speech tag in different
sentences based on different contexts. That is why it is very difficult to have a generic
mapping for POS tags.
If it is difficult, then what approaches do we have?
Before discussing the tagging approaches, let us literate ourselves with the required
knowledge about the words, sentences, and different types of POS tags.
Word Classes
In grammar, a part of speech or part-of-speech (POS) is known as word class or
grammatical category, which is a category of words that have similar grammatical
properties.
The English language has four major word classes: Nouns, Verbs, Adjectives, and Adverbs.
Commonly listed English parts of speech are nouns, verbs, adjectives, adverbs, pronouns,
prepositions, conjunction, interjection, numeral, article, and determiners.
Closed Class
Closed classes are those with a relatively fixed/number of words, and we rarely add new
words to these POS, such as prepositions. Closed class words are generally functional
words like of, it, and, or you, which tend to be very short, occur frequently, and often have
structuring uses in grammar.
Determiners: a, an, the Pronouns: she, he, I, others Prepositions: on, under, over, near, by,
at, from, to, with
Open Class
Open Classes are mostly content-bearing, i.e., they refer to objects, actions, and features; it's
called open classes since new words are added all the time.
By contrast, nouns and verbs, adjectives, and adverbs belong to open classes; new nouns
and verbs like iPhone or to fax are continually being created or borrowed.
Tagset
The problem is (as discussed above) many words belong to more than one word class.
And to do POS tagging, a standard set needs to be chosen. We Could pick very
simple/coarse tagsets such as Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), etc.
But to make tags more dis-ambiguous, the commonly used set is finer-grained, University
of Pennsylvania’s “UPenn TreeBank tagset”, having a total of 45 tags.
Tagging is a disambiguation task; words are ambiguous i.e. have more than one a possible
part of speech, and the goal is to find the correct tag for the situation.
For example, a book can be a verb (book that flight) or a noun (hand me that book).
The goal of POS tagging is to resolve these ambiguities, choosing the proper tag for the
context.
The accuracy of existing State of the Art algorithms of part-of-speech tagging is extremely
high. The accuracy can be as high as ~ 97%, which is also about the human performance on
this task, at least for English.
We’ll discuss algorithms/techniques for this task in the upcoming sections, but first, let’s
explore the task. Exactly how hard is it?
Let's consider one of the popular electronic collections of text samples, Brown Corpus. It is
a general language corpus containing 500 samples of English, totaling roughly one million
words.
In Brown Corpus :
but,
Particularly ambiguous common words include that, back, down, put, and set.
The word back itself can have 6 different parts of speech (JJ, NN, VBP, VB, RP, RB)
depending on the context.
Nonetheless, many words are easy to disambiguate because their different tags aren’t
equally likely. For example, "a" can be a determiner or the letter "a", but the determiner
sense is much more likely.
This idea suggests a useful baseline, i.e., given an ambiguous word, choose the tag which is
most frequent in the corpus.
Let’s explore some common baseline and more sophisticated POS tagging techniques.
Rule-Based Tagging
Rule-based tagging is the oldest tagging approach where we use contextual information to
assign tags to unknown or ambiguous words.
The rule-based approach uses a dictionary to get possible tags for tagging each word. If the
word has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag.
Since rules are usually built manually, therefore they are also called Knowledge-driven
taggers. We have a limited number of rules, approximately around 1000 for the English
language.
● High development cost and high time complexity when applying to a large corpus of
text
● Defining a set of rules manually is an extremely cumbersome process and is not
scalable at all
This tagger can use techniques like Word frequency measurements and Tag Sequence
Probabilities. It can either use one of these approaches or a combination of both. Let’s
discuss these techniques in detail.
The word frequency method will now check the most frequently used POS tag for “play”.
Let’s say this frequent POS tag happens to be VERB; then we assign the POS tag of "play” =
VERB
The main drawback of this approach is that it can yield invalid sequences of tags.
10 sequences have the POS of the next word is NOUN 90 sequences have the POS of the
next word is VERB
w4 = VERB
The main drawback of this technique is that sometimes the predicted sequence is not
Grammatically correct.
Now let’s discuss some properties and limitations of the Stochastic tagging approach :
1. This POS tagging is based on the probability of the tag occurring (either solo or in
sequence)
2. It requires labeled corpus, also called training data in the Machine Learning lingo
3. There would be no probability for the words that don’t exist in the training data
4. It uses a different testing corpus (unseen text) other than the training corpus
5. It is the simplest POS tagging because it chooses the most frequent tags associated
with a word in the training corpus
In Layman's terms;
The algorithm keeps on searching for the new best set of rules given input as labeled
corpus until its accuracy saturates the labeled corpus.
● a tagged corpus
● a dictionary of words with the most frequent tags
● First tag race with NOUN (since its probability of being NOUN is 98%)
● Then apply the above rule and retag the POS of race with VERB (since just the
previous tag before the “race” word is TO )
Step 2: Check every possible transformation & select one which most improves tagging
accuracy.
Similar to the above sample rule, other possible (maybe worst transformations) rules could
be -
Repeat Step 1,2,3 as many times as needed until accuracy saturates or you reach some
predefined accuracy cutoff.
● We can learn a small set of simple rules, and these rules are decent enough for basic
POS tagging
● Development, as well as debugging, is very easy in TBL because the learned rules
are easy to understand
● Complexity in tagging is reduced because, in TBL, there is a cross-connection
between machine-learned and human-generated rules
Drawbacks
Despite being a simple and somewhat effective approach to POS tagging, TBL has major
disadvantages.
● TBL algorithm training/learning time complexity is very high, and time increases
multi-fold when corpus size increases
● TBL does not provide tag probabilities
This makes HMM model a good and reliable probabilistic approach to finding POS tags for
the sequence of words.
Reference:
1. https://www.peppercontent.io/blog/tracing-the-evolution-of-nlp/
2. https://www.geeksforgeeks.org/phases-of-natural-language-processing-nlp/
3. https://www.geeksforgeeks.org/major-challenges-of-natural-language-processing/
4. https://medium.com/womenintechnology/understanding-ambiguities-in-natural-language-
processing-179212a23b55
5. https://bishalbose294.medium.com/nlp-text-encoding-a-beginners-guide-fa332d715854
6. https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-regular-expressions-
in-natural-language-processing/
7. https://iq.opengenus.org/lexicon-in-nlp/#google_vignette
8. https://www.geeksforgeeks.org/parts-of-speech/
9. https://www.scaler.com/topics/nlp/word-classes-and-part-of-speech-tagging-in-nlp/
10. https://ebooks.inflibnet.ac.in/engp13/chapter/phrase-structure-np/
11. https://www.kdnuggets.com/2018/08/understanding-language-syntax-and-structure-
practitioners-guide-nlp-3.html
12.