[go: up one dir, main page]

0% found this document useful (0 votes)
8 views70 pages

Module-1 - Introduction To Natural Language Processing

Uploaded by

RAUNIT MAURYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views70 pages

Module-1 - Introduction To Natural Language Processing

Uploaded by

RAUNIT MAURYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Natural language processing:

Natural language processing or NLP is a subfield of natural languages and computer science
that studies the interactions between human language and computer systems. The field is also
known as computational linguistics and artificial intelligence in the linguistic domain. NLP
primarily relates to applications of natural language processing in languages like English or
French primarily for use by humans. But with NLP evolution, there are new potential
applications for natural language processing in fields such as law enforcement analysis with
criminal profiles, medical diagnosis, and treatment with personalized medicine dashboards.

What is Natural Language Processing?

Human language is a very complex and unique ability that only humans possess. There are
thousands of human languages with millions of words in our vocabularies, where several words
have multiple meanings, which further complicates matters. Computers can perform several
high-level tasks, but the one thing they have lacked is the ability to communicate like human
beings. NLP is an interdisciplinary field of artificial intelligence and linguistics that bridges this
gap between computers and natural languages.

There are infinite possibilities for arranging words in a sentence. It is essentially impossible to
form a database of all sentences from a language and feed it to computers. Even if possible,
computers could not understand or process how we speak or write; language is unstructured to
machines.

Therefore, it is essential to convert sentences into a structured form understandable by


computers. Often there are words with multiple meanings (a dictionary is not sufficient to resolve
this ambiguity, so computers also need to learn grammar), and pronunciations of words also
differ based on regions.
NLP’s function is to translate structured and unstructured text; thus helping machines
understand human language. When you go from the unstructured to structured form
(transforming natural language into informative representation), it is called natural language
understanding (NLU). It is known as natural language generation (NLG) when you go from
structured to unstructured (producing meaningful phrases from internal representation).

Stages of NLP:
The first stage is called tokenization. A string of words or sentences is broken down into
components or tokens. This retains the essence of each word in the text.

The next step is stemming, where the affixes are removed from the words to derive the stem.
For example, “runs,” “ran,” and “running” all have the same stem, “run.”

Lemmatization is the next stage. The algorithm looks for the meaning of a word in a dictionary,
and its root word is determined to derive its significance in the relevant context. For example,
the root of better is not “bet” but good. Several words have multiple meanings, which depend
on the context of the text. For instance, in the phrase “give me a call,” “call” is a noun. But in
“call the doctor,” “call” is a verb. In this stage, NLP analyzes the position and context of the
token to derive the correct meaning of the words, which is called parts of speech tagging.
The next stage is known as “named entity recognition.” In this stage, the algorithm analyzes the
entity associated with a token. For example, the token “London” is associated with location, and
“Google” is associated with an organization. Chunking is the final stage of natural language
processing, which picks individual pieces of information and groups them into more significant
parts. All these functions run on NLTK, a tool designed by Python. All NLP processes and text
analysis use this natural language toolkit library.

The Evolution of Natural Language Processing

History of NLP:

The evolution of NLP is an ongoing process. The earliest work of NLP started as machine
translation, which was simplistic in approach. The idea was to convert one human language into
another, and it began with converting Russian into English. This led to converting human
language into computer language and vice versa. In 1952, Bell Labs created Audrey, the first
speech recognition system. It could recognize all ten numerical digits. However, it was
abandoned because it was faster to input telephone numbers with a finger. In 1962 IBM
demonstrated a shoebox-sized machine capable of recognizing 16 words

DARPA developed Harpy at Carnegie Mellon University in 1971. It was the first system to
recognize over a thousand words. The evolution of natural language processing gained
momentum in the 1980s when real-time speech recognition became possible due to
advancements in computing performances. There was also innovation in algorithms for
processing human languages, which discarded rigid rules and moved to machine learning
techniques that could learn from existing data of natural languages. Earlier, chatbots were rule-
based, where experts would encode rules mapping what a user might say and what an
appropriate reply should. However, this was a tedious process and yielded limited possibilities.

An early example of rule-based NLP was Eliza, created by MIT in 1960. Eliza used synthetic
rules to identify the meaning in the written text, which it would turn around and ask the user
about. Of course, NLP evolution has taken place in the last fifty years. The branches of
computational grammar and statistics gave NLP a different direction, giving rise to statistical
language processing and information extraction fields.

Current trends in NLP :

With the evolution of NLP, speech recognition systems are using deep neural networks.
Different vowels or sounds have different frequencies, which are discernible on a spectrogram.

This allows computers to recognize spoken vowels and words. Each sound is called a
phoneme, and speech recognition software knows what these phonemes look like. Along with
analyzing different words, NLP helps discern where sentences begin and end. And ultimately,
speech is converted to text.

Speech synthesis gives computers the ability to output speech. However, these sounds are
discontinuous and seem robotic. While this was very prominent in the hand-operated machine
from Bell Labs, today’s computer voices like Siri and Alexa have improved.

We are now seeing an explosion of voice interfaces on phones and cars. This creates a positive
feedback loop with people using voice interaction more often, which gives companies more data
to work on. This enables better accuracy, leading to people using voice more, and the loop
continues.

NLP evolution has happened by leaps and bounds in the last decade. NLP integrated with deep
learning and machine learning has enabled chatbots and virtual assistants to carry out
complicated interactions.
Chatbots now operate beyond the domain of customer interactions. They can handle human
resources and healthcare, too. NLP in healthcare can monitor treatments and analyze reports
and health records. Cognitive analytics and NLP are combined to automate routine tasks.

Various NLP Algorithms:

The evolution of NLP has happened with time and advancements in language technology. Data
scientists developed some powerful algorithms along the way; some of them are as follows:

Bag of words: This model counts the frequency of each unique word in an article. This is done
to train machines to understand the similarity of words. However, millions of individual words are
in millions of documents; hence, maintaining such vast data is practically unimaginable.
TF-IDF: TF (term frequency) is calculated as the number of times a certain term appears out of
the number of terms present in the document. This system also eliminates “stop words,” like “is,”
“a,” “the,” etc.

Co-occurrence matrix: This model was developed since the previous models could not solve the
problem of semantic ambiguity. It tracked the context of the text but required a lot of memory to
store all the data.

Transformer models: This is the encoder and decoder model that uses attention to train the
machines that imitate human attention faster. BERT, developed by Google based on this model,
has been phenomenal in revolutionizing NLP.

Carnegie Mellon University and Google have developed XLNet, another attention networkbased
model that has supposedly outperformed BERT in 20 tasks. BERT has exponentially improved
search results on browsers. Megatron and GPT-3 are based on this architecture used in speech
synthesis and image processing.

In this encoder-decoder model, the encoder tells the machine what it should think and
remember from the text. The decoder uses those thoughts to decide the appropriate reply and
action.

For example, in the sentence “I would like some strawberry___.” The ideal words for this blank
would be “cake” or “milkshake.” In this sentence, the encoder focuses on the word strawberry,
and the decoder pulls the right word from a cluster of terms related to strawberry.

Future Predictions of NLP:

NLP evolves every minute as more and more unstructured data is accumulated. So, there is no
end to the evolution of natural language processing.

As more and more data is generated, NLP will take over to analyze, comprehend, and store the
data. This will help digital marketers analyze gigabytes of data in minutes and strategize
marketing policies accordingly.

NLP concerns itself with human language. However, NLP evolution will eventually bring into its
domain non-verbal communications, like body language, gestures, and facial expressions.
To analyze non-verbal communications, NLP must be able to use biometrics like facial
recognition and retina scanner. Just as NLP is adept at understanding sentiments behind
sentences, it will eventually be able to read the feelings behind expressions. If this integration
between biometrics and NLP happens, the interaction between humans and computers will take
on a whole new meaning.

The next massive step in AI is the creation of humanoid robotics by integrating NLP with
biometrics. Through robots, computer-human interaction will move into computer-human
communication. Visual assistants do not even begin to cover the scope of NLP in the future.
When coupled with advancements in biometrics, NLP evolution can create robots who can see,
touch, hear, and speak, much like humans. NLP will shape the communication technologies of
the future

Importance of Natural Language Processing:

NLP solves the root problem of machines not understanding human language. With its
evolution, NLP has surpassed traditional applications, and AI is being used to replace human
resources in several domains.

Let’s look at the importance of NLP in today’s digital world:


“Machine translation” is a significant application of NLP. NLP is behind the widely used
Google Translate, which converts one language into another in real-time. It assists
computers in understanding the context of sentences and the meaning of words.

Virtual assistants like Cortana, Siri, and Alexa are boons of NLP evolution. These assistants
comprehend what you say, give befitting replies, or take appropriate actions, and do all this
through NLP. Intelligent chatbots are taking the world of customer service by storm. They are
replacing human assistance and conversing with customers just like humans do. They interpret
the written text, and it decides on actions accordingly. NLP is the working mechanism behind
such chatbots.
NLP also helps in sentiment analysis. It recognizes the sentiment behind posts. For instance, it
determines whether a review is positive, negative, serious or sarcastic. NLP mechanisms help
companies like Twitter remove tweets with foul language, etc.
NLP automatically sorts our emails into social, promotions, inbox, and spam categories. This
NLP task is known as text classification.
Other importances of NLP are seen in checking spellings, keyword research, and extracting
information. Plagiarism checkers also run on NLP programs.
NLP also drives advertisement recommendations. It matches advertisements with our history.
NLP helps machines understand natural languages and perform language-related tasks. It
makes it possible for computers to analyze more language-based data than humans.

It is impossible to comprehend these staggering volumes of unstructured data available by


conventional means. This is where NLP comes in. The evolution of NLP has enabled machines
to structure and analyze text data tirelessly.

A language has millions of words, several dialects, and thousands of grammatical and structural
rules. It is essential to comprehend human text’s synthetic and semantic context, which is not
possible by computers.

Phases/Stages of Natural Language


Processing (NLP)
Natural Language Processing (NLP) is a field within artificial intelligence that allows computers
to comprehend, analyze, and interact with human language effectively. The process of NLP can
be divided into five distinct phases: Lexical Analysis, Syntactic Analysis, Semantic Analysis,
Discourse Integration, and Pragmatic Analysis. Each phase plays a crucial role in the overall
understanding and processing of natural language.

Phase of Natural Language Processing

First Phase of NLP: Lexical and Morphological Analysis

1. Tokenization

The lexical phase in Natural Language Processing (NLP) involves scanning text and breaking it
down into smaller units such as paragraphs, sentences, and words. This process, known as
tokenization, converts raw text into manageable units called tokens or lexemes. Tokenization is
essential for understanding and processing text at the word level.

In addition to tokenization, various data cleaning and feature extraction techniques are applied,
including:

i) Lemmatization: Reducing words to their base or root form.

ii) Stopwords Removal: Eliminating common words that do not


carry significant meaning, such as "and," "the," and "is."

iii) Correcting Misspelled Words: Ensuring the text is free of


spelling errors to maintain accuracy.

These steps enhance the comprehensibility of the text, making it easier to analyze and process.

2. Morphological Analysis

Types of Morphemes
i) Free Morphemes: Text elements that carry meaning independently and make sense
on their own. For example, "bat" is a free morpheme.

ii) Bound Morphemes: Elements that must be attached to free morphemes to convey
meaning, as they cannot stand alone. For instance, the suffix "-ing" is a bound
morpheme, needing to be attached to a free morpheme like "run" to form "running."

Importance of Morphological Analysis

Morphological analysis is crucial in NLP for several reasons:

Understanding Word Structure: It helps in deciphering the composition of complex


words.

Predicting Word Forms: It aids in anticipating different forms of a word based on its
morphemes.

Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging,


syntactic parsing, and machine translation.

By identifying and analyzing morphemes, the system can interpret text correctly at the
most fundamental level, laying the groundwork for more advanced NLP applications.

Second Phase of NLP: Syntactic Analysis (Parsing)

Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing
(NLP). This phase is essential for understanding the structure of a sentence and assessing its
grammatical correctness. It involves analyzing the relationships between words and ensuring
their logical consistency by comparing their arrangement against standard grammatical rules.

1) Role of Parsing

Parsing examines the grammatical structure and relationships within a given text. It
assigns Parts-Of-Speech (POS) tags to each word, categorizing them as nouns, verbs,
adverbs, etc. This tagging is crucial for understanding how words relate to each other
syntactically and helps in avoiding ambiguity. Ambiguity arises when a text can be
interpreted in multiple ways due to words having various meanings. For example, the
word "book" can be a noun (a physical book) or a verb (the action of booking
something), depending on the sentence context.

Examples of Syntax

Consider the following sentences:

Correct Syntax: "John eats an apple."

Incorrect Syntax: "Apple eats John an."


Despite using the same words, only the first sentence is grammatically correct and makes
sense. The correct arrangement of words according to grammatical rules is what makes the
sentence meaningful.

2) Assigning POS Tags

During parsing, each word in the sentence is assigned a POS tag to indicate its
grammatical category. Here’s an example breakdown:

Sentence: "John eats an apple."

POS Tags:

John: Proper Noun (NNP)

eats: Verb (VBZ)

an: Determiner (DT)

apple: Noun (NN)

Assigning POS tags correctly is crucial for understanding the sentence structure and
ensuring accurate interpretation of the text.

Importance of Syntactic Analysis

By analyzing and ensuring proper syntax, NLP systems can better understand and generate
human language. This analysis helps in various applications, such as machine translation,
sentiment analysis, and information retrieval, by providing a clear structure and reducing
ambiguity.

Third Phase of NLP: Semantic Analysis

Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on
extracting the meaning from text. Unlike syntactic analysis, which deals with grammatical
structure, semantic analysis is concerned with the literal and contextual meaning of words,
phrases, and sentences.

Semantic analysis aims to understand the dictionary definitions of words and their usage in
context. It determines whether the arrangement of words in a sentence makes logical sense.
This phase helps in finding context and logic by ensuring the semantic coherence of sentences.

Key Tasks in Semantic Analysis

1) Named Entity Recognition (NER): NER identifies and classifies entities within the text,
such as names of people, places, and organizations. These entities belong to predefined
categories and are crucial for understanding the text's content.
2) Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous
words based on context. For example, the word "bank" can refer to a financial institution or the
side of a river. WSD uses contextual clues to assign the appropriate meaning.

Examples of Semantic Analysis

Consider the following examples:

Syntactically Correct but Semantically Incorrect: "Apple eats a John."

This sentence is grammatically correct but does not make sense semantically. An apple cannot
eat a person, highlighting the importance of semantic analysis in ensuring logical coherence.

Literal Interpretation: "What time is it?"

This phrase is interpreted literally as someone asking for the current time, demonstrating how
semantic analysis helps in understanding the intended meaning.

Importance of Semantic Analysis

Semantic analysis is essential for various NLP applications, including machine translation,
information retrieval, and question answering. By ensuring that sentences are not only
grammatically correct but also meaningful, semantic analysis enhances the accuracy and
relevance of NLP systems.

Fourth Phase of NLP: Discourse Integration

Discourse Integration is the fourth phase of Natural Language Processing (NLP). This phase
deals with comprehending the relationship between the current sentence and earlier sentences
or the larger context. Discourse integration is crucial for contextualizing text and understanding
the overall message conveyed.

Role of Discourse Integration

Discourse integration examines how words, phrases, and sentences relate to each other within
a larger context. It assesses the impact a word or sentence has on the structure of a text and
how the combination of sentences affects the overall meaning. This phase helps in
understanding implicit references and the flow of information across sentences.

Importance of Contextualization

In conversations and texts, words and sentences often depend on preceding or following
sentences for their meaning. Understanding the context behind these words and
sentences is essential to accurately interpret their meaning.

Example of Discourse Integration


Consider the following examples:

Contextual Reference: "This is unfair!"

To understand what "this" refers to, we need to examine the preceding or following sentences.
Without context, the statement's meaning remains unclear.

Anaphora Resolution: "Taylor went to the store to buy some groceries. She realized she forgot
her wallet."

In this example, the pronoun "she" refers back to "Taylor" in the first sentence. Understanding
that "Taylor" is the antecedent of "she" is crucial for grasping the sentence's meaning.

Application of Discourse Integration

Discourse integration is vital for various NLP applications, such as machine translation,
sentiment analysis, and conversational agents. By understanding the relationships and context
within texts, NLP systems can provide more accurate and coherent responses.

Fifth Phase of NLP: Pragmatic Analysis

Pragmatic Analysis is the fifth and final phase of Natural Language Processing (NLP), focusing
on interpreting the inferred meaning of a text beyond its literal content. Human language is often
complex and layered with underlying assumptions, implications, and intentions that go beyond
straightforward interpretation. This phase aims to grasp these deeper meanings in
communication.

Role of Pragmatic Analysis

Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming to
understand what the writer or speaker truly intends to convey. In natural language, words and
phrases can carry different meanings depending on context, tone, and the situation in which
they are used.

Importance of Understanding Intentions

In human communication, people often do not say exactly what they mean. For instance, the
word "Hello" can have various interpretations depending on the tone and context in which it is
spoken. It could be a simple greeting, an expression of surprise, or even a signal of anger.
Thus, understanding the intended meaning behind words and sentences is crucial.

Examples of Pragmatic Analysis

Consider the following examples:

Contextual Greeting: "Hello! What time is it?"


"Hello!" is more than just a greeting; it serves to establish contact.

"What time is it?" might be a straightforward request for the current time, but it could also imply
concern about being late.

Figurative Expression: "I'm falling for you."

The word "falling" literally means collapsing, but in this context, it means the speaker is
expressing love for someone.

Application of Pragmatic Analysis

Pragmatic analysis is essential for applications like sentiment analysis, conversational AI, and
advanced dialogue systems. By interpreting the deeper, inferred meanings of texts, NLP
systems can understand human emotions, intentions, and subtleties in communication, leading
to more accurate and human-like interactions.

10 Major Challenges of Natural Language Processing(NLP)


Natural Language Processing (NLP) faces various challenges due to the complexity and
diversity of human language. Let's discuss 10 major challenges in NLP:

1. Language differences

The human language and understanding is rich and intricated and there many languages
spoken by humans. Human language is diverse and thousand of human languages spoken
around the world with having its own grammar, vocabular and cultural nuances. Human
cannot understand all the languages and the productivity of human language is high.
There is ambiguity in natural language since same words and phrases can have different
meanings and different context. This is the major challenges in understating of natural
language.

There is a complex syntactic structures and grammatical rules of natural languages. The
rules are such as word order, verb, conjugation, tense, aspect and agreement. There is rich
semantic content in human language that allows speaker to convey a wide range of
meaning through words and sentences. Natural Language is pragmatics which means that
how language can be used in context to approach communication goals. The human
language evolves time to time with the processes such as lexical change.

2.Training Data
Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target.
Training data is composed of both the features (inputs) and their corresponding labels
(outputs). For NLP, features might include text data, and labels could be categories,
sentiments, or any other relevant annotations.

It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.

3. Development Time and Resource Requirements

Development Time and Resource Requirements for Natural Language Processing (NLP)
projects depends on various factors consisting the task complexity, size and quality of the
data, availability of existing tools and libraries, and the team of expert involved. Here are
some key points:

● Complexity of the task: Task such as classification of text or analyzing the

sentiment of the text may require less time compared to more complex tasks

such as machine translation or answering the questions.

● Availability and Quality Data: For Natural Language Processing models

requires high-quality of annotated data. It can be time consuming to collect,

annotate, and preprocess the large text datasets and can be resource-intensive

specially for tasks that requires specialized domain knowledge or fine-tuned

annotations.

● Selection of algorithm and development of model: It is difficult to choose the

right algorithms machine learning algorithms that is best for Natural Language

Processing task.

● Evaluation and Training: It requires powerful computation resources that

consists of powerful hardware (GPUs or TPUs) and time for training the

algorithms iteration. It is also important to evaluate the performance of the


model with the help of suitable metrics and validation techniques for

conforming the quality of the results.

4. Navigating Phrasing Ambiguities in NLP

It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of


human languages. The cause of phrasing ambiguities is when a phrase can be evaluated in
multiple ways that leads to uncertainty in understanding the meaning. Here are some key
points for navigating phrasing ambiguities in NLP:

● Contextual Understanding: Contextual information like previous sentences,

topic focus, or conversational cues can give valuable clues for solving

ambiguities.

● Semantic Analysis: The content of the semantic text is analyzed to find

meaning based on word, lexical relationships and semantic roles. Tools such as

word sense disambiguation, semantics role labeling can be helpful in solving

phrasing ambiguities.

● Syntactic Analysis: The syntactic structure of the sentence is analyzed to find

the possible evaluation based on grammatical relationships and syntactic

patterns.

● Pragmatic Analysis: Pragmatic factors such as intentions of speaker,

implicatures to infer meaning of a phrase. This analysis consists of

understanding the pragmatic context.

● Statistical methods: Statistical methods and machine learning models are used

to learn patterns from data and make predictions about the input phrase.

5. Misspellings and Grammatical Errors


Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there
are different forms of linguistics noise that can impact accuracy of understanding and
analysis. Here are some key points for solving misspelling and grammatical error in NLP:

● Spell Checking: Implement spell-check algorithms and dictionaries to find and

correct misspelled words.

● Text Normalization: The is normalized by converting into a standard format

which may contains tasks such as conversion of text to lowercase, removal of

punctuation and special characters, and expanding contractions.

● Tokenization: The text is split into individual tokens with the help of

tokenization techniques. This technique allows to identify and isolate misspelled

words and grammatical error that makes it easy to correct the phrase.

● Language Models: With the help of language models that is trained on large

corpus of data to predict the likelihood of word or phrase that is correct or not

based on its context.

6. Mitigating Innate Biases in NLP Algorithms

It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness,
equity, and inclusivity in natural language processing applications. Here are some key
points for mitigating biases in NLP algorithms.

● Collection of data and annotation: It is very important to confirm that the

training data used to develop NLP algorithms is diverse, representative and free

from biases.

● Analysis and Detection of bias: Apply bias detection and analysis method on

training data to find biases that is based on demographic factors such as race,

gender, age.
● Data Preprocessing: Data Preprocessing the most important process to train

data to mitigate biases like debiasing word embeddings, balance class

distributions and augmenting underrepresented samples.

● Fair representation learning: Natural Language Processing models are trained

to learn fair representations that are invariant to protect attributes like race or

gender.

● Auditing and Evaluation of Models: Natural Language models are evaluated

for fairness and bias with the help of metrics and audits. NLP models are

evaluated on diverse datasets and perform post-hoc analyses to find and

mitigate innate biases in NLP algorithms.

7. Words with Multiple Meanings

Words with multiple meaning plays a lexical challenge in Nature Language Processing
because of the ambiguity of the word. These words with multiple meaning are known as
polysemous or homonymous have different meaning based on the context in which they
are used. Here are some key points for representing the lexical challenge plays by words
with multiple meanings in NLP:

● Semantic analysis: Implement semantic analysis techniques to find the

underlying meaning of the word in various contexts. Word embedding or

semantic networks are the semantic representation can find the semantic

similarity and relatedness between different word sense.

● Domain specific knowledge: It is very important to have a specific domain-

knowledge in Natural Processing tasks that can be helpful in providing valuable

context and constraints for determining the correct context of the word.

● Multi-word Expression (MWEs): The meaning of the entire sentence or phrase

is analyzed to disambiguate the word with multiple meanings.


● Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies

to find the semantic relationships between different words context.

8. Addressing Multilingualism

It is very important to address language diversity and multilingualism in Natural Language


Processing to confirm that NLP systems can handle the text data in multiple languages
effectively. Here are some key points to address language diversity and multilingualism:

● Multilingual Corpora: Multilingual corpus consists of text data in various

languages and serve as valuable resources for training NLP models and

systems.

● Cross-Lingual Transfer Learning: This is a type of techniques that is used to

transfer knowledge learned from one language to another.

● Language Identification: Design language identification models to

automatically detect the language of a given text.

● Machine Translation: Machine Translation provides the facility to communicate

and inform access across language barriers and can be used as preprocessing

step for multilingual NLP tasks.

9. Reducing Uncertainty and False Positives in NLP

It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models. Here are some key points
to approach the solution:

● Probabilistic Models: Use probabilistic models to figure out the uncertainty in

predictions. Probabilistic models such as Bayesian networks gives probabilistic

estimates of outputs that allow uncertainty quantification and better decision

making.
● Confidence Scores: The confidence scores or probability estimates is calculated

for NLP predictions to assess the certainty of the output of the model.

Confidence scores helps us to identify cases where the model is uncertain or

likely to produce false positives.

● Threshold Tuning: For the classification tasks the decision thresholds is

adjusted to make the balance between sensitivity (recall) and specificity. False

Positives in NLP can be reduced by setting the appropriate thresholds.

● Ensemble Methods: Apply ensemble learning techniques to join multiple model

to reduce uncertainty.

10. Facilitating Continuous Conversations with NLP

Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless
interaction between users and machines. Implementing real time natural language
processing pipelines gives to capability to analyze and interpret user input as it is received
involving algorithms are optimized and systems for low latency processing to confirm
quick responses to user queries and inputs.

Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history
tracking, and generating relevant responses based on the ongoing dialogue. Apply intent
recognition algorithm to find the underlying goals and intentions expressed by users in
their messages.

How to overcome NLP Challenges


It requires a combination of innovative technologies, experts of domain, and
methodological approaches to over the challenges in NLP. Here are some key points to
overcome the challenge of NLP tasks:

● Quantity and Quality of data: High quality of data and diverse data is used to

train the NLP algorithms effectively. Data augmentation, data synthesis,

crowdsourcing are the techniques to address data scarcity issues.


● Ambiguity: The NLP algorithm should be trained to disambiguate the words

and phrases.

● Out-of-vocabulary Words: The techniques are implemented to handle out-of-

vocabulary words such as tokenization, character-level modeling, and

vocabulary expansion.

● Lack of Annotated Data: Techniques such transfer learning and pre-training

can be used to transfer knowledge from large dataset to specific tasks with

limited labeled data.

Types of Ambiguities

1. Lexical Ambiguity

Lexical means relating to words of a language. During Lexical analysis given paragraphs are broken

down into words or tokens. Each token has got specific meaning. There can be instances where a

single word can be interpreted in multiple ways. The ambiguity that is caused by the word alone

rather than the context is known as Lexical Ambiguity.

Example: “Give me the bat!”

In the above sentence, it is unclear whether bat refers to a nocturnal animal bat or a cricket bat. Just

by looking at the word it does not provide enough information about the meaning hence we need to

know the context in which it is used.

Lexical Ambiguity can be further categorized into Polysemy and homonymy.

1. a) Polysemy

It refers to a single word having multiple but related meanings.


Example: Light (adjective).

· Thanks to the new windows, this room is now so light and airy = lit by the natural light of

day.

· The light green dress is better on you = pale colours.

In the above example, light has different meanings but they are related to each other.

Courtesy: Oxford Dictionary


1. b) Homonymy

It refers to a single word having multiple but unrelated meanings.

Example: Bear, left, Pole

· A bear (the animal) can bear (tolerate) very cold temperatures.


· The driver turned left (opposite of right) and left (departed from) the main road.

· Pole and Pole — The first Pole refers to a citizen of Poland who could either be referred to

as Polish or a Pole. The second Pole refers to a bamboo pole or any other wooden pole.

2. Syntactic Ambiguity/ Structural ambiguity

Syntactic meaning refers to the grammatical structure and rules that define how words should be

combined to form sentences and phrases. A sentence can be interpreted in more than one way due

to its structure or syntax such ambiguity is referred to as Syntactic Ambiguity.


Example 1: “Old men and women”

The above sentence can have two possible meanings:

· All old men and young women.

· All old men and old women.

Example 2: “John saw the boy with telescope. “

In the above case, two possible meanings are

· John saw the boy through his telescope.

· John saw the boy who was holding the telescope.

3. Semantic Ambiguity

Semantics is nothing but “Meaning”. The semantics of a word or phrase refers to the way it is

typically understood or interpreted by people. Syntax describes the rules by which words can be

combined into sentences, while semantics describes what they mean.

Semantic Ambiguity occurs when a sentence has more than one interpretation or meaning.

Example 1: “Seema loves her mother and Sriya does too.”

The interpretations can be Sriya loves Seema’s mother or Sriya likes her mother.

Example 2: “He ate the burnt lasagna and pie.”

The above sentence can be interpreted as either ‘the lasagna was burnt and the pie wasn’t’ or both

were burnt.
4. Anaphoric Ambiguity

A word that gets its meaning from a preceding word or phrase is called an anaphor.

Example: “Susan plays the piano. She likes music.”

In this example, the word she is an anaphor and refers back to a preceding expression i.e., Susan.

The linguistic element or elements to which an anaphor refers is called an antecedent. The

relationship between anaphor and antecedent is termed ‘anaphora’. ‘Anaphora resolution’ or

‘anaphor resolution’ is the process of finding the correct antecedent of an anaphor.

Ambiguity that arises when there is more than one reference to the antecedent is known as

Anaphoric Ambiguity.

Example 1: “The horse ran up the hill. It was very steep. It soon got tired.”

In this example, there are two ‘it’, and it is unclear to which each ‘it’ refers, this leads to Anaphoric

Ambiguity. The sentence will be meaningful if first ‘it’ refers to the hill and 2nd ‘it’ refers to the

horse. Anaphors may not be in the immediately previous sentence. They may present in the

sentences before the previous one or may present in the same sentence.

Anaphoric references may not be explicitly present in the previous sentence rather they might refer

to the part of the antecedent.

Example 2: “I went to the hospital, and they told me to go home and rest.”

In this sentence, ‘they’ does not explicitly refer to the hospital instead it refers to the Dr or staff who

attended the patient in the hospital.


Anaphors are mostly pronouns, or they can even be noun phrases in some instances.

Example 3: “Darshan plays keyboard. He loves music. “

In this case ‘He’ is a pronoun.

Example 4: “A puppy drank the milk. The cute little dog was satisfied.”

Here Anaphor is ‘cute little dog’ which is a noun phrase.

5. Pragmatic ambiguity

Pragmatics focuses on the real-time usage of language like what the speaker wants to convey and

how the listener infers it. Situational context, the individuals’ mental states, the preceding dialogue,

and other elements play a major role in understanding what the speaker is trying to say and how

the listeners perceive it.

Example:
Let’s try to understand a few basic rules first…
1. Machine doesn’t understand characters, words or sentences.

2. Machines can only process numbers.

3. Text data must be encoded as numbers for input or output for any machine.

WHY to perform text encoding?

As mentioned in the above points we cannot pass raw text into machines as input until and unless

we convert them into numbers, hence we need to perform text encoding.

WHAT is text encoding?

Text encoding is a process to convert meaningful text into number / vector representation so as to

preserve the context and relationship between words and sentences, such that a machine can

understand the pattern associated in any text and can make out the context of sentences.

HOW to encode text for any NLP task?

There are a lot of methods to convert Text into numerical vectors, they are:

- Index-Based Encoding

- Bag of Words (BOW)

- TF-IDF Encoding

- Word2Vector Encoding

- BERT Encoding

As this is a basic explanation of NLP text encoding hence we will be skipping the last 2 methods, i.e.

Word2Vector and BERT as they are quite complex and powerful implementations of Deep Learning

method based Text Embedding to convert text into vector encoding.


You can find in-depth information about Word2Vector in my other blog stated here: NLP — Text

Encoding: Word2Vec

Before we deep dive into each method let’s set some ground examples so as to make it easier to

follow through.

Document Corpus: This is the whole set of text we have, basically our text corpus, can be anything

like news articles, blogs, etc.… etc.…

Example: We have 5 sentences namely, [“this is a good phone” , “this is a bad mobile” , “she is a good

cat” , “he has a bad temper” , “this mobile phone is not good”]

Data Corpus: It is the collection of unique words in our document corpus, i.e. in our case it looks like

this:

[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper” , “this”]

We will stick to these sentences to understand each embedding method.

This will make it easier to understand and grasp the intuition behind these methods.

So let’s try to understand each of them one by one:

1. Index-Based Encoding:
As the name mentions, Index based, we surely need to give all the unique words an index, like we

have separated out our Data Corpus, now we can index them individually, like…

a:1

bad : 2

this : 13

Now that we have assigned a unique index to all the words so that based on the index we can

uniquely identify them, we can convert our sentences using this index-based method.
It is very trivial to understand, that we are just replacing the words in each sentence with their

respective indexes.

Our document corpus becomes:

[13 7 1 4 10] , [13 7 1 2 8] , [11 7 1 4 3] , [6 5 1 2 12] , [13 8 10 7 9 4]

Now we have encoded all the words with index numbers, and this can be used as input to any

machine since machine understands number.

But there is a tiny bit of issue which needs to be addressed first and that is the consistency of the

input. Our input needs to be of the same length as our model, it cannot vary. It might vary in the real

world but needs to be taken care of when we are using it as input to our model.

Now as we can see the first sentence has 5 words, but the last sentence has 6 words, this will cause

an imbalance in our model.

So to take care of that issue what we do is max padding, which means we take the longest sentence

from our document corpus and we pad the other sentence to be as long. This means if all of my

sentences are of 5 words and one sentence is of 6 words I will make all the sentences of 6 words.

Now how do we add that extra word here? In our case how do we add that extra index here?

If you have noticed we didn’t use 0 as an index number, and preferably that will not be used

anywhere even if we have 100000 words long data corpus, hence we use 0 as our padding index.

This also means that we are appending nothing to our actual sentence as 0 doesn’t represent any

specific word, hence the integrity of our sentences are intact.

So finally our index based encodings are as follows:

[ 13 7 1 4 10 0 ] ,

[ 13 7 1 2 8 0 ] ,

[ 11 7 1 4 3 0 ] ,
[ 6 5 1 2 12 0 ] ,

[ 13 8 10 7 9 4 ]

And this is how we keep our input’s integrity the same and without disturbing the context of our

sentences as well.

Index-Based Encoding considers the sequence information in text encoding.

2. Bag of Words (BOW):


Bag of Words or BoW is another form of encoding where we use the whole data corpus to encode

our sentences. It will make sense once we see actually how to do it.

Data Corpus:

[“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper” , “this”]

As we know that our data corpus will never change, so if we use this as a baseline to create

encodings for our sentences, then we will be on an upper hand to not pad any extra words.

Now, 1st sentence we have is this : “this is a good phone”

How do we use the whole corpus to represent this sentence?

So our first sentence becomes a combination of all the words we have and we do not have.

[1,0,0,1,0,0,1,0,0,1,0,0,1]

This is how our first sentence is represented.

Now there are 2 kinds of BOW:

1. Binary BOW.

2. BOW
The difference between them is, in Binary BOW we encode 1 or 0 for each word appearing or non-

appearing in the sentence. We do not take into consideration the frequency of the word appearing

in that sentence.

In BOW we also take into consideration the frequency of each word occurring in that sentence.

Let’s say our text sentence is “this is a good phone this is a good mobile” (FYI just for reference)

If you see carefully, we considered the number of times the words “this”, “a”, “is” and “good” have

occurred.

So that’s the only difference between Binary BOW and BOW.

BOW totally discards the sequence information of our sentences.

3. TF-IDF Encoding:
Term Frequency — Inverse Document Frequency

As the name suggests, here we give every word a relative frequency coding w.r.t the current

sentence and the whole document.

Term Frequency: Is the occurrence of the current word in the current sentence w.r.t the total

number of words in the current sentence.

Inverse Data Frequency: Log of Total number of words in the whole data corpus w.r.t the total

number of sentences containing the current word.

TF:
Term-Frequency

IDF:

Inverse-Data-Frequency

One thing to note here is we have to calculate the word frequency of each word for that particular

sentence, because depending on the number of times a word occurs in a sentence the TF value can

change, whereas the IDF value remains constant, until and unless new sentences are getting added.

Let’s try to understand by experimenting it out:

Data Corpus: [“a” , “bad” , “cat” , “good” , “has” , “he” , “is” , “mobile” , “not” , “phone” , “she” , “temper”

, “this”]

TF-IDF : “this” in sentence1 : Number of “this” word in sentence1 / total number of words in

sentence1

IDF : log(total number of words in the whole data corpus / total number of sentences having “this”

word)

TF : 1 / 5 = 0.2

IDF : loge(13 / 3) = 1.4663

TF-IDF : 0.2 * 1.4663 = 0.3226


So we associate “this” : 0.3226; similarly we can find out TF-IDF for every word in that sentence and

then rest of the process remains same as BOW, here we replace the word not with the frequency of

its occurrence but rather with the TF-IDF value for that word.

So let’s try to encode our first sentence: “this is a good phone”

As we can see that we have replaced all the words appearing in that sentence with their respective

tf-idf values, one thing to notice here is, we have similar tf-idf values of multiple words. This is a

rare case that has happened with us as we had few documents and almost all words had kind of

similar frequencies.

Regular Expressions in Natural Language


Processing
Introduction

Regular Expressions is very popular among programmers and can be applied in many

programming languages like Java, JS, php, C++, etc. Regular Expressions are useful for

numerous practical day-to-day tasks that a data scientist encounters. It is one of the key

concepts of Natural Language Processing that every NLP expert should be proficient in.

Regular Expressions are used in various tasks such as data pre-processing, rule-based

information mining systems, pattern matching, text feature engineering, web scraping, data

extraction, etc.
What are Regular Expressions?

Regular expressions or RegEx is a sequence of characters mainly used to find or replace

patterns embedded in the text. Let’s consider this example: Suppose we have a list of friends-

And if we want to select only those names on this list which match the certain pattern such as

something the like this-

The names having the first two letters- S and U, followed by only three positions that can be

taken up by any letter. What do you think, which names fit this criterion? Let’s go one by one,

the name Sunil and Sumit fit this criterion as they have S and U in the beginning and three more

letters after that. While rest of the three names are not following the given criteria as Ankit is

starting with the alphabet A whereas Surjeet and Surabhi have more than three characters post

S and U.
What we have done here is that we have a pattern(or criteria) and a list of names and we’re

trying to find the name that matches the given pattern. That’s exactly how regular expressions

work.

In RegEx, we’ve different types of patterns to recognize different strings of characters. Let’s

understand these terms in a bit more detail but first understand the concept of Raw Strings.

The concept of Raw String in Regular Expression

Now let’s start with the concept of Raw String. Python raw string treats backslash(\) as a literal

character. Let’s look at some examples to understand. We have a couple of backslashes here.

But python treats \n as “move to a new line”.

# normal string vs raw string

path = "C:\desktop\nathan" #string


print("string:",path)
As you can see, \n has moved the text after it to a new line. Here “nathon” has become “athon”

and \n disappeared from the path. This is not what we want. Here we use “r” expression to

create a raw string-

path= r"C:\desktop\nathan" #raw string

print("raw string:",path)

As you can see we have the entire path printed out here by simply using “r” in front of the path.

It is always recommended to use raw string while dealing with Regular expressions.

Python Built-in Module for Regular Expressions

Python has a built-in module to work with regular expressions called “re”. Some common

methods from this module are-

● re.match()

● re.search()

● re.findall()

Let us look at each method with the help of an example-

1. re.match(pattern, string)

The re.match function returns a match object on success and none on failure.
import re

#match a word at the beginning of a string

result = re.match('Analytics',r'Analytics Vidhya is the largest data science community of India')


print(result)

Here Pattern = ‘Analytics’ and String = ‘Analytics Vidhya is the largest data science

community of India’. Since the pattern is present at the beginning of the string we got

the matching Object as an output. And since the output of the re.match is an object, we

will use the group() function of the match object to get the matched expression.

print(result.group()) #returns the total matches

As you can see, we got our required output using the group() function. Now let us have a look at

the other case as well-

result_2 = re.match('largest',r'Analytics Vidhya is the largest data science community of India')


print(result_2)

Here as you can notice, our pattern(largest) is not present at the beginning of the string, hence

we got None as our output.

2. re.search(pattern, string)

Matches the first occurrence of a pattern in the entire string(and not just at the beginning).
# search for the pattern "founded" in a given string

result = re.search('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')


print(result.group())

Since our pattern(founded) is present in the string, re.search() has matched the single

occurrence of the pattern.

3. re.findall(pattern, string)

It will return all the occurrences of the pattern from the string. I would recommend you to use

re.findall() always, it can work like both re.search() and re.match().

result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')


print(result)

Since we’ve ‘founded’ twice here in the string, re.findall() has printed it out twice in the output.

Special Sequences in Regular Expressions

Now we’re going to look at some special sequences that come up with Regular expressions.

These are used to extract a different kind of information from a given text. Let’s take a look at

them-

1. \b

\b returns a match where the specified pattern is at the beginning or at the end of a word.
str = r'Analytics Vidhya is the largest Analytics community of India'

#Check if there is any word that ends with "est"

x = re.findall(r"est\b", str)
print(x)

As you can see it returned the last three characters of the word “largest”.

2. \d

\d returns a match where the string contains digits (numbers from 0-9).

str = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

This function has generated all the digits from the string i.e 2, 1, and 9 separately. But is

this what we want? I mean, 1and 9 were together in the string but in our output we got 1

and 9 separated. Let’s see how we can get our desired output-

str = "2 million monthly visits in Jan'19."


# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space

x = re.findall("\d+", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

We can solve this problem by using the ‘+’ sign. Notice how we used ‘\d+’ instead of ‘\d’.

Adding ‘+’ after ‘\d’ will continue to extract the digits till we encounter a space. We can

infer that \d+ repeats one or more occurrences of \d till the non-matching character is

found whereas \d does a character-wise comparison.

3. \D

\D returns a match where the string does not contain any digit. It is basically the

opposite of \d.

str = "2 million monthly visits in Jan'19."

#Check if the word character does not contain any digits (numbers from 0-9):

x = re.findall("\D", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

We’ve got all the strings where there are no digits. But again we are getting individual

characters as output and like this, they really don’t make sense. By now I believe you know how

to tackle this problem now-

#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

Bingo! use \D+ instead of just \D to get characters that make sense.

4. \w

\w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9,

and the underscore _ character)

str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the
underscore _ character)

x = re.findall("\w+",str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

We got all the alphanumeric characters.

5. \W

\W returns match at every non-alphanumeric character. Basically opposite of \w.

str = "2 million monthly visits9!"

#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-
space etc.):

x = re.findall("\W", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

We got every non-alphanumeric character including white spaces.

Metacharacters in Regular Expression

Metacharacters are characters with a special meaning.

1- (.) matches any character (except newline character)


str = "rohan and rohit recently published a research paper!"

#Search for a string that starts with "ro", followed by any number of characters
x = re.findall("ro.", str) #searches one character after ro
x2 = re.findall("ro...", str) #searches three characters after ro

print(x)
print(x2)

We got “roh” and “roh” as our first output since we used only one dot after “ro”. Similarly, “rohan”

and “rohit” as our second output since we used three dots after “ro” in the second statement.

2– (^) starts with

It checks whether the string starts with the given pattern or not.

str = "Data Science"

#Check if the string starts with 'Data':

x = re.findall("^Data", str)

if (x):
print("Yes, the string starts with 'Data'")
else:
print("No match")

This caret(^) symbol checked whether the string started with “Data” or not. And since
our string is starting with the word Data, we got this output. Let’s check the other case
as well-
# try with a different string

str2 = "Big Data"

#Check if the string starts with 'Data':

x2 = re.findall("^Data", str2)

if (x2):
print("Yes, the string starts with 'data'")
else:
print("No match")

Here in this case you can see that the new string is not starting with the word “Data”, hence we

got No match.

3- ($) ends with

It checks whether the string ends with the given pattern or not.

str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
print("Yes, the string ends with 'Science'")
else:
print("No match")

The dollar($) sign checks whether the string ends with the given pattern or not. Here, our

pattern is Science and since the string ends with Science we got this output.

4- (*) matches for zero or more occurrences of the pattern to the left of it
str = "easy easssy eay ey"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y

x = re.findall("eas*y", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

The above code block basically checks if the string contains the pattern”eas*y” this means “ea”

followed by one or more occurrences of “s” and ending with “y”. We got these three strings as

output -” easy”, “easssy”, and “eay” because they match the given pattern. But the string “ey”

does not contain the pattern we’re looking for.

5- (+) matches one or more occurrences of the pattern to the left of it


#Check if the string contains "ea" followed by 1 or more "s" characters and ends with y

str = "easy easssy eay ey"

x = re.findall("eas+y", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

One major difference between * and + is that + checks for one or more occurrences of the

pattern to the left of it. Like in this above example we got “easy” and “easssy” as output but not

“eay” and “ey” because “eay” does not contain any instance of the character “s” and “ey” has

already been discarded earlier.

6- (?) matches zero or one occurrence of the pattern left to it.


str = "easy easssy eay ey"
x = re.findall("eas?y",str)

print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

The question mark(?) looks for zero or one occurrence of the pattern to the left of it. That is why

we got “easy” and “eay” as our output since only these two strings contains one and zero

occurrence of the character “s” respectively, along with the pattern starting with “ea” and ending

with “y”.

7- (|) either or
str = "Analytics Vidhya is the largest data science community of India"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str)
print(x)

if (x):
print("Yes, there is at least one match!")
else:
print("No match")

The pipe(|) operator checks whether any of the two patterns, to its left and right, is present in the

String or not. Here in the above example, we’re checking the String either contains data or

India. Since both of them are present in the String, we got both as the output.
Let’s look at another example:

# try with a different string

str = "Analytics Vidhya is one of the largest data science communities"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str)
print(x)

if (x):
print("Yes, there is at least one match!")

else:
print("No match")

Here the pattern is the same but the String contains only “data” and hence we got only [‘data’]

as the output.

What is lexicon?
The lexicon refers to the collection of words, phrases, or symbols in a specific
language. It encompasses the vocabulary of a language and includes various
linguistic attributes associated with each word, such as part-of-speech tags,
semantic information, pronunciation, and more. It serves as a comprehensive
repository of linguistic knowledge, enabling NLP systems to process and
understand natural language text.

Components of a Lexicon
A lexicon comprises several components that provide rich information about
words and their properties. These components include:
1. Words and their Meanings:
The core component of a lexicon is the listing of words, each associated
with its corresponding meaning(s). This provides the fundamental
building blocks for language understanding.
2. Part-of-Speech (POS) Tags:
POS tags assign grammatical categories to words, such as noun, verb,
adjective, adverb, and more. POS tags play a vital role in syntactic
analysis and help disambiguate word meanings based on their context.
3. Pronunciation:
Lexicons often include information about the pronunciation of words,
helping in tasks such as text-to-speech synthesis and speech
recognition.
4. Semantic Information:
Some lexicons include semantic attributes associated with words, such
as word senses, synonyms, antonyms, and hypernyms. These semantic
relationships enable algorithms to infer deeper meaning from text.

Lexical Semantics
Lexical Processing encompasses various techniques and methods used to
handle and analyze words or lexemes in natural language. It involves tasks
such as normalizing word forms, disambiguating word meanings, and
establishing translation equivalences between different languages. Lexical
processing is an essential component in many language-related applications,
including information retrieval, machine translation, natural language
understanding, and text analysis.

Collectively, the three concepts of Lexical Normalization, Lexical


Disambiguation, and Bilingual Lexicons are often referred to as Lexical
Processing or Lexical Semantics.
1. Lexical Normalization:
Lexical normalization, also known as word normalization or word
standardization, is the process of transforming words or phrases into
their canonical or base form. It helps in handling variations in word
forms to improve text analysis and natural language processing tasks.
Techniques used in lexical normalization include stemming,
lemmatization, and handling abbreviations or acronyms.
2. Lexical Disambiguation:
Lexical disambiguation aims to resolve the ambiguity present in natural
language. It involves determining the correct meaning or sense of a
word in a given context. This is important because many words in
natural language have multiple meanings, and understanding the
intended sense is crucial for accurate language processing.
Techniques such as part-of-speech tagging, semantic role labeling, and
word sense disambiguation algorithms are employed for lexical
disambiguation.
3. Bilingual Lexicons:
Bilingual lexicons are linguistic resources that provide translation
equivalents between words or phrases in different languages. They
facilitate the process of translation and language understanding tasks
by mapping words or phrases from one language to another. Bilingual
lexicons can be manually curated or automatically generated using
various techniques, including statistical alignment models, parallel
corpora, machine learning, and bilingual dictionaries.
In summary, lexical normalization focuses on transforming words into their
standardized forms, lexical disambiguation deals with resolving the ambiguity
of words in context, and bilingual lexicons assist in translating words or
phrases between different languages. These concepts play important roles in
natural language processing, machine translation, and cross-lingual
applications.

Applications of Lexicon in NLP


The lexicon finds extensive applications in various NLP tasks, contributing to
the advancement of language processing algorithms. Here are a few key
applications:

1. Sentiment Analysis:
Lexicons play a crucial role in sentiment analysis, where the goal is to
determine the sentiment expressed in a given text. Lexicons contain
sentiment scores or polarity labels associated with words. For example,
the word "happy" might have a positive sentiment score, while "sad"
could have a negative sentiment score. Implementations involve using
lexicons to assign sentiment scores to words in a text and aggregating
them to determine the overall sentiment of the text.
2. Text Classification:
Lexicons serve as valuable resources for text classification tasks.
Lexicons can provide features for classification algorithms, aiding in
better feature representation and decision-making. For example, a
lexicon might contain words associated with specific topics or domains.
Implementations involve incorporating lexicon-based features into
classification algorithms to improve the accuracy of text classification.
3. Machine Translation:
Lexicons are utilized in machine translation systems to provide
translation equivalents for words or phrases. For example, a lexicon
might contain mappings between English and French words.
Implementations involve leveraging the lexicon to translate words or
phrases during the translation process.
4. Word Sense Disambiguation:
Lexicons with semantic information aid in word sense disambiguation,
where the correct meaning of a word in a specific context needs to be
determined. For example, a lexicon might contain multiple senses of the
word "bank" (financial institution vs. river bank). Implementations involve
using the lexicon to disambiguate the correct sense based on the
context in which the word appears.
5. Named Entity Recognition (NER):
Lexicons are used in NER to identify and classify named entities such
as person names, locations, organizations, etc. For example, a lexicon
might contain a list of known organization names. Implementations
involve matching the words in a text with the entries in the lexicon to
identify and extract named entities.

Parts of Speech Definition

The parts of speech are the “traditional grammatical categories to which


words are assigned in accordance with their syntactic functions, such as
noun, verb, adjective, adverb, and so on.” In other words, they refer to the
different roles that words can play in a sentence and how they relate to one
another based on grammar and syntax.

All Parts of Speech with Examples


There are 8 different types of parts of speech i.e., Nouns, Pronouns,
Adjectives, Verbs, Adverb, prepositions, Conjunction, and Interjection.

Noun –

A noun is a word that names a person, place, thing, state, or quality. It can be
singular or plural. Nouns are a part of speech.
A noun is a type of word that stands for either a real thing or an idea. This can include living beings,
locations, actions, characteristics, conditions, and concepts. Nouns can act as either the subject or
the object in a sentence, phrase, or clause.

● Function: Refers to Things or person

● Examples: Pen, Chair, Ram, Honesty

● Sentences: Cars are expensive, This chair is made of wood, and

Ram is a topper, Honesty is the best policy.

Pronoun –

The word used in place of a noun or a noun phrase is known as a pronoun. A


pronoun is used in place of a noun to avoid the repetition of the noun.

● Function: Replaces a noun

● Examples: I, you, he, she, it, they

● Sentences: They are expensive, It is of wood, He is a topper, It is the

best policy

Adjective –

A word that modifies a noun or a pronoun is an adjective. Generally, an


adjective’s function is to further define and quantify a noun or pronoun.
● Function: Describes a noun

● Examples: Super, Red, Our, Big, Great, class

● Sentences: Supercars are expensive, The red chair is for kids, Ram is

a class topper, and Great things take time.

Verb –

A word or a group of words that describes an action, a state, or an event is


called a verb. A verb is a word that says what happens to somebody or what
somebody or something does.

● Function: Describes action or state

● Examples: Play, be, work, writing , like

● Sentences: I play football, I will be a doctor, I like to work, I love

writing poems.

Adverb –

A verb, adjective, another adverb, determiner, clause, preposition, or


sentence is typically modified by an adverb. Adverbs often answer
questions like “how,” “in what way,” “when,” “where,” and “to what extent”
by expressing things like method, place, time, frequency, degree, level of
certainty, etc

An adverb is a type of word that usually changes or adds to the meaning of a verb, an adjective,
another adverb, a determiner, a clause, a preposition, or even a whole sentence. Adverbs often
describe how something is done, where it happens, when it occurs, how often it takes place, and to
what degree or certainty. They help answer questions like how, in what way, when, where, and to
what extent.
● Function: Describes a verb, adjective, or adverb

● Examples: Silently, too, very

● Sentences: I love reading silently, It is too tough to handle, He can

speak very fast.

Preposition –

A preposition is called a connector or linking word which has a very close


relationship with the noun, pronoun or adjective that follows it.
Prepositions show position in space, movement, direction, etc.

● Function: Links a noun to another word

● Examples: at, in, of, after, under,

● Sentences: The ball is under the table, I am at a restaurant, she is in

trouble, I am going after her, It is so nice of him

Conjunction –

A conjunction is a word that connects clauses, sentences, or other words.


Conjunctions can be used alone or in groups of two.

● Function: Joins clauses and sentences

● Examples: and, but, though, after

● Sentences: First, I will go to college and then I may go to Fest, I

don’t have a car but I know how to drive, She failed the exam

though she worked hard, He will come after he finishes his match.
Interjection –

An interjection is a word or phrase expressing some sudden feelings of


sadness or emotions.

● Function: Shows exclamation

● Examples: oh! wow!, alas! Hurray!

● Sentences: Oh! I got fail again, Wow! I got the job, Alas! She is no

more, Hurray! We are going to a party.

These are the main parts of speech, but there are additional subcategories
and variations within each. Understanding the different parts of speech can
help construct grammatically correct sentences and express ideas clearly.

Choose the correct Parts of Speech of the BOLD word from the following
questions.

1. Let us play, Shall We?

a. Conjunction b. Pronoun c. Verb

2. It is a good practice to arrange books on shelves.

a. Verb b. Noun c. Adjective

3. Whose books are these?

a. Pronoun b. Preposition c. verb

4. Father, please get me that toy.


a. Pronoun b. Adverb c. Adjective

5. His mentality is rather obnoxious.

a. Adverb b. Adjective c. Noun

6. He is the guy whose money got stolen.

a. Pronoun b. Conjunction c. Adjective

7. I will have finished my semester by the end of this year.

a. Interjection b. Conjunction c. Preposition

8. Bingo! That’s the one I have been looking for

a. Interjection b. Conjunction c. Preposition

Part-of-speech (POS) tagging is an important Natural Language Processing (NLP) concept


that categorizes words in the text corpus with a particular part of speech tag (e.g., Noun,
Verb, Adjective, etc.)

POS tagging could be the very first task in text processing for further downstream tasks in
NLP, like speech recognition, parsing, machine translation, sentiment analysis, etc.

The particular POS tag of a word can be used as a feature by various Machine Learning
algorithms used in Natural Language Processing.

Introduction
Simply put, In Parts of Speech tagging for English words, we are given a text of English
words we need to identify the parts of speech of each word.

Example Sentence : Learn NLP from Scaler

Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN

Although it seems easy, Identifying the part of speech tags is much more complicated than
simply mapping words to their part of speech tags.

Why Difficult ?

Words often have more than one POS tag. Let’s understand this by taking an easy example.

In the below sentences focus on the word “back” :

The relationship of “back” with adjacent and related words in a phrase, sentence, or
paragraph is changing its POS tag.

It is quite possible for a single word to have a different part of speech tag in different
sentences based on different contexts. That is why it is very difficult to have a generic
mapping for POS tags.
If it is difficult, then what approaches do we have?

Before discussing the tagging approaches, let us literate ourselves with the required
knowledge about the words, sentences, and different types of POS tags.

Word Classes
In grammar, a part of speech or part-of-speech (POS) is known as word class or
grammatical category, which is a category of words that have similar grammatical
properties.

The English language has four major word classes: Nouns, Verbs, Adjectives, and Adverbs.

Commonly listed English parts of speech are nouns, verbs, adjectives, adverbs, pronouns,
prepositions, conjunction, interjection, numeral, article, and determiners.

These can be further categorized into open and closed classes.

Closed Class
Closed classes are those with a relatively fixed/number of words, and we rarely add new
words to these POS, such as prepositions. Closed class words are generally functional
words like of, it, and, or you, which tend to be very short, occur frequently, and often have
structuring uses in grammar.

Example of closed class-

Determiners: a, an, the Pronouns: she, he, I, others Prepositions: on, under, over, near, by,
at, from, to, with

Open Class
Open Classes are mostly content-bearing, i.e., they refer to objects, actions, and features; it's
called open classes since new words are added all the time.

By contrast, nouns and verbs, adjectives, and adverbs belong to open classes; new nouns
and verbs like iPhone or to fax are continually being created or borrowed.

Example of open class-


Nouns: computer, board, peace, school Verbs: say, walk, run, belong Adjectives: clean,
quick, rapid, enormous Adverbs: quickly, softly, enormously, cheerfully

Tagset
The problem is (as discussed above) many words belong to more than one word class.

And to do POS tagging, a standard set needs to be chosen. We Could pick very
simple/coarse tagsets such as Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), etc.

But to make tags more dis-ambiguous, the commonly used set is finer-grained, University
of Pennsylvania’s “UPenn TreeBank tagset”, having a total of 45 tags.
Tagging is a disambiguation task; words are ambiguous i.e. have more than one a possible
part of speech, and the goal is to find the correct tag for the situation.

For example, a book can be a verb (book that flight) or a noun (hand me that book).

The goal of POS tagging is to resolve these ambiguities, choosing the proper tag for the
context.

POS tagging Algorithms Accuracy:

The accuracy of existing State of the Art algorithms of part-of-speech tagging is extremely
high. The accuracy can be as high as ~ 97%, which is also about the human performance on
this task, at least for English.

We’ll discuss algorithms/techniques for this task in the upcoming sections, but first, let’s
explore the task. Exactly how hard is it?
Let's consider one of the popular electronic collections of text samples, Brown Corpus. It is
a general language corpus containing 500 samples of English, totaling roughly one million
words.

In Brown Corpus :

85-86% words are unambiguous - have only 1 POS tag

but,

14-15% words are ambiguous - have 2 or more POS tags

Particularly ambiguous common words include that, back, down, put, and set.

The word back itself can have 6 different parts of speech (JJ, NN, VBP, VB, RP, RB)
depending on the context.

Nonetheless, many words are easy to disambiguate because their different tags aren’t
equally likely. For example, "a" can be a determiner or the letter "a", but the determiner
sense is much more likely.

This idea suggests a useful baseline, i.e., given an ambiguous word, choose the tag which is
most frequent in the corpus.

This is a key concept in the Frequent Class tagging approach.

Let’s explore some common baseline and more sophisticated POS tagging techniques.

Rule-Based Tagging
Rule-based tagging is the oldest tagging approach where we use contextual information to
assign tags to unknown or ambiguous words.

The rule-based approach uses a dictionary to get possible tags for tagging each word. If the
word has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag.
Since rules are usually built manually, therefore they are also called Knowledge-driven
taggers. We have a limited number of rules, approximately around 1000 for the English
language.

One of example of a rule is as follows:

Sample Rule: If an ambiguous word “X” is preceded by a determiner and followed by a


noun, tag it as an adjective;

A nice car: nice is an ADJECTIVE here.

Limitations/Disadvantages of Rule-Based Approach:

● High development cost and high time complexity when applying to a large corpus of
text
● Defining a set of rules manually is an extremely cumbersome process and is not
scalable at all

Stochastic POS Tagging


Stochastic POS Tagger uses probabilistic and statistical information from the corpus of
labeled text (where we know the actual tags of words in the corpus) to assign a POS tag to
each word in a sentence.

This tagger can use techniques like Word frequency measurements and Tag Sequence
Probabilities. It can either use one of these approaches or a combination of both. Let’s
discuss these techniques in detail.

Word Frequency Measurements


The tag encountered most frequently in the corpus is the one assigned to the ambiguous
words(words having 2 or more possible POS tags).

Let’s understand this approach using some example sentences :

Ambiguous Word = “play”

Sentence 1 : I play cricket every day. POS tag of play = VERB


Sentence 2 : I want to perform a play. POS tag of play = NOUN

The word frequency method will now check the most frequently used POS tag for “play”.
Let’s say this frequent POS tag happens to be VERB; then we assign the POS tag of "play” =
VERB

The main drawback of this approach is that it can yield invalid sequences of tags.

Tag Sequence Probabilities


In this method, the best tag for a given word is determined by the probability that it occurs
with “n” previous tags.

10 sequences have the POS of the next word is NOUN 90 sequences have the POS of the
next word is VERB

Then the POS of the word

w4 = VERB
The main drawback of this technique is that sometimes the predicted sequence is not
Grammatically correct.

Now let’s discuss some properties and limitations of the Stochastic tagging approach :

1. This POS tagging is based on the probability of the tag occurring (either solo or in
sequence)
2. It requires labeled corpus, also called training data in the Machine Learning lingo
3. There would be no probability for the words that don’t exist in the training data
4. It uses a different testing corpus (unseen text) other than the training corpus
5. It is the simplest POS tagging because it chooses the most frequent tags associated
with a word in the training corpus

Transformation-Based Learning Tagger: TBL


Transformation-based tagging is the combination of Rule-based & stochastic tagging
methodologies.

In Layman's terms;

The algorithm keeps on searching for the new best set of rules given input as labeled
corpus until its accuracy saturates the labeled corpus.

Algorithm takes following Input:

● a tagged corpus
● a dictionary of words with the most frequent tags

Output : Sequence of transformation rules

Example of sample rule learned by this algorithm:

Rule : Change Noun(NN) to Verb(VB) when previous tag is To(TO)

E.g.: race has the following probabilities in the Brown corpus -

● Probability of tag is NOUN given word is race P(NN | race) = 98%


● Probability of tag is VERB given word is race P(VB | race) = 0.02
Given sequence: is expected to race tomorrow

● First tag race with NOUN (since its probability of being NOUN is 98%)
● Then apply the above rule and retag the POS of race with VERB (since just the
previous tag before the “race” word is TO )

The Working of the TBL Algorithm


Step 1: Label every word with the most likely tag via lookup from the input dictionary.

Step 2: Check every possible transformation & select one which most improves tagging
accuracy.

Similar to the above sample rule, other possible (maybe worst transformations) rules could
be -

● Change Noun(NN) to Determiner(DT) when previous tag is To(TO)


● Change Noun(NN) to Adverb(RB) when previous tag is To(TO)
● Change Noun(NN) to Adjective(JJ) when previous tag is To(TO)
● etc…..

Step 3: Re-tag corpus by applying all possible transformation rules

Repeat Step 1,2,3 as many times as needed until accuracy saturates or you reach some
predefined accuracy cutoff.

Advantages and Drawbacks of the TBL Algorithm


Advantages

● We can learn a small set of simple rules, and these rules are decent enough for basic
POS tagging
● Development, as well as debugging, is very easy in TBL because the learned rules
are easy to understand
● Complexity in tagging is reduced because, in TBL, there is a cross-connection
between machine-learned and human-generated rules

Drawbacks

Despite being a simple and somewhat effective approach to POS tagging, TBL has major
disadvantages.
● TBL algorithm training/learning time complexity is very high, and time increases
multi-fold when corpus size increases
● TBL does not provide tag probabilities

Hidden Markov Model POS Tagging: HMM


HMM is a probabilistic sequence model, i.e., for POS tagging a given sequence of words, it
computes a probability distribution over possible sequences of POS labels and chooses the
best label sequence.

This makes HMM model a good and reliable probabilistic approach to finding POS tags for
the sequence of words.

Reference:

1. https://www.peppercontent.io/blog/tracing-the-evolution-of-nlp/
2. https://www.geeksforgeeks.org/phases-of-natural-language-processing-nlp/
3. https://www.geeksforgeeks.org/major-challenges-of-natural-language-processing/
4. https://medium.com/womenintechnology/understanding-ambiguities-in-natural-language-
processing-179212a23b55
5. https://bishalbose294.medium.com/nlp-text-encoding-a-beginners-guide-fa332d715854
6. https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-regular-expressions-
in-natural-language-processing/
7. https://iq.opengenus.org/lexicon-in-nlp/#google_vignette
8. https://www.geeksforgeeks.org/parts-of-speech/
9. https://www.scaler.com/topics/nlp/word-classes-and-part-of-speech-tagging-in-nlp/
10. https://ebooks.inflibnet.ac.in/engp13/chapter/phrase-structure-np/
11. https://www.kdnuggets.com/2018/08/understanding-language-syntax-and-structure-
practitioners-guide-nlp-3.html
12.

You might also like