[go: up one dir, main page]

0% found this document useful (0 votes)
9 views30 pages

NLP - Unit 1 Notes

UNIT I: INTRODUCTION: Origins and challenges of NLP Language Modelling: Grammar-based LM, Statistical LM Regular Expressions, Finite-State Automata English Morphology, Transducers for lexicon and rules, Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

NLP - Unit 1 Notes

UNIT I: INTRODUCTION: Origins and challenges of NLP Language Modelling: Grammar-based LM, Statistical LM Regular Expressions, Finite-State Automata English Morphology, Transducers for lexicon and rules, Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

UNIT I OVERVIEW AND LANGUAGE MODELLING 9

Overview: Origins and challenges of NLP Language and Grammar-


Processing Indian Languages- NLP Applications Information Retrieval.
Language Modeling: Various Grammar- based Language Models
Statistical Language Model.

1.1: Overview

Definition (NLP):

NLP stands for Natural Language Processing, which is a part of


Computer Science, Human language, and Artificial Intelligence. It is
the technology that is used by machines to understand, analyse,
manipulate, and interpret spoken and written human's languages.

It helps developers to organize knowledge for performing tasks such as


translation, automatic summarization, Named Entity Recognition (NER),
speech recognition, relationship extraction, and topic segmentation.

Natural language processing uses artificial intelligence to process and


interpret real-world input—spoken or written—in a form that a computer
1
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

can comprehend. So that, the computers can understand natural language


as humans do.

NLP is the driving force behind things like


 Virtual assistants
 Speech recognition
 Sentiment analysis
 Automatic text summarization
 Machine translation and much more

1.1.1: Components of NLP

There are two components of NLP,


1. Natural Language Understanding (NLU)
2. Natural Language Generation (NLG).

1. Natural Language Understanding (NLU)


 This involves transforming human language into a machine-readable
format.
 It helps the machine to understand and analyse human language by
extracting the text from large data such as entities, keywords,
emotions, relations, and semantic roles.
 NLU mainly used in Business applications to understand the
customer's problem in both spoken and written language.
 NLU involves the following tasks -
o It is used to map the given input into useful representation.
o It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)


 It acts as a translator that converts the computerized data into natural
language representation.
 It mainly involves Text planning, Sentence planning, and Text
realization.
 The NLU is harder than NLG

2
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Difference between NLU and NLG

NLU NLG

NLU is the process of reading and NLG is the process of writing or


interpreting language. generating language.

It produces non-linguistic outputs It produces constructing natural language


from natural language inputs. outputs from non-linguistic inputs.

1.1.2: NLP Terminology


 Phonology − It is study of organizing sound systematically.
 Morphology: The study of the formation and internal structure of
words.
 Morpheme − It is primitive unit of meaning in a language.
 Syntax: The study of the formation and internal structure of
sentences.
 Semantics: The study of the meaning of sentences.
 Pragmatics − It deals with using and understanding sentences in
different situations and how the interpretation of the sentence is
affected.
 Discourse − It deals with how the immediately preceding sentence can
affect the interpretation of the next sentence.
 World Knowledge − It includes the general knowledge about the
world.

1.1.3: Steps in NLP

3
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

There are general five steps:


1. Morphological Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis

1. Morphological Analysis

 The first phase of NLP is the Lexical Analysis.


 Simple word-level analysis.
 It is the study of different forms of the word. D
 Different operations are
1. Tokenization: refers to the process of converting a sequence of text
into smaller parts, known as tokens.
Eg. John ate the pizza!
We extract tokens or the words separated by spaces.
John, ate, the, pizza, !
2. Stop word removal: It is a preprocessing step in NLP that involves
removing common, non-meaningful words like “the” and “and” from
text data.

4
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

For the above example, from generated tokens I don’t want unwanted
tokens like punctuations, articles irrelevant, prepositions, meaningful
words.
After removing stopwords, I get the words,
John, ate, pizza

3. Stemming: It is process of reducing words into its base form (root


form/stem form)
Eg.
John / John
Ate / eat
Pizza / pizza
Car, cars / car
Run, ran, running / run
Stemmer, stemming, stemmed / stem

4. N-gram Language Model


N-gram is a sequence of n continuous words. Using N-gram model, we
can the highest probability of the occurrence of a word next in the
sequence of words based on the previous word in the given
text/sentence.

2. Syntactic Analysis (Parsing)

Syntactic analysis, also referred to as syntax analysis or parsing, is the


process of analysing natural language with the rules of a formal grammar.
5
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Syntactic Analysis is a sentence-level analysis and used to check grammar,


word arrangements, and shows the relationship among the words.

Here we try to find whether the given sentence is grammatically correct or


not.

Example 1: Agra goes to the Poonam

In the real world, Agra goes to the Poonam, does not make any sense, so
this sentence is rejected by the Syntactic analyzer.

Example 2:
John ate the apple

Ate the apple john

Certain rules we follow to form a sentence. Rules are written using CFG.
Based on the CFG, we create a parse tree for a sentence. If the parse tree is
complete, then the given sentence is grammatically right; if the parse tree is
incomplete, then it is wrong.

6
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

3. Semantic Analysis

Semantic analysis analyses the grammatical format of sentences, including


the arrangement of words, phrases, and clauses, to determine relationships
between independent terms in a specific context. This is a crucial task of
natural language processing (NLP) systems.

Example:

She drank some milk

She drank some books.

To check for the meaningfulness, first it performs morphological analysis to


find the important tokens. First find the base of the important tokens.

4.Discourse Analysis

Discourse analysis may be defined as the process of determining contextual


information that is useful for performing other tasks

Discourse Integration depends upon the sentences that proceeds it and also
invokes the meaning of the sentences that follow it.

Here we try to find out words to resolve the references.


7
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Example:

Monkeys eat banana when they wake up.

Who is they here? Monkeys

Monkeys eat banana when they are ripe.

Who is they here? Banana

5. Pragmatic Analysis

Pragmatic is the fifth and last phase of NLP.

Pragmatic Analysis deals with the overall communicative and social content
and its effect on interpretation. It means abstracting or deriving the
meaningful use of language in situations.

It helps you to discover the intended effect by applying a set of rules that
characterize cooperative dialogues.

For Example: "Open the door" is interpreted as a request instead of an


order.

1.1.4: Applications of NLP

There are the following applications of NLP –

1. Question Answering
Question Answering focuses on building systems that automatically
answer the questions asked by humans in a natural language.

2. Spam Detection
Spam detection is used to detect unwanted e-mails getting to a user's
inbox.

8
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

3. Sentiment Analysis
Sentiment Analysis is also known as opinion mining. It is used on the
web to analyse the attitude, behaviour, and emotional state of the sender.
This application is implemented through a combination of NLP (Natural
Language Processing) and statistics by assigning the values to the text
(positive, negative, or natural), identify the mood of the context (happy, sad,
angry, etc.)

4. Machine Translation
Machine translation is used to translate text or speech from one
natural language to another natural language.
Example: Google Translator

5. Spelling correction
Microsoft Corporation provides word processor software like MS-word,
PowerPoint for the spelling correction.

9
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is
used in applications, such as mobile, home automation, video recovery,
dictating to Microsoft Word, voice biometrics, voice user interface, and so
on.

7. Chatbot
Implementing the Chatbot is one of the important applications of NLP.
It is used by many companies to provide the customer's chat services.

8. Information extraction
Information extraction is one of the most important applications of
NLP. It is used for extracting structured information from unstructured or
semi-structured machine-readable documents.

9. Natural Language Understanding (NLU)


It converts a large set of text into more formal representations such as
first-order logic structures that are easier for the computer programs to
manipulate notations of the natural language processing.

1.2: Origins and Challenges of Language and Grammer

1.2.1: Human Vs Machine with regard to Language processing

For humans, learning in early childhood occurs in a consistent way;


children interact with unstructured data and process that data into
information. After amassing this information, we begin to analyze

10
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

information in an attempt to understand its implications in a given situation


or the nuance of a given problem. We understand that at a certain point, we
have a learned understanding of our life and environment. Only after
understanding implications, can the information be used to solve a set of
problems or life situations. Humans iterate through multiple scenarios to
consciously or unconsciously simulate whether a solution will be a success
or failure. After practice with this unstructured data -> information ->
knowledge ->wisdom.

Machines learn by a similar method;


 Initially, the machine translates unstructured textual data into
meaningful terms
 Then identifies connections between those terms
 Finally comprehends the context.

Many technologies conspire to process natural languages, the most


popular of which areStanford CoreNLP, Spacy, AllenNLP, and Apache NLTK,
amongst others.

Figure 1.1: Working of NLP

1.2.2: CHALLENGES OF NLP:

Although natural language processing (NLP) is a highly advantageous


technique, there are still a number of issues and limitations with NLP.
1. Contextual words and phrases and homonyms

11
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

2. Synonyms
3. Irony and sarcasm
4. Ambiguity
5. Errors in text or speech
6. Colloquialisms and slang
7. Domain-specific language
8. Low-resource languages
9. Lack of research and development

1. Contextual Words and Phrases and Homonyms

Many words, particularly in English, have the exact same pronunciation but
completely distinct meanings. The same words and phrases can have
different meanings depending on the sentence's context.

For example:
I ran to the store because we ran out of milk.
Can I run something past you real quick?
The house is looking really run down.
These are easy for humans to understand because we read the context of
the sentence and we understand all of the different definitions. And, while
NLP language models may have learned all of the definitions, differentiating
between them in context can present problems.

Homonyms – two or more words that are pronounced the same but have
different definitions – can be problematic for question answering and
speech-to-text applications because they aren’t written in text form. Usage
of their and there, for example, is even a common problem for humans.
2. Synonyms

Synonyms can lead to issues similar to contextual understanding because


we use many different words to express the same idea. Furthermore, some
of these words may convey exactly the same meaning, while some may be
have the levels of complexity (small, little, tiny, minute) and different people
use synonyms to denote slightly different meanings within their personal
vocabulary.

12
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

So, for building NLP systems, it’s important to include all of a word’s
possible meanings and all possible synonyms.
3. Irony and sarcasm

Irony and sarcasm present problems for machine learning models because
they generally use words and phrases that may be positive or negative, but
actually indicate the opposite.

Tweet: @Sony and @PlayStation said this would be the most accessible
console of them all. Yeah right.

Models can be trained with certain clues that frequently accompany ironic
or sarcastic phrases, like “yeah right,” “whatever,” etc., and word
embeddings (where words that have the same meaning have a similar
representation), but it’s still a tricky process.

4. Ambiguity

Ambiguity in NLP refers to sentences and phrases that potentially have two
or more possible interpretations.
Lexical ambiguity: a word that could be used as a verb, noun, or adjective.
Semantic ambiguity: the interpretation of a sentence in context. For
example: I saw the boy on the beach with my binoculars. This could mean
that I saw a boy through my binoculars or the boy had my binoculars with
him
Syntactic ambiguity: In the sentence above, this is what creates the
confusion of meaning. The phrase with my binoculars could modify the
verb, “saw,” or the noun, “boy.”
Even for humans this sentence alone is difficult to interpret without the
context of surrounding text. POS (part of speech) tagging is one NLP solution
that can help solve the problem, somewhat.
5. Errors in text and speech

Misspelled or misused words can create problems for text analysis.


Autocorrect and grammar correction applications can handle common
mistakes, but don’t always understand the writer’s intention.
With spoken language, mispronunciations, different accents, stutters, etc.,
can be difficult for a machine to understand. However, as language

13
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

databases grow and smart assistants are trained by their individual users,
these issues can be minimized.
6. Colloquialisms and slang

Informal phrases, expressions, idioms, and culture-specific lingo present a


number of problems for NLP – especially for models intended for broad use.
Because as formal language, colloquialisms may have no “dictionary
definition” at all, and these expressions may even have different meanings in
different geographic areas. Furthermore, cultural slang is constantly
morphing and expanding, so new words pop up every day.
7. Domain-specific language

Different businesses and industries often use very different language. An


NLP processing model needed for healthcare, for example, would be very
different than one used to process legal documents. These days, however,
there are a number of analysis tools trained for specific fields, but extremely
niche industries may need to build or train their own models.
8. Low-resource languages

AI machine learning NLP applications have been largely built for the most
common, widely used languages. However, many languages, especially those
spoken by people with less access to technology often go overlooked and
under processed. For example, there are over 3,000 languages in Africa,
alone. There isn’t very much data on many of these languages.
However, new techniques, like multilingual transformers (using Google’s
BERT “Bidirectional Encoder Representations from Transformers”) and
multilingual sentence embeddings aim to identify and leverage universal
similarities that exist between languages.
9. Lack of research and development

Machine learning requires A LOT of data to function to its outer limits –


billions of pieces of training data. The more data NLP models are trained on,
the smarter they become. That said, data (and human language!) is only
growing by the day, as are new machine learning techniques and custom
algorithms. All of the problems above will require more research and new
techniques in order to improve on them.

14
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

1.3: Processing Indian Languages

A language is a systematic form of communication that can take a


variety of forms.
Since the Indian subcontinent itself has a multitude of languages, dialects
and writing styles spoken by more than a billion people, we need tools to
work with them. As Indian languages pose many challenges for NLP like
ambiguity, complexity, language grammar, translation problems, and
obtaining the correct data for the NLP algorithms, it creates a lot of
opportunities for NLP projects in India.

Most Indians are multilingual and study more than one language in school.
The percentage of English speakers in India is…just 10%. That’s 10% of a
one billion-plus population!

Most Indians have Hindi as their first language, followed by Marathi,


Telugu, Punjabi, etc. For a lot of people living in rural communities, English
is not even a language they understand or speak.

Thus, there is a clear need to boost NLP research for Indian languages so
that such people who don’t know English can get “online” in the true sense
of the word, ask questions, in their mother tongue and get answers.

Figure 1.2: Languages by number of native speakers in India.

15
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Text Processing for Indian Languages using Python

There are a handful of Python libraries we can use to perform text


processing and build NLP applications for Indian languages.

Figure 1.3: NLP Libraries for Indian Languages

 Tokenization: - Process of converting a sequence of text into smaller


parts.
 Word Embeddings: - Converting words into their numerical
representation.
 Text Completion: - Process of generating text by predicting and
suggesting missing words in a given context.
 Similarity of Sentences: - It is a measure of how similar two pieces or
to what degree they express the same meaning.
 Normalization: - It is a pre-processing step to improve the quality of
the text by removing noises and unwanted data and making suitable
for machines.
 Transliteration: - It is a process of transferring a word from the
alphabet of one language to another.
 Phonetic Analysis: - It is a branch of NLP which deals with how the
sounds are produced and analysis of sounds.
 Syllabification: - It is the process of dividing words into syllables. A
syllable is a unit of sounds that includes a vowel and any
accompanying consonant sounds.

16
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

 Lemmatization: - It is the process of reducing words to their base or


root form known as lemma. It helps to normalize different infracted
forms of a word to a common base making it easier to analyse and
compare words.
 Part of Speech (PoS): - It involves assigning a specific part of speech
(such as noun, verb, adjective etc.) to each word in a given text. The
goal is to understand the syntactic structure of a sentence and extract
information about the roles of words.
 Named Entity Recognition (NER): - It involves identifying and
classifying entities such as names of peoples, organization, location,
dates within a given text.
 Dependency Parsing: - It is the process of analysing the grammatical
structure of a sentence by identifying the relationship between words.

 iNLTK (Natural Language Toolkit for Indic Languages)


 iNLTK provides support for various NLP applications in Indic
languages.
 The languages supported are
 Hindi (hi),
 Punjabi (pa),
 Sanskrit (sa),
 Gujarati (gu),
 Kannada (kn),
 Malayalam (ml),
 Nepali (ne),
 Odia (or),
 Marathi (mr),
 Bengali (bn),
 Tamil (ta),
 Urdu (ur),
 English (en).

 iNLTK is like the NLTK Python package. It provides the feature for NLP
tasks such as tokenisation and vector embedding for input text with
an easy API interface.

17
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

 One has to first install;


pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

 Then next is installing iNLTK using pip:


pip install inltk

 Indic NLP Library:


 The Indian languages have some difficulties which come from sharing
a lot of similarity in terms of script, phonology, language syntax, etc.,
and this library provides a general solution.
 Indic NLP Library provides functionalities like text normalisation,
script normalisation, tokenisation, word segmentation, romanistion,
indicisation, script conversion, transliteration and translation.
 Languages supported:
 Indo-aryan:
 Assamese (asm), Bengali (ben), Gujarati (guj), Hindi/Urdu
(hin/urd), Marathi (mar), Nepali (nep), Odiaa (ori), Punjabi
(pan).
 Dravidian:
 Sindhi (snd), Sinhala (sin), Sanskrit (san), Konkani (kok),
Kannada (kan), Malayalam (mal), Teugu (tel), Tami (tam).
 Others:
 English (eng).
 Tasks handled:
 It handles bilingual tasks like Script conversions for languages
mentioned above except Urdu and English.
 This language supports languages like Konkani, Sindhi, Telugu and
some others which aren’t supported by iNLTK library.
 Transliteration amongst the 18 above mentioned languages.
 Translation amongst ten languages.
 The library needs Python 2.7+, Indic NLP Resources (only for some
modules) and Morfessor 2.0 Python Library.
 Installation:
pip install indic-nlp-library

18
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

 Next, download the resources folder which contains the models for
different languages.
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

 StanfordNLP:
 StanfordNLP contains tools which can be used to convert a string
containing human language text into lists of words and sentences.
 This library converts the human language texts into lists to generate
base forms of those words, parts of speech and morphological
features, and also to give a syntactic structure dependency parse.
 This Syntactic structure dependency parse is designed to be parallel
among more than 70 languages using the Universal Dependencies
formalism.
 The language inherits additional functionality from CoreNLP Java
package such as constituency parsing, linguistic pattern matching and
conference resolution.
 The modules are built on top of PyTorch, and the package is a
combination of software based on Stanford entry in the CoNLL 2018
Shared Task on Universal Dependency Parsing and Java Stanford
CoreNLP software.
 SantfordNLP offers features like:
 Easy Native Python Implementation.
 Complete neural network pipeline for better and easy text analytics
which includes multi-word token (MWT) expansion, tokenisation,
parts-of-speech (POS), lemmatisation, morphological features
tagging and dependency parsing.
 Stable Python interface to CoreNLP.
 The neural network model has support for 53 human languages
featured in 73 treebanks.
 Install using pip,
pip install stanfordnlp

19
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

1.3.1: Top datasets for NLP (Indian languages)

 Semantic Relations from Wikipedia: Contains automatically


extracted semantic relations from multilingual Wikipedia corpus.

 HC Corpora (Old Newspapers): This dataset is a subset of HC Corpora


newspapers containing around 16,806,041 sentences and paragraphs
in 67 languages including Hindi.

 Sentiment Lexicons for 81 Languages: This dataset contains


positive and negative sentiment lexicons for 81 languages which also
includes Hindi.

 IIT Bombay English-Hindi Parallel Corpus: This dataset contains


parallel corpus for English-Hindi and monolingual Hindi corpus. This
dataset was developed ar the Center for Indian Language Technology.

 Indic Languages Multilingual Parallel Corpus: This parallel corpus


covers 7 Indic languages (in addition to English) like Bengali, Hindi,
Malayalam, Tamil, Telugu, Sinhalese, Urdu.

 Microsoft Speech Corpus (Indian languages)(Audio dataset): This


corpus contains conversational, phrasal training and test data for
Telugu, Gujarati and Tamil.

 Hindi Speech Recognition Corpus (Audio Dataset): This is a corpus


collected in India consisting of voices of 200 different speakers from
different regions of the country. It also contains 100 pairs of daily
spontaneous conversational speech data.

1.4: NLP Applications Information Retrieval

When we have large data set collections and want to query anything written
in the natural language, we use the concept of Information Retrieval.

20
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Information Retrieval:- Information Retrieval (IR) is a that deals with


the representation, organization, storage, retrieval, and access of
information from document repositories, particularly textual
information. It involves the use of algorithms and techniques to find
relevant information in a large dataset.

 The IR system assists the users in finding the information they require
but it does not explicitly return the answers to the question.
 It notifies regarding the existence and location of documents that
might consist of the required information.
 Information retrieval also extends support to users in browsing or
filtering document collection or processing a set of retrieved
documents.

The figure below shows how several processes are related with information
retrieval.

Figure 1.4: Various processes related with Information Retrieval

Natural Language Processing (NLP): is the analysis of natural language, be


it the analysis of spoken language or written language, plus the application
there of.

21
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Natural Language Parsing (NL-parsing): refers to the initial stages of the


analysis process, where character streams are processed. The parsing
process can contain all tasks that are possibly performed in the lexical level,
the morphological level and the syntactic level.
Information Extraction (IE): is the process of extracting information from
texts. Information Extraction typically leads to the introduction of a
semantic interpretation of meaning based on the narrative under
consideration. However, since Information Extraction also includes the
“NLparsing” stages, both IE and NL-parsing can point to the same
processes.
Information Retrieval (IR): is the usage of IE results for retrieving
information or documents from other sources. IR requires some
measurement or heuristics to estimate similarity between the extracted
information and other natural language (text) sources.

When we are working with the NLP model, in many cases, we encounter
large collections of data that are unstructured. So, to work with such
unstructured large data collections and to fulfil our information need
(extraction of useful data), we use Information Retrieval.

Example: - Internet Searching


We can take an example of internet searching to understand the need for IR
and IR applications in NLP. Suppose we are still in the 90s and want to
search for a particular thing on the internet. Now, as there was a limited
number of pages and websites out there on the internet, the data were
structured, and since the data was structured, so it took a lesser amount of
space and time for the searching.

Suppose today's situation is where we have unlimited sources on the


internet for a particular topic. Since there are a lot and a lot of pages
present on the internet today, it cannot be structured data. So, the search
space or memory has increased a lot over time, but the time taken to search
for a particular thing has decreased over the days. But the search space has
remained the same, so we use the Information Retrieval technique to deal
with such kind of searching over large data sets.

22
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

1.4.1: Natural Language Queries

 Natural Language Queries allow users to search for information in


more natural and intuitive way.
 Instead of typing in keywords, users can enter their queries in plain
English.
 However, natural language queries can be difficult to process, as they
often require sophisticated parsing and analysis of the user’s intent.
 Semantic search goes beyond keyword-based and natural language
queries to analyse the meaning of the search query and the
document’s content.
 It uses techniques such as Named Entity Recognition and Entity
Linking to extract entities form the text and understand the
relationships between them.
 This approach can provide more accurate results and a more
comprehensive understanding of the user’s intent.
 There are many tools and libraries available for building Information
Retrieval systems.
 Some of the most popular ones include,
1. Elasticsearch
2. Apache Solr
3. Lucene
 These tools provide a powerful set of features for indexing, searching
and retrieving documents.
 Other applications of Information Retrieval include Recommendations
Systems, Fraud Detection, and Text Mining.
 Recommendation Systems:- Use information retrieval techniques to
suggest products or content to users based on their interest and
preference.
 Fraud Detection Systems:- use information retrieval to identify
patterns and anomalies in financial transactions.
 Text Mining:- uses information retrieval to extract knowledge and
insights from large collections of unstructured text data.

23
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

1.5: Language Modeling: Various Grammar- based Language Models


Statistical Language Model

Language Model:- A language model in NLP is a probabilistic statistical


model that determines the probability of a given sequence of words
occurring in a sentence based on the previous words. It helps to predict
which word is more likely to appear next in the sentence.

 Applications:- it is widely used in predictive text input systems,


speech recognition, machine translation, spelling correction etc.
 Input:- input to a language model is usually a training set of example
sentences.
 Output:- Output is a probability distribution over sequences of words.
We can use the last one word (unigram), last two words (bigram), last
three words (trigram) or last n words (n-gram) to predict the next word
as per our requirements.

What Language Models can do?

 Content generation: - This includes generating complete texts or


parts of them based on the data and terms provided by humans.
 Part-of-Speech (PoS) tagging: - PoS tagging is the process of marking
each word in a text with its corresponding part of speech such as
noun, verb, adjective, etc., the models are trained on large amounts of
labelled text data and can learn to predict the POS of a word based on
its context and the surrounding words in a sentence.

24
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

 Question Answering: - Language Models can be trainind to


understand and answer with and without the context given.
 Text Summarization: - Language Models can be used to
automatically shorten documents, papers, podcasts, videos and more
into their important bites. Models can work in two ways: Extract the
most important information from the original text or provide
summaries that don’t repeat the original language.
 Sentiment Analysis: - The language models are good option for
sentiment analysis as it can capture the tone of voice, and semantic
orientation of texts.
 Conversational AI: - Language models are an inevitable part of
speech-enabled applications that require converting speech to text and
vice versa. As a part of conversational AI systems, language models
can provide relevant text responses to inputs.
 Machine translation: - The ability of ML-powered language models to
generalize effectively to long contexts has enabled them to enhance
machine translation. Instead of translating text word by word,
language models can learn the representations of input and output
sequences and provide robust results.
 Code completion: - Recent large-scale language models have
demonstrated an impressive ability to generate code, edit, and explain
code. However, they can complete only simple programming tasks by
translating instructions into code or checking it for errors.

What Language Models cannot do?


They can’t perform tasks that involve
 Common-sense knowledge,
 Understanding abstract concepts, and
 Making inferences based on incomplete information.
 They also lack the ability to understand the world as humans do,
and they can't make decisions or take actions in the physical world.

1.5.1: Various Grammar- based Language Models

Grammar-based language models in natural language processing (NLP) rely


on explicit rules defined by linguistic grammars to analyze, understand, or

25
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

generate natural language text. These models are based on formal


grammatical rules that describe the syntax and structure of a language.

There are two primary types of grammar-based language models:

1. Rule-Based Models:

 Context-Free Grammars (CFG): Context-free grammars are a


formalism used to describe the syntax of a language by defining a set
of production rules. These rules specify how different parts of speech
(such as nouns, verbs, adjectives, etc.) can be combined to form valid
sentences. Examples of CFGs include the Chomsky hierarchy, which
categorizes grammars based on their generative power, ranging from
regular grammars to context-sensitive grammars.
 Chomsky Normal Form (CNF): A simplified form of context-free
grammars where production rules are restricted to specific
formats.
 Probabilistic Context-Free Grammars (PCFG): Extends CFGs by
assigning probabilities to grammar rules, enabling probabilistic
parsing and generation of sentences.
 Phrase Structure Grammar (PSG): PSG describes the structure of
sentences using hierarchical relationships among constituents. It
organizes words into phrases based on rules that define the
relationships between different linguistic elements.
 Transformational Grammar: This type of grammar emphasizes the
transformation of basic underlying structures to produce different
surface structures. Noam Chomsky's work on transformational
grammar was influential in the development of generative grammatical
models.

2. Dependency Grammar Models:

 Dependency Grammar: Unlike phrase structure grammar,


dependency grammar focuses on the relationships between words in a
sentence. It represents sentences as a set of directed links between
words, where each word depends on another in the sentence. These

26
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

dependencies help in understanding the grammatical structure and


semantic relationships within the text.

Advantages:

 Explicit Rules: They rely on clearly defined rules and structures,


making it easier to interpret and analyze the generated text.
 Linguistic Interpretability: These models often provide a more
interpretable representation of language by capturing hierarchical
relationships and grammatical rules explicitly.
 Precision: Grammar-based models tend to be precise and follow
language rules strictly, which can be beneficial in certain domains
where accuracy and adherence to grammatical structures are crucial.

Limitations:

 Rule Complexity: Defining comprehensive grammatical rules for all


possible linguistic nuances can be challenging.
 Limited Flexibility: They might struggle with informal language,
dialects, or instances where language rules are not strictly followed.
 Maintenance Effort: Maintaining and updating grammatical rules can
be labor-intensive.

In modern NLP, while statistical and neural models dominate due to their
ability to capture complex patterns in data, grammar-based approaches still
serve as valuable tools, especially in applications where adherence to
specific rules or linguistic structures is essential.

1.5.2: Statistical Language Model:-

What is statistical language modeling in NLP?


Statistical Language Modeling, or Language Modeling and LM for short,
is the development of probabilistic models that can predict the next word in
the sequence given the words that precede it.

27
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

Types of statistical language models:


Statistical models include the development of probabilistic models that are
able to predict the next word in the sequence, given the words that precede
it.

There are various types of statistical language models, but two key
categories are widely used:

1. N-gram Models: These models calculate the probability of a word based


on the previous N-1 words in a sequence. The basic assumption is that
the probability of a word depends only on a fixed number of preceding
words (N). For instance:
 Bigram Model (2-gram): Estimates the probability of a word given the
preceding word. P(word | previous word)
 Trigram Model (3-gram): Considers the probability of a word given
the two preceding words. P(word | previous word, previous-previous
word)
 N-gram Model: Extends this concept to consider the probability of a
word given the previous N-1 words.
N-gram models are based on counting occurrences of word sequences in
a training corpus and estimating probabilities based on these counts.
They are relatively simple but effective in various NLP tasks such as
language modeling, speech recognition, and machine translation.

2. Neural Network-Based Models: These models use neural networks,


such as recurrent neural networks (RNNs), long short-term memory
networks (LSTMs), and transformer architectures, to learn distributed
representations of words or subword units.
Some prominent examples include:
a) Recurrent Neural Networks (RNNs): Process sequences by using
sequential information to predict the probability distribution of the
next word given the previous words in the sequence.
b) Long Short-Term Memory Networks (LSTMs): A type of RNN with
memory cells designed to better capture long-range dependencies in
sequences.

28
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

c) Transformer Models: Such as the GPT (Generative Pre-trained


Transformer) series, which use self-attention mechanisms to model
context and dependencies across sequences more effectively.
Neural network-based language models have shown significant
advancements in NLP tasks like language generation, text
summarization, machine translation, and contextual word embeddings
due to their ability to capture complex relationships within sequences.

Applications of Statistical Models:

 Speech Recognition: Voice assistants such as Siri and Alexa are


examples of how language models help machines in processing speech
audio.
 Machine Translation: Google Translator and Microsoft Translate are
examples of how NLP models can help in translating one language to
another.
 Sentiment Analysis: This helps in analyzing sentiments behind a
phrase. This use case of NLP models is used in products that allow
businesses to understand a customer’s intent behind opinions or
attitudes expressed in the text. Hubspot’s Service Hub is an example
of how language models can help in sentiment analysis.
 Text Suggestions: Google services such as Gmail or Google Docs use
language models to help users get text suggestions while they compose
an email or create long text documents, respectively.
 Parsing Tools: Parsing involves analyzing sentences or words that
comply with syntax or grammar rules. Spell checking tools are perfect
examples of language modelling and parsing.

Drawbacks of statistical language modeling:

 Zero probabilities
If we have a tri-gram language model that conditions of two words and
has a vocabulary of 10,000 words. Then we have 10¹² triplets. If our
training data has 10¹⁰ words, there are many triples that will never be
observed in the training data and thus the basic MLE will assign zero
probabilities to those events. And a zero-probability translates to
infinite perplexity. To overcome this issue many techniques have been
29
21ML1601 – NLP Unit – 1 III Year / VI Semester AI&DS

developed under the family of Smoothing Techniques. A good overview


of these techniques is presented in this paper.
 Exponential Growth
The second challenge is that the number of n-grams grows as an nth
exponent of the vocabulary size. A 10,000-word vocabulary would have
10¹² tri-grams and a 100,000 word vocabulary will have 10¹⁵ trigrams.
 Generalization
The last issue with MLE techniques is the lack of generalization. If the
model sees the term ‘white horse’ in the training data but does not see
‘black horse’, the MLE will assign zero probability to ‘black horse’.
(Thankfully, it will assign zero probability to Purple horse as well).

30

You might also like