[go: up one dir, main page]

0% found this document useful (0 votes)
40 views42 pages

NLP Notes Unit 1

The document provides an overview of Natural Language Processing (NLP), detailing its definition, history, and key concepts such as morphology, syntax, and semantics. It outlines the steps involved in NLP, including morphological analysis and semantic analysis, and discusses various applications of NLP in areas like language translation, chatbots, and sentiment analysis. Additionally, it highlights the evolution of NLP from foundational theories to modern applications and methodologies.

Uploaded by

d4madd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views42 pages

NLP Notes Unit 1

The document provides an overview of Natural Language Processing (NLP), detailing its definition, history, and key concepts such as morphology, syntax, and semantics. It outlines the steps involved in NLP, including morphological analysis and semantic analysis, and discusses various applications of NLP in areas like language translation, chatbots, and sentiment analysis. Additionally, it highlights the evolution of NLP from foundational theories to modern applications and methodologies.

Uploaded by

d4madd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

lOMoARcPSD|22192506

NLP Notes Unit 1

Natural Language Processing (SRM Institute of Science and Technology)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Richa Sharma (richasharma@skit.ac.in)
lOMoARcPSD|22192506

18CSE359T
NATURAL LANGUAGE PROCESSING

UNIT 1
SYLLABUS

o Introduction to Natural Language Processing

o Steps – Morphology – Syntax – Semantics

o Morphological Analysis (Morphological Parsing)

o Stemming – Lemmatization

o Parts of Speech Tagging

o Approaches on NLP Tasks (Rule-based, Statistical, Machine Learning)

o N-grams

o Multiword Expressions

o Collocations (Association Measures, Coefficients and Context Measures)

o Vector Representation of Words

o Language Modeling

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

INTRODUCTION TO NATURAL LANGUAGE


PROCESSING
Language is a method of communication with the help of which we can speak, read and write.
Natural Language Processing (NLP) is a subfield of Computer Science that deals with Artificial
Intelligence (AI), which enables computers to understand and process human language.
Natural language processing (NLP) is the intersection of computer science, linguistics and
machine learning. The field focuses on communication between computers and humans in
natural language and NLP is all about making computers understand and generate human
language.

• Phonology – This science helps to deal with patterns present in the sound and speeches
related to the sound as a physical entity.
• Pragmatics – This science studies the different uses of language.
• Morphology – This science deals with the structure of the words and the systematic
relations between them.
• Syntax – This science deal with the structure of the sentences.
• Semantics – This science deals with the literal meaning of the words, phrases as well as
sentences.

HISTORY OF NLP
The study of natural language processing generally started in the 1950s, although some work can
be found from earlier periods.
In 1950, Alan Turing published an article titled “Computing Machinery and
Intelligence” which proposed what is now called the Turing test as a criterion of intelligence.
Turing test — developed by Alan turing in 1950, is a test of a machine’s ability to exhibit
intelligent behaviour equivalent to, or indistinguishable from, that of a human.
In 1950, Alan Turing published an article titled "Machine and Intelligence" which advertised
what is now called the Turing test as a subfield of intelligence.

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Some beneficial and successful Natural language systems were developed in the 1960s were
SHRDLU, a natural language system working in restricted "blocks of words" with restricted
vocabularies was written between 1964 to 1966.

➢ Foundational Insights: 1940’s and 1950’s: Two foundational paradigms: the


automaton and probabilistic or information theoretic models.
• The Automaton: Turing’s work led to the McCulloch-Pitts neuron (McCulloch and
Pitts, 1943), a simplified model of the neuron as a kind of computing element that could
be described in terms of propositional logic, and then to the work of Kleene (1951) and
(1956) on finite automata and regular expressions. Automata theory was contributed to
by Shannon (1948), who applied probabilistic models of discrete Markov processes to
automata for language. Drawing the idea of a finite-state Markov process from Shannon’s
work, Chomsky (1956) first considered finite-state machines as a way to characterize a
grammar, and defined a finite-state language as a language generated by a finite-state
grammar. This includes the context-free grammar, first defined by Chomsky (1956) for
natural languages but independently discovered by Backus (1959) and Naur et al. (1960)
in their descriptions of the ALGOL programming language.
• The second foundational insight of this period was the development of Probabilistic
Algorithms for speech and language processing, which dates to Shannon’s other
contribution: the metaphor of the noisy channel and decoding for the transmission of
language through media like communication channels and speech acoustics.

➢ The Two Camps: 1957–1970: Speech and language processing had split very
cleanly into two paradigms: symbolic and stochastic.
• The Symbolic Paradigm took off from two lines of research. The first was the work of
Chomsky and others on formal language theory and generative syntax throughout the late
1950’s and early to mid 1960’s, and the work of many linguistics and computer scientists
on parsing algorithms, initially top-down and bottom-up, and then via dynamic
programming.
• The Stochastic Paradigm took hold mainly in departments of statistics and of electrical
engineering. By the late 1950’s the Bayesian method was beginning to be applied to to
the problem of optical character recognition. Bledsoe and Browning (1959) built a
Bayesian system for text-recognition that used a large dictionary and computed the
likelihood of each observed letter sequence given each word in the dictionary by
multiplying the likelihoods for each letter.

➢ Four Paradigms: 1970–1983:


• The Stochastic Paradigm played a huge role in the development of speech recognition
algorithms in this period, particularly the use of the Hidden Markov Model and the
3

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

metaphors of the noisy channel and decoding, developed independently by Jelinek, Bahl,
Mercer, and colleagues at IBM’s Thomas J. Watson Research Center, and Baker at
Carnegie Mellon University, who was influenced by the work of Baum and colleagues at
the Institute for Defense Analyses in Princeton.
• The logic-based paradigm was begun by the work of Colmerauer and his colleagues on
Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975), the forerunners of
Prolog and Definite Clause Grammars (Pereira and Warren, 1980).
• The Natural Language Understanding field took off during this period, beginning with
Terry Winograd’s SHRDLU system which simulated a robot embedded in a world of toy
blocks (Winograd, 1972a).
• The discourse modeling paradigm focused on four key areas in discourse. Grosz and
her colleagues proposed ideas of discourse structure and discourse focus (Grosz, 1977a;
Sidner, 1983a), a number of researchers began to work on automatic reference resolution
(Hobbs, 1978a), and the BDI (Belief-Desire-Intention) framework for logic-based work
on speech acts was developed (Perrault and Allen, 1980; Cohen and Perrault, 1979).

➢ Empiricism and Finite State Models Redux: 1983-1993:


• The first class was finite-state models, which began to receive attention again after work
on finite-state phonology and morphology by (Kaplan and Kay, 1981) and finite-state
models of syntax by Church (1980).
• The second trend in this period was what has been called the ‘return of empiricism’.

➢ The Field Comes Together: 1994-1999:


• First, probabilistic and data-driven models had become quite standard throughout natural
language processing.
• Second, the increases in the speed and memory of computers had allowed commercial
exploitation of a number of subareas of speech and language processing, in particular
speech recognition and spelling and grammar checking.
• Finally, the rise of the Web emphasized the need for language-based information retrieval
and information extraction.

APPLICATIONS OF NLP
• Search Autocorrect and Autocomplete: Whenever you search something on Google,
after typing 2-3 letters, it shows you the possible search terms. Or, if you search for
something with typos, it corrects them and still finds relevant results for you. Isn’t it
amazing?
• Language Translator: Have you ever used Google Translate to find out what a
particular word or phrase is in a different language? I’m sure it’s a YES!!

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

• Social Media Monitoring: More and more people these days have started using social
media for posting their thoughts about a particular product, policy, or matter.
• Chatbots: Customer service and experience is the most important thing for any
company.
• Survey Analysis: Surveys are an important way of evaluating a company’s performance.
Companies conduct many surveys to get customer’s feedback on various products.
• Targeted Advertising: One day I was searching for a mobile phone on Amazon, and a
few minutes later, Google started showing me ads related to similar mobile phones on
various webpages. I am sure you have experienced it.
• Hiring and Recruitment: The Human Resource department is an integral part of every
company. They have the most important job of selecting the right employees for a
company.
• Voice Assistants: I am sure you’ve already met them, Google Assistant, Apple Siri,
Amazon Alexa, ring a bell? Yes, all of these are voice assistants.
• Grammar Checkers: This is one of the most widely used applications of natural
language processing.
• Email Filtering: Have you ever used Gmail?
• Sentiment Analysis: Natural language understanding is particularly difficult for
machines when it comes to opinions, given that humans often use sarcasm and irony.
• Text Classification: Text classification, a text analysis task that also includes sentiment
analysis, involves automatically understanding, processing, and categorizing unstructured
text.
• Text Extraction: Text extraction, or information extraction, automatically detects
specific information in a text, such as names, companies, places, and more. This is also
known as named entity recognition.
• Machine Translation: Machine translation (MT) is one of the first applications of
natural language processing.
• Text Summarization: There are two ways of using natural language processing to
summarize data: extraction-based summarization ‒ which extracts keyphrases and creates
a summary, without adding any extra information ‒ and abstraction-based
summarization, which creates new phrases paraphrasing the original source.
• Market Intelligence: Marketers can benefit from natural language processing to learn
more about their customers and use those insights to create more effective strategies.
• Intent Classification: Intent classification consists of identifying the goal or purpose that
underlies a text.

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

• Urgency Detection: NLP techniques can also help you detect urgency in text. You
can train an urgency detection model using your own criteria, so it can recognize certain
words and expressions that denote gravity or discontent.
• Text-based applications: Text-based applications involve the processing of written text,
such as books, newspapers, reports, manuals, e-mail messages, and so on. These are all
reading-based tasks. Text-based natural language research is ongoing in applications such
as
o finding appropriate documents on certain topics from a data-base of texts (for
example, finding relevant books in a library)
o extracting information from messages or articles on certain topics (for example,
building a database of all stock transactions described in the news on a given day)
o translating documents from one language to another (for example, producing
automobile repair manuals in many different languages)
o summarizing texts for certain purposes (for example, producing a 3-page summary of
a 1000-page government report)
• Dialogue-based applications involve human-machine communication. Most naturally
this involves spoken language, but it also includes interaction using keyboards. Typical
potential applications include
o question-answering systems, where natural language is used to query a database (for
example, a query system to a personnel database)
o automated customer service over the telephone (for example, to perform banking
transactions or order items from a catalogue)
o tutoring systems, where the machine interacts with a student (for example, an automated
mathematics tutoring system)
o spoken language control of a machine (for example, voice control of a VCR or computer)
o general cooperative problem-solving systems (for example, a system that helps a person
plan and schedule freight shipments)

STEPS – MORPHOLOGY – SYNTAX – SEMANTICS


➢ Morphological Processing
It is the first phase of NLP. The purpose of this phase is to break chunks of language
input into sets of tokens corresponding to paragraphs, sentences and words. For example,
a word like “uneasy” can be broken into two sub-word tokens as “un-easy”.
➢ Syntax Analysis
It is the second phase of NLP. The purpose of this phase is two folds: to check that a
sentence is well formed or not and to break it up into a structure that shows the syntactic
relationships between the different words. For example, the sentence like “The school
goes to the boy” would be rejected by syntax analyzer or parser.

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

➢ Semantic Analysis
It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you
can say dictionary meaning from the text. The text is checked for meaningfulness. For
example, semantic analyzer would reject a sentence like “Hot ice-cream”.
➢ Pragmatic Analysis
It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events,
which exist in a given context with object references obtained during the last phase
(semantic analysis). For example, the sentence “Put the banana in the basket on the
shelf” can have two semantic interpretations and pragmatic analyzer will choose between
these two possibilities.

MORPHOLOGY: MORPHOLOGICAL ANALYSIS


(MORPHOLOGICAL PARSING)
Morphology is the study of the internal structure of words. Morphology focuses on how the
components within a word (stems, root words, prefixes, suffixes, etc.) are arranged or modified
to create different meanings. English, for example, often adds "-s" or "-es" to the end of count
nouns to indicate plurality, and a "-d" or "-ed" to a verb to indicate past tense. The suffix “-ly” is
added to adjectives to create adverbs (for example, “happy” [adjective] and “happily” [adverb]).

Morphology is the study of the way words are built up from smaller meaning units, morphemes.
A morpheme is often defined as the minimal meaning-bearing unit in a language. So for example
the word fox consists of a single morpheme (the morpheme fox) while the word cats consists of

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

two: the morpheme cat and the morpheme -s. As this example suggests, it is often useful to
distinguish two broad classes of morphemes: stems and affixes.
Affixes are further divided into prefixes, suffixes, infixes, and circumfixes. Prefixes precede
the stem, suffixes follow the stem, circumfixes do both, and infixes are inserted inside the stem.
For example, the word eats is composed of a stem eat and the suffix -s. The word unbuckle is
composed of a stem buckle and the prefix un-.
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together. A number of languages have extensive non-
concatenative morphology, in which morphemes are combined in more complex ways.
Another kind of non-concatenative morphology is called templatic morphology or rootand-
pattern morphology.

➢ METHODS OF MORPHOLOGY
• Morpheme Based Morphology: In these words are analyzed as arrangements of morphemes.
Word-based morphology is (usually) a word-and-paradigm approach. Instead of stating rules to
combine morphemes into word forms or to generate word forms from stems, word-based
morphology states generalizations that hold between the forms of inflectional paradigms.
• Lexeme Based Morphology: Lexeme-based morphology usually takes what it is called an
“item-andprocess” approach. Instead of analyzing a word form as a set of morphemes arranged
in sequence , aword form is said to be the result of applying rules that alter a word-form or steam
in order to produce a new one.
• Word based Morphology: Word-based morphology is usually a word-and -paradigm
approach instead of stating rules to combine morphemes into word forms.

➢ INFLECTIONAL, DERIVATIONAL

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

➢ CLITICIZATION
In morphosyntax, cliticization is a process by which a complex word is formed by attaching
a clitic to a fully inflected word.
Clitic: a morpheme that acts like a word but is reduced and attached to another word.
I've, l'opera

➢ NONCONCATENATIVE MORPHOLOGY
• Vowel Harmony: Vowel harmony is a type of assimilation in which the vowels in a
morpheme (e.g. an affix) are assimilated to vowels in another morpheme (e.g. the word
stem). Vowel harmony is a non-concatenative morphological process and it is unclear
whether an NMT model trained with subword segmentation can learn to generate the
correct vowels for rare or unseen words.
• Reduplication: Reduplication is another nonconcatenative morphological process in
which the whole word (full reduplication) or a part of a word (partial reduplication) is
repeated exactly or with a slight change. In some cases, the repetition can also occur
twice (triplication). Reduplication often marks features such as plurality, intensity or size,
depending on the language and raises the same generalisation question as vowel
harmony.

Prefixes and suffixes are called concatenative morphology since a word is composed of a
number of morphemes concatenated together. A number of languages have extensive non-
concatenative morphology, in which morphemes are combined in more complex ways. The
Tagalog infixation example is one example of non-concatenative morphology, since two
morphemes (hingi and um) are intermingled. Another kind of non-concatenative morphology is
called templatic morphology or root-and-pattern morphology.

➢ MORPHOLOGICAL ANALYSIS
Morphological analysis:
token → lemma + part of speech + grammatical features
Examples:
cats → cat+N+plur
played → play+V+past
katternas → katt+N+plur+def+gen

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

SYNTAX
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this
phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis
checks the text for meaningfulness comparing to the rules of formal grammar. For example, the
sentence like “hot ice-cream” would be rejected by semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings
of symbols in natural language conforming to the rules of formal grammar. The origin of the
word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.
Example: “The dog (noun phrase) went away (verb phrase).”

➢ CONTEXT FREE GRAMMAR


Context free grammars are also called Phrase-Structure Grammars, and the formalism is
equivalent to Backus-Naur Form, or BNF.
A context-free grammar consists of a set of rules or productions, each of which expresses the
ways that symbols of the language can be grouped and ordered together, and a lexicon of words
and symbols. For example, the following productions express that an NP (or noun phrase) can be
composed of either a ProperNoun or a determiner (Det) followed by a Nominal; a Nominal in
turn can consist of one or more Nouns.

Context-free rules can be hierarchically embedded, so we can combine the previous rules with
others, like the following, that express facts about the lexicon:

The symbols that are used in a CFG are divided into two classes. The symbols that correspond to
words in the language (“the”, “nightclub”) are called terminal symbols; the lexicon is the set of
rules that introduce these terminal symbols. The symbols that express abstractions over these
terminals are called non-terminals. In each context-free rule, the item to the right of the arrow
(→) is an ordered list of one or more terminals and non-terminals; to the left of the arrow is a
single non-terminal symbol expressing some cluster or generalization. The non-terminal
associated with each word in the lexicon is its lexical category, or part of speech.
A CFG can be thought of in two ways: as a device for generating sentences and as a device for
assigning a structure to a given sentence. Viewing a CFG as a generator, we can read the →
arrow as “rewrite the symbol on the left with the string of symbols on the right”.

We say the string a flight can be derived from the non-terminal NP. Thus, a CFG can be used to
generate a set of strings. This sequence of rule expansions is called a derivation of the string of
10

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

words. It is common to represent a derivation by a parse tree (commonly shown inverted with the
root at the top).
The formal language defined by a CFG is the set of strings that are derivable from the designated
start symbol.

11

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Sentences (strings of words) that can be derived by a grammar are in the formal language defined
by that grammar, and are called grammatical sentences. Sentences that cannot be derived by a
given formal grammar are not in the language defined by that grammar and are referred to as
ungrammatical.

Formal Definition of Context-Free Grammar


A context-free grammar G is defined by four parameters: N, Σ, R, S (technically this is a “4-
tuple”).

➢ PARSING TECHNIQUE
For FSAs, for example, the parser is searching through the space of all possible paths through the
automaton. In syntactic parsing, the parser can be viewed as searching through the space of all
possible parse trees to find the correct parse tree for the sentence.

The goal of a parsing search is to find all trees whose root is the start symbol S, which cover
exactly the words in the input.

Two search strategies underlying most parsers: top-down or goal-directed search and bottom-up
or data-directed search.

➢ TOP-DOWN PARSING
A top-down parser searches for a parse tree by trying to build from the root node S down to the
leaves. The algorithm starts by assuming the input can be derived by the designated start symbol

12

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

S. The next step is to find the tops of all trees which can start with S, by looking for all the
grammar rules with S on the left-hand side.

➢ BOTTOM-UP PARSING
Bottom-up parsing is the earliest known parsing algorithm (it was first suggested by Yngve
(1955)), and is used in the shift-reduce parsers common for computer languages.

13

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

SEMANTICS
Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to
understand the meaning of Natural Language.
➢ Parts of Semantic Analysis
Semantic Analysis of Natural Language can be classified into two broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding the meaning
of each word of the text individually. It basically refers to fetching the dictionary meaning that
a word in the text is deputed to carry.
2. Compositional Semantics Analysis: Although knowing the meaning of each word of the
text is essential, it is not sufficient to completely understand the meaning of the text.
For example, consider the following two sentences:
• Sentence 1: Students love GeeksforGeeks.
• Sentence 2: GeeksforGeeks loves Students.
Although both these sentences 1 and 2 use the same set of root words {student, love,
geeksforgeeks}, they convey entirely different meanings.

➢ Tasks involved in Semantic Analysis


1. Word Sense Disambiguation
2. Relationship Extraction
1. Word Sense Disambiguation:
In Natural Language, the meaning of a word may vary as per its usage in sentences and the
context of the text. Word Sense Disambiguation involves interpreting the meaning of a word
based upon the context of its occurrence in a text.
For example, the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer of
a tree.’
Likewise, the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the accurate
meaning of the word is highly dependent upon its context and usage in the text.
2. Relationship Extraction:
It involves firstly identifying various entities present in the sentence and then extracting the
relationships between those entities.
For example, consider the following sentence:
Semantic Analysis is a topic of NLP which is explained on the GeeksforGeeks blog. The
entities involved in this text, along with their relationships, are shown below.

➢ Elements of Semantic Analysis


• Hyponymy: Hyponymys refers to a term that is an instance of a generic term. They
can be understood by taking class-object as an analogy. For example: ‘Color‘ is a
hypernymy while ‘grey‘, ‘blue‘, ‘red‘, etc, are its hyponyms.
• Homonymy: Homonymy refers to two or more lexical terms with the same
spellings but completely distinct in meaning. For example: ‘Rose‘ might mean ‘the

14

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

past form of rise‘ or ‘a flower‘, – same spelling but different meanings; hence,
‘rose‘ is a homonymy.
• Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job,
Occupation), (Large, Big), (Stop, Halt).
• Antonymy: Antonymy refers to a pair of lexical terms that have contrasting
meanings – they are symmetric to a semantic axis. For example: (Day, Night), (Hot,
Cold), (Large, Small).
• Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the
terms need not be closely related in the case of homonymy. For example: ‘man‘
may mean ‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since
all these different meanings bear a close association, the lexical term ‘man‘ is a
polysemy.
• Meronomy: Meronomy refers to a relationship wherein one lexical term is a
constituent of some larger entity. For example: ‘Wheel‘ is a meronym of
‘Automobile‘

➢ Meaning Representation
While, as humans, it is pretty simple for us to understand the meaning of textual information, it
is not so in the case of machines. Thus, machines tend to represent the text in specific formats
in order to interpret its meaning. This formal structure that is used to understand the meaning
of a text is called meaning representation.
Basic Units of Semantic System:
1. Entity: An entity refers to a particular unit or individual in specific such as a person or
a location. For example GeeksforGeeks, Delhi, etc.
2. Concept: A Concept may be understood as a generalization of entities. It refers to a
broad class of individual units. For example Learning Portals, City, Students.
3. Relations: Relations help establish relationships between various entities and concepts.
For example: ‘GeeksforGeeks is a Learning Portal’, ‘Delhi is a City.’, etc.
4. Predicate: Predicates represent the verb structures of the sentences.

Approaches to Meaning Representations:


1. First-order predicate logic (FOPL)
2. Semantic Nets
3. Frames
4. Conceptual dependency (CD)
5. Rule-based architecture
6. Case Grammar
7. Conceptual Graphs

STEMMING
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes
or to the roots of words known as a lemma.
15

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

➢ The Porter Stemmer

Example: Step 1

Example: Steps 2a and 2b

Example: Step 5

16

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Example Outputs

From Stemming to Lemmatization

Types of Stemmer
1. Porter Stemmer – PorterStemmer(): Martin Porter invented the Porter Stemmer or Porter
algorithm in 1980. Five steps of word reduction are used in the method, each with its own set of
mapping rules. Porter Stemmer is the original stemmer and is renowned for its ease of use and
rapidity. Frequently, the resultant stem is a shorter word with the same root meaning.
Example:
Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect
2. Snowball Stemmer – SnowballStemmer(): Martin Porter also created Snowball Stemmer.
The method utilized in this instance is more precise and is referred to as “English Stemmer” or
“Porter2 Stemmer.” It is somewhat faster and more logical than the original Porter Stemmer.
generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat
3. Lancaster Stemmer – LancasterStemmer(): Lancaster Stemmer is straightforward, although
it often produces results with excessive stemming. Over-stemming renders stems non-linguistic
or meaningless.
eating ---> eat
eats ---> eat
17

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

eaten ---> eat


puts ---> put
putting ---> put
4. Regexp Stemmer – RegexpStemmer(): Regex stemmer identifies morphological affixes
using regular expressions. Substrings matching the regular expressions will be discarded.
mass ---> mas
was ---> was
bee ---> bee
computer ---> computer
advisable ---> advis

LEMMATIZATION
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected
forms of a word so they can be analysed as a single item, identified by the word's lemma, or
dictionary form.
In many languages, words appear in several inflected forms. For example, in English, the verb 'to
walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might
look up in a dictionary, is called the lemma for the word. The association of the base form with a
part of speech is often called a lexeme of the word.
For instance:
• The word "better" has "good" as its lemma. This link is missed by stemming, as it
requires a dictionary look-up.
• The word "walk" is the base form for the word "walking", and hence this is matched in
both stemming and lemmatisation.
• The word "meeting" can be either the base form of a noun or a form of a verb ("to
meet") depending on the context; e.g., "in our last meeting" or "We are meeting again
tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma
depending on the context.

18

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

PARTS OF SPEECH TAGGING

POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a
particular part of a speech based on its definition and context. It is responsible for text reading in
a language and assigning some specific token (Parts of Speech) to each word. It is also called
grammatical tagging.
Input: Everything to permit us.

Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]

NLTK POS Tags Examples are as below:

Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
19

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

JJ This NLTK POS Tag is an adjective (large)


JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ ‘s)
PRP personal pronoun (hers, herself, him, himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)

APPROACHES ON NLP TASKS (RULE-BASED,


STATISTICAL, MACHINE LEARNING)
1. RULE-BASED
The rule-based POS tagging models apply a set of handwritten rules and use contextual
information to assign POS tags to words. These rules are often known as context frame
rules. One such rule might be: “If an ambiguous/unknown word ends with the suffix ‘ing’
and is preceded by a Verb, label it as a Verb”.

20

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If
the word has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag. Disambiguation can also be performed in rule-based tagging by
analyzing the linguistic features of a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in the
form of rules. These rules may be either −
• Context-pattern rules
• Or, as Regular expression compiled into finite-state automata, intersected with lexically
ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
• First stage − In the first stage, it uses a dictionary to assign each word a list of potential
parts-of-speech.
• Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.

Properties of Rule-Based POS Tagging


• These taggers are knowledge-driven taggers.
• The rules in Rule-based POS tagging are built manually.
• The information is coded in the form of rules.
• We have some limited number of rules approximately around 1000.
• Smoothing and language modeling is defined explicitly in rule-based taggers.

2. STATISTICAL
Statistical learning theory is a framework for machine learning drawing from the fields
of statistics and functional analysis.
Statistical analysis is the process of collecting and analyzing data in order to discern patterns and
trends. It is a method for removing bias from evaluating data by employing numerical analysis.
This technique is useful for collecting the interpretations of research, developing statistical
models, and planning surveys and studies.

Types of Statistical Analysis


• Descriptive Analysis
Descriptive statistical analysis involves collecting, interpreting, analyzing, and summarizing data
to present them in the form of charts, graphs, and tables. Rather than drawing conclusions, it
simply makes the complex data easy to read and understand.
21

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

• Inferential Analysis
The inferential statistical analysis focuses on drawing meaningful conclusions on the basis of the
data analyzed. It studies the relationship between different variables or makes predictions for the
whole population.
• Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine learning algorithms, data
mining, data modelling, and artificial intelligence to conduct the statistical analysis of data.
• Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of action
based on the results. It is a type of statistical analysis that helps you make an informed decision.
• Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the data.
• Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship between
different variables within the raw data. In simple words, it determines why something happens
and its effect on other variables. This methodology can be used by businesses to determine the
reason for failure.

Zipf's law
Zipf's law states that the frequency of a token in a text is directly proportional to its rank or
position in the sorted list. This law describes how tokens are distributed in languages: some
tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely
occur.

Hidden Markov Models


A Hidden Markov Model, is a set of states (lexical categories in our case) with directed edges
labeled with transition probabilities that indicate the probability of moving to the state at the end
of the directed edge, given that one is now in the state at the start of the edge. The states are also
labeled with a function which indicates the probabilities of outputting different symbols if in that
state (while in a state, one outputs a single symbol before moving to the next state). In our case,
the symbol output from a state/lexical category is a word belonging to that lexical category.

22

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

3. MACHINE LEARNING
Low-Level
• Tokenization: ML + Rules
• PoS Tagging: Machine Learning
• Chunking: Rules
• Sentence Boundaries: ML + Rules
• Syntax Analysis: ML + Rules
Mid-Level
• Entities: ML + Rules to determine “Who, What, Where”
• Themes: Rules “What’s the buzz?”
• Topics: ML + Rules “About this?”
• Summaries: Rules “Make it short”
• Intentions: ML + Rules “What are you going to do?”
• Intentions uses the syntax matrix to extract the intender, intendee, and intent
• We use ML to train models for the different types of intent
• We use rules to whitelist or blacklist certain words
• Multilayered approach to get you the best accuracy
High-Level
• Apply Sentiment: ML + Rules “How do you feel about that?”

23

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

N-GRAMS
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a
bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

The following sentences as the training corpus:


• Thank you so much for your help.
• I really appreciate your help.
• Excuse me, do you know what time it is?
• I’m really sorry for not inviting you.
• I really like your watch.
Suppose we’re calculating the probability of word “w1” occurring after the word “w2,” then the
formula for this is as follows:
count(w2 w1) / count(w2)
which is the number of times the words occurs in the required sequence, divided by the number
of the times the word before the expected word occurs in the corpus.
From our example sentences, let’s calculate the probability of the word “like” occurring after the
word “really”:
count(really like) / count(really)
=1/3
= 0.33

Similarly, for the other two possibilities:


count(really appreciate) / count(really)
=1/3
= 0.33
count(really sorry) / count(really)
=1/3
= 0.33
P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so
transparent that” and we want to know the probability that the next word is the:

One way to estimate this probability is from relative frequency counts: take a very large corpus,
count the number of times we see its water is so transparent that, and count the number of times
this is followed by the. This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”, as follows:

24

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

To estimate the probability of a word w given a history h, or the probability of an entire word
sequence.

The intuition of the n-gram model is that instead of computing the probability of a word given its
entire history, we can approximate the history by just the last few words.

25

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Applications that can be implemented efficiently and effectively using sets of n‐grams
include spelling error detection and correction, query expansion, information retrieval with
serial, inverted and signature files, dictionary look‐up, text compression, and language
identification.

Language Model with N-Gram


N-grams are continuous sequences of words or symbols or tokens in a document. In
technical terms, they can be defined as the neighboring sequences of items in a document.
They come into play when we deal with text data in NLP (Natural Language Processing)
tasks.

N Term

1 Unigram

2 Bigram

3 Trigram

N n-gram

Example: “I reside in Bengaluru”

SL.No. Type of n-gram Generated n-grams

1 Unigram [“I”,”reside”,”in”, “Bengaluru”]

2 Bigram [“I reside”,”reside in”,”in Bengaluru”]

3 Trigram [“I reside in”, “reside in Bengaluru”]

26

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

MULTIWORD EXPRESSIONS
Multi-word Expressions (MWEs) are word combinations with linguistic properties that cannot be
predicted from the properties of the individual words or the way they have been combined.
MWEs occur frequently and are usually highly domain-dependent. A proper treatment of MWEs
is essential for the success of NLP-systems.
A sequence, continuous or discontinuous, of words or other elements, which is or appears to be
prefabricated: that is stored and retrieved whole from memory at the time from use, rather than
being subject to generation or analysis by language grammar.
• A language word - lexical unit in the language that stands for a concept.
e.g. train, water, ability
• However, that may not be true.
e.g. Prime Minister
Due to institutionalized usage, we tend to think of ‘Prime Minister’ as a single concept.
• Here the concept crosses word boundaries.
Simply put, a multiword expression (MWE):
a. crosses word boundaries
b. is lexically, syntactically, semantically, pragmatically and/or statistically idiosyncratic
E.g. traffic signal, Real Madrid, green card, fall asleep, leave a mark, ate up, figured out, kick the
bucket, spill the beans, ad hoc.

Idiosyncrasies
• Statistical idiosyncracies
Usage of the multiword has been conventionalized, though it is still semantically decomposable
E.g. traffic signal, good morning
• Lexical idiosyncrasies
Lexical items generally not seen in the language, probably borrowed from other languages
E.g. ad hoc, ad hominem
• Syntactic idiosyncrasy
Conventional grammar rules don’t hold, these multiwords exhibit peculiar syntactic behavior

• Semantic Idiosyncrasy
The meaning of the multi word is not completely composable from those of its constituents
This arises from figurative or metaphorical usage
The degree of compositionality varies
27

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

E.g. blow hot and cold – keep changing opinions


spill the beans – reveal secret
run for office – contest for an official post.

MWE Characteristics
• Basis for MWE extraction
o Non-Compositionality
Non-decomposable – e.g. blow hot and cold
Partially decomposable – e.g. spill the beans
o Syntactic Flexibility
Can undergo inflections, insertions, passivizations
e.g. promise(d/s) him the moon
The more non-compositional the phrase, the less syntactically flexible it is
o Substitutability
MWEs resist substitution of their constituents by similar words
E.g. ‘many thanks’ cannot be expressed as ‘several thanks’ or ‘many
gratitudes’
o Institutionalization
Results in statistical significance of collocations
o Paraphrasability
Sometimes it is possible to replace the MWE by a single word
E.g. leave out replaced by omit
• Based on syntactic forms and compositionality
o Institutionalized Noun collocations
E.g. traffic signal, George Bush, green card
o Phrasal Verbs (Verb-Particle constructions)
E.g. call up, eat up
o Light verb constructions (V-N collocations)
E.g. fall asleep, give a demo
o Verb Phrase Idioms
E.g. sweep under the rug

COLLOCATIONS (ASSOCIATION MEASURES,


COEFFICIENTS AND CONTEXT MEASURES)
Collocations are phrases or expressions containing multiple words, which are highly likely to co-
occur. For example — ‘social media’, ‘school holiday’, ‘machine learning’, ‘Universal Studios
Singapore’, etc.

28

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

A collocation is two or more words that often go together. These combinations just sound "right"
to native English speakers, who use them all the time. On the other hand, other combinations
may be unnatural and just sound "wrong". Look at these examples:
natural English... unnatural English...
the fast train the quick train
fast food quick food
a quick shower a fast shower
a quick meal a fast meal
Why learn collocations?
• Your language will be more natural and more easily understood.
• You will have alternative and richer ways of expressing yourself.
• It is easier for our brains to remember and use language in chunks or blocks rather than as
single words.

How to learn collocations


• Be aware of collocations, and try to recognize them when you see or hear them.
• Treat collocations as single blocks of language. Think of them as individual blocks or
chunks, and learn strongly support, not strongly + support.
• When you learn a new word, write down other words that collocate with it (remember
rightly, remember distinctly, remember vaguely, remember vividly).
• Read as much as possible. Reading is an excellent way to learn vocabulary and
collocations in context and naturally.
• Revise what you learn regularly. Practise using new collocations in context as soon as
possible after learning them.
• Learn collocations in groups that work for you. You could learn them by topic (time,
number, weather, money, family) or by a particular word (take action, take a
chance, take an exam).
• You can find information on collocations in any good learner's dictionary. And you can
also find specialized dictionaries of collocations.

Types of collocation
• adverb + adjective: completely satisfied (NOT downright satisfied)
• adjective + noun: excruciating pain (NOT excruciating joy)
• noun + noun: a surge of anger (NOT a rush of anger)
• noun + verb: lions roar (NOT lions shout)
• verb + noun: commit suicide (NOT undertake suicide)
• verb + expression with preposition: burst into tears (NOT blow up in tears)
• verb + adverb: wave frantically (NOT wave feverishly)

ASSOCIATION MEASURES
According P. Pecina, the term “lexical association” refers to association between words.
Collocational association restricts combination of words into phrases. Based on statistical
interpretation of the data from a corpus one could estimate lexical associations automatically by
29

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

means of lexical association measures. These measures determine “the strength of association
between two or more words based on their occurrences and cooccurrences in a corpus”.
S. Evert described in his thesis formal and statistical prerequisites and also presented a
comprehensive repository of association measures. An explicit equation is given for each
measure, using a consistent notation in terms of observed and expected frequencies. New
approaches are suggested also to the study of association measures, with an emphasis on
empirical results and intuitive understanding.
It is only natural to assume that one of the ways to identify the stability of a word combination is
the frequency of their cooccurrence. However, the raw data – the frequency of cooccurrence of
word pairs – are not always meaningful. “Provided that both words are sufficiently frequent,
their co-occurrences might be pure coincidence. Therefore, a statistical interpretation of the
frequency data is necessary, which determines the degree of statistical association between the
words”.
For this purpose, association measures are applied, which assign a score to each word pair based
on the observed frequency data. The higher this score is, the stronger and more certain the
association between the two words. This score depends on a few factors such as co-occurrence
frequency, each word separate frequency, size of a corpus, maximum size of a window for
collocations, etc.

Simple Frequency-based Association Measures


In the simplest case, taking advantage of such property of collocations as recurrency (i.e.,
frequent usage in texts), we can count the number of occurrences of a bigram xy and estimate
their joint probability P(xy):

If the bigram is used frequently, than it is probable that the two words are used together not
by chance but comprise a collocation.
MI (mutual information) was introduced by K. Church and P. Hanks as a measure that
compares context-bound frequencies and independent ones as if words occurrence in a text is
random.

t-score
The t-score measure “also takes into consideration the frequency of co-occurrence of a key word
and its collocate thus answering a question how random the association force is between the
collocates”

30

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

T-score expresses the certainty with which we can argue that there is an association between the
words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole
collocation which is why very frequent word combinations tend to reach a T-score high value
despite not being significant as real mental collocations.

log-likelihood ratio
“In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of
a statistical model given data. Likelihood functions play a key role in statistical inference,
especially methods of estimating a parameter from a set of statistics. In informal contexts,
"likelihood" is often used as a synonym for "probability"” (Wikipedia). This measure is based on
“a ratio of two likelihood functions which correspond to two hypotheses — about random and
non-random nature of phrases”.

Multinomial Likelihood
This measure termed multinomial likelihood (ML) estimates the probability of the observed
contingency table point hypothesis assuming the multinomial sampling distribution:

Hypergeometric Likelihood

Binomial Likelihood

Poisson Likelihood
If we replace the binomial distribution with the Poisson distribution, this will increase the
computational efficiency and will provide results with a higher accuracy. In this case, the
corresponding association measure is called Poisson likelihood (PL) and is calculated as follows:

31

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

COEFFICIENTS MEASURES
The techniques within this approach measure the degree of association between the words x and
y in a bigram xy by estimating one of the coefficients of association strength from the observed
data.

Odds Ratio
Odds ratio is the ratio of two probabilities, that is, the probability that a given event occurs and
the probability that this event does not occur. This ratio is calculated as ad/bc, where a, b, c, d
are taken from the contingency table presented.
The odds ratio is sometimes called the cross-product ratio because the numerator is based on
multiplying the value in cell a times the value in cell d, whereas the denominator is the product
of cell b and cell c. A line from cell a to cell d (for the numerator) and another from cell b to
cell c (for the denominator) creates an X or cross on the two-by-two table(Table 1).

Relative Risk
This measure estimates the strength of association of the words x and y in a bigram xy according
to the formula:

where a, b, c, and d are taken from the contingency table presented earlier(Table 1). The name of
the measure is explained by the fact that this metric is commonly used in medical evaluations,
in particular, epidemiology, to estimate the risk of having a disease related to the risk of
being exposed to this disease. However, it also can be applied to quantifying the association of
words in a word combination.
Relative risk is also called risk ratio because, in medical terms, it is the ratio of the risk of
having a disease if exposed to it divided by the risk of having a disease being unexposed to it.
This ratio of probabilities can also be used in measuring the relation between the probability of

32

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

a bigram xy being a collocation versus the probability of this bigram to be a free word
combination.

Liddell’s Difference of Proportions


This measure (LDP) is the maximum likelihood estimation for the difference of proportions
and is calculated according to the formula:

This metric has been applied to text statistics.

Minimum Sensitivity
Minimum sensitivity (MS) is an effective measure of association of the words x and y in a
bigram xy and has been used successfully in the collocation extraction task. This metric is
calculated according to the formula:

In fact, what this measure does is comparing two conditional probabilities P(y|x) and P(x|y)
and selecting the lesser value thus taking advantage of the notion of conditional probability.

Geometric Mean Coefficient


This association measure is calculated according to the formula:

Applied to the contingency table (see Table 1), the geometric mean is equal to the square root of
the heuristic 𝑀𝐼2 measure defined by the following formula:

Therefore, the geometric mean increases the influence of the co-occurrence frequency in the
numerator and avoids the overestimation for low frequency bigrams.

Dice Coefficient
This association measure (𝐷) is calculated according to the formula:

33

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

This coefficient is one of the most common association measures used to detect collocations;
moreover, its performance happens to be higher than the performance of other association
measures.

Jaccard Coefficient
The Jaccard coefficient (𝐽) is monotonically related to the Dice coefficient and measures
similarity in asymmetric information on binary and non-binary variables. It is commonly applied
to measure similarity of two sets of data and is calculated as a ratio of the cardinality of the sets’
intersection divided by the cardinality of the sets’ union. It is also frequently used as a measure
of association between two terms in information retrieval. To estimate the relation between the
words 𝑥 and 𝑦 in a bigram 𝑥𝑦, the Jaccard coefficient is defined by the following formula:

where the values of 𝑎, 𝑏, and 𝑐 are as given in the contingency table. (Table 1)

Confidence-Interval Estimate for Mutual Information


Point estimates of association between words in a phrase operate well for words which have
sufficiently high frequency; however, these metrics are not reliable when words or word
combinations have few occurrences in a corpus.
The confidence-interval estimate for mutual information (𝑀𝐼𝑐𝑜𝑛𝑓) is defined as:

CONTEXT MEASURES
The general situation that explains why something happens.
Generally, a context is defined as a multiset (bag) of word types occurring within a predefined
distance (also called a context window) from any occurrence of a given bigram type or word type
(their tokens, more precisely) in the corpus.
The main idea of using this concept is to model the average context of an occurrence of the
bigram/word type in the corpus, i.e. word types that typically occur in its neighborhood. In this
work, we employ two approaches representing the average context: by estimating the probability
distribution of word types appearing in such a neighborhood and by the vector space model
adopted from the field of information retrieval.

34

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

In order to estimate the probability distribution P(Z|Ce) of word types z appearing in the context
Ce, this multiset is interpreted as a random sample obtained by sampling (with replacement) from
the population of all possible (basic) word types z ∈ U. The random sample consists of M
realizations of a (discrete) random variable Z representing the word type appearing in the context
Ce. The population parameters are the context occurrence probabilities of the word types z ∈ U.

These parameters can be estimated on the basis of the observed frequencies of word types z ∈ U
obtained from the random sample Ce by the following formula:

VECTOR REPRESENTATION OF WORDS


Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from
vocabulary to a corresponding vector of real numbers which used to find word predictions, word
similarities/semantics.
The process of converting words into numbers are called Vectorization.

Word embeddings help in the following use cases.


• Compute similar words
• Text classifications
• Document clustering/grouping
• Feature extraction for text classifications
• Natural language processing.

After the words are converted as vectors, we need to use some techniques such as Euclidean
distance, Cosine Similarity to identify similar words.

Why Cosine Similarity

Count the common words or Euclidean distance is the general approach used to match similar

documents which are based on counting the number of common words between the documents.

This approach will not work even if the number of common words increases but the document

talks about different topics. To overcome this flaw, the “Cosine Similarity” approach is used to

find the similarity between the documents.

35

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Mathematically, it measures the cosine of the angle between two vectors (item1, item2) projected

in an N-dimensional vector space. The advantageous of cosine similarity is, it predicts the

document similarity even Euclidean is distance.

“Smaller the angle, the higher the similarity” — Cosine Similarity.

Let’s see an example.


1. Julie loves John more than Linda loves John
2. Jane loves John more than Julie loves John

The cosine angle (the smaller the angle) between the two vectors' value is 0.822 which is nearest

to 1.

Now let’s see what are all the ways to convert sentences into vectors.
Word embeddings coming from pre-trained methods such as,
• Word2Vec — From Google
• Fasttext — From Facebook
• Glove — From Standford

➢ Word2Vec
Word2Vec — Word representations in Vector Space founded by Tomas Mikolov and a group of a

research team from Google developed this model in 2013.


Why Word2Vec technique is created:
36

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Most of the NLP systems treat words as atomic units. There is a limitation of the existing systems

that there is no notion of similarity between words. Also, the system works for small, simpler and

outperforms on less data which is only a few billions of data or less.

In order to train with a larger dataset with complex models, the modern techniques use neural

network architecture to train complex data models and outperforms for huge datasets with billions

of words and with millions of words vocabulary.

This technique helps to measure the quality of the resulting vector representations. This works

with similar words that tend to close with words that can have multiple degrees of similarity.
Syntactic Regularities: Refers to grammatical sentence correction.

Semantic Regularities: Refers to the meaning of the vocabulary symbols arranged in that

structure.

Figure 2: Five Syntactic and Semantic word relationship test set.

The proposed technique was found that the similarity of word representations goes beyond

syntactic regularities and works surprisingly good for algebraic operations of word vectors. For

example,

Vector(“King”) — Vector(“Man”)+Vector(“Woman”) = Word(“Queen”)

where “Queen” is the closest result vector of word representations.

The following model architectures for word representations' objectives are to maximize the

accuracy and minimize the computation complexity. The models are,


• FeedForward Neural Net Language Model (NNLM)

• Recurrent Neural Net Language Model (RNNLM)

37

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

All the above-mentioned models are trained using Stochastic gradient


descent and backpropagation.

FeedForward Neural Net Language Model (NNLM)


The NNLM model consists of input, projection, hidden and output layers. This architecture

becomes complex for computation between the projection and the hidden layer, as values in the

projection layer dense.

Recurrent Neural Net Language Model (RNNLM)


RNN model can efficiently represent more complex patterns than the shallow neural network. The

RNN model does not have a projection layer; only input, hidden and output layer.

Models should be trained for huge datasets using a large-scale distributed framework

called DistBelief, which would give better results. The proposed new two models in Word2Vec
such as,
• Continuous Bag-of-Words Model
• Continuous Skip-gram Model

uses distributed architecture which tries to minimize computation complexity.

Continuous Bag-of-Words Model


The CBOW architecture is similar to the feedforward NNLM, where the non-linear hidden layer

is removed and the projection layer is shared for all the words; thus all words get projected into

the same position.

38

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Figure 3: CBOW architecture.

CBOW architecture predicts the current word based on the context.

Continuous Skip-gram Model


The skip-gram model is similar to CBOW. The only difference is instead of predicting the current

word based on the context, it tries to maximize the classification of a word based on another word

in the same sentence.

Figure 4: Skip-gram architecture.

Skip-gram architecture predicts surrounding words given the current word.

GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and
iterate through it and get the co-occurrence of each word with other words in the corpus. We
get a co-occurrence matrix through this. The words which occur next to each other get a value
of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:

39

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

It is a nice evening.
Good Evening!
Is it a nice evening?
it is a nice evening good
it 0
is 1+1 0
a 1/2+1 1+1/2 0
nice 1/3+1/2 1/2+1/3 1+1 0
evening 1/4+1/3 1/3+1/4 1/2+1/2 1+1 0
good 0 0 0 0 1 0

Pre-trained Word Embedding Models:


People generally use pre-trained models for word embeddings. Few of them are:
• SpaCy
• fastText
• Flair etc.

Benefits of using Word Embeddings:


• It is much faster to train than hand build models like WordNet (which uses graph
embeddings)
• Almost all modern NLP applications start with an embedding layer
• It Stores an approximation of meaning

Drawbacks of Word Embeddings:


• It can be memory intensive
• It is corpus dependent. Any underlying bias will have an effect on your model
• It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether

LANGUAGE MODELING
A language model in NLP is a probabilistic statistical model that determines the probability of a
given sequence of words occurring in a sentence based on the previous words. It helps to predict
which word is more likely to appear next in the sentence. Hence it is widely used in predictive
text input systems, speech recognition, machine translation, spelling correction etc. The input to
a language model is usually a training set of example sentences. The output is a probability
distribution over sequences of words. We can use the last one word (unigram), last two words
(bigram), last three words (trigram) or last n words (n-gram) to predict the next word as per our
requirements.

40

Downloaded by Richa Sharma (richasharma@skit.ac.in)


lOMoARcPSD|22192506

Types of Language Models


• Statistical Models:
Statistical models develop probabilistic models that help with predictions for the next word in the
sequence. It also uses data to make predictions depending on the words that preceded. Moreover,
there are multiple statistical language models that help businesses. For instance, N-Gram,
Unigram, Bidirectional, exponential, etc are all examples of statistical models.
• Neural Language Models:

Neural Language Models refer to language models that are developed using neural networks.
Moreover, the models help mitigate the challenges that occur in classical language models.
Further, it helps execute complex tasks like speech recognition or machine transition.

Examples of Language Models


• Speech Recognition:
Firstly, voice assistants like Siri, Alexa, Google Homes, etc. are the biggest examples of the way
language models support machines in processing speech and audio commands.
• Machine Translation:

Further, Google Translator and Microsoft Translate are examples of language models helping
machines to translate words and text to various languages.
• Sentiment analysis:

Sentiment analysis is the process of identifying sentiments and behaviors on the basis of the text.
Further, NLP Models helps businesses to recognize their customer’s intentions and attitude using
text. For example, Hubspot’s Service Hub analyzes sentiments and emotions using NLP
language models.
• Parsing Tools:

Parsing refers to analyzing sentences and words that are complementary according to syntax and
grammar rules. Further, language models enable features like spell-checking.
• Optical Character Recognition (OCR):

OCR is the use of machines to transform images of text into machine-encoded text. Moreover,
the image may be converted from a scanned document or picture. It is also an important function
that helps digitize old paper trails. Hence, it helps analyze and identify handwriting samples.
• Information Retrieval:

It refers to searching documents and files for information. It also includes regular searches for
documents and files and probing for metadata that leads to a document.
41

Downloaded by Richa Sharma (richasharma@skit.ac.in)

You might also like