NLP Notes Unit 1
NLP Notes Unit 1
18CSE359T
NATURAL LANGUAGE PROCESSING
UNIT 1
SYLLABUS
o Stemming – Lemmatization
o N-grams
o Multiword Expressions
o Language Modeling
• Phonology – This science helps to deal with patterns present in the sound and speeches
related to the sound as a physical entity.
• Pragmatics – This science studies the different uses of language.
• Morphology – This science deals with the structure of the words and the systematic
relations between them.
• Syntax – This science deal with the structure of the sentences.
• Semantics – This science deals with the literal meaning of the words, phrases as well as
sentences.
HISTORY OF NLP
The study of natural language processing generally started in the 1950s, although some work can
be found from earlier periods.
In 1950, Alan Turing published an article titled “Computing Machinery and
Intelligence” which proposed what is now called the Turing test as a criterion of intelligence.
Turing test — developed by Alan turing in 1950, is a test of a machine’s ability to exhibit
intelligent behaviour equivalent to, or indistinguishable from, that of a human.
In 1950, Alan Turing published an article titled "Machine and Intelligence" which advertised
what is now called the Turing test as a subfield of intelligence.
Some beneficial and successful Natural language systems were developed in the 1960s were
SHRDLU, a natural language system working in restricted "blocks of words" with restricted
vocabularies was written between 1964 to 1966.
➢ The Two Camps: 1957–1970: Speech and language processing had split very
cleanly into two paradigms: symbolic and stochastic.
• The Symbolic Paradigm took off from two lines of research. The first was the work of
Chomsky and others on formal language theory and generative syntax throughout the late
1950’s and early to mid 1960’s, and the work of many linguistics and computer scientists
on parsing algorithms, initially top-down and bottom-up, and then via dynamic
programming.
• The Stochastic Paradigm took hold mainly in departments of statistics and of electrical
engineering. By the late 1950’s the Bayesian method was beginning to be applied to to
the problem of optical character recognition. Bledsoe and Browning (1959) built a
Bayesian system for text-recognition that used a large dictionary and computed the
likelihood of each observed letter sequence given each word in the dictionary by
multiplying the likelihoods for each letter.
metaphors of the noisy channel and decoding, developed independently by Jelinek, Bahl,
Mercer, and colleagues at IBM’s Thomas J. Watson Research Center, and Baker at
Carnegie Mellon University, who was influenced by the work of Baum and colleagues at
the Institute for Defense Analyses in Princeton.
• The logic-based paradigm was begun by the work of Colmerauer and his colleagues on
Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975), the forerunners of
Prolog and Definite Clause Grammars (Pereira and Warren, 1980).
• The Natural Language Understanding field took off during this period, beginning with
Terry Winograd’s SHRDLU system which simulated a robot embedded in a world of toy
blocks (Winograd, 1972a).
• The discourse modeling paradigm focused on four key areas in discourse. Grosz and
her colleagues proposed ideas of discourse structure and discourse focus (Grosz, 1977a;
Sidner, 1983a), a number of researchers began to work on automatic reference resolution
(Hobbs, 1978a), and the BDI (Belief-Desire-Intention) framework for logic-based work
on speech acts was developed (Perrault and Allen, 1980; Cohen and Perrault, 1979).
APPLICATIONS OF NLP
• Search Autocorrect and Autocomplete: Whenever you search something on Google,
after typing 2-3 letters, it shows you the possible search terms. Or, if you search for
something with typos, it corrects them and still finds relevant results for you. Isn’t it
amazing?
• Language Translator: Have you ever used Google Translate to find out what a
particular word or phrase is in a different language? I’m sure it’s a YES!!
• Social Media Monitoring: More and more people these days have started using social
media for posting their thoughts about a particular product, policy, or matter.
• Chatbots: Customer service and experience is the most important thing for any
company.
• Survey Analysis: Surveys are an important way of evaluating a company’s performance.
Companies conduct many surveys to get customer’s feedback on various products.
• Targeted Advertising: One day I was searching for a mobile phone on Amazon, and a
few minutes later, Google started showing me ads related to similar mobile phones on
various webpages. I am sure you have experienced it.
• Hiring and Recruitment: The Human Resource department is an integral part of every
company. They have the most important job of selecting the right employees for a
company.
• Voice Assistants: I am sure you’ve already met them, Google Assistant, Apple Siri,
Amazon Alexa, ring a bell? Yes, all of these are voice assistants.
• Grammar Checkers: This is one of the most widely used applications of natural
language processing.
• Email Filtering: Have you ever used Gmail?
• Sentiment Analysis: Natural language understanding is particularly difficult for
machines when it comes to opinions, given that humans often use sarcasm and irony.
• Text Classification: Text classification, a text analysis task that also includes sentiment
analysis, involves automatically understanding, processing, and categorizing unstructured
text.
• Text Extraction: Text extraction, or information extraction, automatically detects
specific information in a text, such as names, companies, places, and more. This is also
known as named entity recognition.
• Machine Translation: Machine translation (MT) is one of the first applications of
natural language processing.
• Text Summarization: There are two ways of using natural language processing to
summarize data: extraction-based summarization ‒ which extracts keyphrases and creates
a summary, without adding any extra information ‒ and abstraction-based
summarization, which creates new phrases paraphrasing the original source.
• Market Intelligence: Marketers can benefit from natural language processing to learn
more about their customers and use those insights to create more effective strategies.
• Intent Classification: Intent classification consists of identifying the goal or purpose that
underlies a text.
• Urgency Detection: NLP techniques can also help you detect urgency in text. You
can train an urgency detection model using your own criteria, so it can recognize certain
words and expressions that denote gravity or discontent.
• Text-based applications: Text-based applications involve the processing of written text,
such as books, newspapers, reports, manuals, e-mail messages, and so on. These are all
reading-based tasks. Text-based natural language research is ongoing in applications such
as
o finding appropriate documents on certain topics from a data-base of texts (for
example, finding relevant books in a library)
o extracting information from messages or articles on certain topics (for example,
building a database of all stock transactions described in the news on a given day)
o translating documents from one language to another (for example, producing
automobile repair manuals in many different languages)
o summarizing texts for certain purposes (for example, producing a 3-page summary of
a 1000-page government report)
• Dialogue-based applications involve human-machine communication. Most naturally
this involves spoken language, but it also includes interaction using keyboards. Typical
potential applications include
o question-answering systems, where natural language is used to query a database (for
example, a query system to a personnel database)
o automated customer service over the telephone (for example, to perform banking
transactions or order items from a catalogue)
o tutoring systems, where the machine interacts with a student (for example, an automated
mathematics tutoring system)
o spoken language control of a machine (for example, voice control of a VCR or computer)
o general cooperative problem-solving systems (for example, a system that helps a person
plan and schedule freight shipments)
➢ Semantic Analysis
It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you
can say dictionary meaning from the text. The text is checked for meaningfulness. For
example, semantic analyzer would reject a sentence like “Hot ice-cream”.
➢ Pragmatic Analysis
It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events,
which exist in a given context with object references obtained during the last phase
(semantic analysis). For example, the sentence “Put the banana in the basket on the
shelf” can have two semantic interpretations and pragmatic analyzer will choose between
these two possibilities.
Morphology is the study of the way words are built up from smaller meaning units, morphemes.
A morpheme is often defined as the minimal meaning-bearing unit in a language. So for example
the word fox consists of a single morpheme (the morpheme fox) while the word cats consists of
two: the morpheme cat and the morpheme -s. As this example suggests, it is often useful to
distinguish two broad classes of morphemes: stems and affixes.
Affixes are further divided into prefixes, suffixes, infixes, and circumfixes. Prefixes precede
the stem, suffixes follow the stem, circumfixes do both, and infixes are inserted inside the stem.
For example, the word eats is composed of a stem eat and the suffix -s. The word unbuckle is
composed of a stem buckle and the prefix un-.
Prefixes and suffixes are often called concatenative morphology since a word is composed of a
number of morphemes concatenated together. A number of languages have extensive non-
concatenative morphology, in which morphemes are combined in more complex ways.
Another kind of non-concatenative morphology is called templatic morphology or rootand-
pattern morphology.
➢ METHODS OF MORPHOLOGY
• Morpheme Based Morphology: In these words are analyzed as arrangements of morphemes.
Word-based morphology is (usually) a word-and-paradigm approach. Instead of stating rules to
combine morphemes into word forms or to generate word forms from stems, word-based
morphology states generalizations that hold between the forms of inflectional paradigms.
• Lexeme Based Morphology: Lexeme-based morphology usually takes what it is called an
“item-andprocess” approach. Instead of analyzing a word form as a set of morphemes arranged
in sequence , aword form is said to be the result of applying rules that alter a word-form or steam
in order to produce a new one.
• Word based Morphology: Word-based morphology is usually a word-and -paradigm
approach instead of stating rules to combine morphemes into word forms.
➢ INFLECTIONAL, DERIVATIONAL
➢ CLITICIZATION
In morphosyntax, cliticization is a process by which a complex word is formed by attaching
a clitic to a fully inflected word.
Clitic: a morpheme that acts like a word but is reduced and attached to another word.
I've, l'opera
➢ NONCONCATENATIVE MORPHOLOGY
• Vowel Harmony: Vowel harmony is a type of assimilation in which the vowels in a
morpheme (e.g. an affix) are assimilated to vowels in another morpheme (e.g. the word
stem). Vowel harmony is a non-concatenative morphological process and it is unclear
whether an NMT model trained with subword segmentation can learn to generate the
correct vowels for rare or unseen words.
• Reduplication: Reduplication is another nonconcatenative morphological process in
which the whole word (full reduplication) or a part of a word (partial reduplication) is
repeated exactly or with a slight change. In some cases, the repetition can also occur
twice (triplication). Reduplication often marks features such as plurality, intensity or size,
depending on the language and raises the same generalisation question as vowel
harmony.
Prefixes and suffixes are called concatenative morphology since a word is composed of a
number of morphemes concatenated together. A number of languages have extensive non-
concatenative morphology, in which morphemes are combined in more complex ways. The
Tagalog infixation example is one example of non-concatenative morphology, since two
morphemes (hingi and um) are intermingled. Another kind of non-concatenative morphology is
called templatic morphology or root-and-pattern morphology.
➢ MORPHOLOGICAL ANALYSIS
Morphological analysis:
token → lemma + part of speech + grammatical features
Examples:
cats → cat+N+plur
played → play+V+past
katternas → katt+N+plur+def+gen
SYNTAX
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this
phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis
checks the text for meaningfulness comparing to the rules of formal grammar. For example, the
sentence like “hot ice-cream” would be rejected by semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings
of symbols in natural language conforming to the rules of formal grammar. The origin of the
word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.
Example: “The dog (noun phrase) went away (verb phrase).”
Context-free rules can be hierarchically embedded, so we can combine the previous rules with
others, like the following, that express facts about the lexicon:
The symbols that are used in a CFG are divided into two classes. The symbols that correspond to
words in the language (“the”, “nightclub”) are called terminal symbols; the lexicon is the set of
rules that introduce these terminal symbols. The symbols that express abstractions over these
terminals are called non-terminals. In each context-free rule, the item to the right of the arrow
(→) is an ordered list of one or more terminals and non-terminals; to the left of the arrow is a
single non-terminal symbol expressing some cluster or generalization. The non-terminal
associated with each word in the lexicon is its lexical category, or part of speech.
A CFG can be thought of in two ways: as a device for generating sentences and as a device for
assigning a structure to a given sentence. Viewing a CFG as a generator, we can read the →
arrow as “rewrite the symbol on the left with the string of symbols on the right”.
We say the string a flight can be derived from the non-terminal NP. Thus, a CFG can be used to
generate a set of strings. This sequence of rule expansions is called a derivation of the string of
10
words. It is common to represent a derivation by a parse tree (commonly shown inverted with the
root at the top).
The formal language defined by a CFG is the set of strings that are derivable from the designated
start symbol.
11
Sentences (strings of words) that can be derived by a grammar are in the formal language defined
by that grammar, and are called grammatical sentences. Sentences that cannot be derived by a
given formal grammar are not in the language defined by that grammar and are referred to as
ungrammatical.
➢ PARSING TECHNIQUE
For FSAs, for example, the parser is searching through the space of all possible paths through the
automaton. In syntactic parsing, the parser can be viewed as searching through the space of all
possible parse trees to find the correct parse tree for the sentence.
The goal of a parsing search is to find all trees whose root is the start symbol S, which cover
exactly the words in the input.
Two search strategies underlying most parsers: top-down or goal-directed search and bottom-up
or data-directed search.
➢ TOP-DOWN PARSING
A top-down parser searches for a parse tree by trying to build from the root node S down to the
leaves. The algorithm starts by assuming the input can be derived by the designated start symbol
12
S. The next step is to find the tops of all trees which can start with S, by looking for all the
grammar rules with S on the left-hand side.
➢ BOTTOM-UP PARSING
Bottom-up parsing is the earliest known parsing algorithm (it was first suggested by Yngve
(1955)), and is used in the shift-reduce parsers common for computer languages.
13
SEMANTICS
Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to
understand the meaning of Natural Language.
➢ Parts of Semantic Analysis
Semantic Analysis of Natural Language can be classified into two broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding the meaning
of each word of the text individually. It basically refers to fetching the dictionary meaning that
a word in the text is deputed to carry.
2. Compositional Semantics Analysis: Although knowing the meaning of each word of the
text is essential, it is not sufficient to completely understand the meaning of the text.
For example, consider the following two sentences:
• Sentence 1: Students love GeeksforGeeks.
• Sentence 2: GeeksforGeeks loves Students.
Although both these sentences 1 and 2 use the same set of root words {student, love,
geeksforgeeks}, they convey entirely different meanings.
14
past form of rise‘ or ‘a flower‘, – same spelling but different meanings; hence,
‘rose‘ is a homonymy.
• Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job,
Occupation), (Large, Big), (Stop, Halt).
• Antonymy: Antonymy refers to a pair of lexical terms that have contrasting
meanings – they are symmetric to a semantic axis. For example: (Day, Night), (Hot,
Cold), (Large, Small).
• Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the
terms need not be closely related in the case of homonymy. For example: ‘man‘
may mean ‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since
all these different meanings bear a close association, the lexical term ‘man‘ is a
polysemy.
• Meronomy: Meronomy refers to a relationship wherein one lexical term is a
constituent of some larger entity. For example: ‘Wheel‘ is a meronym of
‘Automobile‘
➢ Meaning Representation
While, as humans, it is pretty simple for us to understand the meaning of textual information, it
is not so in the case of machines. Thus, machines tend to represent the text in specific formats
in order to interpret its meaning. This formal structure that is used to understand the meaning
of a text is called meaning representation.
Basic Units of Semantic System:
1. Entity: An entity refers to a particular unit or individual in specific such as a person or
a location. For example GeeksforGeeks, Delhi, etc.
2. Concept: A Concept may be understood as a generalization of entities. It refers to a
broad class of individual units. For example Learning Portals, City, Students.
3. Relations: Relations help establish relationships between various entities and concepts.
For example: ‘GeeksforGeeks is a Learning Portal’, ‘Delhi is a City.’, etc.
4. Predicate: Predicates represent the verb structures of the sentences.
STEMMING
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes
or to the roots of words known as a lemma.
15
Example: Step 1
Example: Step 5
16
Example Outputs
Types of Stemmer
1. Porter Stemmer – PorterStemmer(): Martin Porter invented the Porter Stemmer or Porter
algorithm in 1980. Five steps of word reduction are used in the method, each with its own set of
mapping rules. Porter Stemmer is the original stemmer and is renowned for its ease of use and
rapidity. Frequently, the resultant stem is a shorter word with the same root meaning.
Example:
Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect
2. Snowball Stemmer – SnowballStemmer(): Martin Porter also created Snowball Stemmer.
The method utilized in this instance is more precise and is referred to as “English Stemmer” or
“Porter2 Stemmer.” It is somewhat faster and more logical than the original Porter Stemmer.
generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat
3. Lancaster Stemmer – LancasterStemmer(): Lancaster Stemmer is straightforward, although
it often produces results with excessive stemming. Over-stemming renders stems non-linguistic
or meaningless.
eating ---> eat
eats ---> eat
17
LEMMATIZATION
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected
forms of a word so they can be analysed as a single item, identified by the word's lemma, or
dictionary form.
In many languages, words appear in several inflected forms. For example, in English, the verb 'to
walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might
look up in a dictionary, is called the lemma for the word. The association of the base form with a
part of speech is often called a lexeme of the word.
For instance:
• The word "better" has "good" as its lemma. This link is missed by stemming, as it
requires a dictionary look-up.
• The word "walk" is the base form for the word "walking", and hence this is matched in
both stemming and lemmatisation.
• The word "meeting" can be either the base form of a noun or a form of a verb ("to
meet") depending on the context; e.g., "in our last meeting" or "We are meeting again
tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma
depending on the context.
18
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a
particular part of a speech based on its definition and context. It is responsible for text reading in
a language and assigning some specific token (Parts of Speech) to each word. It is also called
grammatical tagging.
Input: Everything to permit us.
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
19
20
Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If
the word has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag. Disambiguation can also be performed in rule-based tagging by
analyzing the linguistic features of a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in the
form of rules. These rules may be either −
• Context-pattern rules
• Or, as Regular expression compiled into finite-state automata, intersected with lexically
ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
• First stage − In the first stage, it uses a dictionary to assign each word a list of potential
parts-of-speech.
• Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.
2. STATISTICAL
Statistical learning theory is a framework for machine learning drawing from the fields
of statistics and functional analysis.
Statistical analysis is the process of collecting and analyzing data in order to discern patterns and
trends. It is a method for removing bias from evaluating data by employing numerical analysis.
This technique is useful for collecting the interpretations of research, developing statistical
models, and planning surveys and studies.
• Inferential Analysis
The inferential statistical analysis focuses on drawing meaningful conclusions on the basis of the
data analyzed. It studies the relationship between different variables or makes predictions for the
whole population.
• Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine learning algorithms, data
mining, data modelling, and artificial intelligence to conduct the statistical analysis of data.
• Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of action
based on the results. It is a type of statistical analysis that helps you make an informed decision.
• Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the data.
• Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship between
different variables within the raw data. In simple words, it determines why something happens
and its effect on other variables. This methodology can be used by businesses to determine the
reason for failure.
Zipf's law
Zipf's law states that the frequency of a token in a text is directly proportional to its rank or
position in the sorted list. This law describes how tokens are distributed in languages: some
tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely
occur.
22
3. MACHINE LEARNING
Low-Level
• Tokenization: ML + Rules
• PoS Tagging: Machine Learning
• Chunking: Rules
• Sentence Boundaries: ML + Rules
• Syntax Analysis: ML + Rules
Mid-Level
• Entities: ML + Rules to determine “Who, What, Where”
• Themes: Rules “What’s the buzz?”
• Topics: ML + Rules “About this?”
• Summaries: Rules “Make it short”
• Intentions: ML + Rules “What are you going to do?”
• Intentions uses the syntax matrix to extract the intender, intendee, and intent
• We use ML to train models for the different types of intent
• We use rules to whitelist or blacklist certain words
• Multilayered approach to get you the best accuracy
High-Level
• Apply Sentiment: ML + Rules “How do you feel about that?”
23
N-GRAMS
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a
bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).
One way to estimate this probability is from relative frequency counts: take a very large corpus,
count the number of times we see its water is so transparent that, and count the number of times
this is followed by the. This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”, as follows:
24
To estimate the probability of a word w given a history h, or the probability of an entire word
sequence.
The intuition of the n-gram model is that instead of computing the probability of a word given its
entire history, we can approximate the history by just the last few words.
25
Applications that can be implemented efficiently and effectively using sets of n‐grams
include spelling error detection and correction, query expansion, information retrieval with
serial, inverted and signature files, dictionary look‐up, text compression, and language
identification.
N Term
1 Unigram
2 Bigram
3 Trigram
N n-gram
26
MULTIWORD EXPRESSIONS
Multi-word Expressions (MWEs) are word combinations with linguistic properties that cannot be
predicted from the properties of the individual words or the way they have been combined.
MWEs occur frequently and are usually highly domain-dependent. A proper treatment of MWEs
is essential for the success of NLP-systems.
A sequence, continuous or discontinuous, of words or other elements, which is or appears to be
prefabricated: that is stored and retrieved whole from memory at the time from use, rather than
being subject to generation or analysis by language grammar.
• A language word - lexical unit in the language that stands for a concept.
e.g. train, water, ability
• However, that may not be true.
e.g. Prime Minister
Due to institutionalized usage, we tend to think of ‘Prime Minister’ as a single concept.
• Here the concept crosses word boundaries.
Simply put, a multiword expression (MWE):
a. crosses word boundaries
b. is lexically, syntactically, semantically, pragmatically and/or statistically idiosyncratic
E.g. traffic signal, Real Madrid, green card, fall asleep, leave a mark, ate up, figured out, kick the
bucket, spill the beans, ad hoc.
Idiosyncrasies
• Statistical idiosyncracies
Usage of the multiword has been conventionalized, though it is still semantically decomposable
E.g. traffic signal, good morning
• Lexical idiosyncrasies
Lexical items generally not seen in the language, probably borrowed from other languages
E.g. ad hoc, ad hominem
• Syntactic idiosyncrasy
Conventional grammar rules don’t hold, these multiwords exhibit peculiar syntactic behavior
• Semantic Idiosyncrasy
The meaning of the multi word is not completely composable from those of its constituents
This arises from figurative or metaphorical usage
The degree of compositionality varies
27
MWE Characteristics
• Basis for MWE extraction
o Non-Compositionality
Non-decomposable – e.g. blow hot and cold
Partially decomposable – e.g. spill the beans
o Syntactic Flexibility
Can undergo inflections, insertions, passivizations
e.g. promise(d/s) him the moon
The more non-compositional the phrase, the less syntactically flexible it is
o Substitutability
MWEs resist substitution of their constituents by similar words
E.g. ‘many thanks’ cannot be expressed as ‘several thanks’ or ‘many
gratitudes’
o Institutionalization
Results in statistical significance of collocations
o Paraphrasability
Sometimes it is possible to replace the MWE by a single word
E.g. leave out replaced by omit
• Based on syntactic forms and compositionality
o Institutionalized Noun collocations
E.g. traffic signal, George Bush, green card
o Phrasal Verbs (Verb-Particle constructions)
E.g. call up, eat up
o Light verb constructions (V-N collocations)
E.g. fall asleep, give a demo
o Verb Phrase Idioms
E.g. sweep under the rug
28
A collocation is two or more words that often go together. These combinations just sound "right"
to native English speakers, who use them all the time. On the other hand, other combinations
may be unnatural and just sound "wrong". Look at these examples:
natural English... unnatural English...
the fast train the quick train
fast food quick food
a quick shower a fast shower
a quick meal a fast meal
Why learn collocations?
• Your language will be more natural and more easily understood.
• You will have alternative and richer ways of expressing yourself.
• It is easier for our brains to remember and use language in chunks or blocks rather than as
single words.
Types of collocation
• adverb + adjective: completely satisfied (NOT downright satisfied)
• adjective + noun: excruciating pain (NOT excruciating joy)
• noun + noun: a surge of anger (NOT a rush of anger)
• noun + verb: lions roar (NOT lions shout)
• verb + noun: commit suicide (NOT undertake suicide)
• verb + expression with preposition: burst into tears (NOT blow up in tears)
• verb + adverb: wave frantically (NOT wave feverishly)
ASSOCIATION MEASURES
According P. Pecina, the term “lexical association” refers to association between words.
Collocational association restricts combination of words into phrases. Based on statistical
interpretation of the data from a corpus one could estimate lexical associations automatically by
29
means of lexical association measures. These measures determine “the strength of association
between two or more words based on their occurrences and cooccurrences in a corpus”.
S. Evert described in his thesis formal and statistical prerequisites and also presented a
comprehensive repository of association measures. An explicit equation is given for each
measure, using a consistent notation in terms of observed and expected frequencies. New
approaches are suggested also to the study of association measures, with an emphasis on
empirical results and intuitive understanding.
It is only natural to assume that one of the ways to identify the stability of a word combination is
the frequency of their cooccurrence. However, the raw data – the frequency of cooccurrence of
word pairs – are not always meaningful. “Provided that both words are sufficiently frequent,
their co-occurrences might be pure coincidence. Therefore, a statistical interpretation of the
frequency data is necessary, which determines the degree of statistical association between the
words”.
For this purpose, association measures are applied, which assign a score to each word pair based
on the observed frequency data. The higher this score is, the stronger and more certain the
association between the two words. This score depends on a few factors such as co-occurrence
frequency, each word separate frequency, size of a corpus, maximum size of a window for
collocations, etc.
If the bigram is used frequently, than it is probable that the two words are used together not
by chance but comprise a collocation.
MI (mutual information) was introduced by K. Church and P. Hanks as a measure that
compares context-bound frequencies and independent ones as if words occurrence in a text is
random.
t-score
The t-score measure “also takes into consideration the frequency of co-occurrence of a key word
and its collocate thus answering a question how random the association force is between the
collocates”
30
T-score expresses the certainty with which we can argue that there is an association between the
words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole
collocation which is why very frequent word combinations tend to reach a T-score high value
despite not being significant as real mental collocations.
log-likelihood ratio
“In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of
a statistical model given data. Likelihood functions play a key role in statistical inference,
especially methods of estimating a parameter from a set of statistics. In informal contexts,
"likelihood" is often used as a synonym for "probability"” (Wikipedia). This measure is based on
“a ratio of two likelihood functions which correspond to two hypotheses — about random and
non-random nature of phrases”.
Multinomial Likelihood
This measure termed multinomial likelihood (ML) estimates the probability of the observed
contingency table point hypothesis assuming the multinomial sampling distribution:
Hypergeometric Likelihood
Binomial Likelihood
Poisson Likelihood
If we replace the binomial distribution with the Poisson distribution, this will increase the
computational efficiency and will provide results with a higher accuracy. In this case, the
corresponding association measure is called Poisson likelihood (PL) and is calculated as follows:
31
COEFFICIENTS MEASURES
The techniques within this approach measure the degree of association between the words x and
y in a bigram xy by estimating one of the coefficients of association strength from the observed
data.
Odds Ratio
Odds ratio is the ratio of two probabilities, that is, the probability that a given event occurs and
the probability that this event does not occur. This ratio is calculated as ad/bc, where a, b, c, d
are taken from the contingency table presented.
The odds ratio is sometimes called the cross-product ratio because the numerator is based on
multiplying the value in cell a times the value in cell d, whereas the denominator is the product
of cell b and cell c. A line from cell a to cell d (for the numerator) and another from cell b to
cell c (for the denominator) creates an X or cross on the two-by-two table(Table 1).
Relative Risk
This measure estimates the strength of association of the words x and y in a bigram xy according
to the formula:
where a, b, c, and d are taken from the contingency table presented earlier(Table 1). The name of
the measure is explained by the fact that this metric is commonly used in medical evaluations,
in particular, epidemiology, to estimate the risk of having a disease related to the risk of
being exposed to this disease. However, it also can be applied to quantifying the association of
words in a word combination.
Relative risk is also called risk ratio because, in medical terms, it is the ratio of the risk of
having a disease if exposed to it divided by the risk of having a disease being unexposed to it.
This ratio of probabilities can also be used in measuring the relation between the probability of
32
a bigram xy being a collocation versus the probability of this bigram to be a free word
combination.
Minimum Sensitivity
Minimum sensitivity (MS) is an effective measure of association of the words x and y in a
bigram xy and has been used successfully in the collocation extraction task. This metric is
calculated according to the formula:
In fact, what this measure does is comparing two conditional probabilities P(y|x) and P(x|y)
and selecting the lesser value thus taking advantage of the notion of conditional probability.
Applied to the contingency table (see Table 1), the geometric mean is equal to the square root of
the heuristic 𝑀𝐼2 measure defined by the following formula:
Therefore, the geometric mean increases the influence of the co-occurrence frequency in the
numerator and avoids the overestimation for low frequency bigrams.
Dice Coefficient
This association measure (𝐷) is calculated according to the formula:
33
This coefficient is one of the most common association measures used to detect collocations;
moreover, its performance happens to be higher than the performance of other association
measures.
Jaccard Coefficient
The Jaccard coefficient (𝐽) is monotonically related to the Dice coefficient and measures
similarity in asymmetric information on binary and non-binary variables. It is commonly applied
to measure similarity of two sets of data and is calculated as a ratio of the cardinality of the sets’
intersection divided by the cardinality of the sets’ union. It is also frequently used as a measure
of association between two terms in information retrieval. To estimate the relation between the
words 𝑥 and 𝑦 in a bigram 𝑥𝑦, the Jaccard coefficient is defined by the following formula:
where the values of 𝑎, 𝑏, and 𝑐 are as given in the contingency table. (Table 1)
CONTEXT MEASURES
The general situation that explains why something happens.
Generally, a context is defined as a multiset (bag) of word types occurring within a predefined
distance (also called a context window) from any occurrence of a given bigram type or word type
(their tokens, more precisely) in the corpus.
The main idea of using this concept is to model the average context of an occurrence of the
bigram/word type in the corpus, i.e. word types that typically occur in its neighborhood. In this
work, we employ two approaches representing the average context: by estimating the probability
distribution of word types appearing in such a neighborhood and by the vector space model
adopted from the field of information retrieval.
34
In order to estimate the probability distribution P(Z|Ce) of word types z appearing in the context
Ce, this multiset is interpreted as a random sample obtained by sampling (with replacement) from
the population of all possible (basic) word types z ∈ U. The random sample consists of M
realizations of a (discrete) random variable Z representing the word type appearing in the context
Ce. The population parameters are the context occurrence probabilities of the word types z ∈ U.
These parameters can be estimated on the basis of the observed frequencies of word types z ∈ U
obtained from the random sample Ce by the following formula:
After the words are converted as vectors, we need to use some techniques such as Euclidean
distance, Cosine Similarity to identify similar words.
Count the common words or Euclidean distance is the general approach used to match similar
documents which are based on counting the number of common words between the documents.
This approach will not work even if the number of common words increases but the document
talks about different topics. To overcome this flaw, the “Cosine Similarity” approach is used to
35
Mathematically, it measures the cosine of the angle between two vectors (item1, item2) projected
in an N-dimensional vector space. The advantageous of cosine similarity is, it predicts the
The cosine angle (the smaller the angle) between the two vectors' value is 0.822 which is nearest
to 1.
Now let’s see what are all the ways to convert sentences into vectors.
Word embeddings coming from pre-trained methods such as,
• Word2Vec — From Google
• Fasttext — From Facebook
• Glove — From Standford
➢ Word2Vec
Word2Vec — Word representations in Vector Space founded by Tomas Mikolov and a group of a
Most of the NLP systems treat words as atomic units. There is a limitation of the existing systems
that there is no notion of similarity between words. Also, the system works for small, simpler and
In order to train with a larger dataset with complex models, the modern techniques use neural
network architecture to train complex data models and outperforms for huge datasets with billions
This technique helps to measure the quality of the resulting vector representations. This works
with similar words that tend to close with words that can have multiple degrees of similarity.
Syntactic Regularities: Refers to grammatical sentence correction.
Semantic Regularities: Refers to the meaning of the vocabulary symbols arranged in that
structure.
The proposed technique was found that the similarity of word representations goes beyond
syntactic regularities and works surprisingly good for algebraic operations of word vectors. For
example,
The following model architectures for word representations' objectives are to maximize the
37
becomes complex for computation between the projection and the hidden layer, as values in the
RNN model does not have a projection layer; only input, hidden and output layer.
Models should be trained for huge datasets using a large-scale distributed framework
called DistBelief, which would give better results. The proposed new two models in Word2Vec
such as,
• Continuous Bag-of-Words Model
• Continuous Skip-gram Model
is removed and the projection layer is shared for all the words; thus all words get projected into
38
word based on the context, it tries to maximize the classification of a word based on another word
GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and
iterate through it and get the co-occurrence of each word with other words in the corpus. We
get a co-occurrence matrix through this. The words which occur next to each other get a value
of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
39
It is a nice evening.
Good Evening!
Is it a nice evening?
it is a nice evening good
it 0
is 1+1 0
a 1/2+1 1+1/2 0
nice 1/3+1/2 1/2+1/3 1+1 0
evening 1/4+1/3 1/3+1/4 1/2+1/2 1+1 0
good 0 0 0 0 1 0
LANGUAGE MODELING
A language model in NLP is a probabilistic statistical model that determines the probability of a
given sequence of words occurring in a sentence based on the previous words. It helps to predict
which word is more likely to appear next in the sentence. Hence it is widely used in predictive
text input systems, speech recognition, machine translation, spelling correction etc. The input to
a language model is usually a training set of example sentences. The output is a probability
distribution over sequences of words. We can use the last one word (unigram), last two words
(bigram), last three words (trigram) or last n words (n-gram) to predict the next word as per our
requirements.
40
Neural Language Models refer to language models that are developed using neural networks.
Moreover, the models help mitigate the challenges that occur in classical language models.
Further, it helps execute complex tasks like speech recognition or machine transition.
Further, Google Translator and Microsoft Translate are examples of language models helping
machines to translate words and text to various languages.
• Sentiment analysis:
Sentiment analysis is the process of identifying sentiments and behaviors on the basis of the text.
Further, NLP Models helps businesses to recognize their customer’s intentions and attitude using
text. For example, Hubspot’s Service Hub analyzes sentiments and emotions using NLP
language models.
• Parsing Tools:
Parsing refers to analyzing sentences and words that are complementary according to syntax and
grammar rules. Further, language models enable features like spell-checking.
• Optical Character Recognition (OCR):
OCR is the use of machines to transform images of text into machine-encoded text. Moreover,
the image may be converted from a scanned document or picture. It is also an important function
that helps digitize old paper trails. Hence, it helps analyze and identify handwriting samples.
• Information Retrieval:
It refers to searching documents and files for information. It also includes regular searches for
documents and files and probing for metadata that leads to a document.
41