|| Jai Sri Gurudev||
Sri Adichunchanagiri Shikshana Trust (R)
SJB INSTITUTE OF TECHNOLOGY
(Affiliated to Visvesvaraya Technological University, Belagavi& Approved by AICTE, New Delhi.)
No. 67, BGS Health & Education City, Dr. Vishnuvardhan Road Kengeri, Bengaluru – 560 060
Subject: Natural Language Processing(18CS743)
By
CHETAN R, Assistant Professor
Semester / Section: 7A and B
Department of Information Science & Engineering
Aca. Year: ODD SEM /2021-22
PARSING
2
Overview
The task that uses the rewrite rules of a grammar to either generate a
particular sequence of words or reconstruct its derivation is termed
as parsing.
The following constraints guide the search process:
1. Input: Words in the input sentence. A valid parse is one that covers
all the words in a sentence. These words must constitute the leaves of
the final parse tree.
2. Grammar: The root of the final parse tree must be the start symbol
of the grammar.
3
Top-down parsing
4
Top-down search space
5
Bottom-up parsing
6
Pros and Cons
The top-down parser search starts generating trees with the start
symbol of the grammar. It never wastes time exploring a tree
leading to a different root.
It wastes time exploring S trees that eventually result in words that
are inconsistent with the input.
The bottom up parser never explores a tree that does not match
the input.
It wastes time generating trees that have no chance of leading to an
7 S-rooted tree.
Basic Top-Down Parser
8
Derivation using top-down, depth first algorithm
9
Disadvantages
1. Left Recursion: which causes the search to get stuck in an infinite loop.
2. Structural Ambiguity: which occurs when a grammar assigns more than one parse
to a sentence.
3. Attachment Ambiguity: If a constituent fits more than one position in a parse tree.
4. Coordination Ambiguity: Occurs when its is nit clear which phrases are being
combined with a conjunction like ‘and’.
5. Local Ambiguity: Occurs when certain parts of a sentence are ambiguous.
6. Repeated Parsing: Parser often builds valid trees for portions of the input that it
discards during backtracking.
10
EARLEY PARSER
It implements an efficient parallel top-down search using
dynamic programming.
It builds a table of sub-trees for each of the constituents in the
input.
The most important component of this algorithm is the Earley
Chart that has n+1 entries, where n is the number of words in
the input.
The algorithm makes a left to right scan of input to fill the
11 elements in this chart.
State Information
1. A sub-tree corresponding to a grammar rule.
2. Information about the program made in completing the
sub-tree.
3. Position of the sub-tree with respect to input.
A state is represented as a dotted rule and a pair of numbers
representing starting position and the position of dot.
A X1 … C … Xm, [i,j]
12
Algorithm
13
Predictor
Generates new states representing potential expansion of the
non-terminal in the left-most derivation.
It is applied to every state that has a non-terminal to the right of
the dot, when the category of that non-terminal is different
from the part-of-speech.
If A X1 … C … Xm, [i,j] then for every rule of the form
B the operation adds to chart[j], the state:
B , [j,j]
14
Example
When the generating state is S NP VP, [0,0] the predictor
adds the following states to chart[0]:
NP Det Nominal, [0,0]
NP Noun, [0,0]
NP Pronoun, [0,0]
NP Det Noun PP, [0,0]
15
Scanner
Scanner is used when a state has a part of speech category to
the right of the dot.
It examines the input to see if the part-of-speech appearing to
the right of the dot matches one of the part-of-speech
associated with the current input.
If Yes, then it creates a new state using the rule.
If the state is A … a …, [i,j] and ‘a’ is associated with wj
then, it adds a … wj [i,j] to chart [j+1].
16
Completer
Completer is used when the dot reaches the right end of the
rule.
The presence of such a state signifies successful completion of
the parse of some grammatical category.
If A … , [j,k], then the computer adds
B … A … [i, k] to chart [k] for all states
B … A … , [i, j] in chart [j].
17
18
CYK Parser
Cocke-Younger-Kasami is a dynamic programming parsing
algorithm.
It follows a bottom-up approach in parsing. Builds the parse tree
incrementally. Each entry in the table is based on the previous
entries.
The CYK algorithm assumes the grammar to be in chomsky normal
form (CNF). A CFG is in CNF if all the rules are of only two forms.
A BC
19 A W, where W is a word.
CYK Algorithm
20
Example
Sentence: “The girl wrote an essay”
21
Probabilistic Parsing
A statistical parser works by assigning probabilities to possible parses
of a sentence and returning the most likely parse as the final one.
More formally given a grammar G, sentence s and a set of possible
parse trees of s which we denote by (s), a probabilistic parser finds
the most likely parse ` of s as follows:
22
Advantages
1. Probabilistic parser offers is removal of ambiguity for
parsing.
2. The search becomes more efficient.
23
Probabilistic Context Free Grammar
24
Example
25
Probability Estimation
26
Estimating Rule Probability
27
Two Parse Trees
28
Parsing PCFGs
29
Indian Languages
The majority of the indian languages are free word order.
The order of the sentence can be changed without leading
to a grammatically incorrect sentence.
30
Contd..
Extensive and productive use of complex predicates (CPs) is another property that
most Indian languages have in common.
A complex predicate combines a light verb, noun, or adjective to produce a new
verb.
31
Parsing Indian Languages
Bharti and Sangal described an approach for parsing of Indian
languages based on Paninian grammar formalism. It has 2 stages:
1. Is responsible for identifying word groups.
2. Assigning parse structure to the input sentence.
32
Karaka Chart
33
Constraint Graph
34
Constraints
35
Parse of the sentence
36
References
1. Bharti, Akshar and Rajeev Sangal, 1990, ‘A Karaka-based
approach parsing of Indian Languages’, Proceedings of the 13th
Conference on Computational Linguistics, Association for
Computational Linguistics, 3.
2. Chomsky, N., 1957, Syntactic Structures, Mouton, The Hague.
37