Bansilal Ramnath Agarwal Charitable Trust's
Vishwakarma Institute of Information Technology
Department of
Artificial Intelligence and Data Science
Student Name: Vaishnavi Lawate
Class: TY Division: A Roll No: 371032
Semester: 6th Academic Year: 2024-25
Subject Name & Code: Natural Language Processing
Title of Assignment: 6) Implement CKY algorithm for deep parsing
Aim : Implement CKY algorithm for deep parsing
Background Theory :
Introduction
Parsing is a fundamental task in Natural Language Processing (NLP), where the structure of
a sentence is analyzed based on grammar rules. The CKY algorithm (Cocke–Kasami–
Younger) is a well-known parsing algorithm for context-free grammars, especially in
Chomsky Normal Form (CNF). When extended with probabilities, it becomes the
Probabilistic CKY (PCKY) algorithm, which is used to find the most likely parse tree of a
sentence.
CKY Algorithm
What is CKY?
The CKY (Cocke–Kasami–Younger) algorithm is a bottom-up dynamic programming
algorithm used to parse a string and determine whether it can be generated by a context-free
grammar (CFG) in CNF.
Note: Chomsky Normal Form (CNF) is a format where every production rule is either:
• A → BC (where B and C are non-terminal symbols)
• A → a (where a is a terminal symbol)
Working of CKY
Given a string of words w1, w2, ..., wn and a grammar in CNF:
1. Initialize a 2D table (n x n) where each cell [i][j] stores the set of non-terminal
symbols that can generate the substring from wi to wj.
2. Fill the diagonal: For each word wi, add all non-terminals A such that A → wi is a
production rule.
3. Bottom-up filling: For substrings longer than 1, check for all splits k between i and j,
and for all pairs of non-terminals (B in [i][k], C in [k+1][j]), add A to [i][j] if there's a
rule A → BC.
4. Check the top cell: If the start symbol S is in [0][n-1], then the string can be
generated by the grammar.
Example
Given:
• Grammar in CNF: o S → NP VP o NP → Det N
o VP → V NP o
Det → 'the' o
N → 'dog' |
'cat' o V→
'chased'
• Input: "the dog chased the cat"
CKY fills a table where each cell holds possible non-terminals that derive the substring. At
the end, it checks if 'S' (start symbol) is in the topmost cell.
Time and Space Complexity
• Time Complexity: O(n³ * |G|), where n is the number of words, and |G| is the number
of grammar rules.
• Space Complexity: O(n²)
Probabilistic CKY (PCKY)
What is Probabilistic CKY?
Probabilistic CKY extends CKY using Probabilistic Context-Free Grammar (PCFG), where
each production rule has a probability.
Goal: Instead of just checking if a sentence can be generated, PCKY finds the most probable
parse tree based on the given probabilities.
How PCFG Works
In a PCFG, each rule has a probability:
• S → NP VP [1.0]
• NP → Det N [0.6], NP → 'dog' [0.4]
• Det → 'the' [0.8]
• N → 'cat' [0.5], N → 'dog' [0.5]
The probability of a parse tree is the product of all rule probabilities used in the derivation.
Working of Probabilistic CKY
• Similar table-filling as CKY, but each cell also stores the maximum probability for
each non-terminal.
• Instead of just storing a non-terminal, each cell stores a tuple (Non-terminal,
Probability, Backpointer).
• When combining rules (A → B C), compute the probability:
P(A) = P(B in left) * P(C in right) * P(A → B C)
• At the end, choose the parse tree with the highest probability.
Parse Tree Reconstruction
Using the backpointers stored in each cell, we can trace back to construct the most probable
parse tree.
Differences Between CKY and PCKY
Feature CKY Probabilistic CKY (PCKY)
Input CFG in CNF PCFG in CNF
Whether the sentence can be
Output generated Most probable parse tree
Data Non-terminals + probabilities +
Stored Set of non-terminals backpointers
Usage Parsing/validating structure Probabilistic parsing, language modeling
Output
Any valid parse tree Most probable parse tree
Tree
CODE AND OUTPUT:
from nltk import CFG from
nltk.parse.chart import ChartParser
from nltk.tree import Tree
# Define a simple context-free grammar in CNF
grammar = CFG.fromstring("""
S -> NP VP
VP -> V NP | V
NP -> Det N
Det -> 'the' | 'a'
N -> 'dog' | 'cat' | 'ball'
V -> 'chased' | 'saw'
""")
# Create a parser parser =
ChartParser(grammar)
# Sentences to parse
sentences = [
"the dog chased a cat",
"a cat saw the ball",
"the dog saw a dog",
"a dog chased the ball",
"the cat saw a cat"
]
# Parse and print each tree for sentence in
sentences: print(f"\nParsing
sentence: \"{sentence}\"") tokens =
sentence.split() trees =
list(parser.parse(tokens)) for tree in trees:
tree.pretty_print()
Parsing sentence: "the dog chased a cat"
S
|
| VP
| |
NP | NP
| | |
Det N V Det N
| | | | |
the dog chased a cat
Parsing sentence: "a cat saw the ball"
S
|
| VP
| |
NP | NP
| | | Det N V Det N
| | | | |
a cat saw the ball
Parsing sentence: "the dog saw a dog"
S
|
| VP
| |
NP | NP
| | |
Det N V Det N
| | | | |
the dog saw a dog
Parsing sentence: "a dog chased the ball"
S
|
| VP
| |
NP | NP
| | |
Det N V Det N
| | | | |
a dog chased the ball
Parsing sentence: "the cat saw a cat"
S
|
| VP
| |
NP | NP
| | |
Det N V Det N
| | | | |
the cat saw a cat
Conclusion
The CKY algorithm provides a powerful technique for parsing sentences using context-free
grammars. Its probabilistic extension, the PCKY algorithm, enables more intelligent parsing by
selecting the most likely parse tree. Together, they form a foundational concept in computational
linguistics and natural language understanding.