MODULE 2
WORD LEVEL ANALYSIS
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging
Gayana M N Module 2 – Word Level Analysis 2
Module 2 – Word Level Analysis
By the end of this chapter you should be able to:
1. Analyze the natural language text at word and phrase/sentence level.
2.Critically appraise existing Natural Language Processing (NLP) tools
such as :
Stemmers (Word level)
POS taggers (Word/Phrase/Sentence)
Gayana M N Module 2 – Word Level Analysis 3
Word Level Analysis
▪ Word level processing
▪ Methods for characterizing word sequences
▪ Identifying morphological variants
▪ Detecting and correcting misspelled words
▪ Identifying correct part-of-speech of a word
▪ Rule-based
▪ Stochastic (data driven)
▪ Hybrid
Gayana M N Module 2 – Word Level Analysis 4
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging
Gayana M N Module 2 – Word Level Analysis 2
Introduction
▪ Regular Expressions are used for describing text strings.
▪ Example: To find word ‘supernova’ by using search engine and in
information retrieval applications
▪ Implementation of RE using Finite-State Automaton(FSA).
▪ Used in speech recognition and synthesis, spell checking, information
extraction.
▪ Errors in typing and spelling are common in text processing.
▪ An interactive facility to correct errors, Identifying word with different
meanings depending on the context.
Gayana M N Module 2 – Word Level Analysis 6
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging
Gayana M N Module 2 – Word Level Analysis 2
Regular Expressions
▪ Regular Expressions (regexes) are powerful way to find and replace strings
that take a defined format.
▪ Regular expressions can be used to parse dates, urls and email addresses, log
files, configuration files, command line switches or programming scripts.
▪ RE are the useful tools for the design of language compilers.
▪ RE also used in NLP for tokenization, describing lexicons, morphological
analysis etc.
▪ Perl was the first language that provides integrated support for regular
expressions.
▪ RE was first introduced by Kleene (1956).
▪ RE is an algebraic formula consisting of Pattern, set of strings.
Gayana M N Module 2 – Word Level Analysis 8
Regular Expressions
▪ The simplest kind of regular expression is a sequence of simple characters;
putting characters in sequence is called concatenation.
▪ Regular expressions are case sensitive
Gayana M N Module 2 – Word Level Analysis 9
Regular Expressions
▪ Characters are grouped by putting between [ ], disjunction of characters to
match.
▪ Example: /[0123456789]/ => Any single digit.
/[0-9]/ => Any one digit from 0 to 9
/[m-p]/ => Any one letter m,n,o or p
/[^x]/ =>Single character except x
Gayana M N Module 2 – Word Level Analysis 1
0
Regular Expressions
▪ The brackets can be used with the dash (-) to specify any one character in a
range.
Gayana M N Module 2 – Word Level Analysis 1
1
Regular Expressions
▪ The square braces can also be used to specify what a single character
cannot be, by use of the caret ^ .
▪ /[a]/ matches any single character (including special characters) except a.
▪ This is only true when the caret is the first symbol after the open square
brace.
Gayana M N Module 2 – Word Level Analysis 1
2
Regular Expressions
▪ we use the question mark / ? /, which means “zero or one instances of the
previous character”
▪ * => Specifies zero or more occurrences of a preceding character or RE.
▪ * is called as Kleene *
▪ /b*/ =>Match any string containing zero or more occurrences of b. such as
‘b’, ’bb’ or ‘ bbb’ etc.
▪ /[ab]*/ => zero or more occurrence of “a”s or “b”s. This will match
strings: ‘aa’ ,’bb’ or ‘abab’
Gayana M N Module 2 – Word Level Analysis 1
3
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)
Gayana M N Module 2 – Word Level Analysis 1
4
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)
/aardvark.*aardvark/
Gayana M N Module 2 – Word Level Analysis 1
5
Regular Expressions
▪ Anchors are special characters.
▪ the caret ^ has three uses:
▪ to match the start of a line,
▪ to indicate a negation inside of square brackets, and
▪ just to mean a caret.
/The dog\.$/ matches a line that contains only the phrase The dog.
Gayana M N Module 2 – Word Level Analysis 1
6
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.
▪ /gupp(y|ies)/ ?
▪ Operator Pecedence
Gayana M N Module 2 – Word Level Analysis 1
7
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.
Gayana M N Module 2 – Word Level Analysis 1
8
Regular Expressions
Gayana M N Module 2 – Word Level Analysis 1
9
Regular Expressions
▪ Example: To check if string is email address or not
▪ An email address consist of a non-empty sequence of characters
followed by the ‘@ ’ symbol, followed by another non-empty sequence
of characters ending with pattern like .xx , .xxx , .xxxx etc.
▪ The regular expression for an email address is :
^[A-Za-z0-9_\.-]+ @[A-Za-z0-9_\.-]+ [A-Za-z0-9_ ][A-Za-z0-9_ ]$
• Above example works for most of the cases. But it is may not be accurate
enough to match all correct addresses.
• It may accept non-working email addresses. So fine tuning is required for
accurate characterization.
Gayana M N Module 2 – Word Level Analysis 2
0
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging
Gayana M N Module 2 – Word Level Analysis 2
Finite State Machines
▪ Consider the game of playing board ; dice are thrown .
▪ All possible positions of the pieces on the board => states.
▪ State in which the game begins =>initial state
▪ State corresponding to the winning position =>final state
▪ Example: Machine with input, processor, memory and output device Here,
machine starts in initial state, Checks input goes to next state, If process
perfectly then reaches final state & terminates.
▪ If machine stucks in between,then non-final state-reject input
Gayana M N Module 2 – Word Level Analysis 2
2
Finite State Machines
▪ In ‘Finite Automaton’
▪ ‘Finite’ refers number of states and the alphabet of input symbols is finite.
▪ ‘Automaton’ refers machine moves automatically. I.e. change of state is
completely governed by the input.=>deterministic
▪ Properties of finite automaton:
▪ A finite set of states, one is initial/ start state , and one more is final
state.
▪ A finite Alphabet set, ⅀ , consisting of input symbols
▪ A finite set of transitions that specify each state and each symbol of
input alphabet
Gayana M N Module 2 – Word Level Analysis 2
3
Finite State Machines
▪ Deterministic Finite state Automaton (DFA): Exactly one transition leading
out of a state is possible for the different input symbol
▪ Example: Suppose ⅀ ={ a,b } , the set of states ={q0,q1,q2,q3,q4} with q0
being the start state and q4 the final state, we have the following rules of
transition:
Gayana M N Module 2 – Word Level Analysis 2
4
Finite State Machines
▪ Deterministic finite automaton (DFA) can be defined as 5 tuple
(Q,⅀, δ, S, F) below:
▪ Where, Q is set of states,
▪ ⅀ is an alphabet,
▪ S is the start state,
▪ F⊆Q is a set of final states ,
▪ δ is a transition function.
▪ ‘Finite Automaton’ used in wide variety of areas: linguistics, electrical
engineering, computer science, mathematics and logic
▪ These are important tool in computational linguistics and used in
mathematical device to implement regular expressions.
Gayana M N Module 2 – Word Level Analysis 2
5
Finite State Machines
▪ Non-Deterministic Finite Automaton (NFA):maps Q X ( ⅀ U {ɛ}) to a
subset of the power set of Q. i.e. for each state,there can be more than one
transition on a given symbol, each leading to a different state.
▪ Q X ( ⅀ U {ɛ}) , means NFA can make use of ɛ, to do state transition,if no
symbols(a,b,..) are defined.
▪ In NFA, More than one transition out of a state is possible for the same
input symbol.
Gayana M N Module 2 – Word Level Analysis 2
6
Finite State Machines
▪ Listing all the state transitions is inconvenient as it quite long so we have to
represent automaton as => State-transition table.
▪ In transition table, Rows => states , Columns=> input (symbol) , ɸ =>
Missing transition
Gayana M N Module 2 – Word Level Analysis 2
7
Finite State Machines
▪ Example: Language consisting of all strings containing a’s and b’s and
ending with baa.
▪ This language can be specified by regular expression /(a|b)*baa$/
▪ NFA and State-transition table can be represented as below:
Gayana M N Module 2 – Word Level Analysis 2
8
Finite State Machines
▪ Two automata that define the same language are said to be equivalent.
▪ NFA can be converted to an equivalent DFA and vice versa
▪ Example: Language consisting of all strings containing as and bs and
ending with baa .Regular expression is : /(a|b)*baa$/
▪ The Equivalent DFA for the NFA shown as below:
Gayana M N Module 2 – Word Level Analysis 2
9
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging
Gayana M N Module 2 – Word Level Analysis 2
Morphological Parsing
▪ Sub-discipline of linguistics.
▪ Studies word structure and the formation of words from smaller units
(morphemes).
Gayana M N Module 2 – Word Level Analysis 31
Morphological Parsing
Morphemes
▪ Morphemes are the smallest meaningful constituents of words.
▪ Words are composed of morphemes (one or more).
▪ bread – consists of single morpheme
▪ eggs – consist of two: eggand -s
▪ sing-er-s, home-work, un-kind-ly, flipp-ed
▪ de-nation-al-iz-ation
▪ auto-servis-u (voluntary service)
▪ Plural morpheme: cat-s, dog-s, judg-es
▪ Opposite: un-happy, in-comprehensive, im-possible, ir-rational
Gayana M N Module 2 – Word Level Analysis 32
Morphological Parsing
Parsing
▪ Taking a surface input and analyzing its components and underlying
structure.
Morphological parsing
▪ To discover the morphemes that build a given word.
▪ Taking a word as input and identifying the stems and affixes (and
possibly interpreting these).
Example:
goose goose +N +SG or goose + V
geese goose +N +PL
gooses goose +V +3SG
Gayana M N Module 2 – Word Level Analysis 33
Morphological Parsing
Why Parse words ?
▪ To find stems.
▪ Simple key to word similarity
▪ yellow, yellowish, yellows, yellowed, yellowing…
▪ egg, eggs
▪ To find affixes and the information they convey.
▪ ‘ed’signals a verb
▪ ‘ish’an adjective
▪ ‘s’?
▪ Morphological parsing provides information about a word’s
semantics and the syntactic role it plays in a sentence.
Gayana M N Module 2 – Word Level Analysis 34
Morphological Parsing
Some Practical Applications ….
▪ For spell checking
▪ Is muncheable a legal word ?
▪ To identify a word’s part-of-speech (POS)
▪ Sentence Parsing, Machine Translation, …
▪ To identify a word’s stem
▪ Information Retrieval
▪ Why not just list all word forms in a lexicon?
Gayana M N Module 2 – Word Level Analysis 35
Morphological Parsing
Classes of morphemes
▪ Stems
▪ Affixes
Stems
▪ The main morpheme in a word.
▪ Contains the central meaning.
▪ A simple key to word similarity.
Affixes
▪ Modify the meaning given by the stem.
Gayana M N Module 2 – Word Level Analysis 36
Morphological Parsing
Classes of morphemes
Affixes
▪ Modify the meaning given by the stem.
▪ Divided into:
▪ Prefix – morphemes which appear before a stem
▪ Suffix - morphemes applied to the end of the stem
▪ Infix - morphemes that appear inside a stem
▪ Circumfix - morphemes applied to either end of the stem
▪ Prefixes and suffixes are quite common in Urdu, Hindi and English.
Gayana M N Module 2 – Word Level Analysis 37
Morphological Parsing
Classes of morphemes
Prefix - morphemes which appear before a stem
▪ Unhappy
▪ bewaqt (Urdu)
▪ अप मान, सं सार उप वन (Hindi),
▪ ಅತೃಪ್ತಿ (Atripti) (Kannada)
▪ ….
Gayana M N Module 2 – Word Level Analysis 38
Morphological Parsing
Classes of morphemes
Suffix - morphemes applied to the end of the stem.
▪ trees, birds
▪ ಮರಗಳು ,(upayōgisu)
▪ शीतलता, ज़Ǿरतमद
Gayana M N Module 2 – Word Level Analysis 39
Morphological Parsing
Classes of morphemes
Infix - morphemes that appear inside the stem.
▪ Common in Austronesian and Austroasiatic languages (Tagalog-
-philippines, Khmer-cambodia)
▪ Tagalog:
▪ Kayu- ‘wood’=> kayu-in- => ‘kinayu’(gathered wood)
▪ basa - ‘read’ = > b·um·asa - ‘readpast’
▪ sulat - ‘write’ = > s·um·ulat - ‘wrote’
▪ Very rare in English:
abso·bloody·lutely (emphatic or humorous form of absolutely)
Gayana M N Module 2 – Word Level Analysis 40
Morphological Parsing
Classes of morphemes
Circumfix - morphemes applied to either end of the stem.
Example:
in-correct-ly
im-matur-ity
un-bear-able
re-cover-ed
Suffixes are more common than prefixes which are more common than infixes
and circumfixes.
Gayana M N Module 2 – Word Level Analysis 41
Morphological Parsing
Word Formation
3 main ways for word formation:
▪ Inflection
▪ Derivation
▪ Compounding
Gayana M N Module 2 – Word Level Analysis 42
Morphological Parsing
Inflection
▪ Root wordis combined with a grammaticalmorphemeto yield a word of
the sameclass.
▪ bring, brings, brought, bringing
Gayana M N Module 2 – Word Level Analysis 43
Morphological Parsing
Derivation
▪ Word stem is combined with a grammatical morpheme to yield a word
belonging to different class.
▪ compute(verb)=> computation(noun)
▪ Nominalization (noun from verb/adjective)
▪ logic, logical, illogical, illogicality, logician
Gayana M N Module 2 – Word Level Analysis 44
Morphological Parsing
Compounding
▪ Merging two or more words to form a new word
▪ N + N → N: rain-bow
▪ V + N → V: pick-pocket
▪ P + V → V: over-do
▪ N + P → N: desk-top
▪ P + V → V: over-look
▪ Adj + Adj → Adj: bitter-sweet
Gayana M N Module 2 – Word Level Analysis 45
Morphological Parsing
Why Morphological Analysis
▪ New words are continually forming a NL
▪ Morphologically related to known words
▪ Understanding morphology = > understand syntactic and semantic
properties of new words.
Example:
▪ Agreement features of words (In parsing)
▪ Identify the presence of a query word in a document inspite of
morphological variants (IR).
Gayana M N Module 2 – Word Level Analysis 46
Morphological Parsing
Why Morphological Analysis
▪ Takes as input the inflected surface form of each word in text
▪ Output the parsed form consisting of canonical form (lemma) of the
words and a set of tags showing its syntactic category and morphological
characteristics.
Example:
▪ POS/inflectional properties (gender, number, person, tense, …)
Gayana M N Module 2 – Word Level Analysis 47
Morphological Parsing
What do we need to build a morphological parser?
A morphological parser uses the following information sources:
▪ Lexicon: list of stems and affixes with basic information about them
(w/ corresponding p.o.s.)
▪ Morphotactics of the language: model of how and which morphemes
can be affixed to a stem
▪ rest-less-ness vs rest-ness-less
▪ Orthographic rules: spelling modifications that may occur when
affixation occurs
▪ easy-> easier y->ier ( not ‘easyer’)
Gayana M N Module 2 – Word Level Analysis 48
Morphological Parsing
Exhaustive Lexicon with feature of all the roots
Gayana M N Module 2 – Word Level Analysis 49
Morphological Parsing
Exhaustive Lexicon with feature of all the roots
Word form Category Root Gender Number Person
Ghodhaa noun GhoDaa masculine singular 3rd
Ghodhii -do- -do- feminine -do- -do-
Ghodhon -do- -do- masculine plural -do-
Ghodhe -do- -do- -do- -do- -do-
Gayana M N Module 2 – Word Level Analysis 50
Morphological Parsing
Limitations of exhaustive lexicon with features
▪ Puts heavy demand on memory.
▪ List every form of the word=>large number of, redundant entries.
▪ Fails to capture linguistic generalization.
▪ Fails to show the relationship between different roots having similar
word forms.
▪ Essential to develop a system capable of understanding unknown
words.
▪ For morphologically complex languages, number of possible word-forms
may be theoretically infinite.
▪ Not practical to list all possible word-forms for such languages.
Gayana M N Module 2 – Word Level Analysis 51
Morphological Parsing
Morphological Systems - Stemmers
▪ Simplest morphological system.
▪ Collapses morphological variations of a given word (word-forms) to
one lemma or stem.
▪ Do not require a lexicon.
▪ Specifically used in Information Retrieval (IR)
▪ Popular stemmers
▪ Lovins (1968)
▪ Porter (1980)
Gayana M N Module 2 – Word Level Analysis 52
Morphological Parsing
Morphological Systems - Stemmers
▪ Do not require a lexicon.
▪ Use rewrite rules of the form:
▪ ier → y (eg., earlier →early)
▪ ing→ ε (eg., playing→ play)
Stemming algorithms work in 2 steps:
1. Suffix removal: Removes predefined endings from words.
2. Recoding: Adds predefined endings to the output of the step 1.
The two steps can be performed:
1. Sequentially (Lovins’s)
2. Simultaneously (Porter’s)
Gayana M N Module 2 – Word Level Analysis 53
Morphological Parsing
Morphological Systems - Stemmers
Porter Stemmer(1980)
▪ Used for tasks in which you only care about the stem.
▪ IR, topic detection, document similarity,….
▪ Lexicon-free morphological analysis.
▪ Cascades rewrite rules
▪ e.g. misunderstanding → misunderstand → understand …
▪ Easily implemented as an FST with rules
▪ ier → y (eg., earlier →early)
▪ ing → ε (eg., playing → play)
▪ Not perfect ….
▪ does → doe
Gayana M N Module 2 – Word Level Analysis 54
Morphological Parsing
Morphological Systems - Stemmers
Porter Stemmer(1980)
▪ Porters algorithm makes use of following transformational rule for the word
: “rotational” ---> “rotate”
ational ---> ate
▪ But transformation of the word ‘organization’ ---> ‘organ’ and ‘noise’ --->
‘noisy’ are not perfect
▪ It reduces only suffixes; prefixes and compounds are not reduced
Gayana M N Module 2 – Word Level Analysis 55
Morphological Parsing
Two-level Morphological Model
▪ First proposed by Kimmo Koskenniemi (1983)
▪ Applicable to highly inflected languages.
▪ Word is represented as a correspondence between its lexical form and its
surface level form.
▪ Surface form → Actual spelling of the word
▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
▪ Morphological parsing -> Mapping from the surface level into
morpheme and feature sequences on the lexical level.
Gayana M N Module 2 – Word Level Analysis 56
Morphological Parsing
Two-level Morphological Model
▪ Surface form → Actual spelling of the word
▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
Surface Level p l a y i n g
Lexical p l a y +V +PP
Surface Level b o o k s
Lexical b o o k +N PL
Gayana M N Module 2 – Word Level Analysis 57
Morphological Parsing
Two-level Morphological Model
▪ First component is the stem, cat.
▪ Second component is the morphological information, that specifies the
surface level form is a plural noun (+N+PL).
Surface Level c a t s
Lexical c a t +N +PL
Gayana M N Module 2 – Word Level Analysis 58
Morphological Parsing
Parsing with FSTs
cats → cat + N + PL
▪ Word can be represented as correspondence between lexical level (its
morphemes) and surface level (its orthography).
▪ Morphological parsing: finds the mapping (transduction) between lexical
and surface levels.
Surface Level c a t s
Lexical c a t +N +PL
Gayana M N Module 2 – Word Level Analysis 59
Morphological Parsing
Finite State Transducers (FSTs)
▪ FSTs map between one set of symbols and another using a FSA, whose
alphabet Σ is composed of pairs of symbols from input and output
alphabets.
▪ FST is a 2-state automaton that:
▪ Recognizes
▪ Generates a pair of strings
Gayana M N Module 2 – Word Level Analysis 60
Morphological Parsing
Finite State Transducers (FSTs)
▪ FST is a 6-tuple {Σ 1, Σ 2, Q, δ, S, F} consisting of
▪ Q: set of states.
▪ Σ: an alphabet of complex symbols, each an i:o pair s.t. i Σ1 (an
input alphabet) and o Σ2.
▪ Σ1: the alphabet of input
▪ Σ2: the alphabet of output
▪ S: a start state
▪ F: a set of final states in Q; F ϵ Q
▪ δ: a transition function mapping; Q x (Σ1 U {ε}) x (Σ2 U {ε}) to
power set of Q.
hot → cot
Gayana M N Module 2 – Word Level Analysis 61
Morphological Parsing
Finite State Transducers (FSTs)
h:c q1 o:o
t:t
q0 q3 q4
c:b q2 a:a
Figure: A simple transducer that accepts two input strings, hotand cat,and
maps them onto cot and bat respectively.
Gayana M N Module 2 – Word Level Analysis 62
Morphological Parsing
Finite State Transducers (FSTs)
▪ FSTs encode regular relations.
▪ Regular relation is the relation between regular languages.
▪ Regular language encoded on the upper side of an FST is called upper
language.
▪ Regular language encoded on the lower side is termed as lower
language.
Gayana M N Module 2 – Word Level Analysis 63
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Step2: Maps
Surface
Step1: Split word
Intermediate morphemes to stem
into possible Lexical form
form form and morphological
morphemes
features
Figure: Two-step morphological parser.
Gayana M N Module 2 – Word Level Analysis 64
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Step2: Maps
Surface
Step1: Split word
Intermediate morphemes to stem
into possible Lexical form
form form and morphological
morphemes
features
bird bird bird + N +SG
birds bird + s bird + N +PL
goose goose goose + N + SG
geese geese geese + N + PL
boxes box + s box + N + PL
Figure: Two-step morphological parser.
Gayana M N Module 2 – Word Level Analysis 65
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
Gayana M N Module 2 – Word Level Analysis 66
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s
Gayana M N Module 2 – Word Level Analysis 67
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s
▪ Output is concatenation of morphemes. i.e., Stems + Affixes.
▪ + indicates morpheme boundaries.
Gayana M N Module 2 – Word Level Analysis 68
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s
▪ Output is concatenation of morphemes. i.e., Stems + Affixes.
▪ + indicates morpheme boundaries.
▪ Also, consider spellingrules.
Gayana M N Module 2 – Word Level Analysis 69
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s
▪ Output is concatenation of morphemes. i.e., Stems + Affixes.
▪ + indicates morpheme boundaries.
▪ Also, consider spellingrules.
foxes → foxe+ s foxeis stem, s the suffix.
Gayana M N Module 2 – Word Level Analysis 70
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than one representation for a given word.
Gayana M N Module 2 – Word Level Analysis 71
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
Gayana M N Module 2 – Word Level Analysis 72
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
▪ A transducer that does the mapping (translation) required by the first step for
the surface form ‘lesser’ represents information that:
▪ the comparative form of the adjective ‘less’ is ‘lesser’,
▪ εis the empty string.
Gayana M N Module 2 – Word Level Analysis 73
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
▪ A transducer that does the mapping (translation) required by the first step for
the surface form ‘lesser’ represents information that:
▪ the comparative form of the adjective ‘less’ is ‘lesser’,
▪ εis the empty string.
▪ The automaton is inherently bi-directional.
▪ Same transducer can be used for analysis (surface input, ‘upward
application’) or for generation (lexical input, ‘downward’ application).
Gayana M N Module 2 – Word Level Analysis 74
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
Gayana M N Module 2 – Word Level Analysis 75
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s
Gayana M N Module 2 – Word Level Analysis 76
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s → cat +N +PL
Gayana M N Module 2 – Word Level Analysis 77
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
Gayana M N Module 2 – Word Level Analysis 78
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
foxes → fox + s
Gayana M N Module 2 – Word Level Analysis 79
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
foxes → fox + s → fox +N +PL
Gayana M N Module 2 – Word Level Analysis 80
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
foxes → fox + s → fox +N +PL
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.
Gayana M N Module 2 – Word Level Analysis 81
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
foxes → fox + s → fox +N +PL
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.
▪ However, spouses → spouse + s → spouse +N +PL is correct.
Gayana M N Module 2 – Word Level Analysis 82
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
foxes → fox + s → fox +N +PL
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.
▪ However, spouses → spouse + s → spouse +N +PL is correct.
▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 83
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
Orthographic rules
foxes → fox + s → fox +N +PL are used to handle
spelling variations
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.
▪ However, spouses → spouse + s → spouse +N +PL is correct.
▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 84
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL Spelling Rules:
• Add ‘e’after –s, -z, -x, -ch, -sh
foxes → fox + s → fox +N +PL before the ‘s’
• dish → dishes
• box →boxes
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.
▪ However, spouses → spouse + s → spouse +N +PL is correct.
▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 85
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Implementation of the two steps with transducers requires building two
transducers:
▪ One that maps the surface form to the intermediate form.
▪ Another that maps the intermediate form to the lexical form.
Gayana M N Module 2 – Word Level Analysis 86
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
FST-based morphological parser for singular and plural nouns in English.
Considerations:
▪ Plural form of regular nouns usually end with –s or –es.
▪ However, a word ending in ‘s’neednot necessarily be the plural form of a word.
▪ There are several singular words ending in ‘s’ (e.g., miss and ass).
▪ One of the required translations is the deletion of the ‘e’,when introducing a
morpheme boundary.
▪ Required for words ending in -xes, -ses, -zes (e.g., boxes and suffixes).
Gayana M N Module 2 – Word Level Analysis 87
Morphological Parsing
Implementing Two-level Morphological Model using FSTs: STEP1
FST-based morphological parser for singular and plural nouns in English
Figure: A simplified FST, mapping English nouns to the intermediate form.
Gayana M N Module 2 – Word Level Analysis 88
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP1
▪ Sequences of states that the transducer undergoes, given the surface
forms birds as input.
s
2 6 Output: bird + s
Surface level s
ε +
b i r d s
1 1 1 1 1 1 Output: birds
b i r d s
Lexical level
Figure: Possible sequences of states, Input: birds
Gayana M N Module 2 – Word Level Analysis 89
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP1
▪ Sequences of states that the transducer undergoes, given the surface form
boxes as input.
ε s
1 + 2 s 6 Output: boxe + s
e s
e s
4 Output: boxes
b o x
1 1 1 3
b o x e
ε ε s
Output: box + s
4 5 6
+ s
Figure: Possible sequences of states, Input: boxes
Gayana M N Module 2 – Word Level Analysis 90
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat
2. Regular noun stem +s, e.g., bird + s
3. Singular irregular noun stem, e.g., goose
4. Plural irregular noun stem, e.g., geese
Gayana M N Module 2 – Word Level Analysis 91
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s
3. Singular irregular noun stem, e.g., goose
4. Plural irregular noun stem, e.g., geese
Gayana M N Module 2 – Word Level Analysis 92
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to themselves;
but then output N and replace PL with s.
3. Singular irregular noun stem, e.g., goose
4. Plural irregular noun stem, e.g., geese
Gayana M N Module 2 – Word Level Analysis 93
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to themselves;
but then output N and replace PL with s.
3. Singular irregular noun stem, e.g., goose→ Same as in first case.
4. Plural irregular noun stem, e.g., geese
Gayana M N Module 2 – Word Level Analysis 94
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves
and then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to
themselves; but then output N and replace PL with s.
3. Singular irregular noun stem, e.g., goose→ Same as in first case.
4. Plural irregular noun stem, e.g., geese→ Map irregular plural noun stem to the
corresponding singular stem and add N and PL.
Gayana M N Module 2 – Word Level Analysis 95
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
Figure: Stem Mappings
Gayana M N Module 2 – Word Level Analysis 96
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping fromthe intermediatelevelto the
lexical level.
Figure: Transducer for Step 2
Gayana M N Module 2 – Word Level Analysis 97
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
b i r d ε:N + :ε
ε:sg s:PL
g o o s e ε:N ε:sg
ε:sg
g e:o e:o s e ε:N
Figure: A transducer mapping nouns to their stem and morphological features
Gayana M N Module 2 – Word Level Analysis 98
Morphological Parsing
Summary
▪ Simplest morphological systems are stemmers.
▪ Do not use lexicon, instead use re-write rules.
▪ Not perfect.
▪ FSTs provide a useful tool for implementing a standard model
of morphological analysis.
▪ Koskenniemi’s two-level morphological model.
▪ But for many tasks (e.g. IR) much simpler approaches are still widely
used,
e.g. the rule-based Porter Stemmer.
Gayana M N Module 2 – Word Level Analysis 99