0% found this document useful (0 votes)

82 views99 pages

Module2 Ch3 A

Uploaded by

pomono1988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views99 pages

Module2 Ch3 A

Uploaded by

pomono1988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

MODULE 2

WORD LEVEL ANALYSIS

MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2

Module 2 – Word Level Analysis

By the end of this chapter you should be able to:

1. Analyze the natural language text at word and phrase/sentence level.

2.Critically appraise existing Natural Language Processing (NLP) tools
such as :
Stemmers (Word level)
POS taggers (Word/Phrase/Sentence)

Gayana M N Module 2 – Word Level Analysis 3

Word Level Analysis

▪ Word level processing

▪ Methods for characterizing word sequences
▪ Identifying morphological variants
▪ Detecting and correcting misspelled words
▪ Identifying correct part-of-speech of a word
▪ Rule-based
▪ Stochastic (data driven)
▪ Hybrid

Gayana M N Module 2 – Word Level Analysis 4

MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2

Introduction

▪ Regular Expressions are used for describing text strings.

▪ Example: To find word ‘supernova’ by using search engine and in
information retrieval applications
▪ Implementation of RE using Finite-State Automaton(FSA).
▪ Used in speech recognition and synthesis, spell checking, information
extraction.
▪ Errors in typing and spelling are common in text processing.
▪ An interactive facility to correct errors, Identifying word with different
meanings depending on the context.

Gayana M N Module 2 – Word Level Analysis 6

MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2

Regular Expressions
▪ Regular Expressions (regexes) are powerful way to find and replace strings
that take a defined format.
▪ Regular expressions can be used to parse dates, urls and email addresses, log
files, configuration files, command line switches or programming scripts.
▪ RE are the useful tools for the design of language compilers.
▪ RE also used in NLP for tokenization, describing lexicons, morphological
analysis etc.
▪ Perl was the first language that provides integrated support for regular
expressions.
▪ RE was first introduced by Kleene (1956).
▪ RE is an algebraic formula consisting of Pattern, set of strings.

Gayana M N Module 2 – Word Level Analysis 8

Regular Expressions
▪ The simplest kind of regular expression is a sequence of simple characters;
putting characters in sequence is called concatenation.

▪ Regular expressions are case sensitive

Gayana M N Module 2 – Word Level Analysis 9

Regular Expressions
▪ Characters are grouped by putting between [ ], disjunction of characters to
match.
▪ Example: /[0123456789]/ => Any single digit.
/[0-9]/ => Any one digit from 0 to 9
/[m-p]/ => Any one letter m,n,o or p
/[^x]/ =>Single character except x

Gayana M N Module 2 – Word Level Analysis 1

0
Regular Expressions
▪ The brackets can be used with the dash (-) to specify any one character in a
range.

Gayana M N Module 2 – Word Level Analysis 1

1
Regular Expressions
▪ The square braces can also be used to specify what a single character
cannot be, by use of the caret ^ .
▪ /[a]/ matches any single character (including special characters) except a.
▪ This is only true when the caret is the first symbol after the open square
brace.

Gayana M N Module 2 – Word Level Analysis 1

2
Regular Expressions
▪ we use the question mark / ? /, which means “zero or one instances of the
previous character”

▪ * => Specifies zero or more occurrences of a preceding character or RE.

▪ * is called as Kleene *
▪ /b*/ =>Match any string containing zero or more occurrences of b. such as
‘b’, ’bb’ or ‘ bbb’ etc.
▪ /[ab]*/ => zero or more occurrence of “a”s or “b”s. This will match
strings: ‘aa’ ,’bb’ or ‘abab’
Gayana M N Module 2 – Word Level Analysis 1
3
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)

Gayana M N Module 2 – Word Level Analysis 1

4
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)

/aardvark.*aardvark/

Gayana M N Module 2 – Word Level Analysis 1

5
Regular Expressions
▪ Anchors are special characters.

▪ the caret ^ has three uses:

▪ to match the start of a line,
▪ to indicate a negation inside of square brackets, and
▪ just to mean a caret.
/The dog\.$/ matches a line that contains only the phrase The dog.
Gayana M N Module 2 – Word Level Analysis 1
6
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.
▪ /gupp(y|ies)/ ?
▪ Operator Pecedence

Gayana M N Module 2 – Word Level Analysis 1

7
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.

Gayana M N Module 2 – Word Level Analysis 1

8
Regular Expressions

Gayana M N Module 2 – Word Level Analysis 1

9
Regular Expressions
▪ Example: To check if string is email address or not
▪ An email address consist of a non-empty sequence of characters
followed by the ‘@ ’ symbol, followed by another non-empty sequence
of characters ending with pattern like .xx , .xxx , .xxxx etc.
▪ The regular expression for an email address is :
^[A-Za-z0-9_\.-]+ @[A-Za-z0-9_\.-]+ [A-Za-z0-9_ ][A-Za-z0-9_ ]$
• Above example works for most of the cases. But it is may not be accurate
enough to match all correct addresses.
• It may accept non-working email addresses. So fine tuning is required for
accurate characterization.

Gayana M N Module 2 – Word Level Analysis 2

0
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2

Finite State Machines
▪ Consider the game of playing board ; dice are thrown .
▪ All possible positions of the pieces on the board => states.
▪ State in which the game begins =>initial state
▪ State corresponding to the winning position =>final state
▪ Example: Machine with input, processor, memory and output device Here,
machine starts in initial state, Checks input goes to next state, If process
perfectly then reaches final state & terminates.
▪ If machine stucks in between,then non-final state-reject input

Gayana M N Module 2 – Word Level Analysis 2

2
Finite State Machines
▪ In ‘Finite Automaton’
▪ ‘Finite’ refers number of states and the alphabet of input symbols is finite.
▪ ‘Automaton’ refers machine moves automatically. I.e. change of state is
completely governed by the input.=>deterministic
▪ Properties of finite automaton:
▪ A finite set of states, one is initial/ start state , and one more is final
state.
▪ A finite Alphabet set, ⅀ , consisting of input symbols
▪ A finite set of transitions that specify each state and each symbol of
input alphabet

Gayana M N Module 2 – Word Level Analysis 2

3
Finite State Machines
▪ Deterministic Finite state Automaton (DFA): Exactly one transition leading
out of a state is possible for the different input symbol
▪ Example: Suppose ⅀ ={ a,b } , the set of states ={q0,q1,q2,q3,q4} with q0
being the start state and q4 the final state, we have the following rules of
transition:

Gayana M N Module 2 – Word Level Analysis 2

4
Finite State Machines
▪ Deterministic finite automaton (DFA) can be defined as 5 tuple
(Q,⅀, δ, S, F) below:
▪ Where, Q is set of states,
▪ ⅀ is an alphabet,
▪ S is the start state,
▪ F⊆Q is a set of final states ,
▪ δ is a transition function.
▪ ‘Finite Automaton’ used in wide variety of areas: linguistics, electrical
engineering, computer science, mathematics and logic
▪ These are important tool in computational linguistics and used in
mathematical device to implement regular expressions.
Gayana M N Module 2 – Word Level Analysis 2
5
Finite State Machines
▪ Non-Deterministic Finite Automaton (NFA):maps Q X ( ⅀ U {ɛ}) to a
subset of the power set of Q. i.e. for each state,there can be more than one
transition on a given symbol, each leading to a different state.
▪ Q X ( ⅀ U {ɛ}) , means NFA can make use of ɛ, to do state transition,if no
symbols(a,b,..) are defined.
▪ In NFA, More than one transition out of a state is possible for the same
input symbol.

Gayana M N Module 2 – Word Level Analysis 2

6
Finite State Machines
▪ Listing all the state transitions is inconvenient as it quite long so we have to
represent automaton as => State-transition table.
▪ In transition table, Rows => states , Columns=> input (symbol) , ɸ =>
Missing transition

Gayana M N Module 2 – Word Level Analysis 2

7
Finite State Machines
▪ Example: Language consisting of all strings containing a’s and b’s and
ending with baa.
▪ This language can be specified by regular expression /(a|b)*baa$/
▪ NFA and State-transition table can be represented as below:

Gayana M N Module 2 – Word Level Analysis 2

8
Finite State Machines
▪ Two automata that define the same language are said to be equivalent.
▪ NFA can be converted to an equivalent DFA and vice versa
▪ Example: Language consisting of all strings containing as and bs and
ending with baa .Regular expression is : /(a|b)*baa$/
▪ The Equivalent DFA for the NFA shown as below:

Gayana M N Module 2 – Word Level Analysis 2

9
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2

Morphological Parsing

▪ Sub-discipline of linguistics.
▪ Studies word structure and the formation of words from smaller units

(morphemes).

Gayana M N Module 2 – Word Level Analysis 31

Morphological Parsing

Morphemes
▪ Morphemes are the smallest meaningful constituents of words.
▪ Words are composed of morphemes (one or more).
▪ bread – consists of single morpheme
▪ eggs – consist of two: eggand -s
▪ sing-er-s, home-work, un-kind-ly, flipp-ed
▪ de-nation-al-iz-ation
▪ auto-servis-u (voluntary service)
▪ Plural morpheme: cat-s, dog-s, judg-es
▪ Opposite: un-happy, in-comprehensive, im-possible, ir-rational

Gayana M N Module 2 – Word Level Analysis 32

Morphological Parsing
Parsing
▪ Taking a surface input and analyzing its components and underlying
structure.
Morphological parsing
▪ To discover the morphemes that build a given word.
▪ Taking a word as input and identifying the stems and affixes (and
possibly interpreting these).
Example:
goose goose +N +SG or goose + V
geese goose +N +PL
gooses goose +V +3SG
Gayana M N Module 2 – Word Level Analysis 33
Morphological Parsing
Why Parse words ?
▪ To find stems.
▪ Simple key to word similarity
▪ yellow, yellowish, yellows, yellowed, yellowing…
▪ egg, eggs
▪ To find affixes and the information they convey.
▪ ‘ed’signals a verb
▪ ‘ish’an adjective
▪ ‘s’?
▪ Morphological parsing provides information about a word’s
semantics and the syntactic role it plays in a sentence.
Gayana M N Module 2 – Word Level Analysis 34
Morphological Parsing

Some Practical Applications ….

▪ For spell checking
▪ Is muncheable a legal word ?
▪ To identify a word’s part-of-speech (POS)
▪ Sentence Parsing, Machine Translation, …
▪ To identify a word’s stem
▪ Information Retrieval
▪ Why not just list all word forms in a lexicon?

Gayana M N Module 2 – Word Level Analysis 35

Morphological Parsing
Classes of morphemes
▪ Stems
▪ Affixes
Stems
▪ The main morpheme in a word.
▪ Contains the central meaning.
▪ A simple key to word similarity.

Affixes
▪ Modify the meaning given by the stem.

Gayana M N Module 2 – Word Level Analysis 36

Morphological Parsing
Classes of morphemes
Affixes
▪ Modify the meaning given by the stem.
▪ Divided into:
▪ Prefix – morphemes which appear before a stem
▪ Suffix - morphemes applied to the end of the stem
▪ Infix - morphemes that appear inside a stem
▪ Circumfix - morphemes applied to either end of the stem
▪ Prefixes and suffixes are quite common in Urdu, Hindi and English.

Gayana M N Module 2 – Word Level Analysis 37

Morphological Parsing

Classes of morphemes
Prefix - morphemes which appear before a stem
▪ Unhappy
▪ bewaqt (Urdu)
▪ अप मान, सं सार उप वन (Hindi),
▪ ಅತೃಪ್ತಿ (Atripti) (Kannada)
▪ ….

Gayana M N Module 2 – Word Level Analysis 38

Morphological Parsing
Classes of morphemes
Suffix - morphemes applied to the end of the stem.
▪ trees, birds
▪ ಮರಗಳು ,(upayōgisu)
▪ शीतलता, ज़Ǿरतमद

Gayana M N Module 2 – Word Level Analysis 39

Morphological Parsing
Classes of morphemes
Infix - morphemes that appear inside the stem.

▪ Common in Austronesian and Austroasiatic languages (Tagalog-

-philippines, Khmer-cambodia)
▪ Tagalog:
▪ Kayu- ‘wood’=> kayu-in- => ‘kinayu’(gathered wood)
▪ basa - ‘read’ = > b·um·asa - ‘readpast’
▪ sulat - ‘write’ = > s·um·ulat - ‘wrote’

▪ Very rare in English:

abso·bloody·lutely (emphatic or humorous form of absolutely)

Gayana M N Module 2 – Word Level Analysis 40

Morphological Parsing
Classes of morphemes
Circumfix - morphemes applied to either end of the stem.
Example:
in-correct-ly
im-matur-ity
un-bear-able
re-cover-ed

Suffixes are more common than prefixes which are more common than infixes
and circumfixes.

Gayana M N Module 2 – Word Level Analysis 41

Morphological Parsing

Word Formation
3 main ways for word formation:
▪ Inflection
▪ Derivation
▪ Compounding

Gayana M N Module 2 – Word Level Analysis 42

Morphological Parsing

Inflection
▪ Root wordis combined with a grammaticalmorphemeto yield a word of
the sameclass.
▪ bring, brings, brought, bringing

Gayana M N Module 2 – Word Level Analysis 43

Morphological Parsing

Derivation
▪ Word stem is combined with a grammatical morpheme to yield a word
belonging to different class.
▪ compute(verb)=> computation(noun)
▪ Nominalization (noun from verb/adjective)
▪ logic, logical, illogical, illogicality, logician

Gayana M N Module 2 – Word Level Analysis 44

Morphological Parsing

Compounding
▪ Merging two or more words to form a new word
▪ N + N → N: rain-bow
▪ V + N → V: pick-pocket
▪ P + V → V: over-do
▪ N + P → N: desk-top
▪ P + V → V: over-look
▪ Adj + Adj → Adj: bitter-sweet

Gayana M N Module 2 – Word Level Analysis 45

Morphological Parsing

Why Morphological Analysis

▪ New words are continually forming a NL
▪ Morphologically related to known words
▪ Understanding morphology = > understand syntactic and semantic
properties of new words.
Example:
▪ Agreement features of words (In parsing)
▪ Identify the presence of a query word in a document inspite of
morphological variants (IR).

Gayana M N Module 2 – Word Level Analysis 46

Morphological Parsing

Why Morphological Analysis

▪ Takes as input the inflected surface form of each word in text
▪ Output the parsed form consisting of canonical form (lemma) of the
words and a set of tags showing its syntactic category and morphological
characteristics.

Example:
▪ POS/inflectional properties (gender, number, person, tense, …)

Gayana M N Module 2 – Word Level Analysis 47

Morphological Parsing

What do we need to build a morphological parser?

A morphological parser uses the following information sources:
▪ Lexicon: list of stems and affixes with basic information about them
(w/ corresponding p.o.s.)
▪ Morphotactics of the language: model of how and which morphemes
can be affixed to a stem
▪ rest-less-ness vs rest-ness-less
▪ Orthographic rules: spelling modifications that may occur when
affixation occurs
▪ easy-> easier y->ier ( not ‘easyer’)
Gayana M N Module 2 – Word Level Analysis 48
Morphological Parsing

Exhaustive Lexicon with feature of all the roots

Gayana M N Module 2 – Word Level Analysis 49

Morphological Parsing

Exhaustive Lexicon with feature of all the roots

Word form Category Root Gender Number Person

Ghodhaa noun GhoDaa masculine singular 3rd

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Gayana M N Module 2 – Word Level Analysis 50

Morphological Parsing
Limitations of exhaustive lexicon with features
▪ Puts heavy demand on memory.
▪ List every form of the word=>large number of, redundant entries.
▪ Fails to capture linguistic generalization.
▪ Fails to show the relationship between different roots having similar
word forms.
▪ Essential to develop a system capable of understanding unknown
words.
▪ For morphologically complex languages, number of possible word-forms
may be theoretically infinite.
▪ Not practical to list all possible word-forms for such languages.

Gayana M N Module 2 – Word Level Analysis 51

Morphological Parsing

Morphological Systems - Stemmers

▪ Simplest morphological system.
▪ Collapses morphological variations of a given word (word-forms) to
one lemma or stem.
▪ Do not require a lexicon.
▪ Specifically used in Information Retrieval (IR)
▪ Popular stemmers
▪ Lovins (1968)
▪ Porter (1980)

Gayana M N Module 2 – Word Level Analysis 52

Morphological Parsing

Morphological Systems - Stemmers

▪ Do not require a lexicon.
▪ Use rewrite rules of the form:
▪ ier → y (eg., earlier →early)
▪ ing→ ε (eg., playing→ play)

Stemming algorithms work in 2 steps:

1. Suffix removal: Removes predefined endings from words.
2. Recoding: Adds predefined endings to the output of the step 1.
The two steps can be performed:
1. Sequentially (Lovins’s)
2. Simultaneously (Porter’s)
Gayana M N Module 2 – Word Level Analysis 53
Morphological Parsing

Morphological Systems - Stemmers

Porter Stemmer(1980)
▪ Used for tasks in which you only care about the stem.
▪ IR, topic detection, document similarity,….
▪ Lexicon-free morphological analysis.
▪ Cascades rewrite rules
▪ e.g. misunderstanding → misunderstand → understand …
▪ Easily implemented as an FST with rules
▪ ier → y (eg., earlier →early)
▪ ing → ε (eg., playing → play)
▪ Not perfect ….
▪ does → doe
Gayana M N Module 2 – Word Level Analysis 54
Morphological Parsing

Morphological Systems - Stemmers

Porter Stemmer(1980)
▪ Porters algorithm makes use of following transformational rule for the word
: “rotational” ---> “rotate”
ational ---> ate
▪ But transformation of the word ‘organization’ ---> ‘organ’ and ‘noise’ --->
‘noisy’ are not perfect
▪ It reduces only suffixes; prefixes and compounds are not reduced

Gayana M N Module 2 – Word Level Analysis 55

Morphological Parsing
Two-level Morphological Model
▪ First proposed by Kimmo Koskenniemi (1983)
▪ Applicable to highly inflected languages.
▪ Word is represented as a correspondence between its lexical form and its
surface level form.
▪ Surface form → Actual spelling of the word
▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
▪ Morphological parsing -> Mapping from the surface level into
morpheme and feature sequences on the lexical level.

Gayana M N Module 2 – Word Level Analysis 56

Morphological Parsing
Two-level Morphological Model

▪ Surface form → Actual spelling of the word

▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
Surface Level p l a y i n g
Lexical p l a y +V +PP

Surface Level b o o k s
Lexical b o o k +N PL

Gayana M N Module 2 – Word Level Analysis 57

Morphological Parsing
Two-level Morphological Model

▪ First component is the stem, cat.

▪ Second component is the morphological information, that specifies the
surface level form is a plural noun (+N+PL).

Surface Level c a t s
Lexical c a t +N +PL

Gayana M N Module 2 – Word Level Analysis 58

Morphological Parsing

Parsing with FSTs

cats → cat + N + PL

▪ Word can be represented as correspondence between lexical level (its

morphemes) and surface level (its orthography).
▪ Morphological parsing: finds the mapping (transduction) between lexical
and surface levels.
Surface Level c a t s
Lexical c a t +N +PL

Gayana M N Module 2 – Word Level Analysis 59

Morphological Parsing
Finite State Transducers (FSTs)
▪ FSTs map between one set of symbols and another using a FSA, whose
alphabet Σ is composed of pairs of symbols from input and output
alphabets.

▪ FST is a 2-state automaton that:

▪ Recognizes
▪ Generates a pair of strings

Gayana M N Module 2 – Word Level Analysis 60

Morphological Parsing
Finite State Transducers (FSTs)
▪ FST is a 6-tuple {Σ 1, Σ 2, Q, δ, S, F} consisting of
▪ Q: set of states.
▪ Σ: an alphabet of complex symbols, each an i:o pair s.t. i Σ1 (an
input alphabet) and o Σ2.
▪ Σ1: the alphabet of input
▪ Σ2: the alphabet of output
▪ S: a start state
▪ F: a set of final states in Q; F ϵ Q
▪ δ: a transition function mapping; Q x (Σ1 U {ε}) x (Σ2 U {ε}) to
power set of Q.
hot → cot
Gayana M N Module 2 – Word Level Analysis 61
Morphological Parsing
Finite State Transducers (FSTs)

h:c q1 o:o
t:t
q0 q3 q4

c:b q2 a:a

Figure: A simple transducer that accepts two input strings, hotand cat,and
maps them onto cot and bat respectively.

Gayana M N Module 2 – Word Level Analysis 62

Morphological Parsing

Finite State Transducers (FSTs)

▪ FSTs encode regular relations.

▪ Regular relation is the relation between regular languages.
▪ Regular language encoded on the upper side of an FST is called upper
language.
▪ Regular language encoded on the lower side is termed as lower
language.

Gayana M N Module 2 – Word Level Analysis 63

Figure: Two-step morphological parser.

Gayana M N Module 2 – Word Level Analysis 64

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Step2: Maps
Surface
Step1: Split word
Intermediate morphemes to stem
into possible Lexical form
form form and morphological
morphemes
features
bird bird bird + N +SG
birds bird + s bird + N +PL
goose goose goose + N + SG
geese geese geese + N + PL
boxes box + s box + N + PL

Figure: Two-step morphological parser.

Gayana M N Module 2 – Word Level Analysis 65

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.

Gayana M N Module 2 – Word Level Analysis 66

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

Gayana M N Module 2 – Word Level Analysis 67

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.

▪ + indicates morpheme boundaries.

Gayana M N Module 2 – Word Level Analysis 68

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.

▪ + indicates morpheme boundaries.

▪ Also, consider spellingrules.

Gayana M N Module 2 – Word Level Analysis 69

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.

▪ + indicates morpheme boundaries.

▪ Also, consider spellingrules.

foxes → foxe+ s foxeis stem, s the suffix.

Gayana M N Module 2 – Word Level Analysis 70

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than one representation for a given word.

Gayana M N Module 2 – Word Level Analysis 71

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.

Gayana M N Module 2 – Word Level Analysis 72

Gayana M N Module 2 – Word Level Analysis 73

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
▪ A transducer that does the mapping (translation) required by the first step for
the surface form ‘lesser’ represents information that:
▪ the comparative form of the adjective ‘less’ is ‘lesser’,
▪ εis the empty string.
▪ The automaton is inherently bi-directional.
▪ Same transducer can be used for analysis (surface input, ‘upward
application’) or for generation (lexical input, ‘downward’ application).
Gayana M N Module 2 – Word Level Analysis 74
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.

Gayana M N Module 2 – Word Level Analysis 75

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s

Gayana M N Module 2 – Word Level Analysis 76

Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s → cat +N +PL