[go: up one dir, main page]

0% found this document useful (0 votes)
82 views99 pages

Module2 Ch3 A

Uploaded by

pomono1988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views99 pages

Module2 Ch3 A

Uploaded by

pomono1988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

MODULE 2

WORD LEVEL ANALYSIS


MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2


Module 2 – Word Level Analysis

By the end of this chapter you should be able to:

1. Analyze the natural language text at word and phrase/sentence level.


2.Critically appraise existing Natural Language Processing (NLP) tools
such as :
Stemmers (Word level)
POS taggers (Word/Phrase/Sentence)

Gayana M N Module 2 – Word Level Analysis 3


Word Level Analysis

▪ Word level processing


▪ Methods for characterizing word sequences
▪ Identifying morphological variants
▪ Detecting and correcting misspelled words
▪ Identifying correct part-of-speech of a word
▪ Rule-based
▪ Stochastic (data driven)
▪ Hybrid

Gayana M N Module 2 – Word Level Analysis 4


MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2


Introduction

▪ Regular Expressions are used for describing text strings.


▪ Example: To find word ‘supernova’ by using search engine and in
information retrieval applications
▪ Implementation of RE using Finite-State Automaton(FSA).
▪ Used in speech recognition and synthesis, spell checking, information
extraction.
▪ Errors in typing and spelling are common in text processing.
▪ An interactive facility to correct errors, Identifying word with different
meanings depending on the context.

Gayana M N Module 2 – Word Level Analysis 6


MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2


Regular Expressions
▪ Regular Expressions (regexes) are powerful way to find and replace strings
that take a defined format.
▪ Regular expressions can be used to parse dates, urls and email addresses, log
files, configuration files, command line switches or programming scripts.
▪ RE are the useful tools for the design of language compilers.
▪ RE also used in NLP for tokenization, describing lexicons, morphological
analysis etc.
▪ Perl was the first language that provides integrated support for regular
expressions.
▪ RE was first introduced by Kleene (1956).
▪ RE is an algebraic formula consisting of Pattern, set of strings.

Gayana M N Module 2 – Word Level Analysis 8


Regular Expressions
▪ The simplest kind of regular expression is a sequence of simple characters;
putting characters in sequence is called concatenation.

▪ Regular expressions are case sensitive

Gayana M N Module 2 – Word Level Analysis 9


Regular Expressions
▪ Characters are grouped by putting between [ ], disjunction of characters to
match.
▪ Example: /[0123456789]/ => Any single digit.
/[0-9]/ => Any one digit from 0 to 9
/[m-p]/ => Any one letter m,n,o or p
/[^x]/ =>Single character except x

Gayana M N Module 2 – Word Level Analysis 1


0
Regular Expressions
▪ The brackets can be used with the dash (-) to specify any one character in a
range.

Gayana M N Module 2 – Word Level Analysis 1


1
Regular Expressions
▪ The square braces can also be used to specify what a single character
cannot be, by use of the caret ^ .
▪ /[a]/ matches any single character (including special characters) except a.
▪ This is only true when the caret is the first symbol after the open square
brace.

Gayana M N Module 2 – Word Level Analysis 1


2
Regular Expressions
▪ we use the question mark / ? /, which means “zero or one instances of the
previous character”

▪ * => Specifies zero or more occurrences of a preceding character or RE.


▪ * is called as Kleene *
▪ /b*/ =>Match any string containing zero or more occurrences of b. such as
‘b’, ’bb’ or ‘ bbb’ etc.
▪ /[ab]*/ => zero or more occurrence of “a”s or “b”s. This will match
strings: ‘aa’ ,’bb’ or ‘abab’
Gayana M N Module 2 – Word Level Analysis 1
3
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)

Gayana M N Module 2 – Word Level Analysis 1


4
Regular Expressions
▪ + => Specifies one or more occurrences of a preceding character
▪ + is called as Kleene +.
▪ /a+/ => one or more occurrences of ‘a’.
▪ (/./), a wildcard expression that matches any single character (except a
carriage return)

/aardvark.*aardvark/

Gayana M N Module 2 – Word Level Analysis 1


5
Regular Expressions
▪ Anchors are special characters.

▪ the caret ^ has three uses:


▪ to match the start of a line,
▪ to indicate a negation inside of square brackets, and
▪ just to mean a caret.
/The dog\.$/ matches a line that contains only the phrase The dog.
Gayana M N Module 2 – Word Level Analysis 1
6
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.
▪ /gupp(y|ies)/ ?
▪ Operator Pecedence

Gayana M N Module 2 – Word Level Analysis 1


7
Regular Expressions
▪ disjunction operator, also called the pipe symbol |
▪ /cat|dog/ matches either the string cat or the string dog.

Gayana M N Module 2 – Word Level Analysis 1


8
Regular Expressions

Gayana M N Module 2 – Word Level Analysis 1


9
Regular Expressions
▪ Example: To check if string is email address or not
▪ An email address consist of a non-empty sequence of characters
followed by the ‘@ ’ symbol, followed by another non-empty sequence
of characters ending with pattern like .xx , .xxx , .xxxx etc.
▪ The regular expression for an email address is :
^[A-Za-z0-9_\.-]+ @[A-Za-z0-9_\.-]+ [A-Za-z0-9_ ][A-Za-z0-9_ ]$
• Above example works for most of the cases. But it is may not be accurate
enough to match all correct addresses.
• It may accept non-working email addresses. So fine tuning is required for
accurate characterization.

Gayana M N Module 2 – Word Level Analysis 2


0
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2


Finite State Machines
▪ Consider the game of playing board ; dice are thrown .
▪ All possible positions of the pieces on the board => states.
▪ State in which the game begins =>initial state
▪ State corresponding to the winning position =>final state
▪ Example: Machine with input, processor, memory and output device Here,
machine starts in initial state, Checks input goes to next state, If process
perfectly then reaches final state & terminates.
▪ If machine stucks in between,then non-final state-reject input

Gayana M N Module 2 – Word Level Analysis 2


2
Finite State Machines
▪ In ‘Finite Automaton’
▪ ‘Finite’ refers number of states and the alphabet of input symbols is finite.
▪ ‘Automaton’ refers machine moves automatically. I.e. change of state is
completely governed by the input.=>deterministic
▪ Properties of finite automaton:
▪ A finite set of states, one is initial/ start state , and one more is final
state.
▪ A finite Alphabet set, ⅀ , consisting of input symbols
▪ A finite set of transitions that specify each state and each symbol of
input alphabet

Gayana M N Module 2 – Word Level Analysis 2


3
Finite State Machines
▪ Deterministic Finite state Automaton (DFA): Exactly one transition leading
out of a state is possible for the different input symbol
▪ Example: Suppose ⅀ ={ a,b } , the set of states ={q0,q1,q2,q3,q4} with q0
being the start state and q4 the final state, we have the following rules of
transition:

Gayana M N Module 2 – Word Level Analysis 2


4
Finite State Machines
▪ Deterministic finite automaton (DFA) can be defined as 5 tuple
(Q,⅀, δ, S, F) below:
▪ Where, Q is set of states,
▪ ⅀ is an alphabet,
▪ S is the start state,
▪ F⊆Q is a set of final states ,
▪ δ is a transition function.
▪ ‘Finite Automaton’ used in wide variety of areas: linguistics, electrical
engineering, computer science, mathematics and logic
▪ These are important tool in computational linguistics and used in
mathematical device to implement regular expressions.
Gayana M N Module 2 – Word Level Analysis 2
5
Finite State Machines
▪ Non-Deterministic Finite Automaton (NFA):maps Q X ( ⅀ U {ɛ}) to a
subset of the power set of Q. i.e. for each state,there can be more than one
transition on a given symbol, each leading to a different state.
▪ Q X ( ⅀ U {ɛ}) , means NFA can make use of ɛ, to do state transition,if no
symbols(a,b,..) are defined.
▪ In NFA, More than one transition out of a state is possible for the same
input symbol.

Gayana M N Module 2 – Word Level Analysis 2


6
Finite State Machines
▪ Listing all the state transitions is inconvenient as it quite long so we have to
represent automaton as => State-transition table.
▪ In transition table, Rows => states , Columns=> input (symbol) , ɸ =>
Missing transition

Gayana M N Module 2 – Word Level Analysis 2


7
Finite State Machines
▪ Example: Language consisting of all strings containing a’s and b’s and
ending with baa.
▪ This language can be specified by regular expression /(a|b)*baa$/
▪ NFA and State-transition table can be represented as below:

Gayana M N Module 2 – Word Level Analysis 2


8
Finite State Machines
▪ Two automata that define the same language are said to be equivalent.
▪ NFA can be converted to an equivalent DFA and vice versa
▪ Example: Language consisting of all strings containing as and bs and
ending with baa .Regular expression is : /(a|b)*baa$/
▪ The Equivalent DFA for the NFA shown as below:

Gayana M N Module 2 – Word Level Analysis 2


9
MODULE 2
• Introduction
• Regular Expressions
• Finite State Automata
• Morphological Parsing
• Spelling Error Detection and Correction
• Words and Word Classes
• Part of-Speech Tagging

Gayana M N Module 2 – Word Level Analysis 2


Morphological Parsing

▪ Sub-discipline of linguistics.
▪ Studies word structure and the formation of words from smaller units

(morphemes).

Gayana M N Module 2 – Word Level Analysis 31


Morphological Parsing

Morphemes
▪ Morphemes are the smallest meaningful constituents of words.
▪ Words are composed of morphemes (one or more).
▪ bread – consists of single morpheme
▪ eggs – consist of two: eggand -s
▪ sing-er-s, home-work, un-kind-ly, flipp-ed
▪ de-nation-al-iz-ation
▪ auto-servis-u (voluntary service)
▪ Plural morpheme: cat-s, dog-s, judg-es
▪ Opposite: un-happy, in-comprehensive, im-possible, ir-rational

Gayana M N Module 2 – Word Level Analysis 32


Morphological Parsing
Parsing
▪ Taking a surface input and analyzing its components and underlying
structure.
Morphological parsing
▪ To discover the morphemes that build a given word.
▪ Taking a word as input and identifying the stems and affixes (and
possibly interpreting these).
Example:
goose goose +N +SG or goose + V
geese goose +N +PL
gooses goose +V +3SG
Gayana M N Module 2 – Word Level Analysis 33
Morphological Parsing
Why Parse words ?
▪ To find stems.
▪ Simple key to word similarity
▪ yellow, yellowish, yellows, yellowed, yellowing…
▪ egg, eggs
▪ To find affixes and the information they convey.
▪ ‘ed’signals a verb
▪ ‘ish’an adjective
▪ ‘s’?
▪ Morphological parsing provides information about a word’s
semantics and the syntactic role it plays in a sentence.
Gayana M N Module 2 – Word Level Analysis 34
Morphological Parsing

Some Practical Applications ….


▪ For spell checking
▪ Is muncheable a legal word ?
▪ To identify a word’s part-of-speech (POS)
▪ Sentence Parsing, Machine Translation, …
▪ To identify a word’s stem
▪ Information Retrieval
▪ Why not just list all word forms in a lexicon?

Gayana M N Module 2 – Word Level Analysis 35


Morphological Parsing
Classes of morphemes
▪ Stems
▪ Affixes
Stems
▪ The main morpheme in a word.
▪ Contains the central meaning.
▪ A simple key to word similarity.

Affixes
▪ Modify the meaning given by the stem.

Gayana M N Module 2 – Word Level Analysis 36


Morphological Parsing
Classes of morphemes
Affixes
▪ Modify the meaning given by the stem.
▪ Divided into:
▪ Prefix – morphemes which appear before a stem
▪ Suffix - morphemes applied to the end of the stem
▪ Infix - morphemes that appear inside a stem
▪ Circumfix - morphemes applied to either end of the stem
▪ Prefixes and suffixes are quite common in Urdu, Hindi and English.

Gayana M N Module 2 – Word Level Analysis 37


Morphological Parsing

Classes of morphemes
Prefix - morphemes which appear before a stem
▪ Unhappy
▪ bewaqt (Urdu)
▪ अप मान, सं सार उप वन (Hindi),
▪ ಅತೃಪ್ತಿ (Atripti) (Kannada)
▪ ….

Gayana M N Module 2 – Word Level Analysis 38


Morphological Parsing
Classes of morphemes
Suffix - morphemes applied to the end of the stem.
▪ trees, birds
▪ ಮರಗಳು ,(upayōgisu)
▪ शीतलता, ज़Ǿरतमद

Gayana M N Module 2 – Word Level Analysis 39


Morphological Parsing
Classes of morphemes
Infix - morphemes that appear inside the stem.

▪ Common in Austronesian and Austroasiatic languages (Tagalog-


-philippines, Khmer-cambodia)
▪ Tagalog:
▪ Kayu- ‘wood’=> kayu-in- => ‘kinayu’(gathered wood)
▪ basa - ‘read’ = > b·um·asa - ‘readpast’
▪ sulat - ‘write’ = > s·um·ulat - ‘wrote’

▪ Very rare in English:


abso·bloody·lutely (emphatic or humorous form of absolutely)

Gayana M N Module 2 – Word Level Analysis 40


Morphological Parsing
Classes of morphemes
Circumfix - morphemes applied to either end of the stem.
Example:
in-correct-ly
im-matur-ity
un-bear-able
re-cover-ed

Suffixes are more common than prefixes which are more common than infixes
and circumfixes.

Gayana M N Module 2 – Word Level Analysis 41


Morphological Parsing

Word Formation
3 main ways for word formation:
▪ Inflection
▪ Derivation
▪ Compounding

Gayana M N Module 2 – Word Level Analysis 42


Morphological Parsing

Inflection
▪ Root wordis combined with a grammaticalmorphemeto yield a word of
the sameclass.
▪ bring, brings, brought, bringing

Gayana M N Module 2 – Word Level Analysis 43


Morphological Parsing

Derivation
▪ Word stem is combined with a grammatical morpheme to yield a word
belonging to different class.
▪ compute(verb)=> computation(noun)
▪ Nominalization (noun from verb/adjective)
▪ logic, logical, illogical, illogicality, logician

Gayana M N Module 2 – Word Level Analysis 44


Morphological Parsing

Compounding
▪ Merging two or more words to form a new word
▪ N + N → N: rain-bow
▪ V + N → V: pick-pocket
▪ P + V → V: over-do
▪ N + P → N: desk-top
▪ P + V → V: over-look
▪ Adj + Adj → Adj: bitter-sweet

Gayana M N Module 2 – Word Level Analysis 45


Morphological Parsing

Why Morphological Analysis


▪ New words are continually forming a NL
▪ Morphologically related to known words
▪ Understanding morphology = > understand syntactic and semantic
properties of new words.
Example:
▪ Agreement features of words (In parsing)
▪ Identify the presence of a query word in a document inspite of
morphological variants (IR).

Gayana M N Module 2 – Word Level Analysis 46


Morphological Parsing

Why Morphological Analysis


▪ Takes as input the inflected surface form of each word in text
▪ Output the parsed form consisting of canonical form (lemma) of the
words and a set of tags showing its syntactic category and morphological
characteristics.

Example:
▪ POS/inflectional properties (gender, number, person, tense, …)

Gayana M N Module 2 – Word Level Analysis 47


Morphological Parsing

What do we need to build a morphological parser?


A morphological parser uses the following information sources:
▪ Lexicon: list of stems and affixes with basic information about them
(w/ corresponding p.o.s.)
▪ Morphotactics of the language: model of how and which morphemes
can be affixed to a stem
▪ rest-less-ness vs rest-ness-less
▪ Orthographic rules: spelling modifications that may occur when
affixation occurs
▪ easy-> easier y->ier ( not ‘easyer’)
Gayana M N Module 2 – Word Level Analysis 48
Morphological Parsing

Exhaustive Lexicon with feature of all the roots

Gayana M N Module 2 – Word Level Analysis 49


Morphological Parsing

Exhaustive Lexicon with feature of all the roots

Word form Category Root Gender Number Person

Ghodhaa noun GhoDaa masculine singular 3rd

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Gayana M N Module 2 – Word Level Analysis 50


Morphological Parsing
Limitations of exhaustive lexicon with features
▪ Puts heavy demand on memory.
▪ List every form of the word=>large number of, redundant entries.
▪ Fails to capture linguistic generalization.
▪ Fails to show the relationship between different roots having similar
word forms.
▪ Essential to develop a system capable of understanding unknown
words.
▪ For morphologically complex languages, number of possible word-forms
may be theoretically infinite.
▪ Not practical to list all possible word-forms for such languages.

Gayana M N Module 2 – Word Level Analysis 51


Morphological Parsing

Morphological Systems - Stemmers


▪ Simplest morphological system.
▪ Collapses morphological variations of a given word (word-forms) to
one lemma or stem.
▪ Do not require a lexicon.
▪ Specifically used in Information Retrieval (IR)
▪ Popular stemmers
▪ Lovins (1968)
▪ Porter (1980)

Gayana M N Module 2 – Word Level Analysis 52


Morphological Parsing

Morphological Systems - Stemmers


▪ Do not require a lexicon.
▪ Use rewrite rules of the form:
▪ ier → y (eg., earlier →early)
▪ ing→ ε (eg., playing→ play)

Stemming algorithms work in 2 steps:


1. Suffix removal: Removes predefined endings from words.
2. Recoding: Adds predefined endings to the output of the step 1.
The two steps can be performed:
1. Sequentially (Lovins’s)
2. Simultaneously (Porter’s)
Gayana M N Module 2 – Word Level Analysis 53
Morphological Parsing

Morphological Systems - Stemmers


Porter Stemmer(1980)
▪ Used for tasks in which you only care about the stem.
▪ IR, topic detection, document similarity,….
▪ Lexicon-free morphological analysis.
▪ Cascades rewrite rules
▪ e.g. misunderstanding → misunderstand → understand …
▪ Easily implemented as an FST with rules
▪ ier → y (eg., earlier →early)
▪ ing → ε (eg., playing → play)
▪ Not perfect ….
▪ does → doe
Gayana M N Module 2 – Word Level Analysis 54
Morphological Parsing

Morphological Systems - Stemmers


Porter Stemmer(1980)
▪ Porters algorithm makes use of following transformational rule for the word
: “rotational” ---> “rotate”
ational ---> ate
▪ But transformation of the word ‘organization’ ---> ‘organ’ and ‘noise’ --->
‘noisy’ are not perfect
▪ It reduces only suffixes; prefixes and compounds are not reduced

Gayana M N Module 2 – Word Level Analysis 55


Morphological Parsing
Two-level Morphological Model
▪ First proposed by Kimmo Koskenniemi (1983)
▪ Applicable to highly inflected languages.
▪ Word is represented as a correspondence between its lexical form and its
surface level form.
▪ Surface form → Actual spelling of the word
▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
▪ Morphological parsing -> Mapping from the surface level into
morpheme and feature sequences on the lexical level.

Gayana M N Module 2 – Word Level Analysis 56


Morphological Parsing
Two-level Morphological Model

▪ Surface form → Actual spelling of the word


▪ Lexical form → Concatenation of its constituent morphemes with
morphological features.
Surface Level p l a y i n g
Lexical p l a y +V +PP

Surface Level b o o k s
Lexical b o o k +N PL

Gayana M N Module 2 – Word Level Analysis 57


Morphological Parsing
Two-level Morphological Model

▪ First component is the stem, cat.


▪ Second component is the morphological information, that specifies the
surface level form is a plural noun (+N+PL).

Surface Level c a t s
Lexical c a t +N +PL

Gayana M N Module 2 – Word Level Analysis 58


Morphological Parsing

Parsing with FSTs

cats → cat + N + PL

▪ Word can be represented as correspondence between lexical level (its


morphemes) and surface level (its orthography).
▪ Morphological parsing: finds the mapping (transduction) between lexical
and surface levels.
Surface Level c a t s
Lexical c a t +N +PL

Gayana M N Module 2 – Word Level Analysis 59


Morphological Parsing
Finite State Transducers (FSTs)
▪ FSTs map between one set of symbols and another using a FSA, whose
alphabet Σ is composed of pairs of symbols from input and output
alphabets.

▪ FST is a 2-state automaton that:


▪ Recognizes
▪ Generates a pair of strings

Gayana M N Module 2 – Word Level Analysis 60


Morphological Parsing
Finite State Transducers (FSTs)
▪ FST is a 6-tuple {Σ 1, Σ 2, Q, δ, S, F} consisting of
▪ Q: set of states.
▪ Σ: an alphabet of complex symbols, each an i:o pair s.t. i Σ1 (an
input alphabet) and o Σ2.
▪ Σ1: the alphabet of input
▪ Σ2: the alphabet of output
▪ S: a start state
▪ F: a set of final states in Q; F ϵ Q
▪ δ: a transition function mapping; Q x (Σ1 U {ε}) x (Σ2 U {ε}) to
power set of Q.
hot → cot
Gayana M N Module 2 – Word Level Analysis 61
Morphological Parsing
Finite State Transducers (FSTs)

h:c q1 o:o
t:t
q0 q3 q4

c:b q2 a:a

Figure: A simple transducer that accepts two input strings, hotand cat,and
maps them onto cot and bat respectively.

Gayana M N Module 2 – Word Level Analysis 62


Morphological Parsing

Finite State Transducers (FSTs)

▪ FSTs encode regular relations.


▪ Regular relation is the relation between regular languages.
▪ Regular language encoded on the upper side of an FST is called upper
language.
▪ Regular language encoded on the lower side is termed as lower
language.

Gayana M N Module 2 – Word Level Analysis 63


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Step2: Maps
Surface
Step1: Split word
Intermediate morphemes to stem
into possible Lexical form
form form and morphological
morphemes
features

Figure: Two-step morphological parser.

Gayana M N Module 2 – Word Level Analysis 64


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Step2: Maps
Surface
Step1: Split word
Intermediate morphemes to stem
into possible Lexical form
form form and morphological
morphemes
features
bird bird bird + N +SG
birds bird + s bird + N +PL
goose goose goose + N + SG
geese geese geese + N + PL
boxes box + s box + N + PL

Figure: Two-step morphological parser.

Gayana M N Module 2 – Word Level Analysis 65


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.

Gayana M N Module 2 – Word Level Analysis 66


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

Gayana M N Module 2 – Word Level Analysis 67


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.


▪ + indicates morpheme boundaries.

Gayana M N Module 2 – Word Level Analysis 68


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.


▪ + indicates morpheme boundaries.

▪ Also, consider spellingrules.

Gayana M N Module 2 – Word Level Analysis 69


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
1. Split the words up into its possible components.
cats → cat + s

▪ Output is concatenation of morphemes. i.e., Stems + Affixes.


▪ + indicates morpheme boundaries.

▪ Also, consider spellingrules.


foxes → foxe+ s foxeis stem, s the suffix.

Gayana M N Module 2 – Word Level Analysis 70


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than one representation for a given word.

Gayana M N Module 2 – Word Level Analysis 71


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.

Gayana M N Module 2 – Word Level Analysis 72


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
▪ A transducer that does the mapping (translation) required by the first step for
the surface form ‘lesser’ represents information that:
▪ the comparative form of the adjective ‘less’ is ‘lesser’,
▪ εis the empty string.

Gayana M N Module 2 – Word Level Analysis 73


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
▪ There can be more than onerepresentation for a given word.
+ADJ +Comp
l e s s ε
l e s s ε e r
Figure: A simple FST.
▪ A transducer that does the mapping (translation) required by the first step for
the surface form ‘lesser’ represents information that:
▪ the comparative form of the adjective ‘less’ is ‘lesser’,
▪ εis the empty string.
▪ The automaton is inherently bi-directional.
▪ Same transducer can be used for analysis (surface input, ‘upward
application’) or for generation (lexical input, ‘downward’ application).
Gayana M N Module 2 – Word Level Analysis 74
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.

Gayana M N Module 2 – Word Level Analysis 75


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s

Gayana M N Module 2 – Word Level Analysis 76


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of thestems and meaningof the affixes.
cats → cat + s → cat +N +PL

Gayana M N Module 2 – Word Level Analysis 77


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

Gayana M N Module 2 – Word Level Analysis 78


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

foxes → fox + s

Gayana M N Module 2 – Word Level Analysis 79


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

foxes → fox + s → fox +N +PL

Gayana M N Module 2 – Word Level Analysis 80


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

foxes → fox + s → fox +N +PL

▪ Learn that ‘foxe’is not a legal stem.


▪ foxe + s is incorrect way of splitting; So, discard foxes.

Gayana M N Module 2 – Word Level Analysis 81


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

foxes → fox + s → fox +N +PL

▪ Learn that ‘foxe’is not a legal stem.


▪ foxe + s is incorrect way of splitting; So, discard foxes.

▪ However, spouses → spouse + s → spouse +N +PL is correct.

Gayana M N Module 2 – Word Level Analysis 82


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL

foxes → fox + s → fox +N +PL

▪ Learn that ‘foxe’is not a legal stem.


▪ foxe + s is incorrect way of splitting; So, discard foxes.

▪ However, spouses → spouse + s → spouse +N +PL is correct.


▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 83
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL
Orthographic rules
foxes → fox + s → fox +N +PL are used to handle
spelling variations
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.

▪ However, spouses → spouse + s → spouse +N +PL is correct.


▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 84
Morphological Parsing
Implementing Two-level Morphological Model using FSTs
Involves two steps:
2. Use a lexicon to look up categories of the stems and meaningof the affixes.
cats → cat + s → cat +N +PL Spelling Rules:
• Add ‘e’after –s, -z, -x, -ch, -sh
foxes → fox + s → fox +N +PL before the ‘s’
• dish → dishes
• box →boxes
▪ Learn that ‘foxe’is not a legal stem.
▪ foxe + s is incorrect way of splitting; So, discard foxes.

▪ However, spouses → spouse + s → spouse +N +PL is correct.


▪ So also, parses → parse+ s → parse+N +PL
Gayana M N Module 2 – Word Level Analysis 85
Morphological Parsing

Implementing Two-level Morphological Model using FSTs

Implementation of the two steps with transducers requires building two


transducers:
▪ One that maps the surface form to the intermediate form.
▪ Another that maps the intermediate form to the lexical form.

Gayana M N Module 2 – Word Level Analysis 86


Morphological Parsing
Implementing Two-level Morphological Model using FSTs
FST-based morphological parser for singular and plural nouns in English.
Considerations:
▪ Plural form of regular nouns usually end with –s or –es.
▪ However, a word ending in ‘s’neednot necessarily be the plural form of a word.
▪ There are several singular words ending in ‘s’ (e.g., miss and ass).
▪ One of the required translations is the deletion of the ‘e’,when introducing a
morpheme boundary.
▪ Required for words ending in -xes, -ses, -zes (e.g., boxes and suffixes).

Gayana M N Module 2 – Word Level Analysis 87


Morphological Parsing
Implementing Two-level Morphological Model using FSTs: STEP1
FST-based morphological parser for singular and plural nouns in English

Figure: A simplified FST, mapping English nouns to the intermediate form.


Gayana M N Module 2 – Word Level Analysis 88
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP1
▪ Sequences of states that the transducer undergoes, given the surface
forms birds as input.

s
2 6 Output: bird + s
Surface level s
ε +
b i r d s
1 1 1 1 1 1 Output: birds
b i r d s

Lexical level

Figure: Possible sequences of states, Input: birds

Gayana M N Module 2 – Word Level Analysis 89


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP1
▪ Sequences of states that the transducer undergoes, given the surface form
boxes as input.
ε s
1 + 2 s 6 Output: boxe + s
e s
e s
4 Output: boxes
b o x
1 1 1 3
b o x e
ε ε s
Output: box + s
4 5 6
+ s

Figure: Possible sequences of states, Input: boxes

Gayana M N Module 2 – Word Level Analysis 90


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat

2. Regular noun stem +s, e.g., bird + s

3. Singular irregular noun stem, e.g., goose

4. Plural irregular noun stem, e.g., geese

Gayana M N Module 2 – Word Level Analysis 91


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s

3. Singular irregular noun stem, e.g., goose

4. Plural irregular noun stem, e.g., geese

Gayana M N Module 2 – Word Level Analysis 92


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to themselves;
but then output N and replace PL with s.

3. Singular irregular noun stem, e.g., goose

4. Plural irregular noun stem, e.g., geese

Gayana M N Module 2 – Word Level Analysis 93


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to themselves;
but then output N and replace PL with s.

3. Singular irregular noun stem, e.g., goose→ Same as in first case.

4. Plural irregular noun stem, e.g., geese

Gayana M N Module 2 – Word Level Analysis 94


Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping from the intermediate level to the
lexical level.
▪ The input to the transducer has one of the following forms:
1. Regular noun stem, e.g., bird, cat → Map all symbols of the stem to themselves
and then output N and sg.
2. Regular noun stem +s, e.g., bird + s → Map all symbols of the stem to
themselves; but then output N and replace PL with s.

3. Singular irregular noun stem, e.g., goose→ Same as in first case.

4. Plural irregular noun stem, e.g., geese→ Map irregular plural noun stem to the
corresponding singular stem and add N and PL.
Gayana M N Module 2 – Word Level Analysis 95
Morphological Parsing

Implementing Two-level Morphological Model using FSTs : STEP2

Figure: Stem Mappings


Gayana M N Module 2 – Word Level Analysis 96
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2
▪ Develop a transducer that does the mapping fromthe intermediatelevelto the
lexical level.

Figure: Transducer for Step 2


Gayana M N Module 2 – Word Level Analysis 97
Morphological Parsing
Implementing Two-level Morphological Model using FSTs : STEP2

b i r d ε:N + :ε

ε:sg s:PL
g o o s e ε:N ε:sg

ε:sg
g e:o e:o s e ε:N

Figure: A transducer mapping nouns to their stem and morphological features

Gayana M N Module 2 – Word Level Analysis 98


Morphological Parsing
Summary
▪ Simplest morphological systems are stemmers.
▪ Do not use lexicon, instead use re-write rules.
▪ Not perfect.
▪ FSTs provide a useful tool for implementing a standard model
of morphological analysis.
▪ Koskenniemi’s two-level morphological model.
▪ But for many tasks (e.g. IR) much simpler approaches are still widely
used,
e.g. the rule-based Porter Stemmer.
Gayana M N Module 2 – Word Level Analysis 99

You might also like