1.5)Describe the English morphology?
English morphology is the study of the structure of words and how they are formed. It deals with the
smallest units of meaning in language, called morphemes. Morphemes can be divided into two main
types:
1. Free morphemes: These are morphemes that can stand alone as words. They do not need to
be attached to other morphemes to have meaning. For example:
o "book"
o "run"
o "cat"
2. Bound morphemes: These are morphemes that cannot stand alone and must be attached to
a free morpheme to convey meaning. Bound morphemes include:
o Prefixes: morphemes added to the beginning of a word (e.g., un- in "undo").
o Suffixes: morphemes added to the end of a word (e.g., -ed in "walked").
o Infixes and circumfixes are less common in English but do exist in some forms or
specific words (e.g., -s- in "absinthe").
Key Concepts in English Morphology:
1. Derivational morphemes: These morphemes are used to create new words or to change the
grammatical category of a word. For example:
o "happy" (adjective) + "-ness" = "happiness" (noun).
o "run" (verb) + "-er" = "runner" (noun).
2. Inflectional morphemes: These morphemes do not change the grammatical category of a
word but provide additional grammatical information, such as tense, number, or possession.
English has a limited set of inflectional morphemes:
o Plural: "cat" → "cats" (adding -s).
o Past tense: "walk" → "walked" (adding -ed).
o Possession: "dog" → "dog’s" (adding -s with an apostrophe).
3. Allomorphs: These are variations of a morpheme that occur in different contexts. For
instance:
o The plural morpheme -s has different forms:
/-s/ after voiceless sounds (e.g., "cats").
/-z/ after voiced sounds (e.g., "dogs").
/-ɪz/ after sibilant sounds (e.g., "boxes").
4. Compounding: English allows the combination of two or more free morphemes to form a
new word. For example:
o "tooth" + "brush" = "toothbrush."
o "sun" + "flower" = "sunflower."
1.6) Transducers for Lexicon and Rules
Transducers, in the context of lexicons and rules, operate by reading an input string (e.g., a word or
morpheme sequence) and producing an output based on both the word's lexical information and the
applied transformation rules.
Lexicon
A lexicon is a collection of words and their properties. In computational linguistics, the lexicon stores:
The base forms (lemma) of words.
Information about morphological properties such as tense, number, or case.
Semantic properties of words (such as meaning, part of speech).
Allomorphs, which are different realizations of a morpheme (for example, "cats" and "dogs"
both reflect plural forms).
A lexical transducer maps a word or morpheme from the lexicon to its various forms or
representations based on certain grammatical rules. For instance, if the input is "run," the transducer
may generate forms like "running," "runs," "ran," etc., depending on the rules of tense or aspect.
Rules
In morphology and syntax, rules are used to transform one form of a word into another. For example,
rules specify how to pluralize a noun, conjugate a verb, or change the case of a noun. These rules can
be:
Inflectional rules: changing forms based on tense, number, gender, etc.
Derivational rules: changing words into new words of a different part of speech (e.g.,
"happy" to "happiness").
The rules govern how morphemes interact, how words combine, and how different forms are
generated from the lexicon.
Types of Transducers for Lexicon and Rules
1. Finite State Transducers (FSTs)
o A common formal model used in computational linguistics to represent lexicons and
morphological rules is the Finite State Transducer (FST). An FST is a type of finite-
state automaton that processes strings (sequences of symbols) and produces output
based on its state transitions.
o FSTs are especially useful for:
Morphological analysis: Mapping between surface forms (actual words) and
base forms (lemmas).
Morphological generation: Creating different word forms from a base form.
o FSTs work by having one state machine for the input and another for the output.
They process the input string (such as a word) through a series of state transitions
while applying the transformation rules, ultimately producing the output (such as a
conjugated or pluralized form).
Example: For the word "cats," an FST could generate the following transitions:
o Input: "cats"
o Apply plural rule (e.g., the addition of -s), output: "cat" (base form)
o Alternatively, it could take "cat" as input and generate "cats" as output based on a
morphological rule.
2. Transducers for Lexicon Look-Up
o Transducers can also act as lexicon look-up tools where input strings (words) are
matched against the lexicon to retrieve morphological properties, such as the base
form, part of speech, and other grammatical features.
o For example, when the word "running" is processed by a transducer, it may look up
"run" as the lemma and identify "running" as the present participle, applying the
appropriate transformation rule.
3. Two-Level Morphology (Kaplan and Kay's Model)
o In two-level morphology, lexical rules and surface rules are applied separately. A
transducer here maps between these two levels.
o Lexical rules map between underlying representations (like a morpheme or lemma)
and surface forms (like a word in a sentence).
o Two-level morphology uses morphological transducers to convert abstract,
underlying forms into the actual surface form seen in speech or writing.
How Transducers Work in Morphological Processing
Lexicon and Rules in Action:
1. Input word: A word is input to the system (e.g., "walked").
2. Lexicon lookup: The lexicon identifies the base form of the word (e.g., "walk") and
stores its morphological properties (verb, past tense, etc.).
3. Rule application: The morphological rules (e.g., adding -ed for past tense) are
applied, either to generate or analyze the word.
4. Output: The final output is the transformed word, in this case, "walked."
This process allows computational systems to recognize, generate, and analyze words based on both
the lexicon (list of known words) and transformation rules that govern their forms.
Tokenization is the process of splitting a stream of text into smaller, meaningful units called tokens.
These tokens can be words, subwords, characters, or even sentences. It is one of the first and most
important steps in many natural language processing (NLP) tasks, as it transforms raw, unstructured
text into structured data that can be processed by machines.
Key Points of Tokenization:
1. Basic Definition: Tokenization breaks down a sequence of text into smaller pieces or units
(tokens) that can be used in further linguistic analysis. These tokens often represent words,
but they could also be punctuation marks, numbers, or other significant symbols, depending
on the application.
Example:
o Raw text: "I love programming!"
o Tokenized form: ["I", "love", "programming", "!"]
2. Types of Tokens:
o Word-level tokens: The most common form of tokenization, where the text is split
into individual words. For example, "I love programming" would be tokenized into
["I", "love", "programming"].
o Subword-level tokens: This is particularly useful for handling rare or out-of-
vocabulary words. Subword tokenization splits words into smaller meaningful units
(e.g., morphemes). For example, "unhappiness" might be tokenized as ["un",
"happiness"].
o Character-level tokens: Text is broken down into individual characters. For example,
"hello" becomes ["h", "e", "l", "l", "o"]. This is often used in tasks like character-level
language modeling or languages with no clear word boundaries (e.g., Chinese).
o Sentence-level tokens: The text is tokenized into individual sentences. For example,
"I love programming. It's fun." becomes ["I love programming.", "It's fun."].
3. Importance of Tokenization in NLP:
o Preprocessing for text analysis: Tokenization is the first step before performing tasks
such as part-of-speech tagging, named entity recognition, or sentiment analysis.
o Enables machine learning models: Tokenized text can be used in machine learning
models like word embeddings (e.g., Word2Vec, GloVe) or neural networks for various
NLP tasks.
o Handles ambiguity: Tokenization helps in resolving ambiguities such as distinguishing
between words and punctuation marks.
o Multilingual support: Tokenization helps in processing different languages by
considering language-specific rules and structures.
4. Challenges in Tokenization:
o Punctuation: Deciding whether to treat punctuation marks as separate tokens or
part of adjacent words.
o Ambiguity in word boundaries: Some languages (e.g., Chinese, Japanese) do not
have spaces between words, making tokenization more complex.
o Handling compound words: In some languages, words are often formed by
combining smaller words (e.g., in German, "Straßenbahn" means "streetcar"), and
these compound words can present challenges for tokenization.
5. Types of Tokenizers:
o Rule-based tokenizers: These use predefined rules to break text into tokens. For
example, they might define that spaces separate words and punctuation marks
should be treated as individual tokens.
o Statistical or machine learning-based tokenizers: These use models trained on large
corpora of text to learn how to split words or sentences based on context and
statistical patterns.
o Pre-built tokenizers: Libraries like NLTK, spaCy, and Hugging Face's Tokenizers offer
ready-made tokenizers for various tasks, supporting multiple languages and handling
complex cases like contractions and special characters.
6. Tokenization in Different Languages:
o English: Tokenization is relatively simple due to the presence of spaces between
words and punctuation marks. However, issues arise with contractions (e.g., "I'm" →
["I", "'m"]).
o Chinese and Japanese: These languages do not have spaces between words, so
tokenization requires more sophisticated methods, such as word segmentation or
using machine learning models for text segmentation.
o Languages with complex morphology: In languages like Turkish, where words can
have many suffixes attached, subword or morpheme-level tokenization is often used
to break down complex words into smaller meaningful units.
Example of Tokenization:
Consider the sentence: "Natural language processing is fun!"
1. Word-level tokenization:
o Tokens: ["Natural", "language", "processing", "is", "fun", "!"]
2. Subword-level tokenization:
o Tokens: ["N", "atural", "language", "pro", "cessing", "is", "fun", "!"]
3. Character-level tokenization:
o Tokens: ["N", "a", "t", "u", "r", "a", "l", " ", "l", "a", "n", "g", "u", "a", "g", "e", " ", "p",
"r", "o", "c", "e", "s", "s", "i", "n", "g", " ", "i", "s", " ", "f", "u", "n", "!"]