0% found this document useful (0 votes)

22 views5 pages

NLP Unit-1

Uploaded by

Konda Vasu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

NLP Unit-1

Uploaded by

Konda Vasu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

1.5)Describe the English morphology?

English morphology is the study of the structure of words and how they are formed. It deals with the
smallest units of meaning in language, called morphemes. Morphemes can be divided into two main
types:

1. Free morphemes: These are morphemes that can stand alone as words. They do not need to
be attached to other morphemes to have meaning. For example:

o "book"

o "run"

o "cat"

2. Bound morphemes: These are morphemes that cannot stand alone and must be attached to
a free morpheme to convey meaning. Bound morphemes include:

o Prefixes: morphemes added to the beginning of a word (e.g., un- in "undo").

o Suffixes: morphemes added to the end of a word (e.g., -ed in "walked").

o Infixes and circumfixes are less common in English but do exist in some forms or
specific words (e.g., -s- in "absinthe").

Key Concepts in English Morphology:

1. Derivational morphemes: These morphemes are used to create new words or to change the
grammatical category of a word. For example:

o "happy" (adjective) + "-ness" = "happiness" (noun).

o "run" (verb) + "-er" = "runner" (noun).

2. Inflectional morphemes: These morphemes do not change the grammatical category of a

word but provide additional grammatical information, such as tense, number, or possession.
English has a limited set of inflectional morphemes:

o Plural: "cat" → "cats" (adding -s).

o Past tense: "walk" → "walked" (adding -ed).

o Possession: "dog" → "dog’s" (adding -s with an apostrophe).

3. Allomorphs: These are variations of a morpheme that occur in different contexts. For
instance:

o The plural morpheme -s has different forms:

 /-s/ after voiceless sounds (e.g., "cats").

 /-z/ after voiced sounds (e.g., "dogs").

 /-ɪz/ after sibilant sounds (e.g., "boxes").

4. Compounding: English allows the combination of two or more free morphemes to form a
new word. For example:
o "tooth" + "brush" = "toothbrush."

o "sun" + "flower" = "sunflower."

1.6) Transducers for Lexicon and Rules

Transducers, in the context of lexicons and rules, operate by reading an input string (e.g., a word or
morpheme sequence) and producing an output based on both the word's lexical information and the
applied transformation rules.

Lexicon

A lexicon is a collection of words and their properties. In computational linguistics, the lexicon stores:

 The base forms (lemma) of words.

 Information about morphological properties such as tense, number, or case.

 Semantic properties of words (such as meaning, part of speech).

 Allomorphs, which are different realizations of a morpheme (for example, "cats" and "dogs"
both reflect plural forms).

A lexical transducer maps a word or morpheme from the lexicon to its various forms or
representations based on certain grammatical rules. For instance, if the input is "run," the transducer
may generate forms like "running," "runs," "ran," etc., depending on the rules of tense or aspect.

Rules

In morphology and syntax, rules are used to transform one form of a word into another. For example,
rules specify how to pluralize a noun, conjugate a verb, or change the case of a noun. These rules can
be:

 Inflectional rules: changing forms based on tense, number, gender, etc.

 Derivational rules: changing words into new words of a different part of speech (e.g.,
"happy" to "happiness").

The rules govern how morphemes interact, how words combine, and how different forms are
generated from the lexicon.

Types of Transducers for Lexicon and Rules

1. Finite State Transducers (FSTs)

o A common formal model used in computational linguistics to represent lexicons and

morphological rules is the Finite State Transducer (FST). An FST is a type of finite-
state automaton that processes strings (sequences of symbols) and produces output
based on its state transitions.

o FSTs are especially useful for:

 Morphological analysis: Mapping between surface forms (actual words) and

base forms (lemmas).

 Morphological generation: Creating different word forms from a base form.

o FSTs work by having one state machine for the input and another for the output.
They process the input string (such as a word) through a series of state transitions
while applying the transformation rules, ultimately producing the output (such as a
conjugated or pluralized form).

Example: For the word "cats," an FST could generate the following transitions:

o Input: "cats"

o Apply plural rule (e.g., the addition of -s), output: "cat" (base form)

o Alternatively, it could take "cat" as input and generate "cats" as output based on a
morphological rule.

2. Transducers for Lexicon Look-Up

o Transducers can also act as lexicon look-up tools where input strings (words) are
matched against the lexicon to retrieve morphological properties, such as the base
form, part of speech, and other grammatical features.

o For example, when the word "running" is processed by a transducer, it may look up
"run" as the lemma and identify "running" as the present participle, applying the
appropriate transformation rule.

3. Two-Level Morphology (Kaplan and Kay's Model)

o In two-level morphology, lexical rules and surface rules are applied separately. A
transducer here maps between these two levels.

o Lexical rules map between underlying representations (like a morpheme or lemma)

and surface forms (like a word in a sentence).

o Two-level morphology uses morphological transducers to convert abstract,

underlying forms into the actual surface form seen in speech or writing.

How Transducers Work in Morphological Processing

 Lexicon and Rules in Action:

1. Input word: A word is input to the system (e.g., "walked").

2. Lexicon lookup: The lexicon identifies the base form of the word (e.g., "walk") and
stores its morphological properties (verb, past tense, etc.).

3. Rule application: The morphological rules (e.g., adding -ed for past tense) are
applied, either to generate or analyze the word.

4. Output: The final output is the transformed word, in this case, "walked."

This process allows computational systems to recognize, generate, and analyze words based on both
the lexicon (list of known words) and transformation rules that govern their forms.
Tokenization is the process of splitting a stream of text into smaller, meaningful units called tokens.
These tokens can be words, subwords, characters, or even sentences. It is one of the first and most
important steps in many natural language processing (NLP) tasks, as it transforms raw, unstructured
text into structured data that can be processed by machines.

Key Points of Tokenization:

1. Basic Definition: Tokenization breaks down a sequence of text into smaller pieces or units
(tokens) that can be used in further linguistic analysis. These tokens often represent words,
but they could also be punctuation marks, numbers, or other significant symbols, depending
on the application.

Example:

o Raw text: "I love programming!"

o Tokenized form: ["I", "love", "programming", "!"]

2. Types of Tokens:

o Word-level tokens: The most common form of tokenization, where the text is split
into individual words. For example, "I love programming" would be tokenized into
["I", "love", "programming"].

o Subword-level tokens: This is particularly useful for handling rare or out-of-

vocabulary words. Subword tokenization splits words into smaller meaningful units
(e.g., morphemes). For example, "unhappiness" might be tokenized as ["un",
"happiness"].

o Character-level tokens: Text is broken down into individual characters. For example,
"hello" becomes ["h", "e", "l", "l", "o"]. This is often used in tasks like character-level
language modeling or languages with no clear word boundaries (e.g., Chinese).

o Sentence-level tokens: The text is tokenized into individual sentences. For example,
"I love programming. It's fun." becomes ["I love programming.", "It's fun."].

3. Importance of Tokenization in NLP:

o Preprocessing for text analysis: Tokenization is the first step before performing tasks
such as part-of-speech tagging, named entity recognition, or sentiment analysis.

o Enables machine learning models: Tokenized text can be used in machine learning
models like word embeddings (e.g., Word2Vec, GloVe) or neural networks for various
NLP tasks.

o Handles ambiguity: Tokenization helps in resolving ambiguities such as distinguishing

between words and punctuation marks.

o Multilingual support: Tokenization helps in processing different languages by

considering language-specific rules and structures.

4. Challenges in Tokenization:
o Punctuation: Deciding whether to treat punctuation marks as separate tokens or
part of adjacent words.

o Ambiguity in word boundaries: Some languages (e.g., Chinese, Japanese) do not

have spaces between words, making tokenization more complex.

o Handling compound words: In some languages, words are often formed by

combining smaller words (e.g., in German, "Straßenbahn" means "streetcar"), and
these compound words can present challenges for tokenization.

5. Types of Tokenizers:

o Rule-based tokenizers: These use predefined rules to break text into tokens. For
example, they might define that spaces separate words and punctuation marks
should be treated as individual tokens.

o Statistical or machine learning-based tokenizers: These use models trained on large

corpora of text to learn how to split words or sentences based on context and
statistical patterns.

o Pre-built tokenizers: Libraries like NLTK, spaCy, and Hugging Face's Tokenizers offer
ready-made tokenizers for various tasks, supporting multiple languages and handling
complex cases like contractions and special characters.

6. Tokenization in Different Languages:

o English: Tokenization is relatively simple due to the presence of spaces between

words and punctuation marks. However, issues arise with contractions (e.g., "I'm" →
["I", "'m"]).

o Chinese and Japanese: These languages do not have spaces between words, so
tokenization requires more sophisticated methods, such as word segmentation or
using machine learning models for text segmentation.

o Languages with complex morphology: In languages like Turkish, where words can
have many suffixes attached, subword or morpheme-level tokenization is often used
to break down complex words into smaller meaningful units.

Example of Tokenization:

Consider the sentence: "Natural language processing is fun!"

1. Word-level tokenization:

o Tokens: ["Natural", "language", "processing", "is", "fun", "!"]

2. Subword-level tokenization:

o Tokens: ["N", "atural", "language", "pro", "cessing", "is", "fun", "!"]

3. Character-level tokenization:

o Tokens: ["N", "a", "t", "u", "r", "a", "l", " ", "l", "a", "n", "g", "u", "a", "g", "e", " ", "p",
"r", "o", "c", "e", "s", "s", "i", "n", "g", " ", "i", "s", " ", "f", "u", "n", "!"]

NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
2 NLP
No ratings yet
2 NLP
36 pages
NLP Material
No ratings yet
NLP Material
250 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Unit 12 (3 Half)
No ratings yet
Unit 12 (3 Half)
37 pages
Lexical Analysis - Morphological Analysis
No ratings yet
Lexical Analysis - Morphological Analysis
9 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
Part02 Linguistics
No ratings yet
Part02 Linguistics
64 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Unit 1 NLP-1
No ratings yet
Unit 1 NLP-1
40 pages
NLP Merged
No ratings yet
NLP Merged
52 pages
Linguistics: Morphology Basics
No ratings yet
Linguistics: Morphology Basics
41 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
English Morphology
No ratings yet
English Morphology
32 pages
Lecture-3 (Words - Transducers)
No ratings yet
Lecture-3 (Words - Transducers)
61 pages
Morp
No ratings yet
Morp
30 pages
3.chapter4 - Lexical Representations
No ratings yet
3.chapter4 - Lexical Representations
36 pages
Inf2a L15 Slides
No ratings yet
Inf2a L15 Slides
31 pages
Morphemes in NLP Explained
No ratings yet
Morphemes in NLP Explained
4 pages
NLP m2
No ratings yet
NLP m2
71 pages
Solution NLP UT1
No ratings yet
Solution NLP UT1
7 pages
NLP MODULE-2 Final
No ratings yet
NLP MODULE-2 Final
114 pages
Module 3: Morphology Inflectional and Derivation Morphology
No ratings yet
Module 3: Morphology Inflectional and Derivation Morphology
17 pages
NLP 39-48
No ratings yet
NLP 39-48
11 pages
02 - Morphological Analysis
No ratings yet
02 - Morphological Analysis
17 pages
NLP Unit-I Notes
No ratings yet
NLP Unit-I Notes
19 pages
NLP-unit2 Final
No ratings yet
NLP-unit2 Final
158 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
02 - Morphological Analysis
100% (1)
02 - Morphological Analysis
17 pages
Unit3 - Morphology and Finite State Transducers
100% (1)
Unit3 - Morphology and Finite State Transducers
55 pages
Morphology Notes
No ratings yet
Morphology Notes
5 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
NLP Pipeline and Morphology
No ratings yet
NLP Pipeline and Morphology
21 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
NLP Shorts 3
No ratings yet
NLP Shorts 3
25 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
NLP Exp 4
No ratings yet
NLP Exp 4
15 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Selected Topic CH 1
No ratings yet
Selected Topic CH 1
36 pages
NLP Morphology for Linguists
No ratings yet
NLP Morphology for Linguists
8 pages
Introduction To Morphology - The Main Notions and Concepts
No ratings yet
Introduction To Morphology - The Main Notions and Concepts
20 pages
NLP3 - Lecture 3
No ratings yet
NLP3 - Lecture 3
52 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
Finnish 2008
No ratings yet
Finnish 2008
64 pages
NLP Study
No ratings yet
NLP Study
48 pages
Natural Language Processing
100% (2)
Natural Language Processing
48 pages
NLP U12
No ratings yet
NLP U12
12 pages
Exp 7 Morphology in NLP
No ratings yet
Exp 7 Morphology in NLP
3 pages
Linguistics Course Overview
No ratings yet
Linguistics Course Overview
46 pages
Morphology: Key Concepts & Types
No ratings yet
Morphology: Key Concepts & Types
16 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
NLP Simple Explanation
No ratings yet
NLP Simple Explanation
9 pages
SMCE-FDP Attendence Day-03 (Session - 6)
No ratings yet
SMCE-FDP Attendence Day-03 (Session - 6)
1 page
Download
No ratings yet
Download
25 pages
Data Science AI ML - Career Guidance v02.1
No ratings yet
Data Science AI ML - Career Guidance v02.1
53 pages
Kasukurthy Anna Rahul: Mob: Email
No ratings yet
Kasukurthy Anna Rahul: Mob: Email
2 pages
Oops Through Java lab-AIML
No ratings yet
Oops Through Java lab-AIML
2 pages
Network Interconnection Devices Guide
No ratings yet
Network Interconnection Devices Guide
12 pages
4 Aug 2024 All Shifts Question Paper Solution
No ratings yet
4 Aug 2024 All Shifts Question Paper Solution
313 pages
A DETAILED LESSON PLAN in English III
No ratings yet
A DETAILED LESSON PLAN in English III
9 pages
Grammar Practice for Students
0% (1)
Grammar Practice for Students
2 pages
Contextual Grammar PDF
0% (1)
Contextual Grammar PDF
50 pages
Connectives That Express Cause and Effect, Contrast, and Condition
No ratings yet
Connectives That Express Cause and Effect, Contrast, and Condition
20 pages
1.3 Demonstrative Pronouns
No ratings yet
1.3 Demonstrative Pronouns
4 pages
Adverbs
No ratings yet
Adverbs
22 pages
SJKT Sentul, Kuala Lumpur ASSESSMENT 1 (2019) English Language Year 4 Paper 1
No ratings yet
SJKT Sentul, Kuala Lumpur ASSESSMENT 1 (2019) English Language Year 4 Paper 1
3 pages
Understanding 'Can' and 'Can't' in English
No ratings yet
Understanding 'Can' and 'Can't' in English
3 pages
Word Accent Stress Shift and
No ratings yet
Word Accent Stress Shift and
8 pages
SUBJECT VERB AGREEMENT LET Reviewer
No ratings yet
SUBJECT VERB AGREEMENT LET Reviewer
2 pages
Adverb of Time and Place
No ratings yet
Adverb of Time and Place
12 pages
Passive Voice With Modals
No ratings yet
Passive Voice With Modals
32 pages
Use of Modifier Exercise
No ratings yet
Use of Modifier Exercise
2 pages
Terjemahan Tafsir Ibnu Kathir - Surah Al Baqarah Ayat 21 - 22 PDF
No ratings yet
Terjemahan Tafsir Ibnu Kathir - Surah Al Baqarah Ayat 21 - 22 PDF
73 pages
RPT (2025) English (F3)
No ratings yet
RPT (2025) English (F3)
9 pages
Effective Business Communication Tips
33% (3)
Effective Business Communication Tips
20 pages
Atg Quiz Add S Es Ies
50% (2)
Atg Quiz Add S Es Ies
2 pages
Direc and Indirect (Reported) Speech
No ratings yet
Direc and Indirect (Reported) Speech
8 pages
Passive Voice
No ratings yet
Passive Voice
8 pages
IELTS Speaking Tips for Test Takers
No ratings yet
IELTS Speaking Tips for Test Takers
9 pages
Panduan Kata Sifat dalam Bahasa Inggris
No ratings yet
Panduan Kata Sifat dalam Bahasa Inggris
4 pages
Turkish Logic
100% (1)
Turkish Logic
21 pages
French Grammar: Comparatives & Superlatives
No ratings yet
French Grammar: Comparatives & Superlatives
38 pages
Simple Vs Continuous
0% (1)
Simple Vs Continuous
18 pages
Actividades Ingles 3
100% (1)
Actividades Ingles 3
43 pages
Crossword Possessive Adj Pron
No ratings yet
Crossword Possessive Adj Pron
1 page
Chapter 4 Verbal Aptitude Gate Ese 831674322966569
No ratings yet
Chapter 4 Verbal Aptitude Gate Ese 831674322966569
14 pages
The Present Continuous Tense
50% (2)
The Present Continuous Tense
6 pages
MI PLANTA DE NARANJA LIMA (1er Cap - Ed Especial)
No ratings yet
MI PLANTA DE NARANJA LIMA (1er Cap - Ed Especial)
4 pages

NLP Unit-1

Uploaded by

NLP Unit-1

Uploaded by

1.5)Describe the English morphology?

o Prefixes: morphemes added to the beginning of a word (e.g., un- in "undo").

o Suffixes: morphemes added to the end of a word (e.g., -ed in "walked").

Key Concepts in English Morphology:

o "happy" (adjective) + "-ness" = "happiness" (noun).

o "run" (verb) + "-er" = "runner" (noun).

2. Inflectional morphemes: These morphemes do not change the grammatical category of a

o Plural: "cat" → "cats" (adding -s).

o Past tense: "walk" → "walked" (adding -ed).

o Possession: "dog" → "dog’s" (adding -s with an apostrophe).

o The plural morpheme -s has different forms:

 /-s/ after voiceless sounds (e.g., "cats").

 /-z/ after voiced sounds (e.g., "dogs").

 /-ɪz/ after sibilant sounds (e.g., "boxes").

o "sun" + "flower" = "sunflower."

1.6) Transducers for Lexicon and Rules

 The base forms (lemma) of words.

 Information about morphological properties such as tense, number, or case.

 Semantic properties of words (such as meaning, part of speech).

 Inflectional rules: changing forms based on tense, number, gender, etc.

Types of Transducers for Lexicon and Rules

1. Finite State Transducers (FSTs)

o A common formal model used in computational linguistics to represent lexicons and

o FSTs are especially useful for:

 Morphological analysis: Mapping between surface forms (actual words) and

 Morphological generation: Creating different word forms from a base form.

2. Transducers for Lexicon Look-Up

3. Two-Level Morphology (Kaplan and Kay's Model)

o Lexical rules map between underlying representations (like a morpheme or lemma)

o Two-level morphology uses morphological transducers to convert abstract,

How Transducers Work in Morphological Processing

 Lexicon and Rules in Action:

1. Input word: A word is input to the system (e.g., "walked").

Key Points of Tokenization:

o Raw text: "I love programming!"

o Tokenized form: ["I", "love", "programming", "!"]

o Subword-level tokens: This is particularly useful for handling rare or out-of-

3. Importance of Tokenization in NLP:

o Handles ambiguity: Tokenization helps in resolving ambiguities such as distinguishing

o Multilingual support: Tokenization helps in processing different languages by

o Ambiguity in word boundaries: Some languages (e.g., Chinese, Japanese) do not

o Handling compound words: In some languages, words are often formed by

o Statistical or machine learning-based tokenizers: These use models trained on large

6. Tokenization in Different Languages:

o English: Tokenization is relatively simple due to the presence of spaces between

Consider the sentence: "Natural language processing is fun!"

o Tokens: ["Natural", "language", "processing", "is", "fun", "!"]

o Tokens: ["N", "atural", "language", "pro", "cessing", "is", "fun", "!"]

You might also like