NLP Unit 1
NLP Unit 1
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 1
Unit I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological
Models
Finding the Structure of Documents: Introduction,
Methods, Complexity of the Approaches, Performance
of the Approaches, Features.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 2
How human communicate with each other:
They use natural language
We can use Telugu English , Hindi
Then the opposite person is going to listen , interpreting all the sentences
then understand them and then replying them
This is called full way communication.
This communication is possible only when you understand each other
language
This is makes human behaviour very intelligent, using the natural language.
We expect same thing from computer
Computer should replicate the same thing.
For that it uses nlp.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 3
Introduction to Natural Language Processing:
Natural Language Processing (NLP) is a subfield of computer science and
artificial intelligence that focuses on enabling computers to understand,
interpret, and generate human language.
The goal of NLP is to create intelligent systems that can understand and
communicate with humans in natural language.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 4
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 5
APPLICATIONS OF NLP:
Speech recognition
Example: assistant,siri(apple)
Sentimental analysis
Example: social media(twitter) like movies feedback, politicial
Machine translations: translate from one lang to lang
Example: google translator
Text summarization
Information retrieval
Chat boots
When u ask the question machine is going to reply by using there whole
database.
Etc….like spelling check
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 6
Core Tasks in NLP
Some fundamental tasks include:
Tokenization: Splitting text into words, phrases, or other units.
Part-of-Speech (POS) Tagging: Identifying words as nouns, verbs,
adjectives, etc.
Named Entity Recognition (NER): Detecting proper nouns like names of
people, organizations, or places.
Parsing: Analyzing the grammatical structure of a sentence.
Language Modelling: Predicting the next word or sentence in a sequence.
Text Classification: Assigning categories to text (e.g., spam vs. non-spam).
Machine Translation: Translating text from one language to another.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 7
Challenges in NLP
Ambiguity: Words and sentences can have multiple meanings.
Context understanding: Words change meaning based on context.
Sarcasm and irony: Hard to detect from text alone.
Low-resource languages: Lack of data for many world languages.
Code-switching: Mixing of languages in speech/text.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 8
Components of Natural Language Processing
Two Components of Natural Language Processing (NLP) are:
1.Natural Language Understanding (NLU):
NLU is the process of enabling computers to understand and interpret
human language.
While NLP involves processing and analyzing text, NLU is more
concerned with understanding the intent and context of the language,
rather than just recognizing patterns or structure.
It involves extracting meaning from the text and transforming it into a
structured format that machines can use to take action.
This involves analysing the structure, syntax, and semantics of natural
language data to derive meaning from it.
Some of the tasks involved in NLU include part-of-speech tagging,
parsing, named entity recognition, and sentiment analysis.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 9
In simple terms, NLU is the technology that helps computers
understand human language, rather than just processing it.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 10
2.Natural Language Generation (NLG):
NLG is the process of enabling computers to generate human-like language.
This involves using algorithms and models to produce coherent and
contextually appropriate text, speech, or other forms of natural language
output.
Some of the tasks involved in NLG include summarization, paraphrasing,
text generation, and dialogue generation.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 11
How NLP, NLU, and NLG Relate:
NLP is the overarching domain that includes both NLU and NLG.
While NLU deals with understanding the user's input, NLG focuses on
producing the system's output.
These two components work together to facilitate a seamless interaction
between humans and machines.
NLP brings together understanding (NLU) and generation (NLG) to process
and produce human language in a coherent and effective way.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 12
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 13
Phases of Natural Language Processing
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 14
Lexical or Morphological Analysis:
Task: Breaking the input text into individual tokens (words, punctuation marks, etc.). It also
identifies word structures such as roots, prefixes, and suffixes.
Example:
For the sentence "The quick brown fox" In NLP:
Tokens ["The", "quick", "brown", "fox"].. Lexical analysis may include:
Tokenization (splitting text into words
Example (in programming): or sentences)
Input: int age = 25; Removing punctuation
Lexical Analyzer Output: Lowercasing
int → keyword
age → identifier
= → operator
25 → number literal
; → punctuation
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 15
Morphological Analysis:
Definition:
Morphological analysis is the process of analyzing a word's structure to identify its root
(lemma) and affixes (prefixes, suffixes, infixes), and understanding grammatical features (e.g.,
tense, number, gender, case).
Example:
Word: "running“ Lexical analysis = "Split and label the words.
\ Morphological analysis = "Understand how each word
Morphological Analysis: is built."
Root: run
Suffix: -ing
Part of speech: verb (present participle/gerund)
Another example:
Word: "unhappiness“
Prefix: un-
Root: happy
Suffix: -ness
Word type: noun PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 16
Syntax Analysis or Parsing:
Task: Checking the grammatical structure of the sentences. It ensures that the sentence
conforms to the grammatical rules of the language and identifies relationships between words.
Example: "The quick brown fox jumps over the lazy dog"
noun phrase (NP) followed by a verb phrase (VP)
e.g., S → NP + VP, NP → Det + N, VP → V + NP
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 17
Word Morphemes Type
The the Determiner
quick quick Adjective
brown brown Adjective
fox fox Noun (root)
jumps jump + -s Verb (3rd person singular present)
over over Preposition
the the Determiner
lazy lazy Adjective
dog dog Noun (root)
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 18
Example Sentence: Dependency Parse (Simplified):
"The dog chased the cat." chased
Phrase Structure Rules: / \
dog cat
S → NP + VP / \
NP → Det + N The the
VP → V + NP
Constituency Parse Tree (Text Format): "chased" is the main verb (head)
S "dog" is the subject of "chased“
├── NP "cat" is the object of "chased“
│ ├── Det: The Sentence: "She gave him a gift."
│ └── N: dog Subject: She
└── VP Verb: gave
├── V: chased Indirect object: him
└── NP Direct object: a gift
├── Det: the
└── N: cat
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 19
S
├── NP: She
└── VP
├── V: gave
├── NP: him
└── NP
├── Det: a
└── N: gift
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 20
Semantic Analysis :
Task: Extracting the meaning of a sentence. It maps syntactic structures to
their meaning, identifying the roles played by different words (subject, object,
etc.).
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 21
Word-Level Meaning (Lexical Semantics)
Word Meaning
The A definite article; refers to a specific entity
Quick Describes speed; fast
Brown Describes color
Fox A small, agile mammal; the subject
Jumps Action verb meaning to leap upward or forward
Over Preposition indicating movement across something
The Definite article
Lazy Describes lack of energy or activity
Dog A domesticated mammal; the object (target of the jump)
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 22
Discourse Integration :
Task: Understanding the context across multiple sentences. This involves
resolving references like pronouns and understanding how the meaning of
one sentence affects the next.
it the process of connecting the meaning of a sentence with the sentences
before and after it, creating a coherent flow of ideas across the entire text or
conversation.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 23
Pragmatic Analysis:
Task: Understanding the intended meaning of the text in a specific context. This
includes interpreting idiomatic expressions, implied meanings, and speech acts
(e.g., commands, requests).
Example: In the sentence "Can you open the door?", pragmatic analysis
understands that this is a request, not a question about ability.
Example:
Person A: "It’s cold in here."
Person B: "I’ll close the window."
Person A didn’t literally ask to close the window, but implied it.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 24
Natural Language Processing (NLP) System Architecture:
typically consists of multiple layers or components that transform human language (spoken or written) into a format that
machines can process, understand, and respond.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 25
Words and Their Components
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 26
Words and Their Components:
understanding how words are built helps machines analyze, process, and generate
language more effectively.
This involves breaking words into their components, which is central to tasks like text
classification, machine translation, sentiment analysis, and more.
Word and components are basic building blocks of nlp
Corpus: Corpora are large and diverse collections of text, speech, and other forms of human
language that are used as input data for natural language processing (NLP) applications.
In order to be used effectively, corpora must be pre-processed and analyzed using various
operations and techniques.
Here are some common operations on corpora in NLP:
1.Morphemes
2.Tokens
3.Lexemes
4.typoloy
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 27
1.TOKENS:
It refers to a sequence of character or a single unit of text ,such as word.
Character, or symbol.
Token are words that they are credited by dividing the text into smaller
units.
This process is called as Tokenization.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 28
Type Description Example
Split text by "I love NLP" → ["I",
Word Tokenization
spaces/punctuation "love", "NLP"]
"Hello. How are you?" →
Sentence Tokenization Split text into sentences ["Hello.", "How are
you?"]
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 29
2.LEXEME :
A lexeme is a basic unit of meaning in language.
It refers to a group of word forms that are variations of the same word (same
dictionary entry) but differ in tense, number, case, etc.
Lexemes are used to represent the meaning and context of word in sentence.
This process is called as lemmatization.
Definition: A lexeme is an abstract unit of meaning that can have different
word forms depending on grammatical usage.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 30
Examples of Lexemes
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 31
3.MORPHEMES:A morpheme is the smallest meaningful unit of language. It
cannot be divided further without losing or changing its meaning.
Definition: A morpheme is the smallest grammatical or meaningful unit in a
language.
For example: "unhappiness" has 3 morphemes: un- (prefix) + happy (root) + -ness
(suffix)
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 32
Type Description Example Effect
"happy" →
Changes word's
"unhappy", Creates new
Derivational meaning or part
"teach" → word
of speech
"teacher"
Expresses "dog" → "dogs",
Changes form,
Inflectional grammatical "walk" →
not core meaning
changes "walked"
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 33
4.TYPOLOGY:
is the study of systematic classification of languages based on their common
structural features and patterns.
It helps understand how languages are similar or different in terms of syntax,
morphology, phonology, etc.
Types of Typology in Linguistics
Typology Type Focus Examples
Isolating (Chinese),
How words are formed and
Morphological Typology Agglutinative (Turkish),
structured
Fusional (Latin)
Word order and sentence SVO (English), SOV
Syntactic Typology
structure (Japanese), VSO (Arabic)
1.Irregularity
2.Productivity
3.Ambiguity
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 35
1.Irregularity:
Irregularity is when a word, phrase, or sentence does not follow standard
grammatical, morphological, or syntactic rules.
Humans easily understand irregularities through experience, but machines
often struggle.
For example:
in English, the past tense of "go" is "went," which does not follow the
regular pattern of adding "-ed" to form the past tense, as in "walked" from
"walk.“
Irregular forms are particularly challenging because they often have to be
memorized individually, as they do not adhere to predictable patterns.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 36
Examples:
• Verb Conjugation:
In English, the past tense of "run" is "ran," not "runned," and the past participle
of "eat" is "eaten," not "eated."
These irregular forms deviate from the regular pattern of adding "-ed" to form
the past tense.
• Plural Formation:
The plural of "mouse" is "mice," and the plural of "child" is "children," not
following the regular pattern of adding "-s" or "-es" to form plurals.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 37
Ambiguity:
Ambiguity i can occur when a single form has multiple possible
interpretations.
This issue can manifest at the level of individual morphemes (e.g., a suffix or
prefix that serves multiple grammatical functions) or in the structure of entire
words (e.g., when the same word form can represent different parts of speech
or grammatical roles depending on context).
ambiguity complicates linguistic analysis, natural language processing, and
language learning, as it requires additional contextual information to resolve
the intended meaning or grammatical function.
Word form that can be understand in multiple ways out of context
Word form that look same but have distinct functions or meaning.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 38
Examples:
1. "Leaves": This word can be the plural form of "leaf" (noun) or the third
person singular present tense of the verb "to leave."
Without context, its grammatical role and meaning are ambiguous.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 39
Productivity:
Productivity is the degree to which a morphological process (like adding
suffixes or prefixes) can be applied to new words to create valid and
meaningful expressions.
A productive morphological process is one that is actively used to create new
words and is readily understood by speakers of the language.
For instance, the suffix " ness" in English can be added to a wide array of
adjectives to form nouns (e.g., a suffix or prefix that serves multiple grammatical
functions) or in the structure of entire words (e.g., when the same word form can
represent denoting a state or quality (e.g., "happiness" from "happy").
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 40
Examples:
• Suffix "-ize": This suffix can be added to nouns and adjectives to form verbs,
indicating the action of making or becoming. For example, "modern" becomes
"modernize," and "legal" becomes "legalize." This process is highly productive in
English.
• Prefix "un-": This prefix can be added to adjectives and some verbs to create
their opposites, such as "happy" to "unhappy" or "do" to "undo." It is a
productive means of negation in English.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 41
Morphological Models
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 42
Morphological Models:
Morphological Models used to analyse and understand the structure of the
word.
These models help linguists understand how words are built from smaller
units (morphemes), how words are related to each other, and how they change
to express different grammatical categories such as tense, case, number, and
gender.
these are classified in to models.
Dictionary Lookup
Finite State Morphology
Unification Based Morphology
Functional Morphology
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 43
DICTIONARY LOOKUP:
It is a technique used in morphological model to analyze and understand the
internal structure of word
It is used to retrieve information about words from a predefined lexical
resource or dictionary.
In the context of NLP, a dictionary( or lexicon) is a collection of words with
associated information such as their meanings, part of-speech.
Dictionary is understood as a data structure that directly enables some
precomputed results, in ourcase word analyses.
The data structure can be optimized for efficient lookup.
Lookup operations are simple andb quick.
Dictionaries can be implemented as lists, binary search trees, hashtables
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 44
Finite State Morphology:
Finite State Morphology (FSM) is a computational approach to the analysis
and generation of word forms in natural language processing (NLP).
It uses finite state automata (FSA) or finite state transducers (FST) to model
the morphological rules of a language.
is a computational approach used in Natural Language Processing to model
how words are built (generation) or broken down (analysis) using finite-state
machines — specifically, finite-state transducers (FSTs).
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 45
Finite State Automata (FSA) :
FSA are computational models used to represent the states and transitions
between those states within a system.
In the context of morphology, an FSA can model the legal sequences of
morphemes and their modifications based on the grammatical rules of a
language.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 46
Example of Finite State Morphology: English Plurals
Consider a simplified model for generating the plural forms of English nouns
using an FST. English plural formation generally follows a few rules, such as
adding "-s" to the end of a word, but with exceptions (e.g., "child" to "children").
1. Regular Plurals: The majority of English nouns form their plural by adding "-
s" or "-es" (if the noun ends in s, sh, ch, x, or z). An FST can encode these
rules with states and transitions that append the correct suffix based on the
final letters of the noun.
For example, the word "cat" transitions through states that recognize the word and
then apply the rule to add "s," producing "cats."
2. Irregular Plurals: For irregular forms, the FST would include specific
transitions for these exceptions. For "child," the FST would have a
transition that maps "child" directly to "children" without following the
regular rule.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 47
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 48
Unification Based Morphology:
Unification-based morphology is a computational approach to the analysis
and generation of word forms that relies on the concept of unification.
This is a process of merging two or more feature structures, aligning them by
their shared attributes.
Unification is the process of combining (merge) feature structures, where
features(such as tense, number, gender etc)
associated with morphemes are unified to form a correct morphological form.
Ex: “ran”
Feature Structure1: [ Number: Singular]
Feature Structure2: [Tense: Past]
Unified Structure: [Number: Singular, Tense: Past]
Here the word “ran” is singular and it is past word( Ex: He ran) “run” + “—
ing” results in a unified feature structure for “running
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 49
Functional Morphology:
It focus on the relationship between a word’s form and its function in conveying
grammatical & semantic information like marking tense, number, case, gender etc
Tense (English Verbs)
Base Form Tense Word Form
play Present play
play Past played
play Continuous playing
play 3rd Person Singular plays
Number (Nouns)
Base Form Number Word Form
cat Singular cat
cat Plural cats
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 50
Finding the Structure of Documents
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 51
Finding the Structure of Documents –
finding the structure of documents means identifying and understanding how a
document is organized — including sections, headings, paragraphs, titles, bullet
points, and more.
Improves Search and Retrieval: Understanding the structure of documents enables
better search and retrieval, as search algorithms can target specific sections (e.g.,
headings, paragraphs, references) to produce more relevant results.
• Facilitates Summarization: Knowing how a document is structured allows
automated systems to summarize key sections like introductions, conclusions, or
results in research papers.
• Enhances Information Extraction: Systems can extract specific information like
dates, addresses, or citations by recognizing different parts of a document's
structure.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 52
Challenges in Finding the Structure of Documents:
• Varied Formats: Documents can come in various formats such as PDFs, HTML,
XML, Word documents, and scanned images. Each format requires different pre-
processing techniques.
• Complex Layouts: Documents with tables, multi-column layouts, images, and
footnotes pose a challenge for document structure analysis.
• Noisy Data: Scanned documents and low-quality images often introduce noise,
making it difficult to accurately extract text or identify structure.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 53
Steps Involved in Document Structure Analysis:
Preprocessing
The first step involves converting the document into a format suitable for
analysis, such as plain text, and performing some cleaning:
• Text Conversion: Converting documents from formats like PDF or
images into machine-readable formats (e.g., plain text, HTML) using
tools such as Tesseract OCR (for images) or PDFBox (for PDFs).
• Noise Removal: Cleaning irrelevant information like metadata,
watermarks, headers, and footers.
Segmentation: Breaking down the document into sentences, paragraphs, and
sections for easier analysis.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 54
Layout Analysis
Layout analysis is the task of identifying and interpreting the physical layout
of the document:
• Detecting Headers and Footers: These elements may be irrelevant for
document content analysis but important for metadata extraction.
• Identifying Columns: In multi-column layouts (common in research
papers), each column should be analyzed separately.
• Recognizing Font Styles: Features like bold or italic text, larger font
sizes, and indentation help identify titles, headings, or emphasized text
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 55
Feature Extraction
Extracting key features from the text to distinguish between various elements
like headings, lists, and paragraphs:
• Formatting Cues: Information such as font size, style (bold, italic), and
alignment (center, left) is helpful in detecting section headers, lists, or quotes.
• Keyword Detection: Keywords like “Introduction”, “Conclusion”, or
“References” often signal important structural elements.
• Punctuation Patterns: Identifying bullet points, numbering schemes, or
figure labels.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 56
Classification
Once the features are extracted, machine learning models or rule-based
systems classify each section or block of text into its corresponding category:
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 57
Hierarchy Construction
After classifying each element, the next step is to build a logical hierarchy:
• Heading Levels: Determining the hierarchy of headings and
subheadings (e.g., H1, H2, H3) based on their relative importance or indentation.
• Content Organization: Grouping related paragraphs under their respective
headings and associating figures, tables, or footnotes with the relevant
sections.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 58
Semantic Analysis
Semantic analysis further enhances document structure understanding by
interpreting the meaning behind different sections:
• Contextual Understanding: Using Natural Language Processing (NLP)
techniques, semantic analysis can determine the meaning of headings, detect
topics, or summarize sections.
• Named Entity Recognition (NER): Detecting entities such as dates, names,
places, or references within sections.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 59
Validation and Correction
The extracted structure may be validated either manually or through
automated methods:
• Manual Validation: This may involve a human reviewing the output to correct
errors in document structure recognition.
• Automated Validation: Rule-based systems can check for
inconsistencies, such as missing headings or misplaced tables, to
ensure that the hierarchy is properly constructed.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 60
Identifying The Boundaries:
Sentence Boundary Detection
Sentence Boundary Detection (SBD), also known as sentence segmentation, is
the process of identifying where one sentence ends and the next begins.
is the process of identifying the boundaries between sentences in a text.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 61
Steps Involved in Sentence Boundary Detection:
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 62
2.Handling Ambiguities:
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 63
3.Supervised Machine Learning (ML) Models:
ML models can be trained to detect sentence boundaries by learning from labeled
data. These models use features like:
▪ Punctuation patterns.
▪ Part of Speech (POS) tags (e.g., period followed by a capital letter could indicate a
new sentence).
Example:
“He said, ‘Let’s go.’ Then they left.”
Desired SBD Output:
“He said, ‘Let’s go.’”
“Then they left.”
So, the boundary is after the outer quote, not just after go..
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 64
Topic Boundary Detection
Topic Boundary Detection involves identifying where one topic ends and another
begins within a document. This is important for understanding the logical
structure and flow of content in long documents such as research papers, books,
or articles.
1.Identifying Topic Shifts:
Topic boundary detection aims to recognize changes in the subject matter. This
could be based on:
▪ Keyword Clustering: When the text shifts from one group of keywords to
another (e.g., from discussing "dogs" to discussing "cats"), it may indicate a
topic shift.
▪ Semantic Change: A significant change in the context of words in
consecutive sentences.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 65
Example:
▪ Input: "In the first chapter, we discuss the properties of water. In the next
chapter, we analyze the effects of temperature on metals.“
▪ Topic boundary detected between "properties of water" and "effects of
temperature on metals.“
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 66
3.TextTiling Algorithm:
A classic unsupervised method that segments text into topical blocks by examining
word frequency distributions and lexical similarities between sections. It detects
topic boundaries by measuring the dissimilarity between adjacent blocks of text.
▪ Example:
▪ Input: "Chapter 1: Introduction... Chapter 2: Methods..."
▪ The algorithm would detect a boundary between these chapters based on word
frequency and semantic similarity.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 67
4.Human-Assisted Approaches:
Manual Topic Annotation: For important documents like legal papers or reports,
manual topic annotation can be used to identify topic boundaries. This is often
done in combination with automated methods to improve accuracy.
Example: Lawyers might mark where different sections of a contract begin and
end, like "Definitions", "Terms", and "Signatures"
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 68