0% found this document useful (0 votes)

30 views68 pages

NLP Unit 1

The document provides an overview of Natural Language Processing (NLP), detailing its components, applications, core tasks, and challenges. It explains the significance of Natural Language Understanding (NLU) and Natural Language Generation (NLG) within NLP, as well as the phases involved in processing language. Additionally, it discusses the importance of understanding word structures and the role of corpora in NLP applications.

Uploaded by

amreen2825

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views68 pages

NLP Unit 1

Uploaded by

amreen2825

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Natural Language Processing (NLP)

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 1
Unit I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological
Models
Finding the Structure of Documents: Introduction,
Methods, Complexity of the Approaches, Performance
of the Approaches, Features.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 2
How human communicate with each other:
 They use natural language
 We can use Telugu English , Hindi
 Then the opposite person is going to listen , interpreting all the sentences
then understand them and then replying them
 This is called full way communication.
 This communication is possible only when you understand each other
language
 This is makes human behaviour very intelligent, using the natural language.
 We expect same thing from computer
 Computer should replicate the same thing.
 For that it uses nlp.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 3
Introduction to Natural Language Processing:
 Natural Language Processing (NLP) is a subfield of computer science and
artificial intelligence that focuses on enabling computers to understand,
interpret, and generate human language.
 The goal of NLP is to create intelligent systems that can understand and
communicate with humans in natural language.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 4
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 5
APPLICATIONS OF NLP:
 Speech recognition
Example: assistant,siri(apple)
 Sentimental analysis
Example: social media(twitter) like movies feedback, politicial
 Machine translations: translate from one lang to lang
Example: google translator
 Text summarization
 Information retrieval
 Chat boots
When u ask the question machine is going to reply by using there whole
database.
 Etc….like spelling check
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 6
Core Tasks in NLP
Some fundamental tasks include:
 Tokenization: Splitting text into words, phrases, or other units.
 Part-of-Speech (POS) Tagging: Identifying words as nouns, verbs,
adjectives, etc.
 Named Entity Recognition (NER): Detecting proper nouns like names of
people, organizations, or places.
 Parsing: Analyzing the grammatical structure of a sentence.
 Language Modelling: Predicting the next word or sentence in a sequence.
 Text Classification: Assigning categories to text (e.g., spam vs. non-spam).
 Machine Translation: Translating text from one language to another.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 7
Challenges in NLP
 Ambiguity: Words and sentences can have multiple meanings.
 Context understanding: Words change meaning based on context.
 Sarcasm and irony: Hard to detect from text alone.
 Low-resource languages: Lack of data for many world languages.
 Code-switching: Mixing of languages in speech/text.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 8
Components of Natural Language Processing
Two Components of Natural Language Processing (NLP) are:
1.Natural Language Understanding (NLU):
 NLU is the process of enabling computers to understand and interpret
human language.
 While NLP involves processing and analyzing text, NLU is more
concerned with understanding the intent and context of the language,
rather than just recognizing patterns or structure.
 It involves extracting meaning from the text and transforming it into a
structured format that machines can use to take action.
 This involves analysing the structure, syntax, and semantics of natural
language data to derive meaning from it.
 Some of the tasks involved in NLU include part-of-speech tagging,
parsing, named entity recognition, and sentiment analysis.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 9
 In simple terms, NLU is the technology that helps computers
understand human language, rather than just processing it.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 10
2.Natural Language Generation (NLG):
 NLG is the process of enabling computers to generate human-like language.
 This involves using algorithms and models to produce coherent and
contextually appropriate text, speech, or other forms of natural language
output.
 Some of the tasks involved in NLG include summarization, paraphrasing,
text generation, and dialogue generation.

These two components work together to enable intelligent systems to interact

with humans in natural language, either by understanding what humans are
saying or by generating natural language output in response.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 11
How NLP, NLU, and NLG Relate:
 NLP is the overarching domain that includes both NLU and NLG.
 While NLU deals with understanding the user's input, NLG focuses on
producing the system's output.
 These two components work together to facilitate a seamless interaction
between humans and machines.
 NLP brings together understanding (NLU) and generation (NLG) to process
and produce human language in a coherent and effective way.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 12
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 13
Phases of Natural Language Processing

1. Lexical or Morphological Analysis

2. Syntax Analysis or Parsing
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 14
Lexical or Morphological Analysis:
Task: Breaking the input text into individual tokens (words, punctuation marks, etc.). It also
identifies word structures such as roots, prefixes, and suffixes.

Example:
For the sentence "The quick brown fox" In NLP:
Tokens ["The", "quick", "brown", "fox"].. Lexical analysis may include:
 Tokenization (splitting text into words
Example (in programming): or sentences)
Input: int age = 25;  Removing punctuation
Lexical Analyzer Output:  Lowercasing
int → keyword
age → identifier
= → operator
25 → number literal
; → punctuation

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 15
Morphological Analysis:
Definition:
Morphological analysis is the process of analyzing a word's structure to identify its root
(lemma) and affixes (prefixes, suffixes, infixes), and understanding grammatical features (e.g.,
tense, number, gender, case).
Example:
Word: "running“ Lexical analysis = "Split and label the words.
\ Morphological analysis = "Understand how each word
Morphological Analysis: is built."
Root: run
Suffix: -ing
Part of speech: verb (present participle/gerund)
Another example:
Word: "unhappiness“
Prefix: un-
Root: happy
Suffix: -ness
Word type: noun PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 16
Syntax Analysis or Parsing:
Task: Checking the grammatical structure of the sentences. It ensures that the sentence
conforms to the grammatical rules of the language and identifies relationships between words.
Example: "The quick brown fox jumps over the lazy dog"
noun phrase (NP) followed by a verb phrase (VP)
e.g., S → NP + VP, NP → Det + N, VP → V + NP

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 17
Word Morphemes Type
The the Determiner
quick quick Adjective
brown brown Adjective
fox fox Noun (root)
jumps jump + -s Verb (3rd person singular present)
over over Preposition
the the Determiner
lazy lazy Adjective
dog dog Noun (root)

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 18
Example Sentence: Dependency Parse (Simplified):
"The dog chased the cat." chased
Phrase Structure Rules: / \
dog cat
S → NP + VP / \
NP → Det + N The the
VP → V + NP
Constituency Parse Tree (Text Format): "chased" is the main verb (head)
S "dog" is the subject of "chased“
├── NP "cat" is the object of "chased“
│ ├── Det: The Sentence: "She gave him a gift."
│ └── N: dog Subject: She
└── VP Verb: gave
├── V: chased Indirect object: him
└── NP Direct object: a gift
├── Det: the
└── N: cat
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 19
S
├── NP: She
└── VP
├── V: gave
├── NP: him
└── NP
├── Det: a
└── N: gift

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 20
Semantic Analysis :
 Task: Extracting the meaning of a sentence. It maps syntactic structures to
their meaning, identifying the roles played by different words (subject, object,
etc.).

 Semantic analysis is the process of interpreting the meaning of a sentence by

understanding the meanings of individual words and how they combine
according to grammar and context.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 21
Word-Level Meaning (Lexical Semantics)
Word Meaning
The A definite article; refers to a specific entity
Quick Describes speed; fast
Brown Describes color
Fox A small, agile mammal; the subject
Jumps Action verb meaning to leap upward or forward
Over Preposition indicating movement across something
The Definite article
Lazy Describes lack of energy or activity
Dog A domesticated mammal; the object (target of the jump)

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 22
Discourse Integration :
 Task: Understanding the context across multiple sentences. This involves
resolving references like pronouns and understanding how the meaning of
one sentence affects the next.
 it the process of connecting the meaning of a sentence with the sentences
before and after it, creating a coherent flow of ideas across the entire text or
conversation.

Example: "John bought a car. He loves it,"

Discourse integration resolves that "He" refers to John and "it" refers to the
car.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 23
Pragmatic Analysis:
Task: Understanding the intended meaning of the text in a specific context. This
includes interpreting idiomatic expressions, implied meanings, and speech acts
(e.g., commands, requests).

Example: In the sentence "Can you open the door?", pragmatic analysis
understands that this is a request, not a question about ability.

Example:
Person A: "It’s cold in here."
Person B: "I’ll close the window."
Person A didn’t literally ask to close the window, but implied it.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 24
Natural Language Processing (NLP) System Architecture:
typically consists of multiple layers or components that transform human language (spoken or written) into a format that
machines can process, understand, and respond.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 25
Words and Their Components

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 26
Words and Their Components:
 understanding how words are built helps machines analyze, process, and generate
language more effectively.
 This involves breaking words into their components, which is central to tasks like text
classification, machine translation, sentiment analysis, and more.
 Word and components are basic building blocks of nlp

Corpus: Corpora are large and diverse collections of text, speech, and other forms of human
language that are used as input data for natural language processing (NLP) applications.
In order to be used effectively, corpora must be pre-processed and analyzed using various
operations and techniques.
Here are some common operations on corpora in NLP:
1.Morphemes
2.Tokens
3.Lexemes
4.typoloy

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 27
1.TOKENS:
 It refers to a sequence of character or a single unit of text ,such as word.
Character, or symbol.
 Token are words that they are credited by dividing the text into smaller
units.
 This process is called as Tokenization.

Definition: A token is the output of a tokenization process — it’s the smallest

unit of text with meaning for processing.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 28
Type Description Example
Split text by "I love NLP" → ["I",
Word Tokenization
spaces/punctuation "love", "NLP"]
"Hello. How are you?" →
Sentence Tokenization Split text into sentences ["Hello.", "How are
you?"]

Split words into smaller "unhappiness" → ["un",

Subword Tokenization
known pieces "happi", "ness"]

Character Tokenization Each character is a token "cat" → ["c", "a", "t"]

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 29
2.LEXEME :
 A lexeme is a basic unit of meaning in language.
 It refers to a group of word forms that are variations of the same word (same
dictionary entry) but differ in tense, number, case, etc.
 Lexemes are used to represent the meaning and context of word in sentence.
 This process is called as lemmatization.
 Definition: A lexeme is an abstract unit of meaning that can have different
word forms depending on grammatical usage.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 30
Examples of Lexemes

Word Forms Lexeme

run, runs, ran, running run
talk, talks, talked, talking talk
child, children child
is, are, was, were, be, being, been be

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 31
3.MORPHEMES:A morpheme is the smallest meaningful unit of language. It
cannot be divided further without losing or changing its meaning.
Definition: A morpheme is the smallest grammatical or meaningful unit in a
language.
For example: "unhappiness" has 3 morphemes: un- (prefix) + happy (root) + -ness
(suffix)

Type Description Examples

Free Morphemes Can stand alone as a word "book", "happy", "run"

Cannot stand alone; must

Bound Morphemes "un-", "-ing", "-ed", "-s"
attach to a root

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 32
Type Description Example Effect

"happy" →
Changes word's
"unhappy", Creates new
Derivational meaning or part
"teach" → word
of speech
"teacher"
Expresses "dog" → "dogs",
Changes form,
Inflectional grammatical "walk" →
not core meaning
changes "walked"

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 33
4.TYPOLOGY:
 is the study of systematic classification of languages based on their common
structural features and patterns.
 It helps understand how languages are similar or different in terms of syntax,
morphology, phonology, etc.
Types of Typology in Linguistics
Typology Type Focus Examples
Isolating (Chinese),
How words are formed and
Morphological Typology Agglutinative (Turkish),
structured
Fusional (Latin)
Word order and sentence SVO (English), SOV
Syntactic Typology
structure (Japanese), VSO (Arabic)

Tone languages (Mandarin),

Phonological Typology Sound systems and patterns
Non-tonal (English
Issues and Challenges:

1.Irregularity
2.Productivity
3.Ambiguity

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 35
1.Irregularity:
 Irregularity is when a word, phrase, or sentence does not follow standard
grammatical, morphological, or syntactic rules.
 Humans easily understand irregularities through experience, but machines
often struggle.
For example:
 in English, the past tense of "go" is "went," which does not follow the
regular pattern of adding "-ed" to form the past tense, as in "walked" from
"walk.“
 Irregular forms are particularly challenging because they often have to be
memorized individually, as they do not adhere to predictable patterns.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 36
Examples:
• Verb Conjugation:
In English, the past tense of "run" is "ran," not "runned," and the past participle
of "eat" is "eaten," not "eated."
These irregular forms deviate from the regular pattern of adding "-ed" to form
the past tense.
• Plural Formation:
The plural of "mouse" is "mice," and the plural of "child" is "children," not
following the regular pattern of adding "-s" or "-es" to form plurals.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 37
Ambiguity:
 Ambiguity i can occur when a single form has multiple possible
interpretations.
 This issue can manifest at the level of individual morphemes (e.g., a suffix or
prefix that serves multiple grammatical functions) or in the structure of entire
words (e.g., when the same word form can represent different parts of speech
or grammatical roles depending on context).
 ambiguity complicates linguistic analysis, natural language processing, and
language learning, as it requires additional contextual information to resolve
the intended meaning or grammatical function.
 Word form that can be understand in multiple ways out of context
 Word form that look same but have distinct functions or meaning.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 38
Examples:
1. "Leaves": This word can be the plural form of "leaf" (noun) or the third
person singular present tense of the verb "to leave."
Without context, its grammatical role and meaning are ambiguous.

2.“He saw the bat in the cave.”

it can be animal? sports equipment?
Needs context to resolve.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 39
Productivity:
 Productivity is the degree to which a morphological process (like adding
suffixes or prefixes) can be applied to new words to create valid and
meaningful expressions.
 A productive morphological process is one that is actively used to create new
words and is readily understood by speakers of the language.

For instance, the suffix " ness" in English can be added to a wide array of
adjectives to form nouns (e.g., a suffix or prefix that serves multiple grammatical
functions) or in the structure of entire words (e.g., when the same word form can
represent denoting a state or quality (e.g., "happiness" from "happy").

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 40
Examples:
• Suffix "-ize": This suffix can be added to nouns and adjectives to form verbs,
indicating the action of making or becoming. For example, "modern" becomes
"modernize," and "legal" becomes "legalize." This process is highly productive in
English.
• Prefix "un-": This prefix can be added to adjectives and some verbs to create
their opposites, such as "happy" to "unhappy" or "do" to "undo." It is a
productive means of negation in English.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 41
Morphological Models

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 42
Morphological Models:
Morphological Models used to analyse and understand the structure of the
word.
These models help linguists understand how words are built from smaller
units (morphemes), how words are related to each other, and how they change
to express different grammatical categories such as tense, case, number, and
gender.
 these are classified in to models.
Dictionary Lookup
 Finite State Morphology
Unification Based Morphology
Functional Morphology

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 43
DICTIONARY LOOKUP:
 It is a technique used in morphological model to analyze and understand the
internal structure of word
 It is used to retrieve information about words from a predefined lexical
resource or dictionary.
 In the context of NLP, a dictionary( or lexicon) is a collection of words with
associated information such as their meanings, part of-speech.
 Dictionary is understood as a data structure that directly enables some
precomputed results, in ourcase word analyses.
 The data structure can be optimized for efficient lookup.
 Lookup operations are simple andb quick.
 Dictionaries can be implemented as lists, binary search trees, hashtables

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 44
Finite State Morphology:
 Finite State Morphology (FSM) is a computational approach to the analysis
and generation of word forms in natural language processing (NLP).
 It uses finite state automata (FSA) or finite state transducers (FST) to model
the morphological rules of a language.
 is a computational approach used in Natural Language Processing to model
how words are built (generation) or broken down (analysis) using finite-state
machines — specifically, finite-state transducers (FSTs).

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 45
Finite State Automata (FSA) :
FSA are computational models used to represent the states and transitions
between those states within a system.
In the context of morphology, an FSA can model the legal sequences of
morphemes and their modifications based on the grammatical rules of a
language.

Finite-State Transducers (FSTs):

are a core component of finite-state morphology and are widely used in Natural
Language Processing (NLP) for tasks like morphological analysis, generation,
tokenization, spelling correction, and speech processing.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 46
Example of Finite State Morphology: English Plurals
Consider a simplified model for generating the plural forms of English nouns
using an FST. English plural formation generally follows a few rules, such as
adding "-s" to the end of a word, but with exceptions (e.g., "child" to "children").
1. Regular Plurals: The majority of English nouns form their plural by adding "-
s" or "-es" (if the noun ends in s, sh, ch, x, or z). An FST can encode these
rules with states and transitions that append the correct suffix based on the
final letters of the noun.
For example, the word "cat" transitions through states that recognize the word and
then apply the rule to add "s," producing "cats."
2. Irregular Plurals: For irregular forms, the FST would include specific
transitions for these exceptions. For "child," the FST would have a
transition that maps "child" directly to "children" without following the
regular rule.
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 47
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 48
Unification Based Morphology:
 Unification-based morphology is a computational approach to the analysis
and generation of word forms that relies on the concept of unification.
 This is a process of merging two or more feature structures, aligning them by
their shared attributes.
 Unification is the process of combining (merge) feature structures, where
features(such as tense, number, gender etc)
 associated with morphemes are unified to form a correct morphological form.
Ex: “ran”
 Feature Structure1: [ Number: Singular]
 Feature Structure2: [Tense: Past]
 Unified Structure: [Number: Singular, Tense: Past]
 Here the word “ran” is singular and it is past word( Ex: He ran) “run” + “—
ing” results in a unified feature structure for “running
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 49
Functional Morphology:
It focus on the relationship between a word’s form and its function in conveying
grammatical & semantic information like marking tense, number, case, gender etc
Tense (English Verbs)
Base Form Tense Word Form
play Present play
play Past played
play Continuous playing
play 3rd Person Singular plays

Number (Nouns)
Base Form Number Word Form
cat Singular cat
cat Plural cats
PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 50
Finding the Structure of Documents

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 51
Finding the Structure of Documents –
finding the structure of documents means identifying and understanding how a
document is organized — including sections, headings, paragraphs, titles, bullet
points, and more.
Improves Search and Retrieval: Understanding the structure of documents enables
better search and retrieval, as search algorithms can target specific sections (e.g.,
headings, paragraphs, references) to produce more relevant results.
• Facilitates Summarization: Knowing how a document is structured allows
automated systems to summarize key sections like introductions, conclusions, or
results in research papers.
• Enhances Information Extraction: Systems can extract specific information like
dates, addresses, or citations by recognizing different parts of a document's
structure.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 52
Challenges in Finding the Structure of Documents:
• Varied Formats: Documents can come in various formats such as PDFs, HTML,
XML, Word documents, and scanned images. Each format requires different pre-
processing techniques.
• Complex Layouts: Documents with tables, multi-column layouts, images, and
footnotes pose a challenge for document structure analysis.
• Noisy Data: Scanned documents and low-quality images often introduce noise,
making it difficult to accurately extract text or identify structure.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 53
Steps Involved in Document Structure Analysis:
Preprocessing
The first step involves converting the document into a format suitable for
analysis, such as plain text, and performing some cleaning:
• Text Conversion: Converting documents from formats like PDF or
images into machine-readable formats (e.g., plain text, HTML) using
tools such as Tesseract OCR (for images) or PDFBox (for PDFs).
• Noise Removal: Cleaning irrelevant information like metadata,
watermarks, headers, and footers.
Segmentation: Breaking down the document into sentences, paragraphs, and
sections for easier analysis.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 54
Layout Analysis
Layout analysis is the task of identifying and interpreting the physical layout
of the document:
• Detecting Headers and Footers: These elements may be irrelevant for
document content analysis but important for metadata extraction.
• Identifying Columns: In multi-column layouts (common in research
papers), each column should be analyzed separately.
• Recognizing Font Styles: Features like bold or italic text, larger font
sizes, and indentation help identify titles, headings, or emphasized text

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 55
Feature Extraction
Extracting key features from the text to distinguish between various elements
like headings, lists, and paragraphs:
• Formatting Cues: Information such as font size, style (bold, italic), and
alignment (center, left) is helpful in detecting section headers, lists, or quotes.
• Keyword Detection: Keywords like “Introduction”, “Conclusion”, or
“References” often signal important structural elements.
• Punctuation Patterns: Identifying bullet points, numbering schemes, or
figure labels.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 56
Classification
Once the features are extracted, machine learning models or rule-based
systems classify each section or block of text into its corresponding category:

• Machine Learning: Algorithms like Support Vector Machines (SVMs),

Random Forests, or Neural Networks are trained on labeled data to predict
categories like "Heading", "Paragraph", "Table", or "List".

• Rule-Based Systems: These can be simpler, using predefined rules to classify

text based on features like font size, position on the page, or keywords.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 57
Hierarchy Construction
After classifying each element, the next step is to build a logical hierarchy:
• Heading Levels: Determining the hierarchy of headings and
subheadings (e.g., H1, H2, H3) based on their relative importance or indentation.
• Content Organization: Grouping related paragraphs under their respective
headings and associating figures, tables, or footnotes with the relevant
sections.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 58
Semantic Analysis
Semantic analysis further enhances document structure understanding by
interpreting the meaning behind different sections:
• Contextual Understanding: Using Natural Language Processing (NLP)
techniques, semantic analysis can determine the meaning of headings, detect
topics, or summarize sections.
• Named Entity Recognition (NER): Detecting entities such as dates, names,
places, or references within sections.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 59
Validation and Correction
The extracted structure may be validated either manually or through
automated methods:
• Manual Validation: This may involve a human reviewing the output to correct
errors in document structure recognition.
• Automated Validation: Rule-based systems can check for
inconsistencies, such as missing headings or misplaced tables, to
ensure that the hierarchy is properly constructed.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 60
Identifying The Boundaries:
 Sentence Boundary Detection
Sentence Boundary Detection (SBD), also known as sentence segmentation, is
the process of identifying where one sentence ends and the next begins.
is the process of identifying the boundaries between sentences in a text.

 Dr. Smith arrived at 5 p.m. He started the meeting right away.

 SBD must correctly identify that:
 "Dr. Smith arrived at 5 p.m." is one sentence.
 "He started the meeting right away." is the next.
 Even though there's a period in "p.m.", it should not be considered a sentence
boundary.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 61
Steps Involved in Sentence Boundary Detection:

1.Identifying Common Sentence Delimiters:

The most common delimiters for sentences are punctuation marks such as periods
(.), exclamation marks (!), and question marks (?).
However, not all periods represent the end of a sentence. For example, periods in
abbreviations like "e.g." or "i.e." should not trigger sentence boundaries.
Example:
 "Dr. Smith arrived at 5 p.m." is one sentence.
 "He started the meeting right away." is the next.
 Even though there's a period in "p.m.", it should not be considered a sentence
boundary.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 62
2.Handling Ambiguities:

Abbreviations: SBD systems must handle common abbreviations correctly by using

predefined lists or machine learning models to learn from the context.
Numerical Data: Periods in decimal numbers (e.g., "3.14") should not cause
sentence breaks.
Example:
Abbreviations: "Mr.", "Inc.", "vs.“
Numbers: "3.14", "10.2.2020"

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 63
3.Supervised Machine Learning (ML) Models:
ML models can be trained to detect sentence boundaries by learning from labeled
data. These models use features like:
▪ Punctuation patterns.
▪ Part of Speech (POS) tags (e.g., period followed by a capital letter could indicate a
new sentence).
Example:
“He said, ‘Let’s go.’ Then they left.”
Desired SBD Output:
“He said, ‘Let’s go.’”
“Then they left.”
So, the boundary is after the outer quote, not just after go..

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 64
 Topic Boundary Detection
Topic Boundary Detection involves identifying where one topic ends and another
begins within a document. This is important for understanding the logical
structure and flow of content in long documents such as research papers, books,
or articles.
1.Identifying Topic Shifts:
Topic boundary detection aims to recognize changes in the subject matter. This
could be based on:
▪ Keyword Clustering: When the text shifts from one group of keywords to
another (e.g., from discussing "dogs" to discussing "cats"), it may indicate a
topic shift.
▪ Semantic Change: A significant change in the context of words in
consecutive sentences.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 65
Example:
▪ Input: "In the first chapter, we discuss the properties of water. In the next
chapter, we analyze the effects of temperature on metals.“
▪ Topic boundary detected between "properties of water" and "effects of
temperature on metals.“

2.Latent Dirichlet Allocation (LDA):

 This topic modeling technique identifies different topics in a document by
clustering words based on their co-occurrence patterns.
▪ Example:
In a research paper, words like "algorithm", "performance", and "complexity"
might belong to one topic,
while words like "experiment", "data", and "result" belong to another.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 66
3.TextTiling Algorithm:
A classic unsupervised method that segments text into topical blocks by examining
word frequency distributions and lexical similarities between sections. It detects
topic boundaries by measuring the dissimilarity between adjacent blocks of text.
▪ Example:
▪ Input: "Chapter 1: Introduction... Chapter 2: Methods..."
▪ The algorithm would detect a boundary between these chapters based on word
frequency and semantic similarity.

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 67
4.Human-Assisted Approaches:
Manual Topic Annotation: For important documents like legal papers or reports,
manual topic annotation can be used to identify topic boundaries. This is often
done in combination with automated methods to improve accuracy.
Example: Lawyers might mark where different sections of a contract begin and
end, like "Definitions", "Terms", and "Signatures"

PREPARED BY GAJJELA.KUSHUBU@GMAIL.COM 68

Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
69 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
Introduction To Natural Language Processing-03-01-2024
No ratings yet
Introduction To Natural Language Processing-03-01-2024
27 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
NLP Introduction Week3
No ratings yet
NLP Introduction Week3
28 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Module 1
No ratings yet
Module 1
40 pages
Module-1 - Introduction To NLP
No ratings yet
Module-1 - Introduction To NLP
39 pages
NLP Lecture Notes R20
No ratings yet
NLP Lecture Notes R20
56 pages
1.introduction To NLP
No ratings yet
1.introduction To NLP
59 pages
Introduction
No ratings yet
Introduction
24 pages
Unit V
No ratings yet
Unit V
16 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
NLP CSM
No ratings yet
NLP CSM
136 pages
NLP Merged
100% (1)
NLP Merged
975 pages
NLP Module - 1
No ratings yet
NLP Module - 1
16 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Introduction
No ratings yet
Introduction
49 pages
Week 12 Topic 8 NLP
No ratings yet
Week 12 Topic 8 NLP
31 pages
Natural Language Processing: By-Himani (ROLL NO. 43)
No ratings yet
Natural Language Processing: By-Himani (ROLL NO. 43)
19 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
AI Chapter 6 and 7 New
No ratings yet
AI Chapter 6 and 7 New
48 pages
Guided by Dinesh Sir Presented by Sam
No ratings yet
Guided by Dinesh Sir Presented by Sam
10 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
1.1chap NLP - Introduction
No ratings yet
1.1chap NLP - Introduction
34 pages
NLP Notes (Ch1-5) PDF
100% (1)
NLP Notes (Ch1-5) PDF
41 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Unit 1
No ratings yet
Unit 1
14 pages
NLP Notes (Ch-1)
No ratings yet
NLP Notes (Ch-1)
5 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Nayie Bayes Classifier 21 Page
No ratings yet
Nayie Bayes Classifier 21 Page
28 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
NLP Notes
No ratings yet
NLP Notes
73 pages
AI - Unit 1 - NLP
No ratings yet
AI - Unit 1 - NLP
27 pages
NLP 1
No ratings yet
NLP 1
20 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
Unit 1
No ratings yet
Unit 1
68 pages
Module1 Chapter1
No ratings yet
Module1 Chapter1
23 pages
NLP Unit-1
No ratings yet
NLP Unit-1
37 pages
Unit I
No ratings yet
Unit I
28 pages
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
No ratings yet
Natural Language Processing (NLP) : Chapter 1: Introduction To NLP
96 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
45 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
2-Lecture Two - (Back Ground of NLP)
No ratings yet
2-Lecture Two - (Back Ground of NLP)
65 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
B.Tech CSE NLP Course Overview
No ratings yet
B.Tech CSE NLP Course Overview
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Natural Language Processing
No ratings yet
Natural Language Processing
24 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
What Is NLP
No ratings yet
What Is NLP
14 pages
Devops Notes_Unit 1&2
No ratings yet
Devops Notes_Unit 1&2
77 pages
NLP_Unit_4
No ratings yet
NLP_Unit_4
15 pages
DA Slip-Test Questions UNIT-1
No ratings yet
DA Slip-Test Questions UNIT-1
1 page
CN Task 1
No ratings yet
CN Task 1
8 pages
DevOps Lab Experiments 1 and 2
No ratings yet
DevOps Lab Experiments 1 and 2
3 pages
DevOps-Unit-I and II Notes
No ratings yet
DevOps-Unit-I and II Notes
43 pages
CN PRG2
No ratings yet
CN PRG2
9 pages
BST Program
No ratings yet
BST Program
3 pages
Unit1 CRD
No ratings yet
Unit1 CRD
27 pages
R22 OS Unit-I
No ratings yet
R22 OS Unit-I
110 pages
DM Unit - 1
No ratings yet
DM Unit - 1
40 pages
NODE3
No ratings yet
NODE3
10 pages
Befa Unit 2
No ratings yet
Befa Unit 2
27 pages
English GR 456 4th Quarter MG Bow 1
No ratings yet
English GR 456 4th Quarter MG Bow 1
32 pages
Passive Constructions PDF
No ratings yet
Passive Constructions PDF
3 pages
HASHIR Solved PAST Paper 9066
No ratings yet
HASHIR Solved PAST Paper 9066
19 pages
Active Voice Passive Voice in Urdu Ilovepdf Compressed PDF
67% (6)
Active Voice Passive Voice in Urdu Ilovepdf Compressed PDF
19 pages
VOICES
No ratings yet
VOICES
9 pages
Agree and Disagree Essay
No ratings yet
Agree and Disagree Essay
4 pages
Exercise Pre Toeic
No ratings yet
Exercise Pre Toeic
50 pages
Study Guide: Learn Serbian. Have Fun
No ratings yet
Study Guide: Learn Serbian. Have Fun
11 pages
Prepositional Phrases Functioning As Subject
No ratings yet
Prepositional Phrases Functioning As Subject
7 pages
Grammar Plus Workbook Grade 10
100% (2)
Grammar Plus Workbook Grade 10
139 pages
CHAPTER 8 - The Passive Voice - Grammar II
No ratings yet
CHAPTER 8 - The Passive Voice - Grammar II
10 pages
25 Grammar Rules 2
No ratings yet
25 Grammar Rules 2
43 pages
Writing Skills (Level - 2) - 1
No ratings yet
Writing Skills (Level - 2) - 1
82 pages
Makalah B. Inggris Verb
No ratings yet
Makalah B. Inggris Verb
16 pages
20 Rules Subject Verb Agreement With Examples
No ratings yet
20 Rules Subject Verb Agreement With Examples
6 pages
LK Modul 3 English For Social Communication
No ratings yet
LK Modul 3 English For Social Communication
6 pages
Noun Clauses
100% (1)
Noun Clauses
13 pages
Enrique Cavazos - Grade 8 Grammar Book
100% (1)
Enrique Cavazos - Grade 8 Grammar Book
235 pages
English Grammer
100% (1)
English Grammer
13 pages
English Grade 8 Teacher's Guide
80% (20)
English Grade 8 Teacher's Guide
148 pages
TG English 3
No ratings yet
TG English 3
204 pages
Week 1 - Reading Academic Texts
No ratings yet
Week 1 - Reading Academic Texts
11 pages
A First Greek Reader
No ratings yet
A First Greek Reader
275 pages
English Parts of Speech Guide
No ratings yet
English Parts of Speech Guide
6 pages
4964-Article Text-13909-1-10-20240118
No ratings yet
4964-Article Text-13909-1-10-20240118
13 pages
Correlative Conjunctions Guide
No ratings yet
Correlative Conjunctions Guide
20 pages
Practice Test
No ratings yet
Practice Test
9 pages
The Open University of Tanzania: Q-Number K A
No ratings yet
The Open University of Tanzania: Q-Number K A
4 pages
Reading TLM 2020 ENGsmall
No ratings yet
Reading TLM 2020 ENGsmall
2 pages
Reported Speech Practice Guide
No ratings yet
Reported Speech Practice Guide
3 pages

NLP Unit 1

Uploaded by

NLP Unit 1

Uploaded by

Natural Language Processing (NLP)

These two components work together to enable intelligent systems to interact

1. Lexical or Morphological Analysis

 Semantic analysis is the process of interpreting the meaning of a sentence by

Example: "John bought a car. He loves it,"

Definition: A token is the output of a tokenization process — it’s the smallest

Split words into smaller "unhappiness" → ["un",

Character Tokenization Each character is a token "cat" → ["c", "a", "t"]

Word Forms Lexeme

Type Description Examples

Free Morphemes Can stand alone as a word "book", "happy", "run"

Cannot stand alone; must

Tone languages (Mandarin),

2.“He saw the bat in the cave.”

Finite-State Transducers (FSTs):

• Machine Learning: Algorithms like Support Vector Machines (SVMs),

• Rule-Based Systems: These can be simpler, using predefined rules to classify

 Dr. Smith arrived at 5 p.m. He started the meeting right away.

1.Identifying Common Sentence Delimiters:

Abbreviations: SBD systems must handle common abbreviations correctly by using

2.Latent Dirichlet Allocation (LDA):

You might also like