[go: up one dir, main page]

100% found this document useful (1 vote)
76 views10 pages

Normalization and Tokenization in NLP

Normalization and tokenization are critical pre-processing steps in natural language processing (NLP) that prepare text data for analysis. Normalization standardizes text to reduce inconsistencies, while tokenization breaks text into meaningful units for further processing. Together, they enhance the accuracy and reliability of various NLP applications, such as machine translation and sentiment analysis.

Uploaded by

aditi2542
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
76 views10 pages

Normalization and Tokenization in NLP

Normalization and tokenization are critical pre-processing steps in natural language processing (NLP) that prepare text data for analysis. Normalization standardizes text to reduce inconsistencies, while tokenization breaks text into meaningful units for further processing. Together, they enhance the accuracy and reliability of various NLP applications, such as machine translation and sentiment analysis.

Uploaded by

aditi2542
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Normalization

and Tokenization
in NLP
Normalization and tokenization are essential pre-processing
steps for natural language processing tasks, preparing text
data for analysis and understanding.

By: Aditi Singh


Reg. No. : 219301596
Introduction to Natural
Language Processing
1 Human Language 2 Diverse Applications
Understanding NLP powers applications
NLP aims to bridge the gap like machine translation,
between humans and sentiment analysis,
computers by enabling chatbot interactions, and
machines to understand text summarization.
and interpret human
language.

3 Data Preparation
Preprocessing text data through normalization and tokenization
is crucial for effective NLP model training.
What is Normalization?
Consistent Reduced Ambiguity
Representation By addressing inconsistencies
Normalization aims to in word forms and spellings,
transform text into a normalization helps NLP
standardized form, reducing models interpret text more
variations and inconsistencies accurately.
that can hinder NLP tasks.

Improved Analysis
Normalization ensures that words with similar meanings are
treated as the same, enhancing the accuracy and reliability of
NLP analysis.
Techniques for Normalization
Case Folding Stemming Lemmatization

Converting text to lowercase to Reducing words to their root form by Converting words to their dictionary
eliminate case sensitivity. For removing suffixes. For example, form, considering grammatical
example, "The" and "the" become "running" becomes "run". context. For example, "better"
the same. becomes "good".
Stemming vs. Lemmatization
Technique Input Word Output

Stemming running run

Lemmatization running run

Stemming better better

Lemmatization better good


What is Tokenization?

Breaking Down Text Meaningful Units Further Processing


Tokenization involves splitting Tokens represent the basic units Tokens are often used as input for
text into individual units called of meaning in text, allowing NLP other NLP tasks, such as part-of-
tokens, which can be words, models to analyze and understand speech tagging and named entity
punctuation marks, or other the structure of language. recognition.
symbols.
Methods of Tokenization

Word-Based
Splits text into individual words, treating spaces as delimiters.

Character-Based
Breaks text into individual characters, useful for languages with complex
writing systems.

Subword-Based
Divides text into smaller units, such as morphemes or syllables, balancing
word boundaries and context.
Handling Special
Characters and
Punctuation
Punctuation Removal
1
Removing punctuation marks to simplify tokenization, but
considering its potential meaning in some contexts.

2 Special Character Handling


Treating special characters as individual tokens or removing
them depending on the NLP task and context.

3 Context-Specific Rules
Tokenization strategies can vary based on the specific
language, domain, and NLP task at hand.
Applications of Normalization and Tokenizat

Machine Translation Sentiment Analysis Chatbot Interactions


Normalization and tokenization Normalization helps analyze the Tokenization enables chatbots to
prepare text for accurate emotional tone of text, enabling process user input and generate
translation across languages. businesses to understand customer relevant responses in natural
feedback. language.
Conclusion and Key
Takeaways
1 Essential 2 Improved Accuracy
Preprocessing These techniques enhance
Normalization and the accuracy and reliability
tokenization are crucial of NLP models by reducing
preprocessing steps for ambiguity and
NLP tasks, preparing text inconsistencies in text.
for analysis and
understanding.

3 Diverse Applications
Normalization and tokenization are fundamental to a wide
range of NLP applications, enabling machines to interpret and
interact with human language.

You might also like