Normalization and Tokenization in NLP

Normalization and tokenization are critical pre-processing steps in natural language processing (NLP) that prepare text data for analysis. Normalization standardizes text to reduce inconsistencies, while tokenization breaks text into meaningful units for further processing. Together, they enhance the accuracy and reliability of various NLP applications, such as machine translation and sentiment analysis.

Uploaded by

aditi2542

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

76 views10 pages

Normalization and Tokenization in NLP

Uploaded by

aditi2542

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Normalization

and Tokenization
in NLP
Normalization and tokenization are essential pre-processing
steps for natural language processing tasks, preparing text
data for analysis and understanding.

By: Aditi Singh

Reg. No. : 219301596
Introduction to Natural
Language Processing
1 Human Language 2 Diverse Applications
Understanding NLP powers applications
NLP aims to bridge the gap like machine translation,
between humans and sentiment analysis,
computers by enabling chatbot interactions, and
machines to understand text summarization.
and interpret human
language.

3 Data Preparation
Preprocessing text data through normalization and tokenization
is crucial for effective NLP model training.
What is Normalization?
Consistent Reduced Ambiguity
Representation By addressing inconsistencies
Normalization aims to in word forms and spellings,
transform text into a normalization helps NLP
standardized form, reducing models interpret text more
variations and inconsistencies accurately.
that can hinder NLP tasks.

Improved Analysis
Normalization ensures that words with similar meanings are
treated as the same, enhancing the accuracy and reliability of
NLP analysis.
Techniques for Normalization
Case Folding Stemming Lemmatization

Converting text to lowercase to Reducing words to their root form by Converting words to their dictionary
eliminate case sensitivity. For removing suffixes. For example, form, considering grammatical
example, "The" and "the" become "running" becomes "run". context. For example, "better"
the same. becomes "good".
Stemming vs. Lemmatization
Technique Input Word Output

Stemming running run

Lemmatization running run

Stemming better better

Lemmatization better good

What is Tokenization?

Breaking Down Text Meaningful Units Further Processing

Tokenization involves splitting Tokens represent the basic units Tokens are often used as input for
text into individual units called of meaning in text, allowing NLP other NLP tasks, such as part-of-
tokens, which can be words, models to analyze and understand speech tagging and named entity
punctuation marks, or other the structure of language. recognition.
symbols.
Methods of Tokenization

Word-Based
Splits text into individual words, treating spaces as delimiters.

Character-Based
Breaks text into individual characters, useful for languages with complex
writing systems.

Subword-Based
Divides text into smaller units, such as morphemes or syllables, balancing
word boundaries and context.
Handling Special
Characters and
Punctuation
Punctuation Removal
1
Removing punctuation marks to simplify tokenization, but
considering its potential meaning in some contexts.

2 Special Character Handling

Treating special characters as individual tokens or removing
them depending on the NLP task and context.

3 Context-Specific Rules
Tokenization strategies can vary based on the specific
language, domain, and NLP task at hand.
Applications of Normalization and Tokenizat

Machine Translation Sentiment Analysis Chatbot Interactions

Normalization and tokenization Normalization helps analyze the Tokenization enables chatbots to
prepare text for accurate emotional tone of text, enabling process user input and generate
translation across languages. businesses to understand customer relevant responses in natural
feedback. language.
Conclusion and Key
Takeaways
1 Essential 2 Improved Accuracy
Preprocessing These techniques enhance
Normalization and the accuracy and reliability
tokenization are crucial of NLP models by reducing
preprocessing steps for ambiguity and
NLP tasks, preparing text inconsistencies in text.
for analysis and
understanding.

3 Diverse Applications
Normalization and tokenization are fundamental to a wide
range of NLP applications, enabling machines to interpret and
interact with human language.

Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
Tokenization
No ratings yet
Tokenization
13 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
Tokenization Essentials
No ratings yet
Tokenization Essentials
20 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
NLP m2
No ratings yet
NLP m2
71 pages
Exp - 2lab Manual
No ratings yet
Exp - 2lab Manual
5 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
NLP and Python Course Overview
No ratings yet
NLP and Python Course Overview
121 pages
08 03 Lessonarticle
No ratings yet
08 03 Lessonarticle
5 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Text Preparation
No ratings yet
Text Preparation
9 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
8 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Text Processing For NLP String Tokenization
No ratings yet
Text Processing For NLP String Tokenization
10 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
2.2text Preprocessing Tokanization
No ratings yet
2.2text Preprocessing Tokanization
3 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
Introduction To Natural Language Processing NLP
No ratings yet
Introduction To Natural Language Processing NLP
8 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Token Izer
No ratings yet
Token Izer
17 pages
NLP m1
No ratings yet
NLP m1
148 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
Natural Language Processing - Session 4 - Tokenization and Stemming
No ratings yet
Natural Language Processing - Session 4 - Tokenization and Stemming
63 pages
Unit I
No ratings yet
Unit I
12 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Gitika Mandal BE4 A 17 NLP EXP1
No ratings yet
Gitika Mandal BE4 A 17 NLP EXP1
3 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
15 pages
Session1 2024 - 2025 - Natural Language Processing
No ratings yet
Session1 2024 - 2025 - Natural Language Processing
40 pages
Module 1
No ratings yet
Module 1
49 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Week 2
No ratings yet
Week 2
90 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Execution: Rockefeller Habits Checklist: (Column 7 of The One-Page Strategic Plan)
No ratings yet
Execution: Rockefeller Habits Checklist: (Column 7 of The One-Page Strategic Plan)
1 page
Oralcom q2 Module 1
86% (21)
Oralcom q2 Module 1
26 pages
Invariance: LING 285 Spring 2021 Mary Byram Washburn
No ratings yet
Invariance: LING 285 Spring 2021 Mary Byram Washburn
46 pages
Literacy Autobiography Final
No ratings yet
Literacy Autobiography Final
4 pages
Mohan Gopala Krishna Kurukula PDF
No ratings yet
Mohan Gopala Krishna Kurukula PDF
2 pages
SPJ 7 LAS Principles of Journalism
0% (1)
SPJ 7 LAS Principles of Journalism
11 pages
Thompson, M. 14
100% (3)
Thompson, M. 14
262 pages
6ps of Research
75% (4)
6ps of Research
2 pages
Physics Wallah Vidyapeeth Academic Operations Team JOB Discription
No ratings yet
Physics Wallah Vidyapeeth Academic Operations Team JOB Discription
7 pages
Learning Module: Sanaysay at Nobela
No ratings yet
Learning Module: Sanaysay at Nobela
122 pages
Stephen's Resume
No ratings yet
Stephen's Resume
1 page
ST2 PPT Report
No ratings yet
ST2 PPT Report
76 pages
Assessment Tool
No ratings yet
Assessment Tool
10 pages
(EN10G-III-A-31) Use The Cases of Pronouns Correctly at The End of The Lesson, The Students Should Be Able To
100% (2)
(EN10G-III-A-31) Use The Cases of Pronouns Correctly at The End of The Lesson, The Students Should Be Able To
5 pages
Wave 4 - Adventure Transcript
100% (1)
Wave 4 - Adventure Transcript
11 pages
Mobile Phone Correction Text
No ratings yet
Mobile Phone Correction Text
3 pages
Child Interest Survey
No ratings yet
Child Interest Survey
1 page
Leadership Studies Minor Guide
No ratings yet
Leadership Studies Minor Guide
26 pages
Post Training Report Form
No ratings yet
Post Training Report Form
2 pages
This Study Resource Was: Y Rule: R, U L E !
91% (11)
This Study Resource Was: Y Rule: R, U L E !
4 pages
PR Strategies for Modern Professionals
No ratings yet
PR Strategies for Modern Professionals
2 pages
Thesis Education
100% (2)
Thesis Education
23 pages
Louis Vuitton Brand Identity Analysis
No ratings yet
Louis Vuitton Brand Identity Analysis
13 pages
Body Language: Key to Effective Communication
No ratings yet
Body Language: Key to Effective Communication
2 pages
What Is Contextual Analysis
0% (1)
What Is Contextual Analysis
5 pages
DSL5068EN (1T1R) : Adsl2+ 150Mbps Wireless-N Modem Router V1.0
No ratings yet
DSL5068EN (1T1R) : Adsl2+ 150Mbps Wireless-N Modem Router V1.0
40 pages
English Reviewer 4
No ratings yet
English Reviewer 4
25 pages
The Impact of Political Communication On Voting Behaviour: A Comparative Study in Karnataka, Kerala & Tamil Nadu
No ratings yet
The Impact of Political Communication On Voting Behaviour: A Comparative Study in Karnataka, Kerala & Tamil Nadu
17 pages
10.4324 9781003175704 Previewpdf
No ratings yet
10.4324 9781003175704 Previewpdf
44 pages
Internet - Wikipedia, The Free Encyclopedia
No ratings yet
Internet - Wikipedia, The Free Encyclopedia
17 pages

Normalization and Tokenization in NLP

Uploaded by

Normalization and Tokenization in NLP

Uploaded by

Normalization

By: Aditi Singh

Stemming running run

Lemmatization running run

Stemming better better

Lemmatization better good

Breaking Down Text Meaningful Units Further Processing

2 Special Character Handling

Machine Translation Sentiment Analysis Chatbot Interactions

You might also like