Normalization
and Tokenization
in NLP
Normalization and tokenization are essential pre-processing
steps for natural language processing tasks, preparing text
data for analysis and understanding.
By: Aditi Singh
Reg. No. : 219301596
Introduction to Natural
Language Processing
1 Human Language 2 Diverse Applications
Understanding NLP powers applications
NLP aims to bridge the gap like machine translation,
between humans and sentiment analysis,
computers by enabling chatbot interactions, and
machines to understand text summarization.
and interpret human
language.
3 Data Preparation
Preprocessing text data through normalization and tokenization
is crucial for effective NLP model training.
What is Normalization?
Consistent Reduced Ambiguity
Representation By addressing inconsistencies
Normalization aims to in word forms and spellings,
transform text into a normalization helps NLP
standardized form, reducing models interpret text more
variations and inconsistencies accurately.
that can hinder NLP tasks.
Improved Analysis
Normalization ensures that words with similar meanings are
treated as the same, enhancing the accuracy and reliability of
NLP analysis.
Techniques for Normalization
Case Folding Stemming Lemmatization
Converting text to lowercase to Reducing words to their root form by Converting words to their dictionary
eliminate case sensitivity. For removing suffixes. For example, form, considering grammatical
example, "The" and "the" become "running" becomes "run". context. For example, "better"
the same. becomes "good".
Stemming vs. Lemmatization
Technique Input Word Output
Stemming running run
Lemmatization running run
Stemming better better
Lemmatization better good
What is Tokenization?
Breaking Down Text Meaningful Units Further Processing
Tokenization involves splitting Tokens represent the basic units Tokens are often used as input for
text into individual units called of meaning in text, allowing NLP other NLP tasks, such as part-of-
tokens, which can be words, models to analyze and understand speech tagging and named entity
punctuation marks, or other the structure of language. recognition.
symbols.
Methods of Tokenization
Word-Based
Splits text into individual words, treating spaces as delimiters.
Character-Based
Breaks text into individual characters, useful for languages with complex
writing systems.
Subword-Based
Divides text into smaller units, such as morphemes or syllables, balancing
word boundaries and context.
Handling Special
Characters and
Punctuation
Punctuation Removal
1
Removing punctuation marks to simplify tokenization, but
considering its potential meaning in some contexts.
2 Special Character Handling
Treating special characters as individual tokens or removing
them depending on the NLP task and context.
3 Context-Specific Rules
Tokenization strategies can vary based on the specific
language, domain, and NLP task at hand.
Applications of Normalization and Tokenizat
Machine Translation Sentiment Analysis Chatbot Interactions
Normalization and tokenization Normalization helps analyze the Tokenization enables chatbots to
prepare text for accurate emotional tone of text, enabling process user input and generate
translation across languages. businesses to understand customer relevant responses in natural
feedback. language.
Conclusion and Key
Takeaways
1 Essential 2 Improved Accuracy
Preprocessing These techniques enhance
Normalization and the accuracy and reliability
tokenization are crucial of NLP models by reducing
preprocessing steps for ambiguity and
NLP tasks, preparing text inconsistencies in text.
for analysis and
understanding.
3 Diverse Applications
Normalization and tokenization are fundamental to a wide
range of NLP applications, enabling machines to interpret and
interact with human language.