0% found this document useful (0 votes)

39 views4 pages

Week 7 - Show in Class - Text Processing

The document discusses the importance of text pre-processing in AI and data analytics, outlining steps such as tokenization, cleaning, stop word removal, stemming, and lemmatization to convert unstructured text into a structured format for analysis. It also introduces TF-IDF (Term Frequency-Inverse Document Frequency), a statistical measure that evaluates the importance of words in a document relative to a collection of documents. TF-IDF is useful for identifying significant words, filtering out common terms, and has applications in information retrieval, text classification, and keyword extraction.

Uploaded by

anle1001.super

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views4 pages

Week 7 - Show in Class - Text Processing

Uploaded by

anle1001.super

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Text Pre-processing and TF-IDF: Foundations of Text Analysis

Text Pre-processing: Preparing Text for Analysis

In the field of AI and data analytics, we often encounter data in the form of
unstructured text. To effectively analyze this text using computational
methods, we need to transform it into a structured format that machines
can understand. This process is called text pre-processing.

Why is Text Pre-processing Necessary?

 Unstructured Data: Raw text is often messy and lacks a defined

structure. It may contain various inconsistencies, irrelevant
information, and formatting that can hinder analysis.

 Numerical Input for AI: Most AI and machine learning models

require numerical input. Text data, being symbolic, needs to be
converted into a numerical representation.

Common Text Pre-processing Steps

1. Tokenization: Breaking Down Text

o Tokenization is the process of splitting text into smaller units

called tokens.

o Tokens can be words, subwords, or characters.

o This step converts a continuous string of text into discrete

elements.

o For example, the sentence "Welcome to the world of AI!" can

be tokenized into the following list of tokens: ["Welcome", "to",
"the", "world", "of", "AI", "!"]

o Python libraries like NLTK provide tools for tokenization.

2. Cleaning: Making Text Consistent

o Cleaning involves removing or standardizing irrelevant

information to reduce noise and improve data consistency.

o Common cleaning operations include:

 Removing punctuation (!, ?, ., etc.)

 Removing special characters (#, @, *, etc.)

 Converting text to lowercase (to treat "The" and "the"

the same)
 Removing numbers (if not relevant to the analysis)

 Handling abbreviations (e.g., "Dr." to "Doctor", "it's" to

"it is")

 Removing extra whitespace

o For example, the input "Welcome to the world of AI!!! It's

amazing, isn't it?" can be cleaned to: "welcome to the world of
ai it is amazing isnt it".

3. Stop Word Removal: Filtering Out Commonplace Words

o Stop words are common words that appear frequently in a

language but carry little meaningful information for many text
analysis tasks.

o Examples of stop words in English include "the", "is", "a",

"and", "in", "to", "I", and "you".

o Removing stop words can help focus on the more important

terms in a text.

o For example, the sentence "The quick brown fox jumps over
the lazy dog" becomes "quick brown fox jumps lazy dog" after
stop word removal.

o NLTK provides lists of stop words for various languages.

4. Stemming: Reducing Words to Their Roots

o Stemming reduces words to their root or base form by

removing suffixes.

o It is a simpler and faster approach than lemmatization.

o For example:

 "running", "runs", "ran" -> "run"

 "easily", "easy", "easier" -> "easi"

o Note that stemming does not always produce a valid word. For
example, both "university" and "universe" might be stemmed
to "univers".

5. Lemmatization: Finding the Dictionary Form

o Lemmatization reduces words to their base or dictionary form,

called the lemma.
o It is more sophisticated than stemming because it considers
the word's meaning and context.

o Lemmatization ensures that the resulting word is a valid word.

o For example:

 "better", "best" -> "good"

 "went" -> "go"

 "are", "is", "was" -> "be"

o Lemmatization is generally more computationally expensive

than stemming.

Text Analysis: Weighing Word Importance with TF-IDF

Once the text has been pre-processed, we can begin to analyze its
content. A common technique for this is TF-IDF, which helps us
understand the importance of words within a document relative to a
collection of documents.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is

a statistical measure that assigns a score to each word in a document
based on its importance.

 Term Frequency (TF): Measures how often a word appears in a

specific document. The more times a word appears in a document,
the more relevant it is to the document's content.

 Inverse Document Frequency (IDF): Measures how rare a word

is across a collection of documents (corpus). Words that appear in
many documents are less informative than words that appear in
only a few.

The TF-IDF score is calculated by multiplying the TF and IDF scores:

TF-IDF = TF * IDF

A high TF-IDF score indicates that a word is frequent in a given document

but rare across the corpus, suggesting that it is an important word for
understanding the document's content.

Why is TF-IDF Useful?

 Identifies Important Words: TF-IDF helps to highlight the words

that are most characteristic of a document.
 Filters Out Common Words: It downweights the importance of
common words (like "the", "is", "and") that appear frequently in all
documents and thus provide little discriminatory power.

 Applications: TF-IDF is widely used in various applications,

including:

o Information Retrieval: Ranking search results based on

their relevance to a query.

o Text Classification: Categorizing documents into different

groups or topics.

o Keyword Extraction: Identifying the most important words

or phrases in a document.

Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
5 pages
TF Idf
No ratings yet
TF Idf
15 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Getting Started With Natural Language Processing
No ratings yet
Getting Started With Natural Language Processing
10 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Lect 5
No ratings yet
Lect 5
40 pages
Text Analysis with TF-IDF and NLTK
No ratings yet
Text Analysis with TF-IDF and NLTK
10 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Module III
No ratings yet
Module III
42 pages
Exp 7
No ratings yet
Exp 7
9 pages
Unit 5
No ratings yet
Unit 5
8 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
Module 3
No ratings yet
Module 3
40 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
NLP Revision Notes
No ratings yet
NLP Revision Notes
6 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
TF-IDF: Feature Extraction Guide
No ratings yet
TF-IDF: Feature Extraction Guide
18 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
Week 12
No ratings yet
Week 12
19 pages
6103 Text Analysis - Related Concepts (Lecture 11)
No ratings yet
6103 Text Analysis - Related Concepts (Lecture 11)
3 pages
NLP Question Bank Answers (Raghav) - This Is Better
No ratings yet
NLP Question Bank Answers (Raghav) - This Is Better
25 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Next-Word Prediction Techniques
No ratings yet
Next-Word Prediction Techniques
12 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
IJDKP
No ratings yet
IJDKP
7 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
DSC 202
No ratings yet
DSC 202
8 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
NLP2
No ratings yet
NLP2
8 pages
TF-IDF Guide for Data Scientists
No ratings yet
TF-IDF Guide for Data Scientists
20 pages
Chapter Two
No ratings yet
Chapter Two
3 pages
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
No ratings yet
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
14 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
131 pages
00 Colab
No ratings yet
00 Colab
2 pages
Quiz 1
No ratings yet
Quiz 1
2 pages
Old-Week 5 - During Class - Enterprise AI and Business Analytics (2510)
No ratings yet
Old-Week 5 - During Class - Enterprise AI and Business Analytics (2510)
7 pages
Information Systems Journal - 2015 - Xu - Internet Aggression in Online Communities A Contemporary Deterrence Perspective
No ratings yet
Information Systems Journal - 2015 - Xu - Internet Aggression in Online Communities A Contemporary Deterrence Perspective
27 pages
Information Systems Journal - 2018 - Mueller - The Roles of Social Identity and Dynamic Salient Group Formations For ERP
No ratings yet
Information Systems Journal - 2018 - Mueller - The Roles of Social Identity and Dynamic Salient Group Formations For ERP
32 pages
English Grammar and Vocabulary Test
No ratings yet
English Grammar and Vocabulary Test
24 pages
Computer Shortcuts
100% (1)
Computer Shortcuts
3 pages
Churning The Global Ocean of Nectar: The Devotional Music of Srila Prabhupåda
No ratings yet
Churning The Global Ocean of Nectar: The Devotional Music of Srila Prabhupåda
16 pages
Anchoring Script For The Teacher's Day
100% (1)
Anchoring Script For The Teacher's Day
3 pages
Learning English With CBC Calgary Monthly Feature Story: "You're Fired!" December 2013
No ratings yet
Learning English With CBC Calgary Monthly Feature Story: "You're Fired!" December 2013
29 pages
The Social Communication Intervention Programme Resource: Supporting Children's Pragmatic and Social Communication Needs, Ages 6-11 1st Edition Adams No Waiting Time
No ratings yet
The Social Communication Intervention Programme Resource: Supporting Children's Pragmatic and Social Communication Needs, Ages 6-11 1st Edition Adams No Waiting Time
86 pages
U1 L1 Present of Be
No ratings yet
U1 L1 Present of Be
7 pages
Study Guide For Latin!
100% (1)
Study Guide For Latin!
3 pages
Proof and Falsity A Logical Investigation 1st Edition Nils Kürbis Download
No ratings yet
Proof and Falsity A Logical Investigation 1st Edition Nils Kürbis Download
106 pages
First Officer Line Flying Evaluation
No ratings yet
First Officer Line Flying Evaluation
14 pages
Learning Disability
No ratings yet
Learning Disability
104 pages
Thats Embarrassing British English Student B2 C1
No ratings yet
Thats Embarrassing British English Student B2 C1
8 pages
Revision #2: Read Changing Making According Adaptations
No ratings yet
Revision #2: Read Changing Making According Adaptations
4 pages
21 Useful English Phrases For Different Situations
No ratings yet
21 Useful English Phrases For Different Situations
6 pages
Unit 5 A.I
No ratings yet
Unit 5 A.I
17 pages
The Key Differences Between Python 2.7.x and Python 3.x With Examples
No ratings yet
The Key Differences Between Python 2.7.x and Python 3.x With Examples
12 pages
Robert B. Brandom - Tales of The Mighty Dead. Historical Essays in The Metaphysics of Intentionality
100% (5)
Robert B. Brandom - Tales of The Mighty Dead. Historical Essays in The Metaphysics of Intentionality
221 pages
Hobbes and Husserl: Sokolowski@cua - Edu
No ratings yet
Hobbes and Husserl: Sokolowski@cua - Edu
12 pages
Blazing Zebra Source Material
No ratings yet
Blazing Zebra Source Material
7 pages
Oral Communication Mastery
No ratings yet
Oral Communication Mastery
7 pages
Presentation Key Language Phrases
No ratings yet
Presentation Key Language Phrases
9 pages
I Dwell in Possibility by Emily Dickinson
100% (1)
I Dwell in Possibility by Emily Dickinson
9 pages
SS Sample Paper 30 Unsolved
No ratings yet
SS Sample Paper 30 Unsolved
8 pages
Perfect Presentations - Open University
No ratings yet
Perfect Presentations - Open University
147 pages
Urdu Medium.
No ratings yet
Urdu Medium.
2 pages
Full Placement Test PDF
100% (1)
Full Placement Test PDF
11 pages
db2v9 Esql
No ratings yet
db2v9 Esql
419 pages
Grade 10 English Lesson Plan
No ratings yet
Grade 10 English Lesson Plan
9 pages
Spelling Bee
No ratings yet
Spelling Bee
3 pages
WEEK 2 - Unit INTRODUCTION - FRIENDS (Greetings)
No ratings yet
WEEK 2 - Unit INTRODUCTION - FRIENDS (Greetings)
5 pages

Week 7 - Show in Class - Text Processing

Uploaded by

Week 7 - Show in Class - Text Processing

Uploaded by

Text Pre-processing and TF-IDF: Foundations of Text Analysis

Text Pre-processing: Preparing Text for Analysis

Why is Text Pre-processing Necessary?

 Unstructured Data: Raw text is often messy and lacks a defined

 Numerical Input for AI: Most AI and machine learning models

Common Text Pre-processing Steps

1. Tokenization: Breaking Down Text

o Tokenization is the process of splitting text into smaller units

o Tokens can be words, subwords, or characters.

o This step converts a continuous string of text into discrete

o For example, the sentence "Welcome to the world of AI!" can

o Python libraries like NLTK provide tools for tokenization.

2. Cleaning: Making Text Consistent

o Cleaning involves removing or standardizing irrelevant

o Common cleaning operations include:

 Removing punctuation (!, ?, ., etc.)

 Removing special characters (#, @, *, etc.)

 Converting text to lowercase (to treat "The" and "the"

 Handling abbreviations (e.g., "Dr." to "Doctor", "it's" to

 Removing extra whitespace

o For example, the input "Welcome to the world of AI!!! It's

3. Stop Word Removal: Filtering Out Commonplace Words

o Stop words are common words that appear frequently in a

o Examples of stop words in English include "the", "is", "a",

o Removing stop words can help focus on the more important

o NLTK provides lists of stop words for various languages.

4. Stemming: Reducing Words to Their Roots

o Stemming reduces words to their root or base form by

o It is a simpler and faster approach than lemmatization.

 "running", "runs", "ran" -> "run"

 "easily", "easy", "easier" -> "easi"

5. Lemmatization: Finding the Dictionary Form

o Lemmatization reduces words to their base or dictionary form,

o Lemmatization ensures that the resulting word is a valid word.

 "better", "best" -> "good"

 "went" -> "go"

 "are", "is", "was" -> "be"

o Lemmatization is generally more computationally expensive

Text Analysis: Weighing Word Importance with TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is

 Term Frequency (TF): Measures how often a word appears in a

 Inverse Document Frequency (IDF): Measures how rare a word

The TF-IDF score is calculated by multiplying the TF and IDF scores:

A high TF-IDF score indicates that a word is frequent in a given document

Why is TF-IDF Useful?

 Identifies Important Words: TF-IDF helps to highlight the words

 Applications: TF-IDF is widely used in various applications,

o Information Retrieval: Ranking search results based on

o Text Classification: Categorizing documents into different

o Keyword Extraction: Identifying the most important words

You might also like