Text Pre Processing (NLTK SpaCy) (1) .HTML
Text Pre Processing (NLTK SpaCy) (1) .HTML
Objective: Understand and apply essential text preprocessing techniques using Python's NLTK and spaCy
libraries to prepare raw text data for machine learning models and analysis.
In this notebook, we will explore how to clean, normalize, and structure raw text data from various sources.
Effective text preprocessing is a fundamental step in Natural Language Processing (NLP) and is crucial for
tasks like text classification, sentiment analysis, information retrieval, and topic modeling. We will use a
case study approach, imagining we are preparing a collection of documents (like resumes, speeches, or
long articles) for analysis.
NLTK (Natural Language Toolkit): A comprehensive library often used for teaching and research,
providing access to many corpora and lexical resources.
spaCy: An industrial-strength library designed for efficiency and speed, particularly good for production
use cases and offering pre-trained models for various languages.
Comparison Table
Feature NLTK (Natural Language Toolkit) spaCy
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Feature NLTK (Natural Language Toolkit) spaCy
Generally **slower for core tasks like Generally faster for core tasks, built for
Speed
tokenization, POS, NER. ** performance.
Very broad range of algorithms, corpora, and Focused on providing efficient, accurate
Functionality
lexical resources. core NLP functionalities.
Modular; user often chains together individual Pipeline-based; core functionalities are
Architecture
components. integrated.
Ease of Use Requires more manual steps (e.g., explicit More opinionated and streamlined API for
(Core) POS tagging for lemmatization). common tasks.
In [ ]:
# Install necessary libraries
!pip install -q nltk spacy docx2txt pandas matplotlib wordcloud
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
Collecting en-core-web-sm==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm
-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 81.4 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
3. Data Loading
In a real-world scenario, text data can come from various sources like text files (.txt), PDFs (.pdf), CSVs
(.csv), Word documents (.docx), databases, or web scraping.
For this notebook, we will primarily work with a sample text string to demonstrate the preprocessing steps
clearly. However, let's briefly touch upon loading from different formats.
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Let's create a sample text for our case study. This text is loosely based on the "machine learning from
Wikipedia" idea mentioned in the prompt.
In [ ]:
# Sample text representing a part of a document
sample_document = """
Machine learning (ML) is a field of study in artificial intelligence concerned with the
Machine learning algorithms are used in a wide variety of applications, such as email fi
In its application across business problems, machine learning is also referred to as pre
"""
In [ ]:
# Let's also add some text with punctuation, numbers, and mixed casing
noisy_text = """
Machine learning rocks! It's revolutionizing the world in 2023 (and beyond!).
Visit our site: http://example.com for more info.
This is awesome!!! We collected 1,234 data points.
Softbank and Google are major players.
"""
In [ ]:
print("--- Sample Document ---")
print(sample_document)
print("\n--- Noisy Text ---")
print(noisy_text)
######
### Note: For loading .docx files, you would typically do:
# import docx2txt
# try:
# text = docx2txt.process("your_document.docx")
# print(text)
# except Exception as e:
# print(f"Error loading docx: {e}")
Machine learning (ML) is a field of study in artificial intelligence concerned with the
development of computer algorithms that can learn from and make predictions on data. Alg
orithms build a mathematical model based on sample data, known as "training data", in or
der to make predictions or decisions without being explicitly programmed to perform the
task.
Machine learning algorithms are used in a wide variety of applications, such as email fi
ltering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks. A subset of machine learning is closely related
to computational statistics, which focuses on making predictions using computers. Mathem
atical optimization delivers methods, theory and application domains to the field of mac
hine learning. Data mining is a related field of study, focusing on exploratory data ana
lysis through unsupervised learning.
In its application across business problems, machine learning is also referred to as pre
dictive analytics.
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
--- Noisy Text ---
Machine learning rocks! It's revolutionizing the world in 2023 (and beyond!).
Visit our site: http://example.com for more info.
This is awesome!!! We collected 1,234 data points.
Softbank and Google are major players.
In [ ]:
import re
import string
# Combine texts
raw_text = sample_document + "\n" + noisy_text
Machine learning (ML) is a field of study in artificial intelligence concerned with the
development of computer algorithms that can learn from and make predictions on data. Alg
orithms build a mathematical model based on sample data, known as "training data", in or
der to make predictions or decisions without being explicitly programmed to perform the
task.
Machine learning algorithms are used in a wide variety of applications, such as email fi
ltering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks. A subset of machine learning is closely related
to computational statistics, which focuses on making predictions using computers. Mathem
atical optimization delivers methods, theory and application domains to the field of mac
hine learning. Data mining is a related field of study, focusing on exploratory data ana
lysis through unsupervised learning.
In its application across business problems, machine learning is also referred to as pre
dictive analytics.
Machine learning rocks! It's revolutionizing the world in 2023 (and beyond!).
Visit our site: http://example.com for more info.
This is awesome!!! We collected 1,234 data points.
Softbank and Google are major players.
1. Convert to Lowercase
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Words like "Apple" and "apple" are treated as different tokens by most models, even though they usually
mean the same thing.If you lowercase:
In [ ]:
cleaned_text = raw_text.lower()
machine learning (ml) is a field of study in artificial intelligence concerned with the
development of computer algorithms that can learn from and make predictions on data. alg
orithms build a mathematical model based on sample data, known as "training data", in or
der to make predictions or decisions without being explicitly programmed to perform the
task.
machine learning algorithms are used in a wide variety of applications, such as email fi
ltering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks. a subset of machine learning is closely related
to computational statistics, which focuses on making predictions using computers. mathem
atical optimization delivers methods, theory and application domains to the field of mac
hine learning. data mining is a related field of study, focusing on exploratory data ana
lysis through unsupervised learning.
in its application across business problems, machine learning is also referred to as pre
dictive analytics.
machine learning rocks! it's revolutionizing the world in 2023 (and beyond!).
visit our site: http://example.com for more info.
this is awesome!!! we collected 1,234 data points.
softbank and google are major players.
Model may falsely associate the product sentiment with random URLs. Inconsistent patterns reduce
accuracy.
Write a Python regular expression that matches URLs starting with "http://" or "https://", followed by one
or more characters that can include letters, digits, special characters like $-_@.&+, !*(),, or percent-
encoded characters like %20. Use re.compile() to compile the pattern.
url_pattern = re.compile(
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
r'http[s]?://' # Match "http://" or "https://"
r'(?:'
r'[a-zA-Z]' # Match any letter (uppercase or lowercase)
r'|'
r'[0-9]' # Match any digit
r'|'
r'[$-_@.&+]' # Match any of the listed special characters
r'|'
r'[!*\\(\\),]' # Match other allowed characters (escaped)
r'|'
r'(?:%[0-9a-fA-F][0-9a-fA-F])' # Match percent-encoded characters
r')+'
)
In [ ]:
def remove_urls(text):
url_pattern = r'https?://\S+|www\.\S+'
return re.sub(url_pattern, '', text)
In [ ]:
print(remove_urls(cleaned_text))
machine learning (ml) is a field of study in artificial intelligence concerned with the
development of computer algorithms that can learn from and make predictions on data. alg
orithms build a mathematical model based on sample data, known as "training data", in or
der to make predictions or decisions without being explicitly programmed to perform the
task.
machine learning algorithms are used in a wide variety of applications, such as email fi
ltering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks. a subset of machine learning is closely related
to computational statistics, which focuses on making predictions using computers. mathem
atical optimization delivers methods, theory and application domains to the field of mac
hine learning. data mining is a related field of study, focusing on exploratory data ana
lysis through unsupervised learning.
in its application across business problems, machine learning is also referred to as pre
dictive analytics.
machine learning rocks! it's revolutionizing the world in 2023 (and beyond!).
visit our site: for more info.
this is awesome!!! we collected 1,234 data points.
softbank and google are major players.
In [ ]:
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-
#For now we can generate this regex using ChatGPT else you can use this link for refer
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
machine learning (ml) is a field of study in artificial intelligence concerned with the
development of computer algorithms that can learn from and make predictions on data. alg
orithms build a mathematical model based on sample data, known as "training data", in or
der to make predictions or decisions without being explicitly programmed to perform the
task.
machine learning algorithms are used in a wide variety of applications, such as email fi
ltering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks. a subset of machine learning is closely related
to computational statistics, which focuses on making predictions using computers. mathem
atical optimization delivers methods, theory and application domains to the field of mac
hine learning. data mining is a related field of study, focusing on exploratory data ana
lysis through unsupervised learning.
in its application across business problems, machine learning is also referred to as pre
dictive analytics.
machine learning rocks! it's revolutionizing the world in 2023 (and beyond!).
visit our site: for more info.
this is awesome!!! we collected 1,234 data points.
softbank and google are major players.
# This line creates a translation table that tells Python which characters to delete fro
In [ ]:
trans = str.maketrans("aeiou", "12345")
# Explanation:
#"a" → "1", "e" → "2", "i" → "3", "o" → "4", "u" → "5"
text = "education"
translated = text.translate(trans)
print(translated)
2d5c1t34n
In [ ]:
cleaned_text = cleaned_text.translate(translator)
print("\n--- After Removing Punctuation ---")
print(cleaned_text)
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
velopment of computer algorithms that can learn from and make predictions on data algori
thms build a mathematical model based on sample data known as training data in order to
make predictions or decisions without being explicitly programmed to perform the task
machine learning algorithms are used in a wide variety of applications such as email fil
tering and computer vision where it is difficult or infeasible to develop conventional a
lgorithms to perform the needed tasks a subset of machine learning is closely related to
computational statistics which focuses on making predictions using computers mathematica
l optimization delivers methods theory and application domains to the field of machine l
earning data mining is a related field of study focusing on exploratory data analysis th
rough unsupervised learning
in its application across business problems machine learning is also referred to as pred
ictive analytics
machine learning rocks its revolutionizing the world in 2023 and beyond
visit our site httpexamplecom for more info
this is awesome we collected 1234 data points
softbank and google are major players
machine learning algorithms are used in a wide variety of applications such as email fil
tering and computer vision where it is difficult or infeasible to develop conventional a
lgorithms to perform the needed tasks a subset of machine learning is closely related to
computational statistics which focuses on making predictions using computers mathematica
l optimization delivers methods theory and application domains to the field of machine l
earning data mining is a related field of study focusing on exploratory data analysis th
rough unsupervised learning
in its application across business problems machine learning is also referred to as pred
ictive analytics
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
visit our site httpexamplecom for more info
this is awesome we collected data points
softbank and google are major players
In [ ]:
print("\nCleaned Text Length:", len(cleaned_text))
# Note: Removing non-English words is more complex and often involves language detection
5. Tokenization
Tokenization is the process of breaking down a text into individual units called tokens. These tokens can
be words, sentences, or even sub-word units.
5.1 Sentence Tokenization Splitting text into sentences. Useful for tasks that require sentence-level
analysis.
In [ ]:
!python -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm
-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 44.5 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
In [ ]:
from nltk.tokenize import sent_tokenize
import spacy
In [ ]:
# Using NLTK
nltk_sentences = sent_tokenize(raw_text) # Using raw_text to show sentence boundary hand
print("--- Sentence Tokenization (NLTK) ---")
for i, sentence in enumerate(nltk_sentences):
print(f"Sentence {i+1}: {sentence}")
In [ ]:
# Using spaCy
# spaCy processes text as a Doc object, which has sentences built-in
"""
nlp is the SpaCy language model pipeline you’ve loaded (e.g., nlp = spacy.load("en_core_
When you call nlp(raw_text), SpaCy processes the text through its pipeline:
Tokenization
Part-of-speech tagging
Lemmatization
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Dependency parsing, etc.
The result is a Doc object (spacy_doc_raw) which is a container of tokens and their ling
"""
spacy_doc_raw = nlp(raw_text) # Using raw_text
spacy_sentences = [sent.text for sent in spacy_doc_raw.sents]
print("\n--- Sentence Tokenization (spaCy) ---")
for i, sentence in enumerate(spacy_sentences):
print(f"Sentence {i+1}: {sentence}")
Sentence 3: Machine learning algorithms are used in a wide variety of applications, such
as email filtering and computer vision, where it is difficult or infeasible to develop c
onventional algorithms to perform the needed tasks.
Sentence 4: A subset of machine learning is closely related to computational statistics,
which focuses on making predictions using computers.
Sentence 5: Mathematical optimization delivers methods, theory and application domains t
o the field of machine learning.
Sentence 6: Data mining is a related field of study, focusing on exploratory data analys
is through unsupervised learning.
Sentence 7: In its application across business problems, machine learning is also referr
ed to as predictive analytics.
spaCy's sentence tokenization is part of its processing pipeline and is generally robust.
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
In [ ]:
from nltk.tokenize import word_tokenize
In [ ]:
# Using spaCy (on the cleaned text)
# spaCy's Doc object already contains tokens
spacy_doc_cleaned = nlp(cleaned_text)
spacy_word_tokens = [token.text for token in spacy_doc_cleaned]
print("\n--- Word Tokenization (spaCy) ---")
print(spacy_word_tokens[:20]) # Print first 20 tokens
1. SpaCy's tokenizer handles contractions and punctuation attached to words more gracefully than
NLTK's basic word_tokenize , which might separate them.
2. SpaCy tokens also contain rich information (POS tags, lemmas, etc.) directly.
In [ ]:
from nltk.corpus import stopwords
In [ ]:
# Remove stop words using NLTK tokens
nltk_tokens_no_stopwords = [word for word in nltk_word_tokens if word not in nltk_stop_w
print("\n--- NLTK Tokens after Stop Word Removal ---")
print(nltk_tokens_no_stopwords[:20])
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
d', 'development', 'computer', 'algorithms', 'learn', 'make', 'predictions', 'data', 'al
gorithms', 'build', 'mathematical', 'model', 'based']
In [ ]:
# SpaCy tokens have an `is_stop` attribute
# token.is_stop is a built-in boolean property that checks if the token is a stop word (
spacy_tokens_no_stopwords = [token.text for token in spacy_doc_cleaned if not token.is_s
print("\n--- spaCy Tokens after Stop Word Removal ---")
print(spacy_tokens_no_stopwords[:20])
In [ ]:
In [ ]:
from collections import Counter
# Step 5: Count word frequencies
word_counts = Counter(spacy_tokens_no_stopwords)
Stemming: A heuristic process that cuts off the ends of words. It's faster but can result in non-dictionary
words (e.g., "automate" from "automatically").
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Lemmatization: A more sophisticated process that uses a vocabulary and morphological analysis to return
the base or dictionary form of a word (known as the lemma). It's slower but produces actual words (e.g.,
"better" -> "good"). Lemmatization often requires the Part-of-Speech (POS) tag of the word to be accurate.
In [ ]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
# nltk, spacy are already imported
In [ ]:
# --- Lemmatization with spaCy ---
# Extract lemmas from spaCy Doc object (on cleaned text, filtering stop words)
spacy_lemmatized_tokens = [token.lemma_ for token in spacy_doc_cleaned if not token.is_s
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
In [ ]:
# Sample noisy texts
texts = [
"Machine learning rocks! It's revolutionizing the world in 2023 (and beyond!). Visit
"This is awesome!!! We collected 1,234 data points.",
"Softbank and Google are major players.",
"Predictive analytics uses machine learning to solve business problems.",
]
In [ ]:
stop_words = set(stopwords.words('english'))
def clean_text(text):
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()
# Lemmatization + remove stopwords + tokens length > 1 (to skip leftover punct or sp
tokens = [token.lemma_ for token in doc if token.text not in stop_words and len(toke
print("Cleaned Texts:")
for i, txt in enumerate(cleaned_texts, 1):
print(f"{i}: {txt}")
Cleaned Texts:
1: machine learning rock revolutionize world beyond visit site info
2: awesome collect data point
3: softbank google major player
4: predictive analytic use machine learn solve business problem
In [ ]:
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
1. Bag of Words (BoW) What it is: BoW converts text into a vector of word counts — it ignores grammar
and word order, just focuses on how often each word appears.
How it works:
Represent each document by counting how many times each word occurs.
Industry Example: In spam detection, BoW helps count words like "free", "win", "offer". Spam emails tend to
use these more often — the model learns this pattern.
In [ ]:
# Using sklearn CountVectorizer for BoW
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(cleaned_texts)
In [ ]:
print("BoW Feature Names:", vectorizer.get_feature_names_out())
BoW Feature Names: ['analytic' 'awesome' 'beyond' 'business' 'collect' 'data' 'google'
'info'
'learn' 'learning' 'machine' 'major' 'player' 'point' 'predictive'
'problem' 'revolutionize' 'rock' 'site' 'softbank' 'solve' 'use' 'visit'
'world']
In [ ]:
print("\nBoW Matrix (Count of words per document):")
bow_matrix.toarray()
In [ ]:
# Get feature names and TF-IDF matrix values
feature_names =vectorizer.get_feature_names_out()
bow_values = bow_matrix.toarray()
# Create a DataFrame
bow_df = pd.DataFrame(bow_values, columns=feature_names)
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
TF-IDF DataFrame:
Out[ ]:
analytic awesome beyond business collect data google info learn learning ... predictiv
Doc_1 0 0 1 0 0 0 0 1 0 1 ...
Doc_2 0 1 0 0 1 1 0 0 0 0 ...
Doc_3 0 0 0 0 0 0 1 0 0 0 ...
Doc_4 1 0 0 1 0 0 0 0 1 0 ...
4 rows × 24 columns
📊 2. TF-IDF (Term Frequency – Inverse Document Frequency) What it is: TF-IDF weighs words by how
important they are to a document relative to all other documents.
How it works:
IDF (Inverse Document Frequency): Gives less weight to common words across all documents (like “the”,
“is”), and more to rare but important ones.
Industry Example: In resume ranking, TF-IDF gives more weight to unique skills (e.g., "TensorFlow", "NLP")
that appear in the job description and not in every resume — helping identify better-matching candidates.
In [ ]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_texts)
TF-IDF Feature Names: ['analytic' 'awesome' 'beyond' 'business' 'collect' 'data' 'googl
e' 'info'
'learn' 'learning' 'machine' 'major' 'player' 'point' 'predictive'
'problem' 'revolutionize' 'rock' 'site' 'softbank' 'solve' 'use' 'visit'
'world']
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
0. 0. 0. 0. 0. 0.
0. 0.5 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0.
0.5 0. 0. 0. 0. 0.5
0.5 0. 0. 0. 0. 0.
0. 0.5 0. 0. 0. 0. ]
[0.36222393 0. 0. 0.36222393 0. 0.
0. 0. 0.36222393 0. 0.2855815 0.
0. 0. 0.36222393 0.36222393 0. 0.
0. 0. 0.36222393 0.36222393 0. 0. ]]
In [ ]:
# Get feature names and TF-IDF matrix values
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_values = tfidf_matrix.toarray()
# Create a DataFrame
tfidf_df = pd.DataFrame(tfidf_values, columns=feature_names)
TF-IDF DataFrame:
Out[ ]:
analytic awesome beyond business collect data google info learn learning ... predict
Doc_1 0.000 0.0 0.341 0.000 0.0 0.0 0.0 0.341 0.000 0.341 ... 0.0
Doc_2 0.000 0.5 0.000 0.000 0.5 0.5 0.0 0.000 0.000 0.000 ... 0.0
Doc_3 0.000 0.0 0.000 0.000 0.0 0.0 0.5 0.000 0.000 0.000 ... 0.0
Doc_4 0.362 0.0 0.000 0.362 0.0 0.0 0.0 0.000 0.362 0.000 ... 0.3
4 rows × 24 columns
Summary
Session 2
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
8. Part-of-Speech (POS) Tagging
POS tagging is the process of assigning a grammatical category (like noun, verb, adjective) to each word in
a sentence. This is useful for understanding the structure and meaning of text and is often a precursor to
other tasks like chunking or dependency parsing.
In [ ]:
# nltk, spacy are already
# spacy_doc_cleaned is already created in 5.2
# SpaCy's POS tagging is part of its standard pipeline and is generally very accurate.
# It provides both coarse-grained (`.pos_`) and fine-grained (`.tag_`) POS tags.
: SPACE (_SP)
machine: NOUN (NN)
learning: NOUN (NN)
ml: PROPN (NNP)
is: AUX (VBZ)
a: DET (DT)
field: NOUN (NN)
of: ADP (IN)
study: NOUN (NN)
in: ADP (IN)
artificial: ADJ (JJ)
intelligence: NOUN (NN)
concerned: VERB (VBN)
with: ADP (IN)
the: DET (DT)
development: NOUN (NN)
of: ADP (IN)
computer: NOUN (NN)
algorithms: NOUN (NNS)
that: PRON (WDT)
can: AUX (MD)
In [ ]:
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
# Using spaCy to find noun chunks
print("--- Noun Chunking with spaCy ---")
for chunk in spacy_doc_cleaned.noun_chunks:
print(chunk.text)
machine learning ml
a field
study
artificial intelligence
the development
computer algorithms
that
predictions
data algorithms
a mathematical model
sample data
data
order
predictions
decisions
the task
machine learning
the world
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
our site
more info
this
we
data points
softbank
google
major players
In [ ]:
# spacy is already imported
# spacy_doc_raw is already created
Vectorization: Converting text tokens into numerical vectors. Common methods include:
Sentiment Analysis: Determining the emotional tone of text (positive, negative, neutral). Libraries for
this include:
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
VADER (Valence Aware Dictionary and sEntiment Reasoner): Rule-based sentiment analysis
specifically tuned for social media text. Available in NLTK.
NRC Lexicons: Lexicons like NRC-Emotion Lexicon (mapping words to emotions) and NRC-VAD
(Valence, Arousal, Dominance) lexicon.
AFINN: A lexicon-based approach mapping words to sentiment scores.
In [ ]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()
In [ ]:
# Example sentiment analysis on a sentence from our original text
sentence_for_sentiment = "Machine learning is a field of study in artificial intelligenc
vs = analyzer.polarity_scores(sentence_for_sentiment)
print("\n--- Sentiment Analysis with NLTK's VADER ---")
print(f"Sentence: {sentence_for_sentiment}")
print(f"VADER Polarity Scores: {vs}")
In [ ]:
sentence_for_sentiment_2 = "Machine learning rocks! This is awesome!!!"
vs2 = analyzer.polarity_scores(sentence_for_sentiment_2)
print(f"Sentence: {sentence_for_sentiment_2}")
print(f"VADER Polarity Scores: {vs2}")
What are Valence Scores? Valence scores measure the emotional intensity or positivity/negativity of a
word.
In sentiment analysis, each word has a valence score (positive, negative, or neutral) based on how people
emotionally perceive that word.
These scores are predefined in the VADER lexicon based on human ratings.
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
When VADER analyzes a sentence, it:
What is compound? The compound score ranges from -1 (most negative) to +1 (most positive).
It is calculated by summing the valence scores of each word in the text and then normalizing the result
between -1 and +1 using a standard formula.
🧠 Example: For the output: {'neg': 0.0, 'neu': 0.721, 'pos': 0.279, 'compound': 0.4767} neg: 0.0 → No
strong negative sentiment detected.
# Use the lemmatized tokens after stop word removal for the word cloud
# Join the list of tokens back into a single string
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Conclusion
Text preprocessing is an essential step in any NLP pipeline. It transforms raw, unstructured text into a clean
and structured format suitable for analysis and training machine learning models.
NLTK is a powerful and flexible library, great for learning and experimenting with various NLP
techniques. It provides modular access to different algorithms and resources.
spaCy is designed for efficiency and production use. Its integrated pipeline provides fast and accurate
results for core tasks like tokenization, POS tagging, lemmatization, and NER.
Choosing between NLTK and spaCy often depends on the specific task, performance requirements, and
ease of use needed for your project. Often, data scientists use a combination of tools based on their
strengths.
By mastering these preprocessing techniques, you are well on your way to tackling more complex text
mining and NLP challenges!
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF