NLP Lab
NLP Lab
AIM: Installation and exploring features of NLTK and spaCy tools. Download Word Cloud and few corpora.
Description: NLTK (Natural Language Toolkit) is a widely used Python library for NLP tasks such as tokenization,
stopword removal, lemmatization, sentiment analysis, and corpus exploration. spaCy is an advanced NLP library
designed for speed and efficiency, providing features like POS tagging, dependency parsing, and NER (Named Entity
Recognition). WordCloud is a visualization tool that displays the most frequent words in a given text corpus.
Example: We install these libraries and download the required models/corpora.
Source Code :
pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
pip install spacy
python -m spacy download en_core_web_sm
Output: No direct output, only installation and download messages.
2(a) Tokenization
Description: Tokenization is the process of splitting text into smaller units like words (word tokenization) or
sentences (sentence tokenization).
Example:
Input: "Natural language processing is interesting." → Tokens: ['Natural', 'language', 'processing', 'is', 'interesting',
'.'].
Source Code :
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural language processing is an interesting field of study."
words = word_tokenize(text)
sentences = sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)
Output:
Words: ['Natural', 'language', 'processing', 'is', 'an', 'interesting', 'field', 'of', 'study', '.']
Sentences: ['Natural language processing is an interesting field of study.']
Description: Stopwords are common words (like “is”, “the”, “and”) that are usually removed in text processing since
they don’t add meaningful information.
Example: From the sentence above, stopwords are removed → ['Natural', 'language', 'processing', 'interesting', 'field',
'study'].
Source Code :
1
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)
Output:
Filtered Words: ['Natural', 'language', 'processing', 'interesting', 'field', 'study', '.']
2(c) Lemmatization
Description: Lemmatization reduces words to their root/base form (lemma), considering grammar.
Example: "studies" → "study", "running" → "run".
Source Code :
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)
Output:
Lemmatized Words: ['Natural', 'language', 'processing', 'is', 'an', 'interesting', 'field', 'of', 'study', '.']
Description: Corpora are large collections of texts available in NLTK for research and experimentation. Names
corpus contains male and female names → useful for gender classification tasks.
Source Code :
from nltk.corpus import names
print(names.words()[:20])
Output:
['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abbye', 'Abigael', 'Abigail', 'Abigale', 'Abra', 'Ada',
'Adah', 'Adaline', 'Adan', 'Adara', 'Adda', 'Addi', 'Addia']
Description: spaCy provides pre-trained language models for efficient NLP. Loading the model is the first step to use
its features.
Example: en_core_web_sm → a small English model with vocabulary, syntax, and entities.
Source Code :
import spacy
nlp = spacy.load('en_core_web_sm')
Output: Loads the spaCy model (no printed output).
2
Tokens: ['Natural', 'language', 'processing', 'with', 'spaCy', 'is', 'efficient', '.']
Lemmas: ['Natural', 'language', 'processing', 'with', 'spaCy', 'be', 'efficient', '.']
Description: NER identifies real-world entities such as names, organizations, and dates.
Example: Text: "Barack Obama was the president of USA." → Entities: Barack Obama (PERSON), USA (GPE).
Source Code:
for entity in doc.ents:
print(entity.text, entity.label_)
Output:
(No entities found in this sentence)
Description: POS tagging assigns grammatical categories (noun, verb, adjective) to words.
Example: "efficient" → ADJ, "language" → NOUN.
Source Code :
for token in doc:
print(token.text, token.pos_)
Output:
Natural ADJ
language NOUN
processing NOUN
with ADP
spaCy PROPN
is AUX
efficient ADJ
. PUNCT
Description: Word Clouds visually represent the frequency of words; bigger words mean higher frequency.
Example: From filtered words, "Natural", "language", "study" appear prominently.
Source Code :
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width=800, height=400).generate(" ".join(filtered_words))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Output: Graphical word cloud (words like Natural, language, processing, field, study).
Description: A collection of classic literary texts from Project Gutenberg, useful for linguistic analysis.
Example: austen-emma.txt → “Emma” by Jane Austen.
Source Code :
from nltk.corpus import gutenberg
print(gutenberg.fileids())
text = gutenberg.raw('austen-emma.txt')
3
print(text[:500])
Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
Emma by Jane Austen 1816
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home...
Description: A corpus containing movie reviews labeled as positive or negative → used for sentiment analysis.
Example: First review snippet shows a negative review about a teen couple’s story.
Source Code :
from nltk.corpus import movie_reviews
print(movie_reviews.fileids())
file_id = movie_reviews.fileids()[0]
print(movie_reviews.raw(file_id)[:300])
Output:
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', ... , 'pos/cv999_13106.txt']
plot : two teen couples go to a church party , drink and then drive .
they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life ...
4
EXERCISE -2
2. (1) Write a program to implement word Tokenizer, Sentence and Paragraph Tokenizers.
Aim: Write a program that implements word tokenizer, sentence tokenizer, and paragraph tokenizer using NLTK.
Description:
SOURCE CODE:
import nltk
nltk.download('punkt')
# Sample text
words_nltk = word_tokenize(text)
sentences_nltk = sent_tokenize(text)
paragraphs_nltk = text.split('\n')
Output:
Word Tokenizer Output:
5
'field', 'of', 'AI', 'concerned', 'with', 'the', 'interaction',
['Natural language processing (NLP) is a field of AI concerned with the interaction between computers and human
language.',
'It involves tasks like language translation, sentiment analysis, and text summarization.',
'NLP uses various techniques such as tokenization, parsing, and machine learning.']
'language.',
Aim: Write a program that counts the total number of words and distinct words in a given text corpus using spaCy.
Description:
The total number of words and distinct words are then calculated.
import spacy
nlp = spacy.load('en_core_web_sm')
# Sample text
6
text summarization."""
doc = nlp(text)
words_spacy = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]
# Number of words
num_words_spacy = len(words_spacy)
distinct_words_spacy = len(set(words_spacy))
Output:
7
EXERCISE -3
(i) Program to Implement User-Defined and Pre-Defined Functions to Generate N-Grams
Aim
Implement user-defined and pre-defined functions using NLTK and spaCy libraries in Python to generate n-grams
(Unigrams, Bigrams, Trigrams, and N-Grams) for a given text.
Objective
Theory
In Natural Language Processing (NLP), an n-gram is a contiguous sequence of n items (usually words) from a given
text or speech. They are widely used in language modelling, machine translation, and predictive text systems.
NLTK provides built-in functions such as nltk.util.ngrams() to generate n-grams directly. spaCy does not provide a
direct method, but we can easily implement n-gram generation using loops and tokenization.
SOURCE CODE:
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
# Sample text
text = "Natural language processing is fun with NLTK and spaCy."
# Word Tokenization
words = word_tokenize(text)
8
print("Bigrams (User-Defined):", bigrams)
print("Trigrams (User-Defined):", trigrams)
import spacy
# Sample text
text = "Natural language processing is fun with NLTK and spaCy."
doc = nlp(text)
unigrams_spacy = generate_ngrams_spacy(doc, 1)
bigrams_spacy = generate_ngrams_spacy(doc, 2)
trigrams_spacy = generate_ngrams_spacy(doc, 3)
Output:
9
Unigrams (User-Defined): [('Natural',), ('language',), ('processing',), ...]
Bigrams (User-Defined): [('Natural', 'language'), ('language', 'processing'), ...]
Trigrams (User-Defined): [('Natural', 'language', 'processing'), ...]
3 (ii) Program to Calculate the Highest Probability of a Word (w2) Occurring After Another Word (w1)
Aim
Calculate the conditional probability of a word w2 occurring immediately after another word w1 using Bigram
Probabilities.
Objective
Theory
The Bigram Model is a type of n-gram model where n=2. It considers pairs of consecutive words. The probability of a
word w2 following another word w1 is given by the ratio of the frequency of the bigram (w1, w2) to the frequency of
w1. This is mostly used in predictive text systems, spell checkers, and other NLP applications.
SOURCE CODE:
Using NLTK:
import nltk
from nltk.corpus import gutenberg
from nltk import bigrams
from nltk.probability import FreqDist
nltk.download('gutenberg')
nltk.download('punkt')
text = gutenberg.raw('austen-emma.txt')
words = nltk.word_tokenize(text)
10
bi_grams = list(bigrams(words))
bigram_fd = FreqDist(bi_grams)
word_fd = FreqDist(words)
w1 = "natural"
w2 = "is"
print(f"Probability of '{w2}' occurring after '{w1}':", bigram_probability(w1, w2))
Using spaCy:
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
bigram_counts = Counter(bigrams_spacy)
word_counts = Counter([token.text for token in doc])
w1 = "Natural"
w2 = "language"
print(f"Probability of '{w2}' occurring after '{w1}':", bigram_probability_spacy(w1, w2))
Output:
11
EXERCISE -4
Aim: To identify collocations (frequent word pairs) from a text corpus using NLTK and spaCy.
Description:
Collocations are pairs of words that often appear together in natural language (e.g., “machine learning” or “artificial
intelligence”). Using NLTK, we can extract collocations with statistical measures like likelihood ratio. With spaCy,
we extract bigrams (two-word combinations).
Source code:
import nltk
from nltk.corpus import reuters
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
# Download resources
nltk.download('reuters')
nltk.download('punkt')
# Using NLTK
words = reuters.words()
bigram_finder = BigramCollocationFinder.from_words(words)
collocations = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("Top 10 Collocations (NLTK):", collocations)
# Using spaCy
import spacy
from collections import Counter
nlp = spacy.load('en_core_web_sm')
text = "Natural language processing and deep learning are crucial for AI."
doc = nlp(text)
bigrams_spacy = [(doc[i].text, doc[i+1].text) for i in range(len(doc)-1)]
bigram_counts = Counter(bigrams_spacy)
print("Top Bigrams (spaCy):", bigram_counts.most_common(10))
Output:
4(ii) Program to Print All Words Beginning with a Given Sequence of Letters
Aim: To extract all words starting with a given prefix from a corpus using NLTK and spaCy.
Description:
This program checks if words in a dictionary or text start with a specified prefix (e.g., “pre”). Useful in prefix-based
search applications like autocomplete.
Source code:
12
import nltk
nltk.download('words')
words = nltk.corpus.words.words()
prefix = "pre"
print(f"Words starting with '{prefix}' (NLTK):", words_with_prefix[:10]) # printing first 10 for clarity
# Using spaCy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
prefix = "pre"
Output:
aim: To filter and display all words from a given corpus or text that have a length greater than four characters using
NLTK and spaCy.
Description:
In Natural Language Processing (NLP), it is often useful to ignore short or less meaningful words such as articles (a,
an, the), prepositions (in, on, at), and conjunctions (and, or, but). These words are usually less informative in tasks
like keyword extraction, information retrieval, and text summarization.
By filtering words based on their length (e.g., words longer than four characters), we can:
13
Focus on content-rich words like nouns (language, machine), verbs (process, learn), and adjectives (natural,
crucial).
1. NLTK words corpus – extracts dictionary words and filters them based on length.
2. spaCy text processing – tokenizes a custom sentence and extracts words longer than four characters.
This demonstrates both lexical resource filtering (NLTK) and context-based filtering (spaCy).
Source code:
import nltk
nltk.download('words')
# Using NLTK
words = nltk.corpus.words.words()
# Using spaCy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
Output:
14
EXERCISE -5
Aim:
Description:
This program helps in finding all synonyms (words with similar meanings) and antonyms (words with opposite
meanings) of a given word. It uses the WordNet lexical database from the NLTK (Natural Language Toolkit) library in
python.
It works as follows:
1. WordNet Lookup:
WordNet groups English words into synsets (sets of synonyms).
Each synset contains multiple lemmas (word forms).
2. Synonyms Extraction:
For each synset of the given word, the program collects all lemma names and stores them as synonyms.
3. Antonyms Extraction:
Some lemmas have antonym links. If found, these are collected separately.
4. User Input:
The program asks the user to enter a word.
Then it prints all synonyms and antonyms of that word.
Source code:
#using NLTK
import nltk
nltk.download('wordnet')
def get_synonyms_antonyms(word):
synsets = wordnet.synsets(word)
synonyms = set()
antonyms = set()
15
# Add synonyms
synonyms.add(lemma.name())
# Add antonyms
if lemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())
word = "happy"
Output:
#using NLTK+spaCy
import spacy
import nltk
nlp = spacy.load('en_core_web_sm')
nltk.download('wordnet')
nltk.download('omw-1.4')
def get_synonyms_antonyms_spacy(word):
synsets = wordnet.synsets(word)
synonyms = set()
antonyms = set()
16
for lemma in synset.lemmas():
synonyms.add(lemma.name())
if lemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())
# from a document
word = "happy"
Output:
5 (II). Aim: To Write a program to find hyponymy, homonymy, polysemy for a given word.
Description:
This program is designed to analyse a given word and extract its semantic relationships using the WordNet lexical
database available in the NLTK (Natural Language Toolkit) library. Specifically, it identifies hyponyms, homonyms,
and polysemy of the word. It works as follows:
1. WordNet Synsets:
o The program retrieves all synsets (groups of synonyms that share the same meaning) of the given
word from WordNet.
2. Hyponymy:
o Example: For the word “bank”, hyponyms include “savings_bank” and “commercial_bank”.
17
3. Homonymy:
o Homonyms are words that share the same spelling but have different, unrelated meanings.
o The program lists all distinct definitions of the word’s synsets to show different meanings.
4. Polysemy:
o The program counts the number of synsets a word has, which represents its polysemy degree.
Source code:
#using NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
def get_hyponyms_homonyms_polysemy(word):
synsets = wordnet.synsets(word)
hyponyms = set()
hyponyms.add(hyponym.name())
homonyms = set()
if len(synset.lemmas()) > 1:
homonyms.add(synset.name())
18
# Test with a word
word = "bank"
Output:
#using NLTK+spaCy
import spacy
import nltk
nlp = spacy.load('en_core_web_sm')
nltk.download('wordnet')
nltk.download('omw-1.4')
def get_hyponyms_homonyms_polysemy_spacy(word):
synsets = wordnet.synsets(word)
19
hyponyms = set()
hyponyms.add(hyponym.name())
homonyms = set()
if len(synset.lemmas()) > 1:
homonyms.add(synset.name())
word = "bank"
OUTPUT:
EXERCISE 6
20
(i) Write a program to find all the stop words in any given text.
Aim:
To Find all the stop words in any given text.
Description:
This program demonstrates how to identify and extract stop words from any given text using two popular NLP
libraries, NLTK and spaCy. Stop words are common words in a language, such as “the”, “is”, “and”, which usually
do not carry significant meaning and are often removed during text preprocessing in Natural Language Processing
tasks. Using NLTK, the text is first tokenized into individual words, and then compared with NLTK’s predefined list
of English stop words to extract all matching words. On the other hand, spaCy provides an attribute is_stop for each
token, allowing stop words to be identified directly while processing the text with spaCy’s language model. Both
methods return the set of stop words present in the input text, which can be useful in applications like text analysis,
sentiment analysis, and machine learning models where filtering out such words helps improve efficiency and focus
on meaningful content.
Source Code:
Using NLTK to Find Stop Words in a Text
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Sample text
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
Output:
Stop Words in the Text (NLTK): ['This', 'is', 'a', 'just', 'to', 'in']
SOURCE CODE:
21
Import spacy
nlp = spacy.load("en_core_web_sm")
# Sample text
doc = nlp(text)
Output:
Stop Words in the Text (spaCy): ['This', 'is', 'a', 'just', 'to', 'in']
(ii) Write a function that finds the 50 most frequently occurring words of a text that are not stop words.
Aim:
Find the 50 most frequently occurring words of a text that are not stop words.
Description:
This program is designed to find the 50 most frequently occurring words in each text after removing stop words. Since
stop words are familiar words like “the”, “is”, and “and” that don’t add much meaning, they are filtered out to focus
on more meaningful terms. Using NLTK, the text is tokenized, stop words and non-alphabetic tokens are removed,
and a frequency distribution is calculated with FreqDist to list the most frequent words. Similarly, in spaCy, the text is
processed into tokens, and words that are stop words or punctuation are excluded. The remaining words are counted
using Python’s Counter to obtain the top 50 frequent terms. This approach helps highlight the most important words in
a text, which is useful in tasks like keyword extraction, text summarization, and content analysis.
Source Code:
Using NLTK to Find the Most Frequent Non-Stop Words
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Sample text
text = "This is a simple sentence, just to test the functionality of finding the most frequent non-stop words. " \
22
"Stop words should be excluded from the analysis so we can focus on meaningful words."
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]
fdist = FreqDist(filtered_words)
Output:
50 Most Frequent Non-Stop Words (NLTK): [('words', 3), ('simple', 1), ('sentence', 1), ('test', 1), ('functionality',
1), ('finding', 1), ('frequent', 1), ('stop', 1), ('excluded', 1), ('analysis', 1), ('focus', 1), ('meaningful', 1)]
import spacy
nlp = spacy.load('en_core_web_sm')
# Sample text
text = "This is a simple sentence, just to test the functionality of finding the most frequent non-stop words. " \
"Stop words should be excluded from the analysis so we can focus on meaningful words."
doc = nlp(text)
# Filter out stop words and punctuation, and create a list of remaining words
filtered_words_spacy = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct and
token.is_alpha]
word_freq = Counter(filtered_words_spacy)
Output:
23
50 Most Frequent Non-Stop Words (spaCy): [('words', 3), ('stop', 2), ('simple', 1), ('sentence', 1), ('test', 1),
('functionality', 1), ('finding', 1), ('frequent', 1), ('non', 1), ('excluded', 1), ('analysis', 1), ('focus', 1),
('meaningful', 1)]
EXERCISE 7
24
7) Write a program to implement various stemming techniques and prepare a chart with the
performance of each method.
Aim:
To implement various stemming techniques (such as Porter Stemmer, Snowball Stemmer, Lancaster Stemmer and
Regexp Stemmer) and analyze their performance by preparing a comparative chart showing the efficiency and
effectiveness of each method.
Description:
Stemming is a text preprocessing technique used in Natural Language Processing (NLP) to reduce words to their base
or root form by removing suffixes or prefixes. Different stemming algorithms follow different approaches, each with
unique strengths and limitations:
Porter Stemmer: Developed by Martin Porter in 1980, this is one of the most widely used stemming
algorithms. It applies a series of rule-based suffix stripping steps to reduce words. While it produces stems
that may not always be valid words, it strikes a balance between simplicity, speed, and accuracy. For example,
“connection” → “connect”, “caresses” → “caress”.
Lancaster Stemmer: Also known as the Paice/Husk stemmer, it is a more aggressive rule-based approach. It
applies a set of rules iteratively until no more stemming is possible. While it is faster, it often over-stems
words, producing very short root forms. For instance, “connection” → “connect”, but “university” →
“univers”.
Snowball Stemmer: An improvement over the Porter Stemmer, also developed by Martin Porter. It supports
multiple languages and provides better accuracy with fewer errors compared to Porter. It is considered more
efficient and consistent. Example: “running” → “run”, “generalization” → “general”.
Regexp Stemmer: A flexible stemmer that uses user-defined regular expressions to remove specific prefixes
or suffixes from words. Its performance depends on the quality of regex patterns provided, making it less
standardized compared to other stemmers. For example, with the rule to remove “ing”, “running” → “run”,
“playing” → “play”.
NLTK provides several stemmers that can be directly applied to text. In this work, we use four stemmers: Porter
Stemmer, Lancaster Stemmer, Snowball Stemmer, and Regexp Stemmer. Each of these reduces words to their
root form using different approaches, allowing us to compare their effectiveness.
Source code:
import nltk
import time
nltk.download('punkt')
25
# Sample text
words = word_tokenize(text)
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
def apply_stemmers(words):
return {
start_time = time.time()
stemmed_words = apply_stemmers(words)
# Collecting results
times = {
"Porter": porter_time,
"Snowball": 0,
"Regexp": 0,
stemmer_names = list(times.keys())
stemmer_times = list(times.values())
26
# Plot the performance of each stemmer
plt.bar(stemmer_names, stemmer_times)
plt.xlabel("Stemming Method")
plt.ylabel("Time (seconds)")
plt.show()
Output:
Porter Stemmed Words: ['run', 'run', 'runner', 'better', 'best', 'tri', 'tri', 'running']
Lancaster Stemmed Words: ['run', 'run', 'run', 'bet', 'best', 'tri', 'tri', 'run']
Snowball Stemmed Words: ['run', 'run', 'runner', 'better', 'best', 'tri', 'tri', 'running']
Regexp Stemmed Words: ['runn', 'runs', 'runner', 'better', 'best', 'tri', 'tri', 'runningness']
27
Using spaCy for Tokenization and Stemming
spaCy does not directly support stemming, but we can use spaCy's lemmatizer as an alternative to stemming.
Lemmatization reduces words to their base or dictionary form, which is similar to stemming but more accurate. We
can still use spaCy for preprocessing (tokenization) and then compare the performance.
Source code:
import spacy
import time
nlp = spacy.load("en_core_web_sm")
# Sample text
doc = nlp(text)
start_time = time.time()
stemmer_times = [porter_time, porter_time, porter_time, porter_time, spacy_time] # Using the same timing for
demonstration
plt.bar(stemmer_names, stemmer_times)
plt.xlabel("Method")
plt.ylabel("Time (seconds)")
plt.show()
Output:
28
spaCy Lemmatized Words: ['run', 'run', 'runner', 'well', 'well', 'try', 'try', 'runningness']
This script performs lemmatization on a sample text using NLTK and SpaCy, compares the results, and measures the
time taken for each method. A bar chart is generated to visualize and compare the performance of the two
lemmatization techniques.
Source code:
import time
import nltk
import spacy
nlp = spacy.load("en_core_web_sm")
nltk.download('punkt')
nltk.download('wordnet')
29
nltk.download('punkt_tab') # Download punkt_tab as suggested by the error message
nltk_lemmatizer = WordNetLemmatizer()
def nltk_lemmatization(text):
words = word_tokenize(text)
def spacy_lemmatization(text):
doc = nlp(text)
text = "The striped bats are hanging on their feet and eating best fishes."
start_time = time.time()
nltk_result = nltk_lemmatization(text)
start_time = time.time()
spacy_result = spacy_lemmatization(text)
# Print Results
plt.xlabel('Lemmatization Method')
plt.ylabel('Time (seconds)')
30
plt.title('Performance of NLTK vs SpaCy Lemmatization')
plt.show()
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
NLTK Lemmatization: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'and', 'eating', 'best', 'fish', '.']
SpaCy Lemmatization: ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'and', 'eat', 'good', 'fish', '.']
31
EXERCISE 8
Aim:
Write a program to implement various lemmatization techniques and prepare a chart with the performance of each
method.
Theory:
Lemmatization is a text normalization technique in Natural Language Processing (NLP) that reduces a word to its
base or dictionary form, called a lemma. Unlike stemming, which simply truncates words, lemmatization considers the
word’s context and part of speech to produce meaningful root forms.
NLTK Lemmatization
o NLTK (Natural Language Toolkit) provides the WordNetLemmatizer, which uses the WordNet
lexical database to find the base form of words.
o By default, it assumes the word is a noun unless a part-of-speech (POS) tag is specified.
o Example: “running” → “running” (without POS), but “running (verb)” → “run”.
SpaCy Lemmatization
Performance Considerations:
NLTK is lightweight but may require explicit POS tagging for better accuracy.
SpaCy is generally faster for large texts as it is optimized in Cython, but loading the model initially can take
extra time.
Objectives:
1. To understand the concept of lemmatization and its importance in text preprocessing.
2. To implement lemmatization using two different libraries: NLTK (WordNetLemmatizer) and SpaCy.
3. To compare the output quality of lemmatization between the two methods.
4. To measure and compare the execution time of NLTK vs SpaCy for performance evaluation.
5. To visualize the comparison results using a bar chart.
Source Code:
32
import time
import nltk
import spacy
nlp = spacy.load("en_core_web_sm")
nltk.download('punkt')
nltk.download('wordnet')
nltk_lemmatizer = WordNetLemmatizer()
def nltk_lemmatization(text):
words = word_tokenize(text)
def spacy_lemmatization(text):
doc = nlp(text)
text = "The striped bats are hanging on their feet and eating best fishes."
33
start_time = time.time()
nltk_result = nltk_lemmatization(text)
start_time = time.time()
spacy_result = spacy_lemmatization(text)
# Print Results
plt.xlabel('Lemmatization Method')
plt.ylabel('Time (seconds)')
plt.show()
Output:
Original Text: The striped bats are hanging on their feet and eating best fishes.
NLTK Lemmatization: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'and', 'eating', 'best', 'fish', '.']
SpaCy Lemmatization: ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'and', 'eat', 'good', 'fish', '.']
34
35
EXERCISE 9
i)Aim:
Write a program that implements Conditional Frequency Distributions (CFD) for a given corpus and displays the most
frequent words for each category.
Description:
A Conditional Frequency Distribution (CFD) is a very important concept in Natural Language Processing (NLP). It is
essentially a collection of frequency distributions, but instead of having a single frequency distribution for all the data,
it maintains separate frequency distributions for different conditions. Each condition has its own frequency
distribution of words. In simpler terms, a CFD allows us to group words by some condition and then analyze the
frequency of words inside each group. In the case of the Reuters corpus, every document belongs to one or more
categories such as trade, crude, money-fx, grain, etc. A CFD can help us find the most frequent words in each
category. For example, the "trade" category may frequently use words like import, export, agreement, deficit, while
the "crude" category may use oil, barrel, OPEC, price, etc. This is extremely useful because language usage changes
depending on context or topic. By using a CFD, we can see how different fields of news reporting emphasize different
sets of words. This type of analysis has multiple applications such as domain-specific vocabulary analysis, topic
modeling, text classification, and linguistic research. Another practical example could be in spam filtering: one
condition could be "spam emails" and the other could be "non-spam emails," and the CFD would then show us which
words occur more frequently in spam vs. non-spam, helping in automatic detection. Thus, CFDs give a structured way
to analyze word usage differences based on categories, genres, or other conditions within a corpus.
Source code:
import nltk
# Load corpus
nltk.download('reuters')
words = reuters.words()
fdist = FreqDist(four_letter_words)
print(f"{word}: {frequency}")
Output:
36
ii)
Aim:
To define a conditional frequency distribution over the names corpus that shows which initial letters are more frequent
for male names versus female names.
Description:
The names corpus in NLTK is a standard dataset that contains a large collection of common male and female names.
By applying a Conditional Frequency Distribution (CFD), we can analyze the distribution of the first letters of names
across genders. This means that the condition will be the gender (male or female), and the frequency distribution will
count how many names start with each letter. This allows us to find patterns such as which letters are more popular as
the starting letters of male names compared to female names. For example, male names frequently begin with J such
as John, James, Jack, Jason, while female names may more often start with M such as Mary, Monica, Michelle,
Megan. This analysis is useful in several ways. First, it provides sociolinguistic insights into naming conventions.
Second, it is used in machine learning for gender prediction models, where the first letter of a name is often
considered a feature for classification. Third, it can reveal cultural or linguistic trends in naming. For example, names
starting with certain letters may be more common in one gender across many cultures. If we run this program, we may
find that letters like J, M, and A dominate as starting letters, while some letters such as Q or X are rare. By using a
CFD, we can not only compare male vs female distributions but also visualize the overlap and uniqueness of naming
patterns. Thus, conditional frequency analysis of the names corpus provides both statistical and practical insights into
human naming systems.
Source code:
import nltk
37
from nltk import ConditionalFreqDist
nltk.download('names')
male_names = names.words('male.txt')
female_names = names.words('female.txt')
cfd = ConditionalFreqDist()
cfd['male'][name[0].lower()] += 1
cfd['female'][name[0].lower()] += 1
print(f"{letter}: {frequency}")
print(f"{letter}: {frequency}")
Output:
38
39
iii)
Aim:
To find all the four-letter words in a corpus and, using frequency distribution, display them in decreasing order of
frequency.
Description:
A Frequency Distribution (FreqDist) is a tool that counts how many times each word appears in a corpus. If we apply
this only to four-letter words, we can identify the most frequently used short words in the dataset. In this case, we are
analyzing the Reuters corpus. Short words, especially four-letter words, can be very interesting because they often
include both content words and function words. Content words may include domain-specific terms like bank, deal,
fund, oil, while function words may include with, from, this, that. By filtering out only four-letter words, we restrict
our analysis to a manageable set and can see which of them dominate in the corpus. For example, in financial news,
words like rate, bank, fund might appear repeatedly. In contrast, in political news, words like vote, plan, deal may
appear more often. The frequency distribution not only lists these words but also orders them from most frequent to
least frequent, allowing us to immediately identify which short words are most common. This has applications in
building text models, creating dictionaries of frequent terms, or studying domain-specific language usage.
Additionally, word length analysis can help in stylometric studies where the length of words is linked to the style of
writing. Thus, by applying FreqDist to four-letter words, we gain a focused yet powerful insight into the most
common short words in the corpus.
Source code:
import nltk
# Load corpus
nltk.download('reuters')
words = reuters.words()
fdist = FreqDist(four_letter_words)
print(f"{word}: {frequency}")
40
EXERCISE 10
Aim:
Implement a program that assigns grammatical tags (PoS) to words in a text corpus.
Description:
This program demonstrates how to perform PoS tagging on real-world text using NLTK. The Reuters corpus (a
collection of news articles) is chosen as the input data. First, the raw text is extracted from the corpus and split into
individual tokens (words) using word_tokenize(). Then, the pos_tag() function assigns a Part-of-Speech tag to each
word. PoS tagging is a fundamental step in NLP since it helps machines understand the grammatical function of
words, e.g., whether a word is a noun, verb, adjective, or determiner. This information is crucial for tasks such as text
classification, question answering, sentiment analysis, machine translation, and speech recognition.
Example Output: [('The', 'DT'), ('stock', 'NN'), ('market', 'NN'), ('rose', 'VBD'), ('today', 'NN')]
Source code:
import nltk
nltk.download('reuters')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sentence = reuters.raw(reuters.fileids()[0]) # Get the first document from the Reuters corpus
pos_tags = nltk.pos_tag(words)
print(pos_tags)
Output:
41
(ii) Program to identify the word with the greatest number of distinct tags
Aim:
Find the word in the corpus that occurs with the largest variety of PoS tags.
Description:
Words in English are often ambiguous, meaning the same word can have different grammatical roles depending on
the context. This program builds a dictionary where each word is stored along with all the PoS tags it has been
assigned in the corpus. By comparing all words, it identifies the one that has been tagged with the highest number of
distinct tags. This gives insight into how flexible a word is in the English language. For instance, the word “book”
can function as a noun (NN) → “I read a book.”, or as a verb (VB) → “I will book a flight.” Similarly, words like
“time” or “run” often appear with different tags. Understanding such words is important in NLP because they require
context-based disambiguation.
Example: Word: “book” → Tagged as NN (Noun) → “I read a book.”, Tagged as VB (Verb) → “I will book a
ticket.”
Output: Word with maximum distinct tags: book | Distinct tags: {'NN', 'VB'}
Source code:
word_tags = defaultdict(set)
word_tags[word].add(tag)
tags = word_tags[max_word]
42
Output:
Aim:
To analyse and display PoS tags in the corpus according to their frequency of occurrence.
Description:
This program counts the frequency of each PoS tag across the corpus and sorts them from most frequent to least
frequent. This statistical analysis provides insight into which parts of speech dominate natural language text. For
example, in news articles, nouns (NN, NNS) appear frequently because they describe people, places, and things.
Prepositions (IN) such as “in”, “on”, “of” are also frequent because they link nouns to other parts of the sentence.
Determiners (DT) such as “the” or “a” are common as they specify nouns. By listing the tags in decreasing order, we
can identify the 20 most frequent tags and interpret what they represent. This type of analysis is useful in corpus
linguistics, text analytics, and building NLP models.
NN: 1200 (Noun, singular), IN: 900 (Preposition/subordinating conjunction – “in”, “on”, “of”), DT: 850 (Determiner
– “the”, “a”, “an”), NNS: 700 (Noun, plural), JJ: 650 (Adjective)
Source code:
from collections import Counter
sorted_tags = tag_counts.most_common()
print(f"{tag}: {count}")
Output:
43
(iv) Program to identify which tags are most commonly found after nouns
Aim:
To study the syntactic behavior of nouns by finding which tags most frequently follow them.
Description:
Nouns are central to sentences, and the words that come immediately after them help form phrases or sentences. This
program checks what tags usually follow nouns (like prepositions, verbs, adjectives, or other nouns). It helps in
understanding sentence patterns in English.This program investigates sentence structure by checking which words
appear immediately after nouns in the corpus. Since nouns are core elements of sentences, the words that follow
them reveal important grammar patterns. For example, in English
A noun may be followed by a verb (VBZ/VBD): “dog barks”, “car broke down”.
A noun may be followed by another noun (NN): “car engine”, “city center”.
By counting and ranking the tags that follow nouns, the program shows the most common structures in English. This
analysis is important for parsing, grammar learning, and sentence generation in NLP.
Example:
44
Sentences: “The dog (NN) barks (VBZ) loudly.”, “A book (NN) on (IN) the table.”, “Car (NN) engine (NN).”
Output:
IN: 500 (Prepositions – “in”, “on”, “of”), VBZ: 400 (Verb, 3rd person singular – “is”, “runs”, “barks”), NN: 350
(Another noun – “car engine”), JJ: 200 (Adjective – “time immemorial”)
Source code:
from collections import Counter
sorted_tags = tag_counts.most_common()
print(f"{tag}: {count}")
Output:
45
46
EXERCISE 11
Aim:
To preprocess text data using spaCy and NLTK, and then compute TF-IDF (Term Frequency–Inverse Document
Frequency) values using scikit-learn in order to evaluate the importance of words in a given corpus of documents.
Description:
This program demonstrates a basic Natural Language Processing (NLP) pipeline where a set of documents (corpus)
is transformed into numerical features using the TF-IDF technique.
1. Libraries Used:
o Fit and transform the preprocessed corpus to generate the TF-IDF matrix.
o Extract feature names (words) and their corresponding TF-IDF scores for each document.
o Print the results, showing the importance of each word in the corpus.
3. Purpose of TF-IDF:
o Inverse Document Frequency (IDF): Reduces the weight of common words across documents.
o TF-IDF: Highlights important words in a document that are not too common across the whole corpus.
4. Applications:
47
o Keyword extraction and information retrieval.
Source Code:
!pip install nltk spacy scikit-learn
imp`ort nltk
import spacy
nltk.download('stopwords')
nlp = spacy.load("en_core_web_sm")
corpus = [
def preprocess(text):
return " ".join([token.text for token in doc if not token.is_stop and not token.is_punct and not token.is_digit])
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)
# Get feature names (words) corresponding to the columns in the TF-IDF matrix
feature_names = vectorizer.get_feature_names_out()
48
# Display the TF-IDF scores for each document
print(f"\nDocument {i + 1}:")
score = tfidf_matrix[i, j]
Output:
Document 1:
Document 2:
Document 3:
Document 4:
49
Word: document, TF-IDF Score: 0.4002
50
EXERCISE 12
12) Write a program to implement chunking and chinking for any corpus
Aim:
Description:
Chunking refers to the process of segmenting and labeling a sentence into "chunks" that correspond to
various syntactic components like noun phrases (NP), verb phrases (VP), etc.
Chinking refers to the process of removing or excluding parts from a chunk based on patterns or conditions.
Source Code:
import nltk
try:
nltk.data.find('tokenizers/punkt_tab/english')
except LookupError:
nltk.download('punkt_tab')
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
corpus = """
John and Mary are going to the market. They will buy some vegetables and fruits.
"""
tokens = word_tokenize(corpus)
tagged_tokens = pos_tag(tokens)
51
chunk_grammar = r"""
"""
chink_grammar = r"""
"""
chunk_parser = RegexpParser(chunk_grammar)
chink_parser = RegexpParser(chink_grammar)
# Apply chunking
chunked = chunk_parser.parse(tagged_tokens)
print("Chunked Sentence:")
print(chunked)
# Apply chinking
chinked = chink_parser.parse(tagged_tokens)
print("\nChinked Sentence:")
print(chinked)
Output:
Chunked Sentence:
(S
John/NNP
and/CC
Mary/NNP
(VP are/VBP)
(VP going/VBG)
to/TO
./.
52
They/PRP
will/MD
(VP buy/VB)
some/DT
vegetables/NNS
and/CC
fruits/NNS
./.
(NP Yesterday/NN)
,/,
I/PRP
(VP saw/VBD)
them/PRP
./.)
Chinked Sentence:
(S
John/NNP
and/CC
Mary/NNP
are/VBP
going/VBG
to/TO
./.
They/PRP
will/MD
buy/VB
some/DT
vegetables/NNS
and/CC
53
fruits/NNS
./.
(NP Yesterday/NN)
,/,
I/PRP
saw/VBD
them/PRP
at/IN
enjoying/VBG
(NP weather/NN)
./.)
54
EXERCISE– 13
def find_misspelled(paragraph):
spell = SpellChecker()
# Tokenize into words
words = paragraph.split()
# Find misspelled words
misspelled = spell.unknown(words)
return misspelled
# Example usage
paragraph = """Natural langauge processing is a feld of Artificial Intelligense
that deals with the interacton between computers and humans."""
misspelled_words = find_misspelled(paragraph)
print("Misspelled words:", misspelled_words)
OUTPUT:
misspelled words: {'humans.', 'feld', 'langauge', 'intelligense', 'interacton'}
(ii) Write a program to prepare a table with frequency of mis-spelled tags for any given text.
AIM
Write a Program to prepare a table with frequency of mis-spelled tags for any given text
OBJECTIVES:
1. Identify Mis-spelled Words:
To analyze a given text and detect all words that are potentially mis-spelled by comparing them against a standard
dictionary or word list.
55
2. Count Frequency:
To calculate how many times each mis-spelled word occurs in the given text.
3. Generate Frequency Table:
To prepare and present a table or list showing each mis-spelled word alongside its frequency count, helping in easy
analysis of common spelling mistakes.
4. Enhance Text Quality:
To assist users in identifying frequently occurring spelling errors so that they can improve the quality and correctness
of the text.
Facilitate Proofreading:
To provide a tool that supports efficient proofreading and editing by highlighting the most common spelling errors for
focused correction.
THEORY:
In natural language processing and text analysis, identifying spelling mistakes is a fundamental task that helps
improve the readability and correctness of written content. Spelling errors can arise from typographical mistakes,
unfamiliarity with correct word forms, or phonetic spelling. Detecting these errors requires a comparison of words in
the text against a reference dictionary or vocabulary.
Step 1: Tokenization
The input text is first split into individual words, often called tokens. This process is known as tokenization and helps
analyze the text at the word level.
Step 2: Spell Checking
Each tokenized word is checked against a dictionary of correctly spelled words. This dictionary could be built-in or
imported from external libraries like pyspellchecker. Words that do not match any entry in the dictionary are
considered potential spelling mistakes (mis-spelled words).
Step 3: Frequency Calculation
After identifying the mis-spelled words, the program counts how many times each of these words appears in the text.
This frequency calculation is essential for understanding which spelling mistakes occur most frequently and might
need priority correction.
Step 4: Display of Results
The program generates a frequency table that lists each unique mis-spelled word along with the number of
occurrences in the text. This table serves as a concise summary for users or editors to focus on specific errors and
improve the overall text quality.
SOURCE CODE:
from collections import Counter
from spellchecker import SpellChecker
def misspelled_frequency_table(paragraph):
spell = SpellChecker()
words = paragraph.split()
misspelled = spell.unknown(words)
# Example usage
paragraph = """Natural langauge processing is a feld of Artificial Intelligense.
Langauge models are used in NLP and Intelligense systems."""
56
freq_table = misspelled_frequency_table(paragraph)
# Print as table
print("Mis-spelled Word | Frequency")
print("-----------------|-----------")
for word, count in freq_table.items():
print(f"{word:<16} | {count}")
OUTPUT:
57
EXERCISE - 14
Aim
To write a Python program that implements various NLP pre-processing techniques required to prepare text data for
further Natural Language Processing (NLP) tasks.
Objective
To understand the importance of pre-processing in NLP.
Lowercasing, Sentence Tokenization, Word Tokenization, Stopword Removal, Punctuation Removal, Lemmatization,
Stemming.
To make the text ready for advanced NLP applications like Sentiment Analysis, Text Classification, or Machine
Translation.
Theory
Natural Language Processing (NLP) involves interaction between computers and human language. Since raw text
contains noise, inconsistencies, and redundancies, pre-processing is a crucial step before applying machine learning or
deep learning models.
Stopword Removal – Removes common words like is, the, an, which do not add meaning.
These steps reduce complexity, remove redundancy, and improve model accuracy.
SOURCE CODE :
import nltk
import re
nltk.download('punkt')
58
nltk.download('stopwords')
nltk.download('wordnet')
# Sample text
# 1. Lowercasing
text = text.lower()
# 2. Sentence Tokenization
sent_tokens = sent_tokenize(text)
# 3. Word Tokenization
word_tokens = word_tokenize(text)
# 5. Removing Stopwords
stop_words = set(stopwords.words('english'))
# 6. Lemmatization
lemmatizer = WordNetLemmatizer()
59
lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]
# 7. Stemming
stemmer = PorterStemmer()
Output:
Original Text:
Lowercased Text:
Sentence Tokenization:
Word Tokenization:
['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'sub-field', 'of', 'artificial', 'intelligence', '.', 'it', 'deals', 'with',
'analyzing', ',', 'understanding', ',', 'and', 'generating', 'human', 'languages', '.']
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', 'it', 'deals', 'with', 'analyzing',
'understanding', 'and', 'generating', 'human', 'languages']
['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'deals', 'analyzing', 'understanding',
'generating', 'human', 'languages']
After Lemmatization:
60
['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'deal', 'analyzing', 'understanding',
'generating', 'human', 'language']
After Stemming:
['natur', 'languag', 'process', 'nlp', 'subfield', 'artifici', 'intellig', 'deal', 'analyz', 'understand', 'gener', 'human', 'languag']
61
Case Study – 2
Auto-Correction of Spellings in Text using NLP.
1. Introduction
In Natural Language Processing (NLP), spelling errors are common when users type text. Automatic spelling
correction improves the quality of text data for applications like search engines, chatbots, and document editing tools.
This case study demonstrates how to implement an auto-correction system using Python.
2. Problem Statement
Users often make spelling mistakes while typing. These errors affect the readability and performance of NLP-based
applications. The goal is to develop a Python program that automatically detects and corrects spelling errors in a given
text.
3. Methodology
In this case study, we use the TextBlob library, which provides a simple and efficient way to correct spelling
mistakes using probabilistic models.
Input Collection
TextBlob treats the input as a text document and allows NLP operations on it.
TextBlob compares each word in the text with a large dictionary of valid words.
Using probability and word frequency models, it selects the most likely correct spelling for each word.
Example:
o “lik” → “like”
o “pythn” → “python”
Output Generation
Both original text and corrected text are shown for comparison.
62
4. Example
5. Implementation
def autocorrect_text(text):
blob = TextBlob(text)
corrected_text = blob.correct()
return str(corrected_text)
if __name__ == "__main__":
7. Conclusion
This case study demonstrates how spell correction can be achieved using NLP tools. By using the TextBlob library, we
can automatically correct spelling mistakes and improve the overall quality of textual data. This is useful in real-world
applications like chatbots, search engines, and text editors.
63