[go: up one dir, main page]

0% found this document useful (0 votes)
60 views63 pages

NLP Lab

The document outlines exercises for installing and exploring NLTK and spaCy, two popular NLP libraries in Python. It covers various features such as tokenization, stopword removal, lemmatization, and named entity recognition, along with practical coding examples. Additionally, it includes exercises for generating n-grams and analyzing text corpora, providing source code and expected outputs for each task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views63 pages

NLP Lab

The document outlines exercises for installing and exploring NLTK and spaCy, two popular NLP libraries in Python. It covers various features such as tokenization, stopword removal, lemmatization, and named entity recognition, along with practical coding examples. Additionally, it includes exercises for generating n-grams and analyzing text corpora, providing source code and expected outputs for each task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 63

EXERCISE -1

AIM: Installation and exploring features of NLTK and spaCy tools. Download Word Cloud and few corpora.

1. Installing NLTK and spaCy

Description: NLTK (Natural Language Toolkit) is a widely used Python library for NLP tasks such as tokenization,
stopword removal, lemmatization, sentiment analysis, and corpus exploration. spaCy is an advanced NLP library
designed for speed and efficiency, providing features like POS tagging, dependency parsing, and NER (Named Entity
Recognition). WordCloud is a visualization tool that displays the most frequent words in a given text corpus.
Example: We install these libraries and download the required models/corpora.
Source Code :
pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
pip install spacy
python -m spacy download en_core_web_sm
Output: No direct output, only installation and download messages.

2. Exploring NLTK Features

2(a) Tokenization

Description: Tokenization is the process of splitting text into smaller units like words (word tokenization) or
sentences (sentence tokenization).
Example:

Input: "Natural language processing is interesting." → Tokens: ['Natural', 'language', 'processing', 'is', 'interesting',
'.'].
Source Code :
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural language processing is an interesting field of study."
words = word_tokenize(text)
sentences = sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)
Output:
Words: ['Natural', 'language', 'processing', 'is', 'an', 'interesting', 'field', 'of', 'study', '.']
Sentences: ['Natural language processing is an interesting field of study.']

2(b) Stopwords Removal

Description: Stopwords are common words (like “is”, “the”, “and”) that are usually removed in text processing since
they don’t add meaningful information.
Example: From the sentence above, stopwords are removed → ['Natural', 'language', 'processing', 'interesting', 'field',
'study'].
Source Code :

1
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)
Output:
Filtered Words: ['Natural', 'language', 'processing', 'interesting', 'field', 'study', '.']

2(c) Lemmatization

Description: Lemmatization reduces words to their root/base form (lemma), considering grammar.
Example: "studies" → "study", "running" → "run".
Source Code :
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)
Output:
Lemmatized Words: ['Natural', 'language', 'processing', 'is', 'an', 'interesting', 'field', 'of', 'study', '.']

2(d) Corpora (Names Corpus)

Description: Corpora are large collections of texts available in NLTK for research and experimentation. Names
corpus contains male and female names → useful for gender classification tasks.
Source Code :
from nltk.corpus import names
print(names.words()[:20])
Output:
['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abbye', 'Abigael', 'Abigail', 'Abigale', 'Abra', 'Ada',
'Adah', 'Adaline', 'Adan', 'Adara', 'Adda', 'Addi', 'Addia']

3. Exploring spaCy Features

3(a) Loading the Model

Description: spaCy provides pre-trained language models for efficient NLP. Loading the model is the first step to use
its features.
Example: en_core_web_sm → a small English model with vocabulary, syntax, and entities.
Source Code :
import spacy
nlp = spacy.load('en_core_web_sm')
Output: Loads the spaCy model (no printed output).

3(b) Tokenization and Lemmatization

Description: spaCy allows tokenization and lemmatization simultaneously.


Example: "processing" → "processing" (token), "be" → "be" (lemma).
Source Code :
doc = nlp("Natural language processing with spaCy is efficient.")
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
Output:

2
Tokens: ['Natural', 'language', 'processing', 'with', 'spaCy', 'is', 'efficient', '.']
Lemmas: ['Natural', 'language', 'processing', 'with', 'spaCy', 'be', 'efficient', '.']

3(c) Named Entity Recognition (NER)

Description: NER identifies real-world entities such as names, organizations, and dates.
Example: Text: "Barack Obama was the president of USA." → Entities: Barack Obama (PERSON), USA (GPE).
Source Code:
for entity in doc.ents:
print(entity.text, entity.label_)
Output:
(No entities found in this sentence)

3(d) Part-of-Speech Tagging

Description: POS tagging assigns grammatical categories (noun, verb, adjective) to words.
Example: "efficient" → ADJ, "language" → NOUN.
Source Code :
for token in doc:
print(token.text, token.pos_)
Output:
Natural ADJ
language NOUN
processing NOUN
with ADP
spaCy PROPN
is AUX
efficient ADJ
. PUNCT

4. Word Cloud Generation

Description: Word Clouds visually represent the frequency of words; bigger words mean higher frequency.
Example: From filtered words, "Natural", "language", "study" appear prominently.
Source Code :
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width=800, height=400).generate(" ".join(filtered_words))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Output: Graphical word cloud (words like Natural, language, processing, field, study).

5. Downloading and Exploring NLTK Corpora

5(a) Gutenberg Corpus

Description: A collection of classic literary texts from Project Gutenberg, useful for linguistic analysis.
Example: austen-emma.txt → “Emma” by Jane Austen.
Source Code :
from nltk.corpus import gutenberg
print(gutenberg.fileids())
text = gutenberg.raw('austen-emma.txt')

3
print(text[:500])
Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
Emma by Jane Austen 1816
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home...

5(b) Movie Reviews Corpus

Description: A corpus containing movie reviews labeled as positive or negative → used for sentiment analysis.
Example: First review snippet shows a negative review about a teen couple’s story.
Source Code :
from nltk.corpus import movie_reviews
print(movie_reviews.fileids())
file_id = movie_reviews.fileids()[0]
print(movie_reviews.raw(file_id)[:300])
Output:
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', ... , 'pos/cv999_13106.txt']
plot : two teen couples go to a church party , drink and then drive .
they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life ...

4
EXERCISE -2
2. (1) Write a program to implement word Tokenizer, Sentence and Paragraph Tokenizers.

Aim: Write a program that implements word tokenizer, sentence tokenizer, and paragraph tokenizer using NLTK.

Description:

Word Tokenizer : Splits text into individual words.

Sentence Tokenizer : Splits text into individual sentences.

Paragraph Tokenizer: Splits text into paragraphs (by newline character).

SOURCE CODE:

Program (Using NLTK):

import nltk

from nltk.tokenize import word_tokenize, sent_tokenize

# Ensure you have downloaded required resources

nltk.download('punkt')

# Sample text

text = """Natural language processing (NLP) is a field of AI

concerned with the interaction between computers and human language.

It involves tasks like language translation, sentiment analysis, and

text summarization. NLP uses various techniques such as

tokenization, parsing, and machine learning."""

# Word Tokenizer (NLTK)

words_nltk = word_tokenize(text)

print("Words Tokenized (NLTK):", words_nltk)

# Sentence Tokenizer (NLTK)

sentences_nltk = sent_tokenize(text)

print("Sentences Tokenized (NLTK):", sentences_nltk

# Paragraph Tokenizer (NLTK)

paragraphs_nltk = text.split('\n')

print("Paragraphs Tokenized (NLTK):", paragraphs_nltk)

Output:
Word Tokenizer Output:

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a',

5
'field', 'of', 'AI', 'concerned', 'with', 'the', 'interaction',

'between', 'computers', 'and', 'human', 'language', '.',

'It', 'involves', 'tasks', 'like', 'language', 'translation', ',',

'sentiment', 'analysis', ',', 'and', 'text', 'summarization', '.',

'NLP', 'uses', 'various', 'techniques', 'such', 'as',

'tokenization', ',', 'parsing', ',', 'and', 'machine', 'learning', '.']

Sentence Tokenizer Output:

['Natural language processing (NLP) is a field of AI concerned with the interaction between computers and human
language.',

'It involves tasks like language translation, sentiment analysis, and text summarization.',

'NLP uses various techniques such as tokenization, parsing, and machine learning.']

Paragraph Tokenizer Output:

['Natural language processing (NLP) is a field of AI',

'concerned with the interaction between computers and human',

'language.',

'It involves tasks like language translation, sentiment analysis, and',

'text summarization. NLP uses various techniques such as',

'tokenization, parsing, and machine learning.']

2(ii). Program to Check Number of Words and Distinct Words in Corpus

Aim: Write a program that counts the total number of words and distinct words in a given text corpus using spaCy.

Description:

The program uses spaCy’s NLP pipeline to tokenize words.

Stop words and punctuations are excluded.

The total number of words and distinct words are then calculated.

Program (Using spaCy):

import spacy

# Load spaCy English model

nlp = spacy.load('en_core_web_sm')

# Sample text

text = """Natural language processing (NLP) is a field of AI

concerned with the interaction between computers and human

language.It involves tasks like language translation, sentiment analysis, and

6
text summarization."""

# Create a spaCy document

doc = nlp(text)

# Tokenizing words (excluding stop words and punctuations)

words_spacy = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

# Number of words

num_words_spacy = len(words_spacy)

print("Total Words (spaCy):", num_words_spacy)

# Number of distinct words

distinct_words_spacy = len(set(words_spacy))

print("Distinct Words (spaCy):", distinct_words_spacy)

Output:

Total Words (spaCy): 15

Distinct Words (spaCy): 15

7
EXERCISE -3
(i) Program to Implement User-Defined and Pre-Defined Functions to Generate N-Grams

Aim

Implement user-defined and pre-defined functions using NLTK and spaCy libraries in Python to generate n-grams
(Unigrams, Bigrams, Trigrams, and N-Grams) for a given text.

Objective

1. Understand the concept of n-grams in Natural Language Processing (NLP).


2. Implement user-defined functions to generate n-grams.
3. Use pre-defined library functions from NLTK and spaCy for generating n-grams.
4. Compare both approaches.

Theory

In Natural Language Processing (NLP), an n-gram is a contiguous sequence of n items (usually words) from a given
text or speech. They are widely used in language modelling, machine translation, and predictive text systems.

• Unigrams: Single words (n=1). Example: 'I', 'am', 'learning'.


• Bigrams: Pairs of consecutive words (n=2). Example: ('I', 'am'), ('am', 'learning').
• Trigrams: Triplets of consecutive words (n=3). Example: ('I', 'am', 'learning').
• N-grams: Generalization for any n.

NLTK provides built-in functions such as nltk.util.ngrams() to generate n-grams directly. spaCy does not provide a
direct method, but we can easily implement n-gram generation using loops and tokenization.

SOURCE CODE:

Using NLTK (User-Defined and Pre-Defined Functions):

import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "Natural language processing is fun with NLTK and spaCy."

# Word Tokenization
words = word_tokenize(text)

# -------- User-Defined Function --------


def generate_ngrams(words, n):
return list(ngrams(words, n))

# Generate Unigrams, Bigrams, Trigrams


unigrams = generate_ngrams(words, 1)
bigrams = generate_ngrams(words, 2)
trigrams = generate_ngrams(words, 3)

print("Unigrams (User-Defined):", unigrams)

8
print("Bigrams (User-Defined):", bigrams)
print("Trigrams (User-Defined):", trigrams)

# -------- Pre-Defined Function --------


unigrams_pre = list(ngrams(words, 1))
bigrams_pre = list(ngrams(words, 2))
trigrams_pre = list(ngrams(words, 3))
ngrams_4 = list(ngrams(words, 4))

print("Unigrams (Pre-defined):", unigrams_pre)


print("Bigrams (Pre-defined):", bigrams_pre)
print("Trigrams (Pre-defined):", trigrams_pre)
print("4-grams (Pre-defined):", ngrams_4)

Using spaCy (User-Defined and Pre-Defined Functions):

import spacy

# Load English model


nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Natural language processing is fun with NLTK and spaCy."
doc = nlp(text)

# -------- User-Defined Function --------


def generate_ngrams_spacy(doc, n):
return [tuple([token.text for token in doc[i:i+n]]) for i in range(len(doc)-n+1)]

unigrams_spacy = generate_ngrams_spacy(doc, 1)
bigrams_spacy = generate_ngrams_spacy(doc, 2)
trigrams_spacy = generate_ngrams_spacy(doc, 3)

print("Unigrams (spaCy User-Defined):", unigrams_spacy)


print("Bigrams (spaCy User-Defined):", bigrams_spacy)
print("Trigrams (spaCy User-Defined):", trigrams_spacy)

# -------- Pre-Defined (Manual Looping) --------


def generate_ngrams_loop(doc, n):
ngrams_list = []
for i in range(len(doc) - n + 1):
ngrams_list.append([token.text for token in doc[i:i+n]])
return ngrams_list

print("Unigrams (spaCy Pre-defined):", generate_ngrams_loop(doc, 1))


print("Bigrams (spaCy Pre-defined):", generate_ngrams_loop(doc, 2))
print("Trigrams (spaCy Pre-defined):", generate_ngrams_loop(doc, 3))

Output:

9
Unigrams (User-Defined): [('Natural',), ('language',), ('processing',), ...]
Bigrams (User-Defined): [('Natural', 'language'), ('language', 'processing'), ...]
Trigrams (User-Defined): [('Natural', 'language', 'processing'), ...]

Unigrams (Pre-defined): [('Natural',), ('language',), ...]


Bigrams (Pre-defined): [('Natural', 'language'), ('language', 'processing'), ...]
Trigrams (Pre-defined): [('Natural', 'language', 'processing'), ...]
4-grams (Pre-defined): [('Natural', 'language', 'processing', 'is'), ...]

Unigrams (spaCy User-Defined): [('Natural',), ('language',), ('processing',), ...]


Bigrams (spaCy User-Defined): [('Natural', 'language'), ('language', 'processing'), ...]
Trigrams (spaCy User-Defined): [('Natural', 'language', 'processing'), ...]

Unigrams (spaCy Pre-defined): [['Natural'], ['language'], ['processing'], ...]


Bigrams (spaCy Pre-defined): [['Natural', 'language'], ['language', 'processing'], ...]
Trigrams (spaCy Pre-defined): [['Natural', 'language', 'processing'], ...]

3 (ii) Program to Calculate the Highest Probability of a Word (w2) Occurring After Another Word (w1)

Aim

Calculate the conditional probability of a word w2 occurring immediately after another word w1 using Bigram
Probabilities.

Objective

1. Understand Bigram Probability Model in NLP.


2. Compute conditional probability: P (w2 | w1) = Count (w1, w2) / Count(w1).
3. Implement using NLTK and spaCy.

Theory

The Bigram Model is a type of n-gram model where n=2. It considers pairs of consecutive words. The probability of a
word w2 following another word w1 is given by the ratio of the frequency of the bigram (w1, w2) to the frequency of
w1. This is mostly used in predictive text systems, spell checkers, and other NLP applications.

Formula: P(w2|w1) = Count (w1, w2) / Count(w1)

SOURCE CODE:

Using NLTK:

import nltk
from nltk.corpus import gutenberg
from nltk import bigrams
from nltk.probability import FreqDist

nltk.download('gutenberg')
nltk.download('punkt')

text = gutenberg.raw('austen-emma.txt')
words = nltk.word_tokenize(text)

10
bi_grams = list(bigrams(words))

bigram_fd = FreqDist(bi_grams)
word_fd = FreqDist(words)

def bigram_probability(w1, w2):


bigram_count = bigram_fd[(w1, w2)]
word_count = word_fd[w1]
if word_count == 0:
return 0
return bigram_count / word_count

w1 = "natural"
w2 = "is"
print(f"Probability of '{w2}' occurring after '{w1}':", bigram_probability(w1, w2))

Using spaCy:

import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

text = "Natural language processing is fun with NLTK and spaCy."


doc = nlp(text)

bigrams_spacy = [(doc[i].text, doc[i+1].text) for i in range(len(doc)-1)]

bigram_counts = Counter(bigrams_spacy)
word_counts = Counter([token.text for token in doc])

def bigram_probability_spacy(w1, w2):


bigram_count = bigram_counts[(w1, w2)]
word_count = word_counts[w1]
if word_count == 0:
return 0
return bigram_count / word_count

w1 = "Natural"
w2 = "language"
print(f"Probability of '{w2}' occurring after '{w1}':", bigram_probability_spacy(w1, w2))

Output:

Probability of 'is' occurring after 'natural': 0.0


Probability of 'language' occurring after 'Natural': 1.0

11
EXERCISE -4

4. (i) Write a program to identify the word collocations

Aim: To identify collocations (frequent word pairs) from a text corpus using NLTK and spaCy.

Description:

Collocations are pairs of words that often appear together in natural language (e.g., “machine learning” or “artificial
intelligence”). Using NLTK, we can extract collocations with statistical measures like likelihood ratio. With spaCy,
we extract bigrams (two-word combinations).

Source code:

import nltk
from nltk.corpus import reuters
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

# Download resources
nltk.download('reuters')
nltk.download('punkt')

# Using NLTK
words = reuters.words()
bigram_finder = BigramCollocationFinder.from_words(words)
collocations = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("Top 10 Collocations (NLTK):", collocations)

# Using spaCy
import spacy
from collections import Counter

nlp = spacy.load('en_core_web_sm')
text = "Natural language processing and deep learning are crucial for AI."
doc = nlp(text)
bigrams_spacy = [(doc[i].text, doc[i+1].text) for i in range(len(doc)-1)]
bigram_counts = Counter(bigrams_spacy)
print("Top Bigrams (spaCy):", bigram_counts.most_common(10))

Output:

4(ii) Program to Print All Words Beginning with a Given Sequence of Letters

Aim: To extract all words starting with a given prefix from a corpus using NLTK and spaCy.

Description:
This program checks if words in a dictionary or text start with a specified prefix (e.g., “pre”). Useful in prefix-based
search applications like autocomplete.

Source code:

12
import nltk

nltk.download('words')

# Using NLTK words corpus

words = nltk.corpus.words.words()

prefix = "pre"

words_with_prefix = [word for word in words if word.lower().startswith(prefix)]

print(f"Words starting with '{prefix}' (NLTK):", words_with_prefix[:10]) # printing first 10 for clarity

# Using spaCy

import spacy

nlp = spacy.load('en_core_web_sm')

text = "Preprocessing is important for machine learning and AI development."

doc = nlp(text)

prefix = "pre"

words_with_prefix_spacy = [token.text for token in doc if token.text.lower().startswith(prefix)]

print(f"Words starting with '{prefix}' (spaCy):", words_with_prefix_spacy)

Output:

4(iii) Program to Print All Words Longer Than Four Characters

aim: To filter and display all words from a given corpus or text that have a length greater than four characters using
NLTK and spaCy.

Description:
In Natural Language Processing (NLP), it is often useful to ignore short or less meaningful words such as articles (a,
an, the), prepositions (in, on, at), and conjunctions (and, or, but). These words are usually less informative in tasks
like keyword extraction, information retrieval, and text summarization.

By filtering words based on their length (e.g., words longer than four characters), we can:

13
 Focus on content-rich words like nouns (language, machine), verbs (process, learn), and adjectives (natural,
crucial).

 Reduce noise in text analysis.

 Improve efficiency of NLP tasks like indexing, clustering, or topic modeling.

This program uses two approaches:

1. NLTK words corpus – extracts dictionary words and filters them based on length.

2. spaCy text processing – tokenizes a custom sentence and extracts words longer than four characters.

This demonstrates both lexical resource filtering (NLTK) and context-based filtering (spaCy).

Source code:
import nltk

nltk.download('words')

# Using NLTK

words = nltk.corpus.words.words()

long_words = [word for word in words if len(word) > 4]

print("Words longer than four characters (NLTK):", long_words[:10]) # show first 10

# Using spaCy

import spacy

nlp = spacy.load('en_core_web_sm')

text = "Natural language processing is a fascinating field of study."

doc = nlp(text)

long_words_spacy = [token.text for token in doc if len(token.text) > 4]

print("Words longer than four characters (spaCy):", long_words_spacy)

Output:

14
EXERCISE -5
Aim:

Write a program to identify all antonyms and synonyms of a word.

Description:

This program helps in finding all synonyms (words with similar meanings) and antonyms (words with opposite
meanings) of a given word. It uses the WordNet lexical database from the NLTK (Natural Language Toolkit) library in
python.

It works as follows:
1. WordNet Lookup:
WordNet groups English words into synsets (sets of synonyms).
Each synset contains multiple lemmas (word forms).

2. Synonyms Extraction:
For each synset of the given word, the program collects all lemma names and stores them as synonyms.

3. Antonyms Extraction:
Some lemmas have antonym links. If found, these are collected separately.

4. User Input:
The program asks the user to enter a word.
Then it prints all synonyms and antonyms of that word.
Source code:

#using NLTK

import nltk

from nltk.corpus import wordnet

# Download necessary resources

nltk.download('wordnet')

nltk.download('omw-1.4') # For multilingual WordNet

# Function to get synonyms and antonyms of a word

def get_synonyms_antonyms(word):

# Get synsets (sets of synonyms)

synsets = wordnet.synsets(word)

synonyms = set()

antonyms = set()

# Extract synonyms and antonyms

for synset in synsets:

for lemma in synset.lemmas():

15
# Add synonyms

synonyms.add(lemma.name())

# Add antonyms

if lemma.antonyms():

antonyms.add(lemma.antonyms()[0].name())

return list(synonyms), list(antonyms)

# Test the function

word = "happy"

synonyms, antonyms = get_synonyms_antonyms(word)

print(f"Synonyms of '{word}':", synonyms)

print(f"Antonyms of '{word}':", antonyms)

Output:

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...

Synonyms of 'happy': ['happy', 'felicitous', 'well-chosen', 'glad']

Antonyms of 'happy': ['unhappy']

#using NLTK+spaCy

import spacy

import nltk

from nltk.corpus import wordnet

# Load spaCy model

nlp = spacy.load('en_core_web_sm')

# Download NLTK resources

nltk.download('wordnet')

nltk.download('omw-1.4')

# Function to get synonyms and antonyms using NLTK

def get_synonyms_antonyms_spacy(word):

# Get synsets from WordNet

synsets = wordnet.synsets(word)

synonyms = set()

antonyms = set()

for synset in synsets:

16
for lemma in synset.lemmas():

synonyms.add(lemma.name())

if lemma.antonyms():

antonyms.add(lemma.antonyms()[0].name())

return list(synonyms), list(antonyms)

# Preprocess text with spaCy (useful if you want to extract words

# from a document

doc = nlp("I feel really happy and excited today.")

# Extract the word to check synonyms/antonyms for

word = "happy"

synonyms, antonyms = get_synonyms_antonyms_spacy(word)

print(f"Synonyms of '{word}':", synonyms)

print(f"Antonyms of '{word}':", antonyms)

Output:

Synonyms of 'happy': ['happy', 'felicitous', 'well-chosen', 'glad']

Antonyms of 'happy': ['unhappy']

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Package wordnet is already up-to-date!

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...

[nltk_data] Package omw-1.4 is already up-to-date!

5 (II). Aim: To Write a program to find hyponymy, homonymy, polysemy for a given word.

Description:

This program is designed to analyse a given word and extract its semantic relationships using the WordNet lexical
database available in the NLTK (Natural Language Toolkit) library. Specifically, it identifies hyponyms, homonyms,
and polysemy of the word. It works as follows:

1. WordNet Synsets:

o The program retrieves all synsets (groups of synonyms that share the same meaning) of the given
word from WordNet.

2. Hyponymy:

o Hyponyms represent more specific concepts of the given word.

o Example: For the word “bank”, hyponyms include “savings_bank” and “commercial_bank”.

o The program collects hyponyms from each synset.

17
3. Homonymy:

o Homonyms are words that share the same spelling but have different, unrelated meanings.

o In WordNet, a word with multiple synsets often reflects homonymy.

o The program lists all distinct definitions of the word’s synsets to show different meanings.

4. Polysemy:

o Polysemy means that a word has multiple related senses/meanings.

o The program counts the number of synsets a word has, which represents its polysemy degree.

Source code:
#using NLTK

import nltk

from nltk.corpus import wordnet

# Download NLTK resources

nltk.download('wordnet')

nltk.download('omw-1.4')

# Function to find hyponymy, homonymy, and polysemy for a

def get_hyponyms_homonyms_polysemy(word):

# Get synsets for the word

synsets = wordnet.synsets(word)

# Find Hyponymy (all hyponyms of each synset)

hyponyms = set()

for synset in synsets:

for hyponym in synset.hyponyms():

hyponyms.add(hyponym.name())

# Find Polysemy (if the word has multiple meanings)

polysemy = len(synsets) > 1

# Homonymy: check if the word has multiple distinct meanings

# (not connected by a hypernym relation)

homonyms = set()

for synset in synsets:

if len(synset.lemmas()) > 1:

homonyms.add(synset.name())

return list(hyponyms), polysemy, list(homonyms)

18
# Test with a word

word = "bank"

hyponyms, polysemy, homonyms = get_hyponyms_homonyms_polysemy(word)

print(f"Hyponyms of '{word}':", hyponyms)

print(f"Polysemy of '{word}':", polysemy)

print(f"Homonyms of '{word}':", homonyms)

Output:

Hyponyms of 'bank': ['lean.v.04', 'acquirer.n.02', 'soil_bank.n.01', 'state_bank.n.01', 'redeposit.v.01', 'lead_bank.n.01',


'eye_bank.n.01', 'riverbank.n.01', 'food_bank.n.01', 'member_bank.n.01', 'home_loan_bank.n.01', 'sandbank.n.01',
'piggy_bank.n.01', 'agent_bank.n.02', 'waterside.n.01', 'federal_reserve_bank.n.01', 'vertical_bank.n.01', 'bluff.n.01',
'commercial_bank.n.01', 'blood_bank.n.01', 'credit_union.n.01', 'credit.v.04', 'merchant_bank.n.01', 'count.v.08',
'thrift_institution.n.01']

Polysemy of 'bank': True

Homonyms of 'bank': ['trust.v.01', 'savings_bank.n.02', 'bank.n.09', 'depository_financial_institution.n.01',


'deposit.v.02', 'bank.n.07']

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Package wordnet is already up-to-date!

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...

[nltk_data] Package omw-1.4 is already up-to-date!

#using NLTK+spaCy

import spacy

import nltk

from nltk.corpus import wordnet

# Load spaCy model

nlp = spacy.load('en_core_web_sm')

# Download NLTK resources

nltk.download('wordnet')

nltk.download('omw-1.4')

# Function to find hyponymy, homonymy, and polysemy for a

def get_hyponyms_homonyms_polysemy_spacy(word):

# Get synsets for the word

synsets = wordnet.synsets(word)

# Find Hyponymy (all hyponyms of each synset)

19
hyponyms = set()

for synset in synsets:

for hyponym in synset.hyponyms():

hyponyms.add(hyponym.name())

# Find Polysemy (if the word has multiple meanings)

polysemy = len(synsets) > 1

# Homonymy: check if the word has multiple distinct meanings

homonyms = set()

for synset in synsets:

if len(synset.lemmas()) > 1:

homonyms.add(synset.name())

return list(hyponyms), polysemy, list(homonyms)

# Example word: 'bank'

word = "bank"

hyponyms, polysemy, homonyms = get_hyponyms_homonyms_polysemy_spacy(word)

print(f"Hyponyms of '{word}':", hyponyms)

print(f"Polysemy of '{word}':", polysemy)

print(f"Homonyms of '{word}':", homonyms)

OUTPUT:

Hyponyms of 'bank': ['lean.v.04', 'acquirer.n.02', 'soil_bank.n.01', 'state_bank.n.01', 'redeposit.v.01', 'lead_bank.n.01',


'eye_bank.n.01', 'riverbank.n.01', 'food_bank.n.01', 'member_bank.n.01', 'home_loan_bank.n.01', 'sandbank.n.01',
'piggy_bank.n.01', 'agent_bank.n.02', 'waterside.n.01', 'federal_reserve_bank.n.01', 'vertical_bank.n.01', 'bluff.n.01',
'commercial_bank.n.01', 'blood_bank.n.01', 'credit_union.n.01', 'credit.v.04', 'merchant_bank.n.01', 'count.v.08',
'thrift_institution.n.01']

Polysemy of 'bank': True

Homonyms of 'bank': ['trust.v.01', 'savings_bank.n.02', 'bank.n.09', 'depository_financial_institution.n.01',


'deposit.v.02', 'bank.n.07']

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Package wordnet is already up-to-date!

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...

[nltk_data] Package omw-1.4 is already up-to-date!

EXERCISE 6
20
(i) Write a program to find all the stop words in any given text.

Aim:
To Find all the stop words in any given text.

Description:
This program demonstrates how to identify and extract stop words from any given text using two popular NLP
libraries, NLTK and spaCy. Stop words are common words in a language, such as “the”, “is”, “and”, which usually
do not carry significant meaning and are often removed during text preprocessing in Natural Language Processing
tasks. Using NLTK, the text is first tokenized into individual words, and then compared with NLTK’s predefined list
of English stop words to extract all matching words. On the other hand, spaCy provides an attribute is_stop for each
token, allowing stop words to be identified directly while processing the text with spaCy’s language model. Both
methods return the set of stop words present in the input text, which can be useful in applications like text analysis,
sentiment analysis, and machine learning models where filtering out such words helps improve efficiency and focus
on meaningful content.

Source Code:
Using NLTK to Find Stop Words in a Text

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

# Download necessary resources

nltk.download('stopwords')

nltk.download('punkt')

# Sample text

text = "This is a simple sentence, just to test stop words in NLTK."

# Tokenize the text

words = word_tokenize(text)

# Get the stop words in English

stop_words = set(stopwords.words('english'))

# Filter out stop words from the tokenized text

stop_words_in_text = [word for word in words if word.lower() in stop_words]

print("Stop Words in the Text (NLTK):", stop_words_in_text)

Output:
Stop Words in the Text (NLTK): ['This', 'is', 'a', 'just', 'to', 'in']

Using spaCy to Find Stop Words in a Text

SOURCE CODE:

21
Import spacy

# Load spaCy's English model

nlp = spacy.load("en_core_web_sm")

# Sample text

text = "This is a simple sentence, just to test stop words in spaCy."

# Create a spaCy document

doc = nlp(text)

# Filter out stop words from the document

stop_words_in_text_spacy = [token.text for token in doc if token.is_stop]

print("Stop Words in the Text (spaCy):", stop_words_in_text_spacy)

Output:
Stop Words in the Text (spaCy): ['This', 'is', 'a', 'just', 'to', 'in']

(ii) Write a function that finds the 50 most frequently occurring words of a text that are not stop words.

Aim:

Find the 50 most frequently occurring words of a text that are not stop words.

Description:

This program is designed to find the 50 most frequently occurring words in each text after removing stop words. Since
stop words are familiar words like “the”, “is”, and “and” that don’t add much meaning, they are filtered out to focus
on more meaningful terms. Using NLTK, the text is tokenized, stop words and non-alphabetic tokens are removed,
and a frequency distribution is calculated with FreqDist to list the most frequent words. Similarly, in spaCy, the text is
processed into tokens, and words that are stop words or punctuation are excluded. The remaining words are counted
using Python’s Counter to obtain the top 50 frequent terms. This approach helps highlight the most important words in
a text, which is useful in tasks like keyword extraction, text summarization, and content analysis.

Source Code:
Using NLTK to Find the Most Frequent Non-Stop Words

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.probability import FreqDist

# Download necessary resources

nltk.download('stopwords')

nltk.download('punkt')

# Sample text

text = "This is a simple sentence, just to test the functionality of finding the most frequent non-stop words. " \

22
"Stop words should be excluded from the analysis so we can focus on meaningful words."

# Tokenize the text

words = word_tokenize(text)

# Get the stop words in English

stop_words = set(stopwords.words('english'))

# Filter out stop words

filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

# Calculate word frequencies

fdist = FreqDist(filtered_words)

# Print the 50 most frequent words

print("50 Most Frequent Non-Stop Words (NLTK):", fdist.most_common(50))

Output:

50 Most Frequent Non-Stop Words (NLTK): [('words', 3), ('simple', 1), ('sentence', 1), ('test', 1), ('functionality',
1), ('finding', 1), ('frequent', 1), ('stop', 1), ('excluded', 1), ('analysis', 1), ('focus', 1), ('meaningful', 1)]

Using Spacy to Find the Most Frequent Non-Stop Words

import spacy

from collections import Counter

# Load spaCy's English model

nlp = spacy.load('en_core_web_sm')

# Sample text

text = "This is a simple sentence, just to test the functionality of finding the most frequent non-stop words. " \

"Stop words should be excluded from the analysis so we can focus on meaningful words."

# Create a spaCy document

doc = nlp(text)

# Filter out stop words and punctuation, and create a list of remaining words

filtered_words_spacy = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct and
token.is_alpha]

# Calculate word frequencies

word_freq = Counter(filtered_words_spacy)

# Print the 50 most frequent words

print("50 Most Frequent Non-Stop Words (spaCy):", word_freq.most_common(50))

Output:

23
50 Most Frequent Non-Stop Words (spaCy): [('words', 3), ('stop', 2), ('simple', 1), ('sentence', 1), ('test', 1),
('functionality', 1), ('finding', 1), ('frequent', 1), ('non', 1), ('excluded', 1), ('analysis', 1), ('focus', 1),
('meaningful', 1)]

EXERCISE 7

24
7) Write a program to implement various stemming techniques and prepare a chart with the
performance of each method.
Aim:
To implement various stemming techniques (such as Porter Stemmer, Snowball Stemmer, Lancaster Stemmer and
Regexp Stemmer) and analyze their performance by preparing a comparative chart showing the efficiency and
effectiveness of each method.

Description:
Stemming is a text preprocessing technique used in Natural Language Processing (NLP) to reduce words to their base
or root form by removing suffixes or prefixes. Different stemming algorithms follow different approaches, each with
unique strengths and limitations:

 Porter Stemmer: Developed by Martin Porter in 1980, this is one of the most widely used stemming
algorithms. It applies a series of rule-based suffix stripping steps to reduce words. While it produces stems
that may not always be valid words, it strikes a balance between simplicity, speed, and accuracy. For example,
“connection” → “connect”, “caresses” → “caress”.

 Lancaster Stemmer: Also known as the Paice/Husk stemmer, it is a more aggressive rule-based approach. It
applies a set of rules iteratively until no more stemming is possible. While it is faster, it often over-stems
words, producing very short root forms. For instance, “connection” → “connect”, but “university” →
“univers”.

 Snowball Stemmer: An improvement over the Porter Stemmer, also developed by Martin Porter. It supports
multiple languages and provides better accuracy with fewer errors compared to Porter. It is considered more
efficient and consistent. Example: “running” → “run”, “generalization” → “general”.

 Regexp Stemmer: A flexible stemmer that uses user-defined regular expressions to remove specific prefixes
or suffixes from words. Its performance depends on the quality of regex patterns provided, making it less
standardized compared to other stemmers. For example, with the rule to remove “ing”, “running” → “run”,
“playing” → “play”.

Using NLTK for Stemming Techniques

NLTK provides several stemmers that can be directly applied to text. In this work, we use four stemmers: Porter
Stemmer, Lancaster Stemmer, Snowball Stemmer, and Regexp Stemmer. Each of these reduces words to their
root form using different approaches, allowing us to compare their effectiveness.

Source code:

import nltk

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, RegexpStemmer

from nltk.tokenize import word_tokenize

import time

import matplotlib.pyplot as plt

# Download necessary resources

nltk.download('punkt')

nltk.download('punkt_tab') # Download punkt_tab as suggested by the error message

25
# Sample text

text = "running runs runner better best tried tries runningness"

# Tokenize the text

words = word_tokenize(text)

# Initialize the stemmers

porter = PorterStemmer()

lancaster = LancasterStemmer()

snowball = SnowballStemmer("english")

regexp = RegexpStemmer('ing$|es$|ed$', min=4)

# Define a function to apply each stemming technique

def apply_stemmers(words):

# Apply the different stemmers to the words

return {

"Porter": [porter.stem(word) for word in words],

"Lancaster": [lancaster.stem(word) for word in words],

"Snowball": [snowball.stem(word) for word in words],

"Regexp": [regexp.stem(word) for word in words],

# Measure time for each stemmer

start_time = time.time()

stemmed_words = apply_stemmers(words)

porter_time = time.time() - start_time

# Collecting results

times = {

"Porter": porter_time,

"Lancaster": 0, # We will measure all other stemmers the same way

"Snowball": 0,

"Regexp": 0,

# Create a bar chart to visualize the performance of each stemming method

stemmer_names = list(times.keys())

stemmer_times = list(times.values())

26
# Plot the performance of each stemmer

plt.bar(stemmer_names, stemmer_times)

plt.xlabel("Stemming Method")

plt.ylabel("Time (seconds)")

plt.title("Stemming Method Performance (NLTK)")

plt.show()

# Now print out the stemmed results to compare

for method, stemmed in stemmed_words.items():

print(f"{method} Stemmed Words: {stemmed}")

Output:

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Package punkt_tab is already up-to-date!

Porter Stemmed Words: ['run', 'run', 'runner', 'better', 'best', 'tri', 'tri', 'running']

Lancaster Stemmed Words: ['run', 'run', 'run', 'bet', 'best', 'tri', 'tri', 'run']

Snowball Stemmed Words: ['run', 'run', 'runner', 'better', 'best', 'tri', 'tri', 'running']

Regexp Stemmed Words: ['runn', 'runs', 'runner', 'better', 'best', 'tri', 'tri', 'runningness']

27
Using spaCy for Tokenization and Stemming

spaCy does not directly support stemming, but we can use spaCy's lemmatizer as an alternative to stemming.
Lemmatization reduces words to their base or dictionary form, which is similar to stemming but more accurate. We
can still use spaCy for preprocessing (tokenization) and then compare the performance.

Source code:

import spacy

import time

import matplotlib.pyplot as plt

# Load spaCy's English model

nlp = spacy.load("en_core_web_sm")

# Sample text

text = "running runs runner better best tried tries runningness"

# Tokenize and lemmatize with spaCy

doc = nlp(text)

# Measure time for spaCy lemmatization

start_time = time.time()

lemmatized_spacy = [token.lemma_ for token in doc if token.is_alpha]

spacy_time = time.time() - start_time

# Visualize the time performance for spaCy Lemmatization vs NLTK Stemming

stemmer_names = ["Porter", "Lancaster", "Snowball", "Regexp", "spaCy Lemmatization"]

stemmer_times = [porter_time, porter_time, porter_time, porter_time, spacy_time] # Using the same timing for
demonstration

# Plot the performance of each method

plt.bar(stemmer_names, stemmer_times)

plt.xlabel("Method")

plt.ylabel("Time (seconds)")

plt.title("Comparison of Stemming and Lemmatization Methods")

plt.show()

# Print the results of lemmatization

print("spaCy Lemmatized Words:", lemmatized_spacy)

Output:

28
spaCy Lemmatized Words: ['run', 'run', 'runner', 'well', 'well', 'try', 'try', 'runningness']

Comparison of NLTK and SpaCy Lemmatization Techniques

This script performs lemmatization on a sample text using NLTK and SpaCy, compares the results, and measures the
time taken for each method. A bar chart is generated to visualize and compare the performance of the two
lemmatization techniques.

Source code:

import time

import nltk

import spacy

import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# Load Spacy English model

nlp = spacy.load("en_core_web_sm")

# Download NLTK data

nltk.download('punkt')

nltk.download('wordnet')

29
nltk.download('punkt_tab') # Download punkt_tab as suggested by the error message

# Initialize NLTK lemmatizer

nltk_lemmatizer = WordNetLemmatizer()

# Function to perform NLTK Lemmatization

def nltk_lemmatization(text):

words = word_tokenize(text)

return [nltk_lemmatizer.lemmatize(word) for word in words]

# Function to perform SpaCy Lemmatization

def spacy_lemmatization(text):

doc = nlp(text)

return [token.lemma_ for token in doc]

# Test text for lemmatization

text = "The striped bats are hanging on their feet and eating best fishes."

# Measure performance of NLTK lemmatization

start_time = time.time()

nltk_result = nltk_lemmatization(text)

nltk_time = time.time() - start_time

# Measure performance of SpaCy lemmatization

start_time = time.time()

spacy_result = spacy_lemmatization(text)

spacy_time = time.time() - start_time

# Print Results

print(f"NLTK Lemmatization: {nltk_result}")

print(f"SpaCy Lemmatization: {spacy_result}")

print(f"Time taken by NLTK: {nltk_time:.5f} seconds")

print(f"Time taken by SpaCy: {spacy_time:.5f} seconds")

# Prepare performance chart

methods = ['NLTK', 'SpaCy']

times = [nltk_time, spacy_time]

plt.bar(methods, times, color=['blue', 'green'])

plt.xlabel('Lemmatization Method')

plt.ylabel('Time (seconds)')

30
plt.title('Performance of NLTK vs SpaCy Lemmatization')

plt.show()

Output:
[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Package wordnet is already up-to-date!

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Package punkt_tab is already up-to-date!

NLTK Lemmatization: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'and', 'eating', 'best', 'fish', '.']

SpaCy Lemmatization: ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'and', 'eat', 'good', 'fish', '.']

Time taken by NLTK: 3.15120 seconds

Time taken by SpaCy: 0.01093 seconds

31
EXERCISE 8

Aim:

Write a program to implement various lemmatization techniques and prepare a chart with the performance of each
method.

Theory:

Lemmatization is a text normalization technique in Natural Language Processing (NLP) that reduces a word to its
base or dictionary form, called a lemma. Unlike stemming, which simply truncates words, lemmatization considers the
word’s context and part of speech to produce meaningful root forms.

 NLTK Lemmatization
o NLTK (Natural Language Toolkit) provides the WordNetLemmatizer, which uses the WordNet
lexical database to find the base form of words.
o By default, it assumes the word is a noun unless a part-of-speech (POS) tag is specified.
o Example: “running” → “running” (without POS), but “running (verb)” → “run”.

 SpaCy Lemmatization

o SpaCy uses a rule-based + statistical approach with built-in morphological analysis.


o It can handle different POS tags automatically and often produces more accurate lemmatizations.
o Example: “better” → “good”, “ate” → “eat”.

Performance Considerations:
 NLTK is lightweight but may require explicit POS tagging for better accuracy.
 SpaCy is generally faster for large texts as it is optimized in Cython, but loading the model initially can take
extra time.

Objectives:
1. To understand the concept of lemmatization and its importance in text preprocessing.
2. To implement lemmatization using two different libraries: NLTK (WordNetLemmatizer) and SpaCy.
3. To compare the output quality of lemmatization between the two methods.
4. To measure and compare the execution time of NLTK vs SpaCy for performance evaluation.
5. To visualize the comparison results using a bar chart.

Source Code:

32
import time

import nltk

import spacy

import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# Load SpaCy English model

nlp = spacy.load("en_core_web_sm")

# Download NLTK resources

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('omw-1.4') # For better WordNet support

# Initialize NLTK lemmatizer

nltk_lemmatizer = WordNetLemmatizer()

# Function to perform NLTK Lemmatization

def nltk_lemmatization(text):

words = word_tokenize(text)

return [nltk_lemmatizer.lemmatize(word) for word in words]

# Function to perform SpaCy Lemmatization

def spacy_lemmatization(text):

doc = nlp(text)

return [token.lemma_ for token in doc]

# Test text for lemmatization

text = "The striped bats are hanging on their feet and eating best fishes."

# Measure performance of NLTK lemmatization

33
start_time = time.time()

nltk_result = nltk_lemmatization(text)

nltk_time = time.time() - start_time

# Measure performance of SpaCy lemmatization

start_time = time.time()

spacy_result = spacy_lemmatization(text)

spacy_time = time.time() - start_time

# Print Results

print("Original Text:", text)

print(f"NLTK Lemmatization: {nltk_result}")

print(f"SpaCy Lemmatization: {spacy_result}")

print(f"Time taken by NLTK: {nltk_time:.5f} seconds")

print(f"Time taken by SpaCy: {spacy_time:.5f} seconds")

# Prepare performance chart

methods = ['NLTK', 'SpaCy']

times = [nltk_time, spacy_time]

plt.bar(methods, times, color=['blue', 'green'])

plt.xlabel('Lemmatization Method')

plt.ylabel('Time (seconds)')

plt.title('Performance of NLTK vs SpaCy Lemmatization')

plt.show()

Output:
Original Text: The striped bats are hanging on their feet and eating best fishes.

NLTK Lemmatization: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'and', 'eating', 'best', 'fish', '.']

SpaCy Lemmatization: ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'and', 'eat', 'good', 'fish', '.']

Time taken by NLTK: 0.00038 seconds

Time taken by SpaCy: 0.01065 seconds

34
35
EXERCISE 9
i)Aim:
Write a program that implements Conditional Frequency Distributions (CFD) for a given corpus and displays the most
frequent words for each category.

Description:
A Conditional Frequency Distribution (CFD) is a very important concept in Natural Language Processing (NLP). It is
essentially a collection of frequency distributions, but instead of having a single frequency distribution for all the data,
it maintains separate frequency distributions for different conditions. Each condition has its own frequency
distribution of words. In simpler terms, a CFD allows us to group words by some condition and then analyze the
frequency of words inside each group. In the case of the Reuters corpus, every document belongs to one or more
categories such as trade, crude, money-fx, grain, etc. A CFD can help us find the most frequent words in each
category. For example, the "trade" category may frequently use words like import, export, agreement, deficit, while
the "crude" category may use oil, barrel, OPEC, price, etc. This is extremely useful because language usage changes
depending on context or topic. By using a CFD, we can see how different fields of news reporting emphasize different
sets of words. This type of analysis has multiple applications such as domain-specific vocabulary analysis, topic
modeling, text classification, and linguistic research. Another practical example could be in spam filtering: one
condition could be "spam emails" and the other could be "non-spam emails," and the CFD would then show us which
words occur more frequently in spam vs. non-spam, helping in automatic detection. Thus, CFDs give a structured way
to analyze word usage differences based on categories, genres, or other conditions within a corpus.

Source code:
import nltk

from nltk.corpus import reuters

from nltk import FreqDist

# Load corpus

nltk.download('reuters')

# Extract words from the Reuters corpus

words = reuters.words()

# Filter for four-letter words

four_letter_words = [word.lower() for word in words if len(word) == 4]

# Create a frequency distribution for four-letter words

fdist = FreqDist(four_letter_words)

# Show the four-letter words in decreasing order of frequency

for word, frequency in fdist.most_common():

print(f"{word}: {frequency}")

Output:

36
ii)
Aim:
To define a conditional frequency distribution over the names corpus that shows which initial letters are more frequent
for male names versus female names.

Description:
The names corpus in NLTK is a standard dataset that contains a large collection of common male and female names.
By applying a Conditional Frequency Distribution (CFD), we can analyze the distribution of the first letters of names
across genders. This means that the condition will be the gender (male or female), and the frequency distribution will
count how many names start with each letter. This allows us to find patterns such as which letters are more popular as
the starting letters of male names compared to female names. For example, male names frequently begin with J such
as John, James, Jack, Jason, while female names may more often start with M such as Mary, Monica, Michelle,
Megan. This analysis is useful in several ways. First, it provides sociolinguistic insights into naming conventions.
Second, it is used in machine learning for gender prediction models, where the first letter of a name is often
considered a feature for classification. Third, it can reveal cultural or linguistic trends in naming. For example, names
starting with certain letters may be more common in one gender across many cultures. If we run this program, we may
find that letters like J, M, and A dominate as starting letters, while some letters such as Q or X are rare. By using a
CFD, we can not only compare male vs female distributions but also visualize the overlap and uniqueness of naming
patterns. Thus, conditional frequency analysis of the names corpus provides both statistical and practical insights into
human naming systems.

Source code:
import nltk

from nltk.corpus import names

37
from nltk import ConditionalFreqDist

# Load names corpus

nltk.download('names')

# Extract male and female names

male_names = names.words('male.txt')

female_names = names.words('female.txt')

# Create a Conditional Frequency Distribution for names by their initial letter

cfd = ConditionalFreqDist()

# Add male and female names to the CFD

for name in male_names:

cfd['male'][name[0].lower()] += 1

for name in female_names:

cfd['female'][name[0].lower()] += 1

# Display the conditional frequency distribution

print("Initial letters most frequent for males:")

for letter, frequency in cfd['male'].most_common():

print(f"{letter}: {frequency}")

print("\nInitial letters most frequent for females:")

for letter, frequency in cfd['female'].most_common():

print(f"{letter}: {frequency}")

Output:

38
39
iii)

Aim:

To find all the four-letter words in a corpus and, using frequency distribution, display them in decreasing order of
frequency.

Description:

A Frequency Distribution (FreqDist) is a tool that counts how many times each word appears in a corpus. If we apply
this only to four-letter words, we can identify the most frequently used short words in the dataset. In this case, we are
analyzing the Reuters corpus. Short words, especially four-letter words, can be very interesting because they often
include both content words and function words. Content words may include domain-specific terms like bank, deal,
fund, oil, while function words may include with, from, this, that. By filtering out only four-letter words, we restrict
our analysis to a manageable set and can see which of them dominate in the corpus. For example, in financial news,
words like rate, bank, fund might appear repeatedly. In contrast, in political news, words like vote, plan, deal may
appear more often. The frequency distribution not only lists these words but also orders them from most frequent to
least frequent, allowing us to immediately identify which short words are most common. This has applications in
building text models, creating dictionaries of frequent terms, or studying domain-specific language usage.
Additionally, word length analysis can help in stylometric studies where the length of words is linked to the style of
writing. Thus, by applying FreqDist to four-letter words, we gain a focused yet powerful insight into the most
common short words in the corpus.

Source code:

import nltk

from nltk.corpus import reuters

from nltk import FreqDist

# Load corpus

nltk.download('reuters')

# Extract words from the Reuters corpus

words = reuters.words()

# Filter for four-letter words

four_letter_words = [word.lower() for word in words if len(word) == 4]

# Create a frequency distribution for four-letter words

fdist = FreqDist(four_letter_words)

# Show the four-letter words in decreasing order of frequency

for word, frequency in fdist.most_common():

print(f"{word}: {frequency}")

40
EXERCISE 10
Aim:

Implement a program that assigns grammatical tags (PoS) to words in a text corpus.

Description:

This program demonstrates how to perform PoS tagging on real-world text using NLTK. The Reuters corpus (a
collection of news articles) is chosen as the input data. First, the raw text is extracted from the corpus and split into
individual tokens (words) using word_tokenize(). Then, the pos_tag() function assigns a Part-of-Speech tag to each
word. PoS tagging is a fundamental step in NLP since it helps machines understand the grammatical function of
words, e.g., whether a word is a noun, verb, adjective, or determiner. This information is crucial for tasks such as text
classification, question answering, sentiment analysis, machine translation, and speech recognition.

Example Output: [('The', 'DT'), ('stock', 'NN'), ('market', 'NN'), ('rose', 'VBD'), ('today', 'NN')]

where DT → Determiner, NN → Noun (singular), VBD → Verb (past tense).

Source code:
import nltk

from nltk.corpus import reuters

from nltk.tokenize import word_tokenize

# Download necessary NLTK resources

nltk.download('reuters')

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

nltk.download('punkt_tab') # Download punkt_tab as suggested by the error message

nltk.download('averaged_perceptron_tagger_eng') # Download the missing tagger

# Example: Use a sentence from the Reuters corpus

sentence = reuters.raw(reuters.fileids()[0]) # Get the first document from the Reuters corpus

words = word_tokenize(sentence) # Tokenize the words

# Perform PoS tagging

pos_tags = nltk.pos_tag(words)

# Display the tagged words

print(pos_tags)

Output:

41
(ii) Program to identify the word with the greatest number of distinct tags

Aim:

Find the word in the corpus that occurs with the largest variety of PoS tags.
Description:

Words in English are often ambiguous, meaning the same word can have different grammatical roles depending on
the context. This program builds a dictionary where each word is stored along with all the PoS tags it has been
assigned in the corpus. By comparing all words, it identifies the one that has been tagged with the highest number of
distinct tags. This gives insight into how flexible a word is in the English language. For instance, the word “book”
can function as a noun (NN) → “I read a book.”, or as a verb (VB) → “I will book a flight.” Similarly, words like
“time” or “run” often appear with different tags. Understanding such words is important in NLP because they require
context-based disambiguation.

Example: Word: “book” → Tagged as NN (Noun) → “I read a book.”, Tagged as VB (Verb) → “I will book a
ticket.”
Output: Word with maximum distinct tags: book | Distinct tags: {'NN', 'VB'}

Source code:

from collections import defaultdict

# Create a defaultdict to store distinct tags for each word

word_tags = defaultdict(set)

# Populate word_tags with words and their corresponding tags

for word, tag in pos_tags:

word_tags[word].add(tag)

# Find the word with the maximum number of distinct tags

max_word = max(word_tags, key=lambda x: len(word_tags[x]))

tags = word_tags[max_word]

print(f"Word with maximum distinct tags: {max_word}")

print(f"Distinct tags: {tags}")

42
Output:

(iii) Program to list tags in order of decreasing frequency

Aim:

To analyse and display PoS tags in the corpus according to their frequency of occurrence.

Description:

This program counts the frequency of each PoS tag across the corpus and sorts them from most frequent to least
frequent. This statistical analysis provides insight into which parts of speech dominate natural language text. For
example, in news articles, nouns (NN, NNS) appear frequently because they describe people, places, and things.
Prepositions (IN) such as “in”, “on”, “of” are also frequent because they link nouns to other parts of the sentence.
Determiners (DT) such as “the” or “a” are common as they specify nouns. By listing the tags in decreasing order, we
can identify the 20 most frequent tags and interpret what they represent. This type of analysis is useful in corpus
linguistics, text analytics, and building NLP models.

Example Output (top few):

NN: 1200 (Noun, singular), IN: 900 (Preposition/subordinating conjunction – “in”, “on”, “of”), DT: 850 (Determiner
– “the”, “a”, “an”), NNS: 700 (Noun, plural), JJ: 650 (Adjective)

Source code:
from collections import Counter

# Get the frequency of each tag in the tagged words

tag_counts = Counter(tag for _, tag in pos_tags)

# Sort tags by frequency

sorted_tags = tag_counts.most_common()

# Display the sorted tags with their frequencies

for tag, count in sorted_tags:

print(f"{tag}: {count}")

Output:

43
(iv) Program to identify which tags are most commonly found after nouns

Aim:

To study the syntactic behavior of nouns by finding which tags most frequently follow them.

Description:

Nouns are central to sentences, and the words that come immediately after them help form phrases or sentences. This
program checks what tags usually follow nouns (like prepositions, verbs, adjectives, or other nouns). It helps in
understanding sentence patterns in English.This program investigates sentence structure by checking which words
appear immediately after nouns in the corpus. Since nouns are core elements of sentences, the words that follow
them reveal important grammar patterns. For example, in English

A noun is often followed by a preposition (IN): “book on the table”.

A noun may be followed by a verb (VBZ/VBD): “dog barks”, “car broke down”.

A noun may be followed by another noun (NN): “car engine”, “city center”.

A noun may also be followed by an adjective (JJ) in some expressions.

By counting and ranking the tags that follow nouns, the program shows the most common structures in English. This
analysis is important for parsing, grammar learning, and sentence generation in NLP.

Example:

44
Sentences: “The dog (NN) barks (VBZ) loudly.”, “A book (NN) on (IN) the table.”, “Car (NN) engine (NN).”
Output:

IN: 500 (Prepositions – “in”, “on”, “of”), VBZ: 400 (Verb, 3rd person singular – “is”, “runs”, “barks”), NN: 350
(Another noun – “car engine”), JJ: 200 (Adjective – “time immemorial”)

Source code:
from collections import Counter

# Get the frequency of each tag in the tagged words

tag_counts = Counter(tag for _, tag in pos_tags)

# Sort tags by frequency

sorted_tags = tag_counts.most_common()

# Display the sorted tags with their frequencies

for tag, count in sorted_tags:

print(f"{tag}: {count}")

Output:

45
46
EXERCISE 11

11)Write a program to implement TF-IDF for any corpus.

Aim:
To preprocess text data using spaCy and NLTK, and then compute TF-IDF (Term Frequency–Inverse Document
Frequency) values using scikit-learn in order to evaluate the importance of words in a given corpus of documents.

Description:
This program demonstrates a basic Natural Language Processing (NLP) pipeline where a set of documents (corpus)
is transformed into numerical features using the TF-IDF technique.

1. Libraries Used:

o spaCy: For text preprocessing (tokenization, lemmatization, stopword removal, punctuation


handling).

o NLTK: For accessing English stopwords.

o Scikit-learn: For computing the TF-IDF matrix.

2. Steps in the Program:

o Import and set up the required libraries (nltk, spacy, sklearn).

o Define a small corpus of four text documents.

o Preprocess each document:

 Convert text to lowercase.

 Remove stopwords, punctuation, and numerical tokens.

o Create a TF-IDF Vectorizer using scikit-learn.

o Fit and transform the preprocessed corpus to generate the TF-IDF matrix.

o Extract feature names (words) and their corresponding TF-IDF scores for each document.

o Print the results, showing the importance of each word in the corpus.

3. Purpose of TF-IDF:

o Term Frequency (TF): Measures how often a word appears in a document.

o Inverse Document Frequency (IDF): Reduces the weight of common words across documents.

o TF-IDF: Highlights important words in a document that are not too common across the whole corpus.

4. Applications:

o Feature extraction for text mining and machine learning models.

o Document similarity comparison.

47
o Keyword extraction and information retrieval.

Source Code:
!pip install nltk spacy scikit-learn

imp`ort nltk

get_ipython().system('python -m spacy download en_core_web_sm')

import spacy

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import stopwords

# Download NLTK stopwords if not already downloaded

nltk.download('stopwords')

# Load spaCy's English language model

nlp = spacy.load("en_core_web_sm")

# Sample Corpus (list of documents)

corpus = [

"Natural language processing with deep learning is fascinating.",

"TF-IDF is a statistical measure used to evaluate the importance of a word.",

"SpaCy and NLTK are popular libraries for NLP tasks.",

"TF-IDF can be used to weigh terms in a document relative to a corpus."

# Preprocess the corpus using spaCy and remove stopwords

def preprocess(text):

doc = nlp(text.lower()) # Convert to lowercase

# Keep only non-stopwords, non-punctuation, and non-numerical tokens

return " ".join([token.text for token in doc if not token.is_stop and not token.is_punct and not token.is_digit])

# Preprocess the entire corpus

preprocessed_corpus = [preprocess(doc) for doc in corpus]

# Initialize the TF-IDF Vectorizer from scikit-learn

vectorizer = TfidfVectorizer()

# Fit and transform the corpus to compute the TF-IDF matrix

tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)

# Get feature names (words) corresponding to the columns in the TF-IDF matrix

feature_names = vectorizer.get_feature_names_out()

48
# Display the TF-IDF scores for each document

for i, doc in enumerate(preprocessed_corpus):

print(f"\nDocument {i + 1}:")

for j, word in enumerate(feature_names):

score = tfidf_matrix[i, j]

if score > 0: # Only display words with non-zero TF-IDF score

print(f"Word: {word}, TF-IDF Score: {score:.4f}")

Output:
Document 1:

Word: deep, TF-IDF Score: 0.4082

Word: fascinating, TF-IDF Score: 0.4082

Word: language, TF-IDF Score: 0.4082

Word: learning, TF-IDF Score: 0.4082

Word: natural, TF-IDF Score: 0.4082

Word: processing, TF-IDF Score: 0.4082

Document 2:

Word: evaluate, TF-IDF Score: 0.4002

Word: idf, TF-IDF Score: 0.3155

Word: importance, TF-IDF Score: 0.4002

Word: measure, TF-IDF Score: 0.4002

Word: statistical, TF-IDF Score: 0.4002

Word: tf, TF-IDF Score: 0.3155

Word: word, TF-IDF Score: 0.4002

Document 3:

Word: libraries, TF-IDF Score: 0.4082

Word: nlp, TF-IDF Score: 0.4082

Word: nltk, TF-IDF Score: 0.4082

Word: popular, TF-IDF Score: 0.4082

Word: spacy, TF-IDF Score: 0.4082

Word: tasks, TF-IDF Score: 0.4082

Document 4:

Word: corpus, TF-IDF Score: 0.4002

49
Word: document, TF-IDF Score: 0.4002

Word: idf, TF-IDF Score: 0.3155

Word: relative, TF-IDF Score: 0.4002

Word: terms, TF-IDF Score: 0.4002

Word: tf, TF-IDF Score: 0.3155

Word: weigh, TF-IDF Score: 0.4002

50
EXERCISE 12
12) Write a program to implement chunking and chinking for any corpus

Aim:

To Write a program to implement chunking and chinking for any corpus

Description:

Chunking and Chinking in NLP

 Chunking refers to the process of segmenting and labeling a sentence into "chunks" that correspond to
various syntactic components like noun phrases (NP), verb phrases (VP), etc.

 Chinking refers to the process of removing or excluding parts from a chunk based on patterns or conditions.

Source Code:

import nltk

try:

nltk.data.find('tokenizers/punkt_tab/english')

except LookupError:

nltk.download('punkt_tab')

import nltk

from nltk import word_tokenize, pos_tag

from nltk.chunk import RegexpParser

# Download necessary resources

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

nltk.download('averaged_perceptron_tagger_eng')

# Sample text corpus

corpus = """

John and Mary are going to the market. They will buy some vegetables and fruits.

Yesterday, I saw them at the park enjoying the sunny weather.

"""

# Tokenize and POS tagging

tokens = word_tokenize(corpus)

tagged_tokens = pos_tag(tokens)

# Define a chunking grammar

51
chunk_grammar = r"""

NP: {<DT>?<JJ>*<NN>} # Noun Phrase (optional Determiner, Adjective, Noun)

VP: {<VB.*><NP|PP>*} # Verb Phrase (Verb followed by NP or PP)

PP: {<IN><NP>} # Prepositional Phrase (Preposition + NP)

"""

# Define a chinking grammar (remove verbs from NP)

chink_grammar = r"""

NP: {<DT>?<JJ>*<NN>} # Noun Phrase

}<VB.*>{ # Chink out Verbs inside NP

"""

# Create chunk and chink parsers

chunk_parser = RegexpParser(chunk_grammar)

chink_parser = RegexpParser(chink_grammar)

# Apply chunking

chunked = chunk_parser.parse(tagged_tokens)

print("Chunked Sentence:")

print(chunked)

# Apply chinking

chinked = chink_parser.parse(tagged_tokens)

print("\nChinked Sentence:")

print(chinked)

Output:
Chunked Sentence:

(S

John/NNP

and/CC

Mary/NNP

(VP are/VBP)

(VP going/VBG)

to/TO

(NP the/DT market/NN)

./.

52
They/PRP

will/MD

(VP buy/VB)

some/DT

vegetables/NNS

and/CC

fruits/NNS

./.

(NP Yesterday/NN)

,/,

I/PRP

(VP saw/VBD)

them/PRP

(PP at/IN (NP the/DT park/NN))

(VP enjoying/VBG (NP the/DT sunny/NN) (NP weather/NN))

./.)

Chinked Sentence:

(S

John/NNP

and/CC

Mary/NNP

are/VBP

going/VBG

to/TO

(NP the/DT market/NN)

./.

They/PRP

will/MD

buy/VB

some/DT

vegetables/NNS

and/CC

53
fruits/NNS

./.

(NP Yesterday/NN)

,/,

I/PRP

saw/VBD

them/PRP

at/IN

(NP the/DT park/NN)

enjoying/VBG

(NP the/DT sunny/NN)

(NP weather/NN)

./.)

54
EXERCISE– 13

(i) Write a program to find all the mis-spelled words in a paragraph.


AIM:
To write a Python program that detects and displays all the mis-spelled words in a paragraph.
OBJECTIVES:
1. To understand the concept of spell checking in Natural Language Processing (NLP).
2. To implement a user-defined function for detecting mis-spelled words using a custom dictionary.
3. To use the pre-defined pyspellchecker library for identifying spelling mistakes and suggesting corrections.
4. To compare both approaches and highlight the advantages of using pre-defined libraries.
THEORY:
In Natural Language Processing (NLP), spell checking is the process of identifying words that are not spelled
correctly in a given text.
 User-Defined Spell Checker: Works with a small custom dictionary of valid words. Words not found are
marked as mis-spelled.
 Pre-Defined Spell Checker (pyspellchecker): Uses a large inbuilt dictionary and Levenshtein distance
algorithm to detect and correct spelling mistakes.
CODE:
# Install library if not already installed
# pip install pyspellchecker

from spellchecker import SpellChecker

def find_misspelled(paragraph):
spell = SpellChecker()
# Tokenize into words
words = paragraph.split()
# Find misspelled words
misspelled = spell.unknown(words)
return misspelled

# Example usage
paragraph = """Natural langauge processing is a feld of Artificial Intelligense
that deals with the interacton between computers and humans."""

misspelled_words = find_misspelled(paragraph)
print("Misspelled words:", misspelled_words)

OUTPUT:
misspelled words: {'humans.', 'feld', 'langauge', 'intelligense', 'interacton'}

(ii) Write a program to prepare a table with frequency of mis-spelled tags for any given text.

AIM
Write a Program to prepare a table with frequency of mis-spelled tags for any given text
OBJECTIVES:
1. Identify Mis-spelled Words:
To analyze a given text and detect all words that are potentially mis-spelled by comparing them against a standard
dictionary or word list.

55
2. Count Frequency:
To calculate how many times each mis-spelled word occurs in the given text.
3. Generate Frequency Table:
To prepare and present a table or list showing each mis-spelled word alongside its frequency count, helping in easy
analysis of common spelling mistakes.
4. Enhance Text Quality:
To assist users in identifying frequently occurring spelling errors so that they can improve the quality and correctness
of the text.
Facilitate Proofreading:
To provide a tool that supports efficient proofreading and editing by highlighting the most common spelling errors for
focused correction.

THEORY:
In natural language processing and text analysis, identifying spelling mistakes is a fundamental task that helps
improve the readability and correctness of written content. Spelling errors can arise from typographical mistakes,
unfamiliarity with correct word forms, or phonetic spelling. Detecting these errors requires a comparison of words in
the text against a reference dictionary or vocabulary.
Step 1: Tokenization
The input text is first split into individual words, often called tokens. This process is known as tokenization and helps
analyze the text at the word level.
Step 2: Spell Checking
Each tokenized word is checked against a dictionary of correctly spelled words. This dictionary could be built-in or
imported from external libraries like pyspellchecker. Words that do not match any entry in the dictionary are
considered potential spelling mistakes (mis-spelled words).
Step 3: Frequency Calculation
After identifying the mis-spelled words, the program counts how many times each of these words appears in the text.
This frequency calculation is essential for understanding which spelling mistakes occur most frequently and might
need priority correction.
Step 4: Display of Results
The program generates a frequency table that lists each unique mis-spelled word along with the number of
occurrences in the text. This table serves as a concise summary for users or editors to focus on specific errors and
improve the overall text quality.

SOURCE CODE:
from collections import Counter
from spellchecker import SpellChecker

def misspelled_frequency_table(paragraph):
spell = SpellChecker()
words = paragraph.split()
misspelled = spell.unknown(words)

# Count frequencies of only misspelled words


freq = Counter([word for word in words if word in misspelled])
return freq

# Example usage
paragraph = """Natural langauge processing is a feld of Artificial Intelligense.
Langauge models are used in NLP and Intelligense systems."""

56
freq_table = misspelled_frequency_table(paragraph)

# Print as table
print("Mis-spelled Word | Frequency")
print("-----------------|-----------")
for word, count in freq_table.items():
print(f"{word:<16} | {count}")

OUTPUT:

Mis-spelled Word | Frequency


-----------------|-----------
langauge |1
feld |1
systems. |1

57
EXERCISE - 14
Aim
To write a Python program that implements various NLP pre-processing techniques required to prepare text data for
further Natural Language Processing (NLP) tasks.

Objective
To understand the importance of pre-processing in NLP.

To perform text normalization steps such as:

Lowercasing, Sentence Tokenization, Word Tokenization, Stopword Removal, Punctuation Removal, Lemmatization,
Stemming.

To make the text ready for advanced NLP applications like Sentiment Analysis, Text Classification, or Machine
Translation.

Theory
Natural Language Processing (NLP) involves interaction between computers and human language. Since raw text
contains noise, inconsistencies, and redundancies, pre-processing is a crucial step before applying machine learning or
deep learning models.

Common NLP Pre-processing Techniques:

Lowercasing – Converts all characters to lowercase for uniformity.

Sentence Tokenization – Splits text into individual sentences.

Word Tokenization – Splits sentences into words/tokens.

Stopword Removal – Removes common words like is, the, an, which do not add meaning.

Punctuation Removal – Eliminates symbols and punctuation.

Lemmatization – Converts words to their base/dictionary form (e.g., running → run).

Stemming – Reduces words to their root form (e.g., studies → studi).

These steps reduce complexity, remove redundancy, and improve model accuracy.

SOURCE CODE :
import nltk

import re

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.stem import WordNetLemmatizer, PorterStemmer

# Download necessary resources

nltk.download('punkt')

58
nltk.download('stopwords')

nltk.download('wordnet')

# Sample text

text = """Natural Language Processing (NLP) is a sub-field of Artificial Intelligence.

It deals with analyzing, understanding, and generating human languages."""

print("Original Text:\n", text)

# 1. Lowercasing

text = text.lower()

print("\nLowercased Text:\n", text)

# 2. Sentence Tokenization

sent_tokens = sent_tokenize(text)

print("\nSentence Tokenization:\n", sent_tokens)

# 3. Word Tokenization

word_tokens = word_tokenize(text)

print("\nWord Tokenization:\n", word_tokens)

# 4. Removing Punctuation & Non-Alphabetic words

words = [re.sub(r'[^a-z]', '', w) for w in word_tokens]

words = [w for w in words if w != '']

print("\nAfter Removing Punctuation:\n", words)

# 5. Removing Stopwords

stop_words = set(stopwords.words('english'))

filtered_words = [w for w in words if w not in stop_words]

print("\nAfter Stopword Removal:\n", filtered_words)

# 6. Lemmatization

lemmatizer = WordNetLemmatizer()

59
lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]

print("\nAfter Lemmatization:\n", lemmatized_words)

# 7. Stemming

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(w) for w in filtered_words]

print("\nAfter Stemming:\n", stemmed_words)

Output:
Original Text:

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence.

It deals with analyzing, understanding, and generating human languages.

Lowercased Text:

natural language processing (nlp) is a sub-field of artificial intelligence.

it deals with analyzing, understanding, and generating human languages.

Sentence Tokenization:

['natural language processing (nlp) is a sub-field of artificial intelligence.',

'it deals with analyzing, understanding, and generating human languages.']

Word Tokenization:

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'sub-field', 'of', 'artificial', 'intelligence', '.', 'it', 'deals', 'with',
'analyzing', ',', 'understanding', ',', 'and', 'generating', 'human', 'languages', '.']

After Removing Punctuation:

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', 'it', 'deals', 'with', 'analyzing',
'understanding', 'and', 'generating', 'human', 'languages']

After Stopword Removal:

['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'deals', 'analyzing', 'understanding',
'generating', 'human', 'languages']

After Lemmatization:

60
['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'deal', 'analyzing', 'understanding',
'generating', 'human', 'language']

After Stemming:

['natur', 'languag', 'process', 'nlp', 'subfield', 'artifici', 'intellig', 'deal', 'analyz', 'understand', 'gener', 'human', 'languag']

61
Case Study – 2
Auto-Correction of Spellings in Text using NLP.

1. Introduction

In Natural Language Processing (NLP), spelling errors are common when users type text. Automatic spelling
correction improves the quality of text data for applications like search engines, chatbots, and document editing tools.
This case study demonstrates how to implement an auto-correction system using Python.

2. Problem Statement

Users often make spelling mistakes while typing. These errors affect the readability and performance of NLP-based
applications. The goal is to develop a Python program that automatically detects and corrects spelling errors in a given
text.

3. Methodology

Two approaches are commonly used:

1. Dictionary-Based Correction: Compares words against a dictionary (eg..pyspellchecker).

2. Context-aware Correction: Predicts correct words based on context (eg..TextBlob).

In this case study, we use the TextBlob library, which provides a simple and efficient way to correct spelling
mistakes using probabilistic models.

The steps are:

Input Collection

 The user enters a sentence that may contain spelling mistakes.

Conversion to TextBlob Object

 The given sentence is passed to the TextBlob class.

 TextBlob treats the input as a text document and allows NLP operations on it.

Spelling Correction using Probabilistic Models

 The .correct() function is applied to the TextBlob object.

 TextBlob compares each word in the text with a large dictionary of valid words.

 Using probability and word frequency models, it selects the most likely correct spelling for each word.

 Example:

o “lik” → “like”

o “pythn” → “python”

Output Generation

 The corrected text is returned and displayed to the user

 Both original text and corrected text are shown for comparison.

62
4. Example

 Input: I lik to wriite progarms in pythn

 Process: TextBlob corrects misspelled words using probabilistic models.

 Output: I like to write programs in python

5. Implementation

from textblob import TextBlob

def autocorrect_text(text):

blob = TextBlob(text)

corrected_text = blob.correct()

return str(corrected_text)

if __name__ == "__main__":

print("----- Auto-Correction of Spellings -----")

user_input = input("Enter a sentence with spelling mistakes: ")

print("\nOriginal Text : ", user_input)

print("Corrected Text: ", autocorrect_text(user_input))

6. Sample Input & Output

Input: I lik to wriite progarms in pythn

Output: Original Text : I lik to wriite progarms in pythn

Corrected Text: I like to write programs in python

7. Conclusion

This case study demonstrates how spell correction can be achieved using NLP tools. By using the TextBlob library, we
can automatically correct spelling mistakes and improve the overall quality of textual data. This is useful in real-world
applications like chatbots, search engines, and text editors.

63

You might also like