0% found this document useful (0 votes)

12 views34 pages

NLP Lab Work

The document contains multiple assignments related to Natural Language Processing (NLP) using Python libraries such as NLTK, Stanza, and spaCy. Each assignment includes a question, a brief explanation, Python code for implementation, and output examples. Topics covered include tokenization, stopword removal, stemming, and part-of-speech tagging.

Uploaded by

Gurvindar Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views34 pages

NLP Lab Work

Uploaded by

Gurvindar Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

ASSIGNMENT - 1

Qn.: Write a python program to perform tokenization by word

and sentence using nltk.
Answer : Tokenization is the process of breaking down a text into smaller units called
tokens.

CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

def tokenize_sentences(text):
sentences = sent_tokenize(text)
return sentences

text = ("Engineering offers diverse career paths, from industrial to research. Consider
internships and networking to explore different specializations. Focus on developing both
technical and soft skills for a successful future in the field.")

sentences = tokenize_sentences(text)
for i, sentence in enumerate(sentences):
print(f"Sentence {i+1}: {sentence}")

OUTPUT :-
Sentence 1: Engineering offers diverse career paths, from industrial to research.

Sentence 2: Consider internships and networking to explore different specializations.

Sentence 3: Focus on developing both technical and soft skills for a successful future in the
field.
ASSIGNMENT - 2
Qn.: Write a python program to perform word tokenization.
Answer: In Natural Language Processing, tokenization refers to the process of breaking
down a large piece of text into smaller units called tokens, such as words, phrases, or
sentences. These tokens are the basic building blocks for further text analysis.

For example, the sentence "I love NLP." can be tokenized into ["I", "love", "NLP", "."].

Tokenization is the first and essential step in many NLP tasks like text preprocessing, part-
of-speech tagging, sentiment analysis, and machine translation, as it helps convert raw text
into a structured format that machines can understand.

CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def tokenize_words(text):
words = word_tokenize(text)
return words

text = "NLTK is a leading platform for building Python programs to work with human
language data."

words = tokenize_words(text)
print(words)

OUTPUT :-

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with',
'human', 'language', 'data', '.']
ASSIGNMENT - 3
Qn.: Write a python program to eliminate stopwords using nltk.
Answer: Stopwords are frequently-used, common words that carry little semantic
meaningful information and are usually removed from text before processing. They are
generally removed from the text during processing so that the NLP algorithm can actually
focus on the words carrying important meaning and thus increasing the quality of the
analysis. Examples include words like "the," "is," "in," "and," and "a."

These words appear frequently in all types of text but do not contribute significantly to the
overall meaning or context. Removing stopwords helps reduce noise in the data, making
NLP tasks like text classification, search, and sentiment analysis more efficient and focused
on meaningful content.

CODE :-

import nltk

nltk.download('punkt')

nltk.download('punkt_tab')

nltk.download('stopwords')

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

sentence = "This is an example sentence demonstrating the removal of stopwords using

NLTK."

words = word_tokenize(sentence)

stopword = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stopword]

filtered_sentence = ' '.join(filtered_words)

print("Original Sentence:\n", sentence)

print("\nFiltered Sentence:\n", filtered_sentence)

OUTPUT :-

Original Sentence :-

"This is an example sentence demonstrating the removal of stopwords using NLTK"

Filtered Sentence :-

example sentence demonstrating removal stopwords using NLTK.

CODE :-

def remove_stopwords(text):
# Tokenize the text into words
words = word_tokenize(text)
# Get English stopwords
english_stopwords = set(stopwords.words('english'))
print(english_stopwords)

OUTPUT:
{'didn', 'yours', 'am', 'this', 'for', 'ourselves', 'were', 'won', 'down', 'you', 'there', 'here', 'to', "it'd",
"wouldn't", 'on', 'i', "doesn't", 'myself', 'only', "that'll", 'further', 'own', "they're", 'after',
"should've", 'who', 'yourself', 'yourselves', 's', 'then', 'ma', 'theirs', "you'd", "hasn't", "couldn't",
'where', 'against', "we'll", 'the', 'or', 'aren', 'herself', 't', "it'll", 'too', 'all', 'couldn', "mustn't", 'don',
'doing', 'y', "isn't", 'we', 'whom', 'hasn', 'no', 'isn', 'itself', 'not', 'have', 'of', 'in', 'as', 'ours', 'with',
'my', 'her', "we'd", "didn't", 'same', 'during', 'these', 'doesn', 'been', 'while', "i'm", 'him',
'between', 'having', 'why', 'will', "wasn't", 'and', 'they', 'when', 'ain', 'how', 'those', "haven't",
"they'd", 'once', 'both', 'above', 'out', 'o', "hadn't", 've', 'has', 'weren', 'from', 'at', 'just', 'any',
'which', "we're", 'by', "it's", 'than', 'do', 'nor', 'off', 'being', 'below', 'she', "he'll", 'should', "won't",
'be', 'wasn', "you'll", 'because', "i'll", "weren't", 'some', 'their', 'd', 'hers', 'up', 'each', "he's",
"needn't", 'are', 'had', 'his', "mightn't", 'is', 'a', "she's", 'about', 'what', 'over', 'shan', 'until',
'more', 'll', 'such', "he'd", 'most', 'now', 'through', 'themselves', 'does', 'he', 'its', "she'd", 'under',
'an', "shouldn't", "we've", "you're", 'but', 'your', 'other', 'before', 'into', 'our', 'it', 'haven', 'did',
"they'll", "don't", 'so', "aren't", 'wouldn', 'shouldn', 'can', 'hadn', 'very', 'me', 'mightn', 'that',
"shan't", "i've", 'again', "she'll", 'm', 'them', "they've", 'mustn', "i'd", "you've", 'was', 're', 'few',
'himself', 'if', 'needn'}
CODE :-
import pandas as pd
# First 30 stopwords list
# Convert stopwords list to a DataFrame
df = pd.DataFrame(english_stopwords, columns=['English Stopwords'])
# Display the first 30 stopwords as a sample
print(df.head(30))

OUTPUT :-
English Stopwords
0 didn
1 yours
2 am
3 this
4 for
5 ourselves
6 were
7 won
8 down
9 you
10 there
11 here
12 to
13 it'd
14 wouldn't
15 on
16 i
17 doesn't
18 myself
19 only
20 that'll
21 further
22 own
23 they're
24 after
25 should've
26 who
27 yourself
28 yourselves
29 s
ASSIGNMENT - 4
Qn.: Write a python program to perform stemming using nltk.
Answer: In Natural Language Processing, stemming is a text processing technique that
removes prefixes and suffixes from a word to obtain its roots or the base form. It is a rule
based approach used to reduce the dimensionality of text data, simplifying words and
improving performance in some NLP tasks. The resulting stem may not always be a valid
word, but it helps group similar words with the same meaning.
For example, the words "playing," "played," and "plays" can all be reduced to the stem "play".
Stemming is useful in tasks like information retrieval, search engines, and text mining, as it
helps match different forms of the same word and improves processing efficiency.
CODE :-
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
def stem_text(text):
porter_stemmer = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [porter_stemmer.stem(word) for word in words]
stemmed_text = ' '.join(stemmed_words)
return stemmed_text

text = "NLTK is a leading platform for building Python programs to work with human
language data."
stemmed_text = stem_text(text)
print(stemmed_text)
OUTPUT :-
nltk is a lead platform for build python program to work with human languag data.
ASSIGNMENT - 5
Qn.: Write a python program to perform tokenization by word
and sentence using Stanza.
Answer : Tokenization refers to the process of breaking down a text into smaller units called
tokens. Stanza is a collection of accurate, efficient tools for the linguistic analysis of many
human languages.

CODE :-

!pip install stanza

import stanza
stanza.download('en')
nlp = stanza.Pipeline('en')
text = "Stanza is developed by Stanford NLP group. It's easy to use and powerful for NLP
tasks!"
doc = nlp(text)
print("Sentence Tokenization:\n")
for i, sentence in enumerate(doc.sentences):
print(f"Sentence {i+1}: {' '.join([word.text for word in sentence.words])}")

print("\nWord Tokenization:\n")
for sentence in enumerate(doc.sentences):
print(f"Words in sentence {i+1}:")
for word in sentence.words:
print(f"- {word.text}")
OUTPUT :-

Sentence Tokenization:

Sentence 1: Stanza is developed by Stanford NLP group.

Sentence 2: It's easy to use and powerful for NLP tasks!

Word Tokenization :-
Words in Sentence 1 :-
- Stanza
- is
- developed
- by
- NLP
- group
-.
Words in Sentence 2 :-
- St
- 's
- easy
- to
- use
- and
- powerful
- for
- NLP
- tasks
-.
ASSIGNMENT - 6
Qn.: Write a python program for word tokenization and sentence
segmentation using spaCy.
Answer : spaCy is an open-source software library for Advanced Natural Language
Processing written in the programming language Python and Cython. The library is
published under MIT license. It offers various capabilities like tokenization, POS-tagging,
named entity recognition (NER), dependency parsing and more, using pre-trained language
models.

CODE :-

! pip install spacy

! python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "spaCy is an open-source library for Advanced Natural Language Processing in
Python. It's fast and easy to use."

doc = nlp(text)
print("Sentence Segmentation:\n")

for i, sent in enumerate(doc.sents):

print(f"Sentence {i+1}: {sent.text}\n")
print("Word Tokenization:\n")

for i, token in enumerate(doc):

print(f"Token{i+1}: {token.text}\n")

OUTPUT :-
Sentence Segmentation:

Sentence 1: spaCy is an open-source library for Advanced Natural Language Processing in

Python.

Sentence 2: It's fast and easy to use.

Word Tokenization :-

Token 1: spaCy
Token 2: is
Token 3: an
Token 4: open
Token 5: -
Token 6: source
Token 7: library
Token 8: for
Token 9: advanced
Token 10: Natural
Token 11: Language
Token 12: Processing
Token 13: in
Token 14: Python
Token 15: .
Token 16: It
Token 17: 's
Token 18: fast
Token 19: and
Token 20: easy
Token 21: to
Token 22: use
Token 23: .
ASSIGNMENT - 7
Qn.: Write a python program to find all the stopwords in the
given corpus using spaCy.
Answer : Spacy is an open-source software library for advanced natural language
processing tasks, written in the programming language python and Cython. Leveraging pre-
trained language models, it offers capabilities like tokenization, POS-tagging, named entity
recognition (NER), dependency parsing and many more.

CODE :-
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Natural Language Processing is a field of artificial intelligence that focuses on the
interaction between humans and computers using natural language. The goal is to enable
computers to understand, interpret and generate human language in a valuable way."

doc = nlp(text)
print("Stopwords found in the corpus:\n")
stp = [token.text.lower() for token in doc if token.is_stop]
for word in sorted(stp):
print(word)
OUTPUT :-
Stopwords found in the corpus:
-a - of
- and - on
- between - that
- in - the
- is - using
ASSIGNMENT - 8
Qn.: Write a python program to find vocabulary, punctuation,
POS tags and perform root word stemming using nltk.
Answer : In Natural Language Processing (NLP), the hierarchy of text refers to the
structured levels at which language is processed and analyzed. This hierarchy begins with
the document level, which may consist of multiple paragraphs. Each paragraph contains
several sentences, and each sentence is made up of phrases. Phrases are composed of
words, which in turn are built from individual characters. Understanding this layered
structure allows NLP systems to break down and interpret language effectively, enabling
tasks such as tokenization, parsing, and semantic analysis.

CODE :-
#".................Hierarchy of Text..................."

from nltk import *

sent = "The dogs are barking loudly outside the house."
import nltk
nltk.download('punkt_tab')
print(nltk.word_tokenize(sent))

OUTPUT :-
['The', 'dogs', 'are', 'barking', 'loudly', 'outside', 'the', 'house', '.']
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!

DEFINITION :-
In Natural Language Processing (NLP), vocabulary refers to the set of unique words or
tokens present in a given text or corpus. It represents all the distinct words that a model can
recognize or process. The size of the vocabulary depends on the dataset and affects tasks
like tokenization, language modeling, and text classification. A well-defined vocabulary helps
machines understand and work with human language more effectively.
CODE :-
#".................Vocabulary..................."

tokens = nltk.word_tokenize(sent)
vocab = sorted(set(tokens))
print(vocab) #Prints

OUTPUT :-
['.', 'The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']

DEFINITION :-
In Natural Language Processing (NLP), punctuation refers to the symbols used in text (such
as commas, periods, question marks, etc.) that help structure and clarify meaning.
Punctuation marks are important for understanding sentence boundaries, pauses, emphasis,
and sentence types. In NLP tasks like sentence segmentation, sentiment analysis, or
machine translation, punctuation helps in accurately interpreting and generating human-like
language.

CODE :-
#".................Punctuation..................."

from string import punctuation

vocab_no_punct=[]
for i in vocab:
if i not in punctuation:
vocab_no_punct.append(i)
print(vocab_no_punct)

OUTPUT :-
['The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']
DEFINITION :-

In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is the

process of assigning each word in a sentence its correct grammatical category, such as
noun, verb, adjective, adverb, etc. This helps the machine understand the structure and
meaning of a sentence. For example, in the sentence "The cat sleeps," "The" is tagged as a
determiner, "cat" as a noun, and "sleeps" as a verb. POS tagging is essential for tasks like
parsing, machine translation, and information extraction.

CODE :-
#".................Part of Speech or POS with the tags..................."

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_kb')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
pos_list = pos_tag(vocab_no_punct)
print(pos_list)
def pos_tagging(text):
words = word_tokenize(text)
tagged_words = nltk.pos_tag(words)
return tagged_words

text = "NLTK is a leading platform for building Python programs to work with human
language data."
tagged_text = pos_tagging(text)
print(tagged_text)

OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
DEFINITION :-

In Natural Language Processing, stemming is the process of reducing a word to its root or
base form by removing prefixes or suffixes. The root of a word obtained through stemming
may not always be a valid word but is a common form used for grouping related words. For
example, the words "running," "runs," and "runner" may all be reduced to the root "run."
Stemming helps in text normalization, improving the performance of search engines, text
classification, and information retrieval.

CODE :-

#".................Root of a word stemming..................."

#Stemming is a technique used to find the root form of a word. In the root form, a word is
devoid of any affixes (suffixes and prefixes)

from nltk.stem.snowball import SnowballStemmer

stemObj = SnowballStemmer("english")
print(stemObj.stem("Studying")) #Prints 'studi'
stemmed_vocab=[]
stemObj = SnowballStemmer("english")
for i in vocab_no_punct:
stemmed_vocab.append(stemObj.stem(i))
print(stemmed_vocab)

OUTPUT :-
studi
['the', 'are', 'bark', 'dog', 'hous', 'loud', 'outsid', 'the']
ASSIGNMENT - 9
Qn.: Write a python program to perform lemmatization using nltk.
Answer : In Natural Language Processing, lemmatization is the process of reducing a
word to its base or dictionary form, known as the lemma, while considering the word’s
meaning and part of speech. Unlike stemming, lemmatization produces real words.
For example, “running,” “ran,” and “runs” are all reduced to the lemma “run.” It uses
vocabulary and morphological analysis to ensure the correct root is found.
Lemmatization is used in tasks like information retrieval, text mining, and machine
translation for better language understanding.

CODE :-
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
return lemmatized_text

text = "The cats are chasing mice and playing in the garden"
lemmatized_text = lemmatize_text(text)
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)
OUTPUT :-

Original Text: The cats are chasing mice and playing in the garden
Lemmatized Text: The cat are chasing mouse and playing in the garden

CODE :-
# Lemmatization removes inflection and reduces the word to its base form

from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('wordnet')
lemmaObj = WordNetLemmatizer()
print(lemmaObj.lemmatize("went",pos='v'))
for i in stemmed_vocab:
print(lemmaObj.lemmatize(i,pos='v'), end = ',')
print()
for i in vocab_no_punct:
print(lemmaObj.lemmatize(i,pos='v'), end = ',')

OUTPUT :-

go
the,be,bark,dog,hous,loud,outsid,the,
The,be,bark,dog,house,loudly,outside,the,
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
ASSIGNMENT - 10
Qn.: Write a python program to perform Parts of Speech tagging
using nltk.
Answer : In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is
the process of assigning each word in a sentence its correct grammatical category, such as
noun, verb, adjective, adverb, etc. This helps the machine understand the structure and
meaning of a sentence. For example, in the sentence "The cat sleeps," "The" is tagged as a
determiner, "cat" as a noun, and "sleeps" as a verb. POS tagging is essential for tasks like
parsing, machine translation, and information extraction.

CODE :-
import nltk
from nltk.tokenize import word_tokenize
# Download NLTK tokenizer and POS tagging models
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Define the POS tagging function
def pos_tagging(text):
# Tokenize the text into words
words = word_tokenize(text)
# Perform POS tagging
tagged_words = nltk.pos_tag(words)
return tagged_words
# Example text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
# Perform POS tagging
tagged_text = pos_tagging(text)
# Print POS tagged text
print(tagged_text)
OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
ASSIGNMENT - 11

Qn.: Write a python program to perform lemmatization using

spaCy and Stanza.
Answer : spaCy and stanza are both popular Natural Language Processing (NLP) libraries
used for processing and analyzing text in human language. They serve similar purposes but
differ in strengths, design and performance.
In Natural Language Processing, lemmatization is the process of reducing a word to its base
or dictionary form, known as the lemma, while considering the word’s meaning and part of
speech. Unlike stemming, lemmatization produces real words. For example, “running,” “ran,”
and “runs” are all reduced to the lemma “run.” It uses vocabulary and morphological analysis
to ensure the correct root is found. Lemmatization is used in tasks like information retrieval,
text mining, and machine translation for better language understanding.

CODE :-
! pip install -q spacy stanza
import spacy
import stanza

spacy_nlp = spacy.load("en-core-web-sm") # spaCy lemmatization

def lemmatize_spacy(text):
doc = spacy.nlp(text)
return [token.lemma_ for token in doc]

stanza.download("en")
stanza_nlp = stanza.Pipeline("en", processors = ["tokenize", "pos", "lemma"]) # stanza
lemmatization
def lemmatize_stanza(text):
doc = stanza_nlp(text)
return [word.lemma for sentence in doc.sentences for word in sentence]

text = "He bettered his performance in the final"

print("Original text:", text)
print("spaCy lemmatization:", lemmatize_spacy(text))
print("Stanza lemmatization:", lemmatize_stanza(text))

OUTPUT :-

Original text: He bettered his performance in the final

spaCy lemmatization: ['he', 'better', 'his', 'performance', 'in', 'the', 'final']
Stanza lemmatization: ['he', 'better', 'his', 'performance', 'in', 'the', 'final']

Original Text: The children are playing in the gardens and eating sandwiches.
spaCy Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
Stanza Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
ASSIGNMENT - 12

Qn.: Write a python program for chunking using nltk.

Answer : Chunking is the process of grouping words together based on their part-of-speech
tags (like nouns, verbs, adjectives) to form meaningful phrases.
In Natural Language Processing, chunking is the process of grouping related words into
meaningful phrases or "chunks" based on their part-of-speech tags. It helps identify
structures like noun phrases (e.g., "the black cat") or verb phrases (e.g., "is running fast").
Chunking does not analyze full sentence structure but focuses on partial parsing to extract
useful segments of information. It is commonly used in information extraction, question
answering, and shallow parsing.

CODE :-
# Install and import required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text
text = "The quick brown fox jumps over the lazy dog."
# Step 1: Tokenize and POS tag
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
# Step 2: Define a chunk grammar (noun phrase: NP)
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN.*>} # NP: optional determiner + adjectives + noun
"""
# Step 3: Create a chunk parser
chunk_parser = RegexpParser(chunk_grammar)
# Step 4: Parse the tagged sentence
chunked_output = chunk_parser.parse(tagged)
# Step 5: Display the chunk tree
chunked_output.draw
# This will open a tree viewer (works in local, not in Colab)
# Alternative text output (for Colab)
print(chunked_output)

OUTPUT :-

(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
jumps/VBZ
over/IN
(NP the/DT lazy/JJ dog/NN)
./.)
ASSIGNMENT - 13

Qn.: Write a python program to perform Named Entity recognition

using nltk.
Answer : In Natural Language Processing, Named Entity Recognition (NER) is the task of
identifying and classifying named entities in text into predefined categories such as person
names, organizations, locations, dates, and more. For example, in the sentence "Apple Inc.
was founded by Steve Jobs in California," NER would label "Apple Inc." as an organization,
"Steve Jobs" as a person, and "California" as a location. NER is widely used in applications
like information extraction, question answering, and summarization.

CODE :-
# Step 1: Install and import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('punkt') # Download required resources
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')
def ner(text): # Define the NER function
tokens = word_tokenize(text) # Tokenize the sentence
tagged_words = pos_tag(tokens) # POS tagging
named_entities = ne_chunk(tagged_words) # Named Entity Recognition
return named_entities
text = "Apple is a company based in California, United States. Steve Jobs was one of its
founders."
# Perform NER
named_entities = ner(text)
print(named_entities) # Print the result

OUTPUT :-

(S
(GPE Apple/NNP)
is/VBZ
a/DT
company/NN
based/VBN
in/IN
(GPE California/NNP)
,/,
(GPE United/NNP States/NNPS)
./.
(PERSON Steve/NNP Jobs/NNP)
was/VBD
one/CD
of/IN
its/PRP$
founders/NNS
./.)
ASSIGNMENT - 14

Qn.: Write a python program for chinking using nltk.

Answer : In Natural Language Processing, chinking is the process of removing specific
words or sequences from previously formed chunks. It is the opposite of chunking. While
chunking groups words into meaningful phrases, chinking excludes certain patterns (like
verbs or prepositions) from these chunks based on part-of-speech tags. For example, in a
noun phrase chunk, chinking can be used to remove verbs if they appear inside the chunk.
Chinking is useful in refining chunks for more accurate phrase detection during shallow
parsing or information extraction.

CODE :-
#{<.*>+}: Chunk everything.
#}<VB.*>{: Chink (remove) verbs from the chunk.

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
text = "The quick brown fox jumps over the lazy dog."
# Tokenize and POS tag the sentence
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)

# Define chunk grammar with chinking

# First, chunk everything as NP (noun phrase)
# Then, remove verbs (VB*) from those chunks using chinking

grammar = r"""
NP: {<.*>+} # Chunk everything
}<VB.*>{ # Chink (remove) any verb from chunks
"""
# Create a chunk parser
chunk_parser = RegexpParser(grammar)
# Parse the sentence
chunked = chunk_parser.parse(tagged_tokens)
# Display the output
print(chunked)
# Optional: draw the chunk tree (only works in local Python, not Colab)
# chunked.draw()

OUTPUT :-

(S
(NP The/DT quick/JJ brown/NN fox/NN)
jumps/VBZ
(NP over/IN the/DT lazy/JJ dog/NN ./.))
ASSIGNMENT - 15
Qn.: Write a python program to find Term Frequency and
Inverse Document Frequency (TF-IDF).
Answer : In Natural Language Processing, Term Frequency (TF) and Inverse Document Frequency
(IDF) are numerical measures used to evaluate how important a word is in a document and across a
collection of documents.
Term Frequency (TF) measures how often a term appears in a document. It is calculated as the
number of times a word appears divided by the total number of words in that document. It shows
the local importance of a word.
Inverse Document Frequency (IDF) measures how unique or rare a term is across all documents. It is
calculated using the total number of documents divided by the number of documents containing the
word, and then taking the logarithm of that value. Words that appear in many documents have a
lower IDF, meaning they are less important.
The combination TF-IDF helps identify words that are frequent in a specific document but rare
across the collection, making them useful for tasks like document classification, search engines, and
keyword extraction.

CODE :-
# Using TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"Data science is an interdisciplinary field.",
"Machine learning is a part of data science.",
"Data science involves statistics and machine learning."
]
# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (terms)
terms = vectorizer.get_feature_names_out()
# Display TF-IDF matrix
for i, doc in enumerate(tfidf_matrix.toarray()):
print(f"\nDocument {i+1} TF-IDF:")
for term, score in zip(terms, doc):
if score > 0:
print(f" {term}: {score:.4f}")

OUTPUT :-

Document 1 TF-IDF:
an: 0.4836
data: 0.2856
field: 0.4836
interdisciplinary: 0.4836
is: 0.3678
science: 0.2856
Document 2 TF-IDF:
data: 0.2805
is: 0.3612
learning: 0.3612
machine: 0.3612
of: 0.4750
part: 0.4750
science: 0.2805
Document 3 TF-IDF:
and: 0.4539 learning: 0.3452 statistics: 0.4539
data: 0.2681 machine: 0.3452
involves: 0.4539 science: 0.2681
ASSIGNMENT - 16

Qn.: Write a python program to find Term Frequency and Inverse

Document Frequency (TF-IDF).
Answer : In Natural Language Processing, unigrams, bigrams, and trigrams are types of
n-grams, which are continuous sequences of n items (usually words) from a given text.

● Unigram: A single word.

Example: “The cat sleeps” → [“The”, “cat”, “sleeps”]
Used in basic text analysis and word frequency models.

● Bigram: A sequence of two consecutive words.

Example: “The cat sleeps” → [“The cat”, “cat sleeps”]
Useful for capturing simple word relationships like "New York".

● Trigram: A sequence of three consecutive words.

Example: “The cat sleeps” → [“The cat sleeps”]
Captures more context and word dependency.
These models are used in language modeling, text prediction, speech recognition, and
machine translation to understand patterns and structure in language.

CODE :-
import nltk
nltk.download('punkt') # Download the Punkt tokenizer models
from nltk.util import ngrams
# Sample text
samplText = 'this is a very good book to study'
# Loop over ngram sizes from 1 to 3
for i in range(1, 4):
# Generate ngrams
NGRAMS = ngrams(sequence=nltk.word_tokenize(samplText), n=i)
# Print each ngram
for grams in NGRAMS:
print(grams)

OUTPUT :-
('this',)
('is',)
('a',)
('very',)
('good',)
('book',)
('to',)
('study',)
('this', 'is')
('is', 'a')
('a', 'very')
('very', 'good')
('good', 'book')
('book', 'to')
('to', 'study')
('this', 'is', 'a')
('is', 'a', 'very')
('a', 'very', 'good')
('very', 'good', 'book')
('good', 'book', 'to')
('book', 'to', 'study')
ASSIGNMENT - 17

Qn.: Write the python code to perform sentiment analysis using

NLP.
Answer : In Natural Language Processing, sentiment analysis is the task of identifying and
classifying the emotional tone or opinion expressed in a piece of text. It determines whether
the sentiment is positive, negative, or neutral.
For example, in the sentence “I love this movie,” the sentiment is positive, while “The service
was terrible” expresses a negative sentiment.
Sentiment analysis is widely used in social media monitoring, product reviews, customer
feedback analysis, and brand reputation management to understand public opinion and
customer satisfaction.
Download VADER lexicon:The nltk.download('vader_lexicon') downloads the VADER
lexicon, which is specifically designed for sentiment analysis. This lexicon contains a set of
words with associated sentiment scores (positive, negative, neutral).
Initialize SentimentIntensityAnalyzer:
The SentimentIntensityAnalyzer is initialized to perform sentiment analysis on the input text.
Analyze Sentiment:sia.polarity_scores(text) returns a dictionary with sentiment scores:
positive: Proportion of text that is positive.
neutral: Proportion of text that is neutral.
negative: Proportion of text that is negative.
compound: A combined score that sums up the overall sentiment. It ranges from -1 (most
negative) to +1 (most positive).
Sentiment Interpretation:Based on the compound score, we categorize the sentiment as
positive, negative, or neutral:
Positive: If the compound score is greater than or equal to 0.05.
Negative: If the compound score is less than or equal to -0.05.
Neutral: If the compound score is between -0.05 and 0.05.
CODE :-

!pip install nltk

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download the VADER lexicon if not already installed
nltk.download('vader_lexicon')# Initialize SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()# Sample text to analyze
sample_text = "I love this product! It's amazing and works as expected."
# Perform sentiment analysis
sentiment_scores = sia.polarity_scores(sample_text)
# Print sentiment scores
print("Sentiment Scores:", sentiment_scores)
# Determine sentiment
if sentiment_scores['compound'] >= 0.05:
sentiment = "Positive"
elif sentiment_scores['compound'] <= -0.05:
sentiment = "Negative"
else:
sentiment = "Neutral"
print(f"The sentiment of the text is: {sentiment}")
# Sentiment Scores: {'neg': 0.0, 'neu': 0.297, 'pos': 0.703, 'compound': 0.8669}
# The sentiment of the text is: Positive

OUTPUT :-
Sentiment Scores: {'neg': 0.0, 'neu': 0.458, 'pos': 0.542, 'compound': 0.8516}
The sentiment of the text is: Positive

NLP Programs
No ratings yet
NLP Programs
5 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP
No ratings yet
NLP
12 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
27 pages
NLP Record
No ratings yet
NLP Record
6 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
7 Idf
No ratings yet
7 Idf
5 pages
NLP 1
No ratings yet
NLP 1
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
3 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
NLP 02
No ratings yet
NLP 02
6 pages
Week 1
No ratings yet
Week 1
14 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (2)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Date: Practical No.4:: Foundation of AI and ML (4351601)
No ratings yet
Date: Practical No.4:: Foundation of AI and ML (4351601)
10 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
H7 W5 NLP - Merged
No ratings yet
H7 W5 NLP - Merged
17 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
Lab 2
No ratings yet
Lab 2
49 pages
Ellyn Lucas Arwood - Language Function - An Introduction To Pragmatic Assessment and Intervention For Higher Order Thinking and Better Literacy - Jessica Kingsley Publishers (2011)
100% (2)
Ellyn Lucas Arwood - Language Function - An Introduction To Pragmatic Assessment and Intervention For Higher Order Thinking and Better Literacy - Jessica Kingsley Publishers (2011)
417 pages
Pemeriksaan Unit Sims
No ratings yet
Pemeriksaan Unit Sims
20 pages
Solution Tiki
No ratings yet
Solution Tiki
20 pages
Geology for Resource Management
No ratings yet
Geology for Resource Management
3 pages
US History 1301 Research Paper
No ratings yet
US History 1301 Research Paper
3 pages
Summer Pack July Class One
No ratings yet
Summer Pack July Class One
37 pages
Canada Digital Adoption Program Provider
No ratings yet
Canada Digital Adoption Program Provider
7 pages
CHN - Module 1
No ratings yet
CHN - Module 1
10 pages
Germany PESTLE Analysis Report
No ratings yet
Germany PESTLE Analysis Report
7 pages
Questions and Answers: English Worksheet On The Face of It
No ratings yet
Questions and Answers: English Worksheet On The Face of It
3 pages
Team Charter
No ratings yet
Team Charter
2 pages
Math 161 Assignment
No ratings yet
Math 161 Assignment
15 pages
GRADE 9 TERM 2 AGRICULTURE SCHEMES MTP
No ratings yet
GRADE 9 TERM 2 AGRICULTURE SCHEMES MTP
21 pages
Analysis of Chaotic Behavior in Non-Linear Dynamical Systems Michał Piórek Updated 2025
No ratings yet
Analysis of Chaotic Behavior in Non-Linear Dynamical Systems Michał Piórek Updated 2025
78 pages
Methodological Brief - EnG v.2 - 2
No ratings yet
Methodological Brief - EnG v.2 - 2
44 pages
Last Stand A Worthy Death - Unknown
No ratings yet
Last Stand A Worthy Death - Unknown
19 pages
1.06 Macbeth Character Development
No ratings yet
1.06 Macbeth Character Development
2 pages
BB Key
No ratings yet
BB Key
33 pages
Problems 3
No ratings yet
Problems 3
11 pages
SAP Ariba Supplier Training - For Existing Users
No ratings yet
SAP Ariba Supplier Training - For Existing Users
6 pages
Professional Exam Passers List
No ratings yet
Professional Exam Passers List
64 pages
Cyan Gradient Technology Startup Business Company Presentation
No ratings yet
Cyan Gradient Technology Startup Business Company Presentation
19 pages
Target Corp Sales Forecasting Report
No ratings yet
Target Corp Sales Forecasting Report
36 pages
CST S1 Internal and Sessional Marks
No ratings yet
CST S1 Internal and Sessional Marks
1 page
Seaway Ventus Datasheet
No ratings yet
Seaway Ventus Datasheet
2 pages
Intergovernmental Relations in A World o
No ratings yet
Intergovernmental Relations in A World o
13 pages
Flat File Processing Tutorial
No ratings yet
Flat File Processing Tutorial
11 pages
Using Predicate Knowledge in AI-represent Simple Knowlege
No ratings yet
Using Predicate Knowledge in AI-represent Simple Knowlege
6 pages
Reading Real 2
No ratings yet
Reading Real 2
5 pages
Lifecycle Costing and Target Costing New)
No ratings yet
Lifecycle Costing and Target Costing New)
4 pages

NLP Lab Work

Uploaded by

NLP Lab Work

Uploaded by

ASSIGNMENT - 1

Qn.: Write a python program to perform tokenization by word

Sentence 2: Consider internships and networking to explore different specializations.

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

sentence = "This is an example sentence demonstrating the removal of stopwords using

filtered_words = [word for word in words if word.lower() not in stopword]

filtered_sentence = ' '.join(filtered_words)

print("Original Sentence:\n", sentence)

print("\nFiltered Sentence:\n", filtered_sentence)

"This is an example sentence demonstrating the removal of stopwords using NLTK"

example sentence demonstrating removal stopwords using NLTK.

!pip install stanza

Sentence 1: Stanza is developed by Stanford NLP group.

Sentence 2: It's easy to use and powerful for NLP tasks!

! pip install spacy

for i, sent in enumerate(doc.sents):

for i, token in enumerate(doc):

Sentence 1: spaCy is an open-source library for Advanced Natural Language Processing in

Sentence 2: It's fast and easy to use.

from nltk import *

from string import punctuation

In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is the

#".................Root of a word stemming..................."

from nltk.stem.snowball import SnowballStemmer

from nltk.stem.wordnet import WordNetLemmatizer

Qn.: Write a python program to perform lemmatization using

spacy_nlp = spacy.load("en-core-web-sm") # spaCy lemmatization

text = "He bettered his performance in the final"

Original text: He bettered his performance in the final

Qn.: Write a python program for chunking using nltk.

Qn.: Write a python program to perform Named Entity recognition

Qn.: Write a python program for chinking using nltk.

# Define chunk grammar with chinking

Qn.: Write a python program to find Term Frequency and Inverse

● Unigram: A single word.

● Bigram: A sequence of two consecutive words.

● Trigram: A sequence of three consecutive words.

Qn.: Write the python code to perform sentiment analysis using

!pip install nltk

You might also like