NLP Lab Work
NLP Lab Work
CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
def tokenize_sentences(text):
sentences = sent_tokenize(text)
return sentences
text = ("Engineering offers diverse career paths, from industrial to research. Consider
internships and networking to explore different specializations. Focus on developing both
technical and soft skills for a successful future in the field.")
sentences = tokenize_sentences(text)
for i, sentence in enumerate(sentences):
print(f"Sentence {i+1}: {sentence}")
OUTPUT :-
Sentence 1: Engineering offers diverse career paths, from industrial to research.
Sentence 3: Focus on developing both technical and soft skills for a successful future in the
field.
ASSIGNMENT - 2
Qn.: Write a python program to perform word tokenization.
Answer: In Natural Language Processing, tokenization refers to the process of breaking
down a large piece of text into smaller units called tokens, such as words, phrases, or
sentences. These tokens are the basic building blocks for further text analysis.
For example, the sentence "I love NLP." can be tokenized into ["I", "love", "NLP", "."].
Tokenization is the first and essential step in many NLP tasks like text preprocessing, part-
of-speech tagging, sentiment analysis, and machine translation, as it helps convert raw text
into a structured format that machines can understand.
CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
def tokenize_words(text):
words = word_tokenize(text)
return words
text = "NLTK is a leading platform for building Python programs to work with human
language data."
words = tokenize_words(text)
print(words)
OUTPUT :-
['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with',
'human', 'language', 'data', '.']
ASSIGNMENT - 3
Qn.: Write a python program to eliminate stopwords using nltk.
Answer: Stopwords are frequently-used, common words that carry little semantic
meaningful information and are usually removed from text before processing. They are
generally removed from the text during processing so that the NLP algorithm can actually
focus on the words carrying important meaning and thus increasing the quality of the
analysis. Examples include words like "the," "is," "in," "and," and "a."
These words appear frequently in all types of text but do not contribute significantly to the
overall meaning or context. Removing stopwords helps reduce noise in the data, making
NLP tasks like text classification, search, and sentiment analysis more efficient and focused
on meaningful content.
CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
words = word_tokenize(sentence)
stopword = set(stopwords.words('english'))
Original Sentence :-
Filtered Sentence :-
CODE :-
def remove_stopwords(text):
# Tokenize the text into words
words = word_tokenize(text)
# Get English stopwords
english_stopwords = set(stopwords.words('english'))
print(english_stopwords)
OUTPUT:
{'didn', 'yours', 'am', 'this', 'for', 'ourselves', 'were', 'won', 'down', 'you', 'there', 'here', 'to', "it'd",
"wouldn't", 'on', 'i', "doesn't", 'myself', 'only', "that'll", 'further', 'own', "they're", 'after',
"should've", 'who', 'yourself', 'yourselves', 's', 'then', 'ma', 'theirs', "you'd", "hasn't", "couldn't",
'where', 'against', "we'll", 'the', 'or', 'aren', 'herself', 't', "it'll", 'too', 'all', 'couldn', "mustn't", 'don',
'doing', 'y', "isn't", 'we', 'whom', 'hasn', 'no', 'isn', 'itself', 'not', 'have', 'of', 'in', 'as', 'ours', 'with',
'my', 'her', "we'd", "didn't", 'same', 'during', 'these', 'doesn', 'been', 'while', "i'm", 'him',
'between', 'having', 'why', 'will', "wasn't", 'and', 'they', 'when', 'ain', 'how', 'those', "haven't",
"they'd", 'once', 'both', 'above', 'out', 'o', "hadn't", 've', 'has', 'weren', 'from', 'at', 'just', 'any',
'which', "we're", 'by', "it's", 'than', 'do', 'nor', 'off', 'being', 'below', 'she', "he'll", 'should', "won't",
'be', 'wasn', "you'll", 'because', "i'll", "weren't", 'some', 'their', 'd', 'hers', 'up', 'each', "he's",
"needn't", 'are', 'had', 'his', "mightn't", 'is', 'a', "she's", 'about', 'what', 'over', 'shan', 'until',
'more', 'll', 'such', "he'd", 'most', 'now', 'through', 'themselves', 'does', 'he', 'its', "she'd", 'under',
'an', "shouldn't", "we've", "you're", 'but', 'your', 'other', 'before', 'into', 'our', 'it', 'haven', 'did',
"they'll", "don't", 'so', "aren't", 'wouldn', 'shouldn', 'can', 'hadn', 'very', 'me', 'mightn', 'that',
"shan't", "i've", 'again', "she'll", 'm', 'them', "they've", 'mustn', "i'd", "you've", 'was', 're', 'few',
'himself', 'if', 'needn'}
CODE :-
import pandas as pd
# First 30 stopwords list
# Convert stopwords list to a DataFrame
df = pd.DataFrame(english_stopwords, columns=['English Stopwords'])
# Display the first 30 stopwords as a sample
print(df.head(30))
OUTPUT :-
English Stopwords
0 didn
1 yours
2 am
3 this
4 for
5 ourselves
6 were
7 won
8 down
9 you
10 there
11 here
12 to
13 it'd
14 wouldn't
15 on
16 i
17 doesn't
18 myself
19 only
20 that'll
21 further
22 own
23 they're
24 after
25 should've
26 who
27 yourself
28 yourselves
29 s
ASSIGNMENT - 4
Qn.: Write a python program to perform stemming using nltk.
Answer: In Natural Language Processing, stemming is a text processing technique that
removes prefixes and suffixes from a word to obtain its roots or the base form. It is a rule
based approach used to reduce the dimensionality of text data, simplifying words and
improving performance in some NLP tasks. The resulting stem may not always be a valid
word, but it helps group similar words with the same meaning.
For example, the words "playing," "played," and "plays" can all be reduced to the stem "play".
Stemming is useful in tasks like information retrieval, search engines, and text mining, as it
helps match different forms of the same word and improves processing efficiency.
CODE :-
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
def stem_text(text):
porter_stemmer = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [porter_stemmer.stem(word) for word in words]
stemmed_text = ' '.join(stemmed_words)
return stemmed_text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
stemmed_text = stem_text(text)
print(stemmed_text)
OUTPUT :-
nltk is a lead platform for build python program to work with human languag data.
ASSIGNMENT - 5
Qn.: Write a python program to perform tokenization by word
and sentence using Stanza.
Answer : Tokenization refers to the process of breaking down a text into smaller units called
tokens. Stanza is a collection of accurate, efficient tools for the linguistic analysis of many
human languages.
CODE :-
print("\nWord Tokenization:\n")
for sentence in enumerate(doc.sentences):
print(f"Words in sentence {i+1}:")
for word in sentence.words:
print(f"- {word.text}")
OUTPUT :-
Sentence Tokenization:
Word Tokenization :-
Words in Sentence 1 :-
- Stanza
- is
- developed
- by
- NLP
- group
-.
Words in Sentence 2 :-
- St
- 's
- easy
- to
- use
- and
- powerful
- for
- NLP
- tasks
-.
ASSIGNMENT - 6
Qn.: Write a python program for word tokenization and sentence
segmentation using spaCy.
Answer : spaCy is an open-source software library for Advanced Natural Language
Processing written in the programming language Python and Cython. The library is
published under MIT license. It offers various capabilities like tokenization, POS-tagging,
named entity recognition (NER), dependency parsing and more, using pre-trained language
models.
CODE :-
import spacy
nlp = spacy.load("en_core_web_sm")
text = "spaCy is an open-source library for Advanced Natural Language Processing in
Python. It's fast and easy to use."
doc = nlp(text)
print("Sentence Segmentation:\n")
OUTPUT :-
Sentence Segmentation:
Token 1: spaCy
Token 2: is
Token 3: an
Token 4: open
Token 5: -
Token 6: source
Token 7: library
Token 8: for
Token 9: advanced
Token 10: Natural
Token 11: Language
Token 12: Processing
Token 13: in
Token 14: Python
Token 15: .
Token 16: It
Token 17: 's
Token 18: fast
Token 19: and
Token 20: easy
Token 21: to
Token 22: use
Token 23: .
ASSIGNMENT - 7
Qn.: Write a python program to find all the stopwords in the
given corpus using spaCy.
Answer : Spacy is an open-source software library for advanced natural language
processing tasks, written in the programming language python and Cython. Leveraging pre-
trained language models, it offers capabilities like tokenization, POS-tagging, named entity
recognition (NER), dependency parsing and many more.
CODE :-
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Natural Language Processing is a field of artificial intelligence that focuses on the
interaction between humans and computers using natural language. The goal is to enable
computers to understand, interpret and generate human language in a valuable way."
doc = nlp(text)
print("Stopwords found in the corpus:\n")
stp = [token.text.lower() for token in doc if token.is_stop]
for word in sorted(stp):
print(word)
OUTPUT :-
Stopwords found in the corpus:
-a - of
- and - on
- between - that
- in - the
- is - using
ASSIGNMENT - 8
Qn.: Write a python program to find vocabulary, punctuation,
POS tags and perform root word stemming using nltk.
Answer : In Natural Language Processing (NLP), the hierarchy of text refers to the
structured levels at which language is processed and analyzed. This hierarchy begins with
the document level, which may consist of multiple paragraphs. Each paragraph contains
several sentences, and each sentence is made up of phrases. Phrases are composed of
words, which in turn are built from individual characters. Understanding this layered
structure allows NLP systems to break down and interpret language effectively, enabling
tasks such as tokenization, parsing, and semantic analysis.
CODE :-
#".................Hierarchy of Text..................."
OUTPUT :-
['The', 'dogs', 'are', 'barking', 'loudly', 'outside', 'the', 'house', '.']
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
DEFINITION :-
In Natural Language Processing (NLP), vocabulary refers to the set of unique words or
tokens present in a given text or corpus. It represents all the distinct words that a model can
recognize or process. The size of the vocabulary depends on the dataset and affects tasks
like tokenization, language modeling, and text classification. A well-defined vocabulary helps
machines understand and work with human language more effectively.
CODE :-
#".................Vocabulary..................."
tokens = nltk.word_tokenize(sent)
vocab = sorted(set(tokens))
print(vocab) #Prints
OUTPUT :-
['.', 'The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']
DEFINITION :-
In Natural Language Processing (NLP), punctuation refers to the symbols used in text (such
as commas, periods, question marks, etc.) that help structure and clarify meaning.
Punctuation marks are important for understanding sentence boundaries, pauses, emphasis,
and sentence types. In NLP tasks like sentence segmentation, sentiment analysis, or
machine translation, punctuation helps in accurately interpreting and generating human-like
language.
CODE :-
#".................Punctuation..................."
OUTPUT :-
['The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']
DEFINITION :-
CODE :-
#".................Part of Speech or POS with the tags..................."
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_kb')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
pos_list = pos_tag(vocab_no_punct)
print(pos_list)
def pos_tagging(text):
words = word_tokenize(text)
tagged_words = nltk.pos_tag(words)
return tagged_words
text = "NLTK is a leading platform for building Python programs to work with human
language data."
tagged_text = pos_tagging(text)
print(tagged_text)
OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
DEFINITION :-
In Natural Language Processing, stemming is the process of reducing a word to its root or
base form by removing prefixes or suffixes. The root of a word obtained through stemming
may not always be a valid word but is a common form used for grouping related words. For
example, the words "running," "runs," and "runner" may all be reduced to the root "run."
Stemming helps in text normalization, improving the performance of search engines, text
classification, and information retrieval.
CODE :-
#Stemming is a technique used to find the root form of a word. In the root form, a word is
devoid of any affixes (suffixes and prefixes)
OUTPUT :-
studi
['the', 'are', 'bark', 'dog', 'hous', 'loud', 'outsid', 'the']
ASSIGNMENT - 9
Qn.: Write a python program to perform lemmatization using nltk.
Answer : In Natural Language Processing, lemmatization is the process of reducing a
word to its base or dictionary form, known as the lemma, while considering the word’s
meaning and part of speech. Unlike stemming, lemmatization produces real words.
For example, “running,” “ran,” and “runs” are all reduced to the lemma “run.” It uses
vocabulary and morphological analysis to ensure the correct root is found.
Lemmatization is used in tasks like information retrieval, text mining, and machine
translation for better language understanding.
CODE :-
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
return lemmatized_text
text = "The cats are chasing mice and playing in the garden"
lemmatized_text = lemmatize_text(text)
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)
OUTPUT :-
Original Text: The cats are chasing mice and playing in the garden
Lemmatized Text: The cat are chasing mouse and playing in the garden
CODE :-
# Lemmatization removes inflection and reduces the word to its base form
OUTPUT :-
go
the,be,bark,dog,hous,loud,outsid,the,
The,be,bark,dog,house,loudly,outside,the,
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
ASSIGNMENT - 10
Qn.: Write a python program to perform Parts of Speech tagging
using nltk.
Answer : In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is
the process of assigning each word in a sentence its correct grammatical category, such as
noun, verb, adjective, adverb, etc. This helps the machine understand the structure and
meaning of a sentence. For example, in the sentence "The cat sleeps," "The" is tagged as a
determiner, "cat" as a noun, and "sleeps" as a verb. POS tagging is essential for tasks like
parsing, machine translation, and information extraction.
CODE :-
import nltk
from nltk.tokenize import word_tokenize
# Download NLTK tokenizer and POS tagging models
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Define the POS tagging function
def pos_tagging(text):
# Tokenize the text into words
words = word_tokenize(text)
# Perform POS tagging
tagged_words = nltk.pos_tag(words)
return tagged_words
# Example text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
# Perform POS tagging
tagged_text = pos_tagging(text)
# Print POS tagged text
print(tagged_text)
OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
ASSIGNMENT - 11
CODE :-
! pip install -q spacy stanza
import spacy
import stanza
stanza.download("en")
stanza_nlp = stanza.Pipeline("en", processors = ["tokenize", "pos", "lemma"]) # stanza
lemmatization
def lemmatize_stanza(text):
doc = stanza_nlp(text)
return [word.lemma for sentence in doc.sentences for word in sentence]
OUTPUT :-
Original Text: The children are playing in the gardens and eating sandwiches.
spaCy Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
Stanza Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
ASSIGNMENT - 12
CODE :-
# Install and import required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text
text = "The quick brown fox jumps over the lazy dog."
# Step 1: Tokenize and POS tag
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
# Step 2: Define a chunk grammar (noun phrase: NP)
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN.*>} # NP: optional determiner + adjectives + noun
"""
# Step 3: Create a chunk parser
chunk_parser = RegexpParser(chunk_grammar)
# Step 4: Parse the tagged sentence
chunked_output = chunk_parser.parse(tagged)
# Step 5: Display the chunk tree
chunked_output.draw
# This will open a tree viewer (works in local, not in Colab)
# Alternative text output (for Colab)
print(chunked_output)
OUTPUT :-
(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
jumps/VBZ
over/IN
(NP the/DT lazy/JJ dog/NN)
./.)
ASSIGNMENT - 13
CODE :-
# Step 1: Install and import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('punkt') # Download required resources
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')
def ner(text): # Define the NER function
tokens = word_tokenize(text) # Tokenize the sentence
tagged_words = pos_tag(tokens) # POS tagging
named_entities = ne_chunk(tagged_words) # Named Entity Recognition
return named_entities
text = "Apple is a company based in California, United States. Steve Jobs was one of its
founders."
# Perform NER
named_entities = ner(text)
print(named_entities) # Print the result
OUTPUT :-
(S
(GPE Apple/NNP)
is/VBZ
a/DT
company/NN
based/VBN
in/IN
(GPE California/NNP)
,/,
(GPE United/NNP States/NNPS)
./.
(PERSON Steve/NNP Jobs/NNP)
was/VBD
one/CD
of/IN
its/PRP$
founders/NNS
./.)
ASSIGNMENT - 14
CODE :-
#{<.*>+}: Chunk everything.
#}<VB.*>{: Chink (remove) verbs from the chunk.
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
text = "The quick brown fox jumps over the lazy dog."
# Tokenize and POS tag the sentence
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
grammar = r"""
NP: {<.*>+} # Chunk everything
}<VB.*>{ # Chink (remove) any verb from chunks
"""
# Create a chunk parser
chunk_parser = RegexpParser(grammar)
# Parse the sentence
chunked = chunk_parser.parse(tagged_tokens)
# Display the output
print(chunked)
# Optional: draw the chunk tree (only works in local Python, not Colab)
# chunked.draw()
OUTPUT :-
(S
(NP The/DT quick/JJ brown/NN fox/NN)
jumps/VBZ
(NP over/IN the/DT lazy/JJ dog/NN ./.))
ASSIGNMENT - 15
Qn.: Write a python program to find Term Frequency and
Inverse Document Frequency (TF-IDF).
Answer : In Natural Language Processing, Term Frequency (TF) and Inverse Document Frequency
(IDF) are numerical measures used to evaluate how important a word is in a document and across a
collection of documents.
Term Frequency (TF) measures how often a term appears in a document. It is calculated as the
number of times a word appears divided by the total number of words in that document. It shows
the local importance of a word.
Inverse Document Frequency (IDF) measures how unique or rare a term is across all documents. It is
calculated using the total number of documents divided by the number of documents containing the
word, and then taking the logarithm of that value. Words that appear in many documents have a
lower IDF, meaning they are less important.
The combination TF-IDF helps identify words that are frequent in a specific document but rare
across the collection, making them useful for tasks like document classification, search engines, and
keyword extraction.
CODE :-
# Using TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"Data science is an interdisciplinary field.",
"Machine learning is a part of data science.",
"Data science involves statistics and machine learning."
]
# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (terms)
terms = vectorizer.get_feature_names_out()
# Display TF-IDF matrix
for i, doc in enumerate(tfidf_matrix.toarray()):
print(f"\nDocument {i+1} TF-IDF:")
for term, score in zip(terms, doc):
if score > 0:
print(f" {term}: {score:.4f}")
OUTPUT :-
Document 1 TF-IDF:
an: 0.4836
data: 0.2856
field: 0.4836
interdisciplinary: 0.4836
is: 0.3678
science: 0.2856
Document 2 TF-IDF:
data: 0.2805
is: 0.3612
learning: 0.3612
machine: 0.3612
of: 0.4750
part: 0.4750
science: 0.2805
Document 3 TF-IDF:
and: 0.4539 learning: 0.3452 statistics: 0.4539
data: 0.2681 machine: 0.3452
involves: 0.4539 science: 0.2681
ASSIGNMENT - 16
CODE :-
import nltk
nltk.download('punkt') # Download the Punkt tokenizer models
from nltk.util import ngrams
# Sample text
samplText = 'this is a very good book to study'
# Loop over ngram sizes from 1 to 3
for i in range(1, 4):
# Generate ngrams
NGRAMS = ngrams(sequence=nltk.word_tokenize(samplText), n=i)
# Print each ngram
for grams in NGRAMS:
print(grams)
OUTPUT :-
('this',)
('is',)
('a',)
('very',)
('good',)
('book',)
('to',)
('study',)
('this', 'is')
('is', 'a')
('a', 'very')
('very', 'good')
('good', 'book')
('book', 'to')
('to', 'study')
('this', 'is', 'a')
('is', 'a', 'very')
('a', 'very', 'good')
('very', 'good', 'book')
('good', 'book', 'to')
('book', 'to', 'study')
ASSIGNMENT - 17
OUTPUT :-
Sentiment Scores: {'neg': 0.0, 'neu': 0.458, 'pos': 0.542, 'compound': 0.8516}
The sentiment of the text is: Positive