[go: up one dir, main page]

0% found this document useful (0 votes)
12 views34 pages

NLP Lab Work

The document contains multiple assignments related to Natural Language Processing (NLP) using Python libraries such as NLTK, Stanza, and spaCy. Each assignment includes a question, a brief explanation, Python code for implementation, and output examples. Topics covered include tokenization, stopword removal, stemming, and part-of-speech tagging.

Uploaded by

Gurvindar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views34 pages

NLP Lab Work

The document contains multiple assignments related to Natural Language Processing (NLP) using Python libraries such as NLTK, Stanza, and spaCy. Each assignment includes a question, a brief explanation, Python code for implementation, and output examples. Topics covered include tokenization, stopword removal, stemming, and part-of-speech tagging.

Uploaded by

Gurvindar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ASSIGNMENT - 1

Qn.: Write a python program to perform tokenization by word


and sentence using nltk.
Answer : Tokenization is the process of breaking down a text into smaller units called
tokens.

CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

def tokenize_sentences(text):
sentences = sent_tokenize(text)
return sentences

text = ("Engineering offers diverse career paths, from industrial to research. Consider
internships and networking to explore different specializations. Focus on developing both
technical and soft skills for a successful future in the field.")

sentences = tokenize_sentences(text)
for i, sentence in enumerate(sentences):
print(f"Sentence {i+1}: {sentence}")

OUTPUT :-
Sentence 1: Engineering offers diverse career paths, from industrial to research.

Sentence 2: Consider internships and networking to explore different specializations.

Sentence 3: Focus on developing both technical and soft skills for a successful future in the
field.
ASSIGNMENT - 2
Qn.: Write a python program to perform word tokenization.
Answer: In Natural Language Processing, tokenization refers to the process of breaking
down a large piece of text into smaller units called tokens, such as words, phrases, or
sentences. These tokens are the basic building blocks for further text analysis.

For example, the sentence "I love NLP." can be tokenized into ["I", "love", "NLP", "."].

Tokenization is the first and essential step in many NLP tasks like text preprocessing, part-
of-speech tagging, sentiment analysis, and machine translation, as it helps convert raw text
into a structured format that machines can understand.

CODE :-
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def tokenize_words(text):
words = word_tokenize(text)
return words

text = "NLTK is a leading platform for building Python programs to work with human
language data."

words = tokenize_words(text)
print(words)

OUTPUT :-

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with',
'human', 'language', 'data', '.']
ASSIGNMENT - 3
Qn.: Write a python program to eliminate stopwords using nltk.
Answer: Stopwords are frequently-used, common words that carry little semantic
meaningful information and are usually removed from text before processing. They are
generally removed from the text during processing so that the NLP algorithm can actually
focus on the words carrying important meaning and thus increasing the quality of the
analysis. Examples include words like "the," "is," "in," "and," and "a."

These words appear frequently in all types of text but do not contribute significantly to the
overall meaning or context. Removing stopwords helps reduce noise in the data, making
NLP tasks like text classification, search, and sentiment analysis more efficient and focused
on meaningful content.

CODE :-

import nltk

nltk.download('punkt')

nltk.download('punkt_tab')

nltk.download('stopwords')

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

sentence = "This is an example sentence demonstrating the removal of stopwords using


NLTK."

words = word_tokenize(sentence)

stopword = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stopword]

filtered_sentence = ' '.join(filtered_words)

print("Original Sentence:\n", sentence)

print("\nFiltered Sentence:\n", filtered_sentence)


OUTPUT :-

Original Sentence :-

"This is an example sentence demonstrating the removal of stopwords using NLTK"

Filtered Sentence :-

example sentence demonstrating removal stopwords using NLTK.

CODE :-

def remove_stopwords(text):
# Tokenize the text into words
words = word_tokenize(text)
# Get English stopwords
english_stopwords = set(stopwords.words('english'))
print(english_stopwords)

OUTPUT:
{'didn', 'yours', 'am', 'this', 'for', 'ourselves', 'were', 'won', 'down', 'you', 'there', 'here', 'to', "it'd",
"wouldn't", 'on', 'i', "doesn't", 'myself', 'only', "that'll", 'further', 'own', "they're", 'after',
"should've", 'who', 'yourself', 'yourselves', 's', 'then', 'ma', 'theirs', "you'd", "hasn't", "couldn't",
'where', 'against', "we'll", 'the', 'or', 'aren', 'herself', 't', "it'll", 'too', 'all', 'couldn', "mustn't", 'don',
'doing', 'y', "isn't", 'we', 'whom', 'hasn', 'no', 'isn', 'itself', 'not', 'have', 'of', 'in', 'as', 'ours', 'with',
'my', 'her', "we'd", "didn't", 'same', 'during', 'these', 'doesn', 'been', 'while', "i'm", 'him',
'between', 'having', 'why', 'will', "wasn't", 'and', 'they', 'when', 'ain', 'how', 'those', "haven't",
"they'd", 'once', 'both', 'above', 'out', 'o', "hadn't", 've', 'has', 'weren', 'from', 'at', 'just', 'any',
'which', "we're", 'by', "it's", 'than', 'do', 'nor', 'off', 'being', 'below', 'she', "he'll", 'should', "won't",
'be', 'wasn', "you'll", 'because', "i'll", "weren't", 'some', 'their', 'd', 'hers', 'up', 'each', "he's",
"needn't", 'are', 'had', 'his', "mightn't", 'is', 'a', "she's", 'about', 'what', 'over', 'shan', 'until',
'more', 'll', 'such', "he'd", 'most', 'now', 'through', 'themselves', 'does', 'he', 'its', "she'd", 'under',
'an', "shouldn't", "we've", "you're", 'but', 'your', 'other', 'before', 'into', 'our', 'it', 'haven', 'did',
"they'll", "don't", 'so', "aren't", 'wouldn', 'shouldn', 'can', 'hadn', 'very', 'me', 'mightn', 'that',
"shan't", "i've", 'again', "she'll", 'm', 'them', "they've", 'mustn', "i'd", "you've", 'was', 're', 'few',
'himself', 'if', 'needn'}
CODE :-
import pandas as pd
# First 30 stopwords list
# Convert stopwords list to a DataFrame
df = pd.DataFrame(english_stopwords, columns=['English Stopwords'])
# Display the first 30 stopwords as a sample
print(df.head(30))

OUTPUT :-
English Stopwords
0 didn
1 yours
2 am
3 this
4 for
5 ourselves
6 were
7 won
8 down
9 you
10 there
11 here
12 to
13 it'd
14 wouldn't
15 on
16 i
17 doesn't
18 myself
19 only
20 that'll
21 further
22 own
23 they're
24 after
25 should've
26 who
27 yourself
28 yourselves
29 s
ASSIGNMENT - 4
Qn.: Write a python program to perform stemming using nltk.
Answer: In Natural Language Processing, stemming is a text processing technique that
removes prefixes and suffixes from a word to obtain its roots or the base form. It is a rule
based approach used to reduce the dimensionality of text data, simplifying words and
improving performance in some NLP tasks. The resulting stem may not always be a valid
word, but it helps group similar words with the same meaning.
For example, the words "playing," "played," and "plays" can all be reduced to the stem "play".
Stemming is useful in tasks like information retrieval, search engines, and text mining, as it
helps match different forms of the same word and improves processing efficiency.
CODE :-
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
def stem_text(text):
porter_stemmer = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [porter_stemmer.stem(word) for word in words]
stemmed_text = ' '.join(stemmed_words)
return stemmed_text

text = "NLTK is a leading platform for building Python programs to work with human
language data."
stemmed_text = stem_text(text)
print(stemmed_text)
OUTPUT :-
nltk is a lead platform for build python program to work with human languag data.
ASSIGNMENT - 5
Qn.: Write a python program to perform tokenization by word
and sentence using Stanza.
Answer : Tokenization refers to the process of breaking down a text into smaller units called
tokens. Stanza is a collection of accurate, efficient tools for the linguistic analysis of many
human languages.

CODE :-

!pip install stanza


import stanza
stanza.download('en')
nlp = stanza.Pipeline('en')
text = "Stanza is developed by Stanford NLP group. It's easy to use and powerful for NLP
tasks!"
doc = nlp(text)
print("Sentence Tokenization:\n")
for i, sentence in enumerate(doc.sentences):
print(f"Sentence {i+1}: {' '.join([word.text for word in sentence.words])}")

print("\nWord Tokenization:\n")
for sentence in enumerate(doc.sentences):
print(f"Words in sentence {i+1}:")
for word in sentence.words:
print(f"- {word.text}")
OUTPUT :-

Sentence Tokenization:

Sentence 1: Stanza is developed by Stanford NLP group.

Sentence 2: It's easy to use and powerful for NLP tasks!

Word Tokenization :-
Words in Sentence 1 :-
- Stanza
- is
- developed
- by
- NLP
- group
-.
Words in Sentence 2 :-
- St
- 's
- easy
- to
- use
- and
- powerful
- for
- NLP
- tasks
-.
ASSIGNMENT - 6
Qn.: Write a python program for word tokenization and sentence
segmentation using spaCy.
Answer : spaCy is an open-source software library for Advanced Natural Language
Processing written in the programming language Python and Cython. The library is
published under MIT license. It offers various capabilities like tokenization, POS-tagging,
named entity recognition (NER), dependency parsing and more, using pre-trained language
models.

CODE :-

! pip install spacy


! python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "spaCy is an open-source library for Advanced Natural Language Processing in
Python. It's fast and easy to use."

doc = nlp(text)
print("Sentence Segmentation:\n")

for i, sent in enumerate(doc.sents):


print(f"Sentence {i+1}: {sent.text}\n")
print("Word Tokenization:\n")

for i, token in enumerate(doc):


print(f"Token{i+1}: {token.text}\n")

OUTPUT :-
Sentence Segmentation:

Sentence 1: spaCy is an open-source library for Advanced Natural Language Processing in


Python.

Sentence 2: It's fast and easy to use.


Word Tokenization :-

Token 1: spaCy
Token 2: is
Token 3: an
Token 4: open
Token 5: -
Token 6: source
Token 7: library
Token 8: for
Token 9: advanced
Token 10: Natural
Token 11: Language
Token 12: Processing
Token 13: in
Token 14: Python
Token 15: .
Token 16: It
Token 17: 's
Token 18: fast
Token 19: and
Token 20: easy
Token 21: to
Token 22: use
Token 23: .
ASSIGNMENT - 7
Qn.: Write a python program to find all the stopwords in the
given corpus using spaCy.
Answer : Spacy is an open-source software library for advanced natural language
processing tasks, written in the programming language python and Cython. Leveraging pre-
trained language models, it offers capabilities like tokenization, POS-tagging, named entity
recognition (NER), dependency parsing and many more.

CODE :-
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Natural Language Processing is a field of artificial intelligence that focuses on the
interaction between humans and computers using natural language. The goal is to enable
computers to understand, interpret and generate human language in a valuable way."

doc = nlp(text)
print("Stopwords found in the corpus:\n")
stp = [token.text.lower() for token in doc if token.is_stop]
for word in sorted(stp):
print(word)
OUTPUT :-
Stopwords found in the corpus:
-a - of
- and - on
- between - that
- in - the
- is - using
ASSIGNMENT - 8
Qn.: Write a python program to find vocabulary, punctuation,
POS tags and perform root word stemming using nltk.
Answer : In Natural Language Processing (NLP), the hierarchy of text refers to the
structured levels at which language is processed and analyzed. This hierarchy begins with
the document level, which may consist of multiple paragraphs. Each paragraph contains
several sentences, and each sentence is made up of phrases. Phrases are composed of
words, which in turn are built from individual characters. Understanding this layered
structure allows NLP systems to break down and interpret language effectively, enabling
tasks such as tokenization, parsing, and semantic analysis.

CODE :-
#".................Hierarchy of Text..................."

from nltk import *


sent = "The dogs are barking loudly outside the house."
import nltk
nltk.download('punkt_tab')
print(nltk.word_tokenize(sent))

OUTPUT :-
['The', 'dogs', 'are', 'barking', 'loudly', 'outside', 'the', 'house', '.']
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!

DEFINITION :-
In Natural Language Processing (NLP), vocabulary refers to the set of unique words or
tokens present in a given text or corpus. It represents all the distinct words that a model can
recognize or process. The size of the vocabulary depends on the dataset and affects tasks
like tokenization, language modeling, and text classification. A well-defined vocabulary helps
machines understand and work with human language more effectively.
CODE :-
#".................Vocabulary..................."

tokens = nltk.word_tokenize(sent)
vocab = sorted(set(tokens))
print(vocab) #Prints

OUTPUT :-
['.', 'The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']

DEFINITION :-
In Natural Language Processing (NLP), punctuation refers to the symbols used in text (such
as commas, periods, question marks, etc.) that help structure and clarify meaning.
Punctuation marks are important for understanding sentence boundaries, pauses, emphasis,
and sentence types. In NLP tasks like sentence segmentation, sentiment analysis, or
machine translation, punctuation helps in accurately interpreting and generating human-like
language.

CODE :-
#".................Punctuation..................."

from string import punctuation


vocab_no_punct=[]
for i in vocab:
if i not in punctuation:
vocab_no_punct.append(i)
print(vocab_no_punct)

OUTPUT :-
['The', 'are', 'barking', 'dogs', 'house', 'loudly', 'outside', 'the']
DEFINITION :-

In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is the


process of assigning each word in a sentence its correct grammatical category, such as
noun, verb, adjective, adverb, etc. This helps the machine understand the structure and
meaning of a sentence. For example, in the sentence "The cat sleeps," "The" is tagged as a
determiner, "cat" as a noun, and "sleeps" as a verb. POS tagging is essential for tasks like
parsing, machine translation, and information extraction.

CODE :-
#".................Part of Speech or POS with the tags..................."

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_kb')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
pos_list = pos_tag(vocab_no_punct)
print(pos_list)
def pos_tagging(text):
words = word_tokenize(text)
tagged_words = nltk.pos_tag(words)
return tagged_words

text = "NLTK is a leading platform for building Python programs to work with human
language data."
tagged_text = pos_tagging(text)
print(tagged_text)

OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
DEFINITION :-

In Natural Language Processing, stemming is the process of reducing a word to its root or
base form by removing prefixes or suffixes. The root of a word obtained through stemming
may not always be a valid word but is a common form used for grouping related words. For
example, the words "running," "runs," and "runner" may all be reduced to the root "run."
Stemming helps in text normalization, improving the performance of search engines, text
classification, and information retrieval.

CODE :-

#".................Root of a word stemming..................."

#Stemming is a technique used to find the root form of a word. In the root form, a word is
devoid of any affixes (suffixes and prefixes)

from nltk.stem.snowball import SnowballStemmer


stemObj = SnowballStemmer("english")
print(stemObj.stem("Studying")) #Prints 'studi'
stemmed_vocab=[]
stemObj = SnowballStemmer("english")
for i in vocab_no_punct:
stemmed_vocab.append(stemObj.stem(i))
print(stemmed_vocab)

OUTPUT :-
studi
['the', 'are', 'bark', 'dog', 'hous', 'loud', 'outsid', 'the']
ASSIGNMENT - 9
Qn.: Write a python program to perform lemmatization using nltk.
Answer : In Natural Language Processing, lemmatization is the process of reducing a
word to its base or dictionary form, known as the lemma, while considering the word’s
meaning and part of speech. Unlike stemming, lemmatization produces real words.
For example, “running,” “ran,” and “runs” are all reduced to the lemma “run.” It uses
vocabulary and morphological analysis to ensure the correct root is found.
Lemmatization is used in tasks like information retrieval, text mining, and machine
translation for better language understanding.

CODE :-
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
return lemmatized_text

text = "The cats are chasing mice and playing in the garden"
lemmatized_text = lemmatize_text(text)
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)
OUTPUT :-

Original Text: The cats are chasing mice and playing in the garden
Lemmatized Text: The cat are chasing mouse and playing in the garden

CODE :-
# Lemmatization removes inflection and reduces the word to its base form

from nltk.stem.wordnet import WordNetLemmatizer


nltk.download('wordnet')
lemmaObj = WordNetLemmatizer()
print(lemmaObj.lemmatize("went",pos='v'))
for i in stemmed_vocab:
print(lemmaObj.lemmatize(i,pos='v'), end = ',')
print()
for i in vocab_no_punct:
print(lemmaObj.lemmatize(i,pos='v'), end = ',')

OUTPUT :-

go
the,be,bark,dog,hous,loud,outsid,the,
The,be,bark,dog,house,loudly,outside,the,
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
ASSIGNMENT - 10
Qn.: Write a python program to perform Parts of Speech tagging
using nltk.
Answer : In Natural Language Processing (NLP), POS tagging (Part-of-Speech tagging) is
the process of assigning each word in a sentence its correct grammatical category, such as
noun, verb, adjective, adverb, etc. This helps the machine understand the structure and
meaning of a sentence. For example, in the sentence "The cat sleeps," "The" is tagged as a
determiner, "cat" as a noun, and "sleeps" as a verb. POS tagging is essential for tasks like
parsing, machine translation, and information extraction.

CODE :-
import nltk
from nltk.tokenize import word_tokenize
# Download NLTK tokenizer and POS tagging models
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Define the POS tagging function
def pos_tagging(text):
# Tokenize the text into words
words = word_tokenize(text)
# Perform POS tagging
tagged_words = nltk.pos_tag(words)
return tagged_words
# Example text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
# Perform POS tagging
tagged_text = pos_tagging(text)
# Print POS tagged text
print(tagged_text)
OUTPUT :-
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'),
('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
ASSIGNMENT - 11

Qn.: Write a python program to perform lemmatization using


spaCy and Stanza.
Answer : spaCy and stanza are both popular Natural Language Processing (NLP) libraries
used for processing and analyzing text in human language. They serve similar purposes but
differ in strengths, design and performance.
In Natural Language Processing, lemmatization is the process of reducing a word to its base
or dictionary form, known as the lemma, while considering the word’s meaning and part of
speech. Unlike stemming, lemmatization produces real words. For example, “running,” “ran,”
and “runs” are all reduced to the lemma “run.” It uses vocabulary and morphological analysis
to ensure the correct root is found. Lemmatization is used in tasks like information retrieval,
text mining, and machine translation for better language understanding.

CODE :-
! pip install -q spacy stanza
import spacy
import stanza

spacy_nlp = spacy.load("en-core-web-sm") # spaCy lemmatization


def lemmatize_spacy(text):
doc = spacy.nlp(text)
return [token.lemma_ for token in doc]

stanza.download("en")
stanza_nlp = stanza.Pipeline("en", processors = ["tokenize", "pos", "lemma"]) # stanza
lemmatization
def lemmatize_stanza(text):
doc = stanza_nlp(text)
return [word.lemma for sentence in doc.sentences for word in sentence]

text = "He bettered his performance in the final"


print("Original text:", text)
print("spaCy lemmatization:", lemmatize_spacy(text))
print("Stanza lemmatization:", lemmatize_stanza(text))

OUTPUT :-

Original text: He bettered his performance in the final


spaCy lemmatization: ['he', 'better', 'his', 'performance', 'in', 'the', 'final']
Stanza lemmatization: ['he', 'better', 'his', 'performance', 'in', 'the', 'final']

Original Text: The children are playing in the gardens and eating sandwiches.
spaCy Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
Stanza Lemmatization: ['the', 'child', 'be', 'play', 'in', 'the', 'garden', 'and', 'eat', 'sandwich', '.']
ASSIGNMENT - 12

Qn.: Write a python program for chunking using nltk.


Answer : Chunking is the process of grouping words together based on their part-of-speech
tags (like nouns, verbs, adjectives) to form meaningful phrases.
In Natural Language Processing, chunking is the process of grouping related words into
meaningful phrases or "chunks" based on their part-of-speech tags. It helps identify
structures like noun phrases (e.g., "the black cat") or verb phrases (e.g., "is running fast").
Chunking does not analyze full sentence structure but focuses on partial parsing to extract
useful segments of information. It is commonly used in information extraction, question
answering, and shallow parsing.

CODE :-
# Install and import required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text
text = "The quick brown fox jumps over the lazy dog."
# Step 1: Tokenize and POS tag
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
# Step 2: Define a chunk grammar (noun phrase: NP)
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN.*>} # NP: optional determiner + adjectives + noun
"""
# Step 3: Create a chunk parser
chunk_parser = RegexpParser(chunk_grammar)
# Step 4: Parse the tagged sentence
chunked_output = chunk_parser.parse(tagged)
# Step 5: Display the chunk tree
chunked_output.draw
# This will open a tree viewer (works in local, not in Colab)
# Alternative text output (for Colab)
print(chunked_output)

OUTPUT :-

(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
jumps/VBZ
over/IN
(NP the/DT lazy/JJ dog/NN)
./.)
ASSIGNMENT - 13

Qn.: Write a python program to perform Named Entity recognition


using nltk.
Answer : In Natural Language Processing, Named Entity Recognition (NER) is the task of
identifying and classifying named entities in text into predefined categories such as person
names, organizations, locations, dates, and more. For example, in the sentence "Apple Inc.
was founded by Steve Jobs in California," NER would label "Apple Inc." as an organization,
"Steve Jobs" as a person, and "California" as a location. NER is widely used in applications
like information extraction, question answering, and summarization.

CODE :-
# Step 1: Install and import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('punkt') # Download required resources
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')
def ner(text): # Define the NER function
tokens = word_tokenize(text) # Tokenize the sentence
tagged_words = pos_tag(tokens) # POS tagging
named_entities = ne_chunk(tagged_words) # Named Entity Recognition
return named_entities
text = "Apple is a company based in California, United States. Steve Jobs was one of its
founders."
# Perform NER
named_entities = ner(text)
print(named_entities) # Print the result

OUTPUT :-

(S
(GPE Apple/NNP)
is/VBZ
a/DT
company/NN
based/VBN
in/IN
(GPE California/NNP)
,/,
(GPE United/NNP States/NNPS)
./.
(PERSON Steve/NNP Jobs/NNP)
was/VBD
one/CD
of/IN
its/PRP$
founders/NNS
./.)
ASSIGNMENT - 14

Qn.: Write a python program for chinking using nltk.


Answer : In Natural Language Processing, chinking is the process of removing specific
words or sequences from previously formed chunks. It is the opposite of chunking. While
chunking groups words into meaningful phrases, chinking excludes certain patterns (like
verbs or prepositions) from these chunks based on part-of-speech tags. For example, in a
noun phrase chunk, chinking can be used to remove verbs if they appear inside the chunk.
Chinking is useful in refining chunks for more accurate phrase detection during shallow
parsing or information extraction.

CODE :-
#{<.*>+}: Chunk everything.
#}<VB.*>{: Chink (remove) verbs from the chunk.

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
text = "The quick brown fox jumps over the lazy dog."
# Tokenize and POS tag the sentence
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)

# Define chunk grammar with chinking


# First, chunk everything as NP (noun phrase)
# Then, remove verbs (VB*) from those chunks using chinking

grammar = r"""
NP: {<.*>+} # Chunk everything
}<VB.*>{ # Chink (remove) any verb from chunks
"""
# Create a chunk parser
chunk_parser = RegexpParser(grammar)
# Parse the sentence
chunked = chunk_parser.parse(tagged_tokens)
# Display the output
print(chunked)
# Optional: draw the chunk tree (only works in local Python, not Colab)
# chunked.draw()

OUTPUT :-

(S
(NP The/DT quick/JJ brown/NN fox/NN)
jumps/VBZ
(NP over/IN the/DT lazy/JJ dog/NN ./.))
ASSIGNMENT - 15
Qn.: Write a python program to find Term Frequency and
Inverse Document Frequency (TF-IDF).
Answer : In Natural Language Processing, Term Frequency (TF) and Inverse Document Frequency
(IDF) are numerical measures used to evaluate how important a word is in a document and across a
collection of documents.
Term Frequency (TF) measures how often a term appears in a document. It is calculated as the
number of times a word appears divided by the total number of words in that document. It shows
the local importance of a word.
Inverse Document Frequency (IDF) measures how unique or rare a term is across all documents. It is
calculated using the total number of documents divided by the number of documents containing the
word, and then taking the logarithm of that value. Words that appear in many documents have a
lower IDF, meaning they are less important.
The combination TF-IDF helps identify words that are frequent in a specific document but rare
across the collection, making them useful for tasks like document classification, search engines, and
keyword extraction.

CODE :-
# Using TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"Data science is an interdisciplinary field.",
"Machine learning is a part of data science.",
"Data science involves statistics and machine learning."
]
# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (terms)
terms = vectorizer.get_feature_names_out()
# Display TF-IDF matrix
for i, doc in enumerate(tfidf_matrix.toarray()):
print(f"\nDocument {i+1} TF-IDF:")
for term, score in zip(terms, doc):
if score > 0:
print(f" {term}: {score:.4f}")

OUTPUT :-

Document 1 TF-IDF:
an: 0.4836
data: 0.2856
field: 0.4836
interdisciplinary: 0.4836
is: 0.3678
science: 0.2856
Document 2 TF-IDF:
data: 0.2805
is: 0.3612
learning: 0.3612
machine: 0.3612
of: 0.4750
part: 0.4750
science: 0.2805
Document 3 TF-IDF:
and: 0.4539 learning: 0.3452 statistics: 0.4539
data: 0.2681 machine: 0.3452
involves: 0.4539 science: 0.2681
ASSIGNMENT - 16

Qn.: Write a python program to find Term Frequency and Inverse


Document Frequency (TF-IDF).
Answer : In Natural Language Processing, unigrams, bigrams, and trigrams are types of
n-grams, which are continuous sequences of n items (usually words) from a given text.

● Unigram: A single word.


Example: “The cat sleeps” → [“The”, “cat”, “sleeps”]
Used in basic text analysis and word frequency models.

● Bigram: A sequence of two consecutive words.


Example: “The cat sleeps” → [“The cat”, “cat sleeps”]
Useful for capturing simple word relationships like "New York".

● Trigram: A sequence of three consecutive words.


Example: “The cat sleeps” → [“The cat sleeps”]
Captures more context and word dependency.
These models are used in language modeling, text prediction, speech recognition, and
machine translation to understand patterns and structure in language.

CODE :-
import nltk
nltk.download('punkt') # Download the Punkt tokenizer models
from nltk.util import ngrams
# Sample text
samplText = 'this is a very good book to study'
# Loop over ngram sizes from 1 to 3
for i in range(1, 4):
# Generate ngrams
NGRAMS = ngrams(sequence=nltk.word_tokenize(samplText), n=i)
# Print each ngram
for grams in NGRAMS:
print(grams)

OUTPUT :-
('this',)
('is',)
('a',)
('very',)
('good',)
('book',)
('to',)
('study',)
('this', 'is')
('is', 'a')
('a', 'very')
('very', 'good')
('good', 'book')
('book', 'to')
('to', 'study')
('this', 'is', 'a')
('is', 'a', 'very')
('a', 'very', 'good')
('very', 'good', 'book')
('good', 'book', 'to')
('book', 'to', 'study')
ASSIGNMENT - 17

Qn.: Write the python code to perform sentiment analysis using


NLP.
Answer : In Natural Language Processing, sentiment analysis is the task of identifying and
classifying the emotional tone or opinion expressed in a piece of text. It determines whether
the sentiment is positive, negative, or neutral.
For example, in the sentence “I love this movie,” the sentiment is positive, while “The service
was terrible” expresses a negative sentiment.
Sentiment analysis is widely used in social media monitoring, product reviews, customer
feedback analysis, and brand reputation management to understand public opinion and
customer satisfaction.
Download VADER lexicon:The nltk.download('vader_lexicon') downloads the VADER
lexicon, which is specifically designed for sentiment analysis. This lexicon contains a set of
words with associated sentiment scores (positive, negative, neutral).
Initialize SentimentIntensityAnalyzer:
The SentimentIntensityAnalyzer is initialized to perform sentiment analysis on the input text.
Analyze Sentiment:sia.polarity_scores(text) returns a dictionary with sentiment scores:
positive: Proportion of text that is positive.
neutral: Proportion of text that is neutral.
negative: Proportion of text that is negative.
compound: A combined score that sums up the overall sentiment. It ranges from -1 (most
negative) to +1 (most positive).
Sentiment Interpretation:Based on the compound score, we categorize the sentiment as
positive, negative, or neutral:
Positive: If the compound score is greater than or equal to 0.05.
Negative: If the compound score is less than or equal to -0.05.
Neutral: If the compound score is between -0.05 and 0.05.
CODE :-

!pip install nltk


import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download the VADER lexicon if not already installed
nltk.download('vader_lexicon')# Initialize SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()# Sample text to analyze
sample_text = "I love this product! It's amazing and works as expected."
# Perform sentiment analysis
sentiment_scores = sia.polarity_scores(sample_text)
# Print sentiment scores
print("Sentiment Scores:", sentiment_scores)
# Determine sentiment
if sentiment_scores['compound'] >= 0.05:
sentiment = "Positive"
elif sentiment_scores['compound'] <= -0.05:
sentiment = "Negative"
else:
sentiment = "Neutral"
print(f"The sentiment of the text is: {sentiment}")
# Sentiment Scores: {'neg': 0.0, 'neu': 0.297, 'pos': 0.703, 'compound': 0.8669}
# The sentiment of the text is: Positive

OUTPUT :-
Sentiment Scores: {'neg': 0.0, 'neu': 0.458, 'pos': 0.542, 'compound': 0.8516}
The sentiment of the text is: Positive

You might also like