0% found this document useful (0 votes)

36 views7 pages

NLP Lab Manual

The document provides various Python programming tasks related to natural language processing and regular expressions. It includes examples of tokenizing sentences and words, generating sentences using a probabilistic context-free grammar (PCFG), building a trigram model for word prediction, extracting US phone numbers, USNs, and email addresses using regular expressions, and understanding cost parameters in an edit distance function. Each task is accompanied by code snippets and explanations.

Uploaded by

panave3104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views7 pages

NLP Lab Manual

Uploaded by

panave3104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1.

)Write a program to tokenize the following given

sentences into the sentence,words. Hello everyone.
Welcome to NITTE (Deemed to be University) NMAMIT. I
AM studying the NLP Elective. Use at least 3 different
methods to perform the same
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize,
regexp_tokenize
nltk.download('punkt_tab')

# Given text to tokenize

text = "Hello everyone. Welcome to NITTE (Deemed to be University)
NMAMIT. I AM studying the NLP Elective."

# Method 1: Splitting sentences in the paragraph using sent_tokenize

print("\nMethod 1: Splitting sentences in the paragraph")
print(text)
print(sent_tokenize(text))

# Method 2: Splitting words in the sentence using word_tokenize

print("\nMethod 2: Splitting words in the sentence")
print(word_tokenize(text))

# Method 3: Tokenizing words using regular expression with

regexp_tokenize
print("\nMethod 3: Tokenizing words using regular expression")
print(regexp_tokenize(text, r"[\w]+"))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt_tab.zip.

Method 1: Splitting sentences in the paragraph

Hello everyone. Welcome to NITTE (Deemed to be University) NMAMIT. I
AM studying the NLP Elective.
['Hello everyone.', 'Welcome to NITTE (Deemed to be University)
NMAMIT.', 'I AM studying the NLP Elective.']

Method 2: Splitting words in the sentence

['Hello', 'everyone', '.', 'Welcome', 'to', 'NITTE', '(', 'Deemed',
'to', 'be', 'University', ')', 'NMAMIT', '.', 'I', 'AM', 'studying',
'the', 'NLP', 'Elective', '.']

Method 3: Tokenizing words using regular expression

['Hello', 'everyone', 'Welcome', 'to', 'NITTE', 'Deemed', 'to', 'be',
'University', 'NMAMIT', 'I', 'AM', 'studying', 'the', 'NLP',
'Elective']

2.)How does the recursive generate function use a PCFG

defined in a Python dictionary to select weighted
production rules and expand the starting symbol 'S' into
a complete sentence?
import random

# Define a simple PCFG grammar.

# Each key is a non-terminal symbol with a list of tuples.
# Each tuple contains a production rule (as a list of symbols) and its
probability.
grammar = {
"S": [(["NP", "VP"], 1.0)], # Sentence -> Noun Phrase + Verb
Phrase
"NP": [
(["Det", "N"], 0.8), # Noun Phrase -> Determiner + Noun
(["Name"], 0.2) # Noun Phrase -> Proper Name
],
"VP": [
(["V", "NP"], 0.5), # Verb Phrase -> Verb + Noun Phrase
(["V"], 0.5) # Verb Phrase -> Verb
],
"Det": [
(["the"], 0.5),
(["a"], 0.5)
],
"N": [
(["cat"], 0.5),
(["dog"], 0.5)
],
"Name": [
(["Alice"], 1.0)
],
"V": [
(["sees"], 1.0)
]
}

def generate(symbol):
"""
Recursively generates a sentence fragment from the given symbol
using the PCFG grammar.

Parameters:
symbol (str): The non-terminal or terminal symbol to expand.

Returns:
str: The generated string from the grammar.
"""
# If the symbol is not in the grammar, it's assumed to be a
terminal.
if symbol not in grammar:
return symbol

productions = grammar[symbol]
# Unzip the production rules and their corresponding weights.
rules, weights = zip(*productions)

# Choose one production rule based on the probabilities.

chosen_rule = random.choices(rules, weights=weights, k=1)[0]

# Debug log: show the chosen production rule for the current non-
terminal.
print(f"Expanding '{symbol}' using rule: {chosen_rule}")

# Recursively generate the string for each symbol in the chosen

rule.
result = [generate(sym) for sym in chosen_rule]
return " ".join(result)

# Generate a sentence starting from the initial symbol 'S'

sentence = generate("S")
print("\nGenerated Sentence:", sentence)

Expanding 'S' using rule: ['NP', 'VP']

Expanding 'NP' using rule: ['Name']
Expanding 'Name' using rule: ['Alice']
Expanding 'VP' using rule: ['V', 'NP']
Expanding 'V' using rule: ['sees']
Expanding 'NP' using rule: ['Det', 'N']
Expanding 'Det' using rule: ['a']
Expanding 'N' using rule: ['dog']

Generated Sentence: Alice sees a dog

3.)Build a trigram model using the Reuters corpus to

predict the next word based on two preceding words?
# Import necessary libraries
import nltk
from nltk import bigrams, trigrams
from nltk.corpus import reuters
from collections import defaultdict
# Download necessary NLTK resources
nltk.download('reuters')
nltk.download('punkt')
nltk.download('punkt_tab')

# Tokenize the text

words = nltk.word_tokenize(' '.join(reuters.words()))

# Create trigrams
tri_grams = list(trigrams(words))

# Build a trigram model

model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurrence

for w1, w2, w3 in tri_grams:
model[(w1, w2)][w3] += 1

# Transform the counts into probabilities

for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count

# Function to predict the next word

def predict_next_word(w1, w2):
"""
Predicts the next word based on the previous two words using the
trained trigram model.

Args:
w1 (str): The first word.
w2 (str): The second word.

Returns:
str: The predicted next word.
"""
next_word = model[w1, w2]
if next_word:
predicted_word = max(next_word, key=next_word.get) # Choose
the most likely next word
return predicted_word
else:
return "No prediction available"

# Example usage
print("Next Word:", predict_next_word('the', 'stock'))
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!

Next Word: of

4.)Using Python's re module, extract US phone numbers,

USNs (format LLLNNLLDDD), and email addresses from
a given string.
text_to_search = "Reach us at 800-555-1212 or help@company.com.
Student ID: NNM21EC099."

import re # Import the regular expression module

# 1. The concise text we want to search within

text_to_search = "Reach us at 800-555-1212 or help@company.com.
Student ID: NNM21EC099."

# 2. Define the regular expression patterns (same as before)

# Phone Number Pattern
phone_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

# USN Number Pattern (LLLNNLLDDD)

usn_pattern = r"[A-Z]{3}\d{2}[A-Z]{2}\d{3}"

# Email Address Pattern

email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# 3. Find all matches for each pattern

found_phones = re.findall(phone_pattern, text_to_search)
found_usns = re.findall(usn_pattern, text_to_search)
found_emails = re.findall(email_pattern, text_to_search)

# 4. Print the results

print(f"--- Original Text ---")
print(text_to_search)
print("-" * 20) # Separator

print(f"\n--- Found Phone Numbers (Pattern: {phone_pattern}) ---")

if found_phones:
for phone in found_phones:
print(f"- {phone}")
else:
print("No phone numbers found matching the pattern.")
print(f"\n--- Found USN Numbers (Pattern: {usn_pattern}) ---")
if found_usns:
for usn in found_usns:
print(f"- {usn}")
else:
print("No USN numbers found matching the pattern.")

print(f"\n--- Found Email Addresses (Pattern: {email_pattern}) ---")

if found_emails:
for email in found_emails:
print(f"- {email}")
else:
print("No email addresses found matching the pattern.")

--- Original Text ---

Reach us at 800-555-1212 or help@company.com. Student ID: NNM21EC099.
--------------------

--- Found Phone Numbers (Pattern: \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})

---
- 800-555-1212

--- Found USN Numbers (Pattern: [A-Z]{3}\d{2}[A-Z]{2}\d{3}) ---

- NNM21EC099

--- Found Email Addresses (Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.

[a-zA-Z]{2,}) ---
- help@company.com

5.) What do the cost parameters (ins_cost, del_cost,

sub_cost) control in this edit distance function?
def weighted_edit_distance_no_numpy(s1, s2, ins_cost=1, del_cost=1,
sub_cost=1):

m = len(s1)
n = len(s2)

# Initialize DP table with nested lists

dp = [[0.0 for _ in range(n + 1)] for _ in range(m + 1)]

# --- Initialization ---

for j in range(n + 1):
dp[0][j] = j * ins_cost
for i in range(m + 1):
dp[i][0] = i * del_cost

# --- Fill DP table ---

for i in range(1, m + 1):
for j in range(1, n + 1):
current_sub_cost = 0 if s1[i - 1] == s2[j - 1] else
sub_cost
deletion = dp[i - 1][j] + del_cost
insertion = dp[i][j - 1] + ins_cost
substitution = dp[i - 1][j - 1] + current_sub_cost

dp[i][j] = min(deletion, insertion, substitution)

return dp[m][n]

# Example usage
string1 = "intention"
string2 = "execution"
distance1 = weighted_edit_distance_no_numpy(string1, string2,
ins_cost=1, del_cost=1, sub_cost=1)
print(f"Weighted edit distance between '{string1}' and '{string2}':
{distance1}")

Weighted edit distance between 'intention' and 'execution': 5

Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Module 5
No ratings yet
Module 5
69 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
NLP Manual Final
No ratings yet
NLP Manual Final
22 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
N Gram Presentation
No ratings yet
N Gram Presentation
29 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
NLP Techniques for Text Processing
No ratings yet
NLP Techniques for Text Processing
41 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Batch 2
No ratings yet
Batch 2
13 pages
Bling
No ratings yet
Bling
7 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Lab File Complete
No ratings yet
Lab File Complete
10 pages
A7 NLP Exp2
No ratings yet
A7 NLP Exp2
11 pages
Karpathy
No ratings yet
Karpathy
32 pages
Pograms
No ratings yet
Pograms
20 pages
NLP Exp 4
No ratings yet
NLP Exp 4
5 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
NLP Midterm Spring2025
No ratings yet
NLP Midterm Spring2025
7 pages
InfoSec Lab Manual for Students
No ratings yet
InfoSec Lab Manual for Students
25 pages
Coding 3
No ratings yet
Coding 3
6 pages
7 Exp
No ratings yet
7 Exp
6 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
NLP Record
No ratings yet
NLP Record
23 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP Lab Complete
No ratings yet
NLP Lab Complete
23 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
AI Lab Final
No ratings yet
AI Lab Final
22 pages
123 NLP 456
No ratings yet
123 NLP 456
4 pages
Natural Language Processing (Weekly Laboratory Assignments) : Sumit Kumar Banerjee
No ratings yet
Natural Language Processing (Weekly Laboratory Assignments) : Sumit Kumar Banerjee
8 pages
SPCC Prac Implementation
No ratings yet
SPCC Prac Implementation
9 pages
x0 Process
No ratings yet
x0 Process
4 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Exp-2 NLP
No ratings yet
Exp-2 NLP
4 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
NLP Pyth
No ratings yet
NLP Pyth
5 pages
NLP EXP 3 (B) - Word Generation
No ratings yet
NLP EXP 3 (B) - Word Generation
2 pages
New Compiler Practls
No ratings yet
New Compiler Practls
17 pages
Aiml Programs
No ratings yet
Aiml Programs
24 pages
Def Generate - N - Chars (A, B) : Return A B
No ratings yet
Def Generate - N - Chars (A, B) : Return A B
20 pages
Sample Program Using Python 3
No ratings yet
Sample Program Using Python 3
5 pages
Implement Trie (Prefix Tree)
No ratings yet
Implement Trie (Prefix Tree)
7 pages
MAR - Lab Manual
No ratings yet
MAR - Lab Manual
42 pages
Microcontroller Lab - RB - For Circulation
No ratings yet
Microcontroller Lab - RB - For Circulation
36 pages
Object Oriented Programming Model Paper 01
No ratings yet
Object Oriented Programming Model Paper 01
2 pages
Microcontroller Lab Manual 2023
No ratings yet
Microcontroller Lab Manual 2023
36 pages
Chapter 03 Design For Static Strength
No ratings yet
Chapter 03 Design For Static Strength
37 pages
PLC Motion Control Guide
No ratings yet
PLC Motion Control Guide
18 pages
Resume Questions
No ratings yet
Resume Questions
1 page
Introduction To Design of Robotic Components
No ratings yet
Introduction To Design of Robotic Components
31 pages
Chapter 04 Threaded Fasteners
100% (1)
Chapter 04 Threaded Fasteners
38 pages
Generalized Wheel Model
No ratings yet
Generalized Wheel Model
18 pages
Kinematics of Mobile Robot
No ratings yet
Kinematics of Mobile Robot
18 pages
Shantonu Sharma, Khulna University, English Discipline, 01633341923
No ratings yet
Shantonu Sharma, Khulna University, English Discipline, 01633341923
3 pages
EL013IU Intro To Ling - Module 6 - Syntax - Handout
No ratings yet
EL013IU Intro To Ling - Module 6 - Syntax - Handout
28 pages
Understanding Articles & Determiners
No ratings yet
Understanding Articles & Determiners
14 pages
English Grammar Plus
90% (10)
English Grammar Plus
65 pages
English A Grade 8 Easter Term 2022 Scheme
No ratings yet
English A Grade 8 Easter Term 2022 Scheme
12 pages
Starters - 3rd Edition
No ratings yet
Starters - 3rd Edition
132 pages
Conjugation Table For Ser and Estar With Spanish and English
No ratings yet
Conjugation Table For Ser and Estar With Spanish and English
4 pages
4 Peter Newmarks Approach To Translation
No ratings yet
4 Peter Newmarks Approach To Translation
9 pages
The Perfect Tenses
No ratings yet
The Perfect Tenses
4 pages
Infinitive
No ratings yet
Infinitive
2 pages
Adverbs English
No ratings yet
Adverbs English
7 pages
English Verb Tenses Guide
No ratings yet
English Verb Tenses Guide
5 pages
Click Here For More Books: For "My English" Page 1 of 32
100% (1)
Click Here For More Books: For "My English" Page 1 of 32
32 pages
Adverbs & Adjectives Guide
88% (8)
Adverbs & Adjectives Guide
2 pages
Industry Tb1 v2
No ratings yet
Industry Tb1 v2
76 pages
English For Teens 2.1
No ratings yet
English For Teens 2.1
16 pages
Spring GPS Assessment
No ratings yet
Spring GPS Assessment
8 pages
Conditional
No ratings yet
Conditional
8 pages
English 8 Lesson Plan Parallelism
No ratings yet
English 8 Lesson Plan Parallelism
5 pages
A Detailed Lesson Plan in English 7: TOPIC: Different Types of Sentences According To Structure
No ratings yet
A Detailed Lesson Plan in English 7: TOPIC: Different Types of Sentences According To Structure
13 pages
DLP English 2 Q1WK 4 Day 1
No ratings yet
DLP English 2 Q1WK 4 Day 1
3 pages
Past Tenses
No ratings yet
Past Tenses
28 pages
Bab 6
No ratings yet
Bab 6
9 pages
The Grammar and Communication For Children
No ratings yet
The Grammar and Communication For Children
480 pages
UNIT 01 Extra Grammar Exercises
No ratings yet
UNIT 01 Extra Grammar Exercises
4 pages
Unit II English II
No ratings yet
Unit II English II
11 pages
Praragraph Writing in Descriptive
No ratings yet
Praragraph Writing in Descriptive
11 pages
Syllabus Class VI30052017083058 PDF
No ratings yet
Syllabus Class VI30052017083058 PDF
11 pages
Unit 9: Describing People
No ratings yet
Unit 9: Describing People
7 pages
Future Perfect Continuous
No ratings yet
Future Perfect Continuous
3 pages

NLP Lab Manual

Uploaded by

NLP Lab Manual

Uploaded by

1.

)Write a program to tokenize the following given

# Given text to tokenize

# Method 1: Splitting sentences in the paragraph using sent_tokenize

# Method 2: Splitting words in the sentence using word_tokenize

# Method 3: Tokenizing words using regular expression with

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

Method 1: Splitting sentences in the paragraph

Method 2: Splitting words in the sentence

Method 3: Tokenizing words using regular expression

2.)How does the recursive generate function use a PCFG

# Define a simple PCFG grammar.

# Choose one production rule based on the probabilities.

# Recursively generate the string for each symbol in the chosen

# Generate a sentence starting from the initial symbol 'S'

Expanding 'S' using rule: ['NP', 'VP']

Generated Sentence: Alice sees a dog

3.)Build a trigram model using the Reuters corpus to

# Tokenize the text

# Build a trigram model

# Count frequency of co-occurrence

# Transform the counts into probabilities

# Function to predict the next word

4.)Using Python's re module, extract US phone numbers,

import re # Import the regular expression module

# 1. The concise text we want to search within

# 2. Define the regular expression patterns (same as before)

# USN Number Pattern (LLLNNLLDDD)

# Email Address Pattern

# 3. Find all matches for each pattern

# 4. Print the results

print(f"\n--- Found Phone Numbers (Pattern: {phone_pattern}) ---")

print(f"\n--- Found Email Addresses (Pattern: {email_pattern}) ---")

--- Original Text ---

--- Found Phone Numbers (Pattern: \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})

--- Found USN Numbers (Pattern: [A-Z]{3}\d{2}[A-Z]{2}\d{3}) ---

--- Found Email Addresses (Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.

5.) What do the cost parameters (ins_cost, del_cost,

# Initialize DP table with nested lists

# --- Initialization ---

# --- Fill DP table ---

dp[i][j] = min(deletion, insertion, substitution)

Weighted edit distance between 'intention' and 'execution': 5

You might also like