[go: up one dir, main page]

0% found this document useful (0 votes)
11 views9 pages

T&S.ipynb - Colaboratory

Uploaded by

percy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

T&S.ipynb - Colaboratory

Uploaded by

percy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3/12/24, 7:11 AM T&S.

ipynb - Colaboratory

keyboard_arrow_down Ex 1
import re

def detect_words(text):
words = re.findall(r'\w+', text)
print("Words:", words)

def detect_numbers(text):
numbers = re.findall(r'\d+', text)
print("Numbers:", numbers)

def detect_emails(text):
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print("Emails:", emails)

def tokenize_sentences(text):
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
print("Sentences:", sentences)

def tokenize_words_punctuation(text):
tokens = re.findall(r'\w+|[^\w\s]', text)
print("Tokens:", tokens)

if __name__ == "__main__":
user_text = input("Enter your text: ")

detect_words(user_text)
detect_numbers(user_text)
detect_emails(user_text)
tokenize_sentences(user_text)
tokenize_words_punctuation(user_text)

Enter your text: Please give me 5 cards. 2 are blue


Words: ['Please', 'give', 'me', '5', 'cards', '2', 'are', 'blue']
Numbers: ['5', '2']
Emails: []
Sentences: ['Please give me 5 cards.', '2 are blue']
Tokens: ['Please', 'give', 'me', '5', 'cards', '.', '2', 'are', 'blue']

keyboard_arrow_down Ex 2
keyboard_arrow_down Import
import nltk
nltk.download('all')

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 1/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory
[nltk_data] | Downloading package floresta to /root/nltk_data...
[nltk_data] | Unzipping corpora/floresta.zip.
[nltk_data] | Downloading package framenet_v15 to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/framenet_v15.zip.
[nltk_data] | Downloading package framenet_v17 to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/framenet_v17.zip.
[nltk_data] | Downloading package gazetteers to /root/nltk_data...
[nltk_data] | Unzipping corpora/gazetteers.zip.
[nltk_data] | Downloading package genesis to /root/nltk_data...
[nltk_data] | Package genesis is already up-to-date!
[nltk_data] | Downloading package gutenberg to /root/nltk_data...
[nltk_data] | Package gutenberg is already up-to-date!
[nltk_data] | Downloading package ieer to /root/nltk_data...
[nltk_data] | Unzipping corpora/ieer.zip.
[nltk_data] | Downloading package inaugural to /root/nltk_data...
[nltk_data] | Package inaugural is already up-to-date!
[nltk_data] | Downloading package indian to /root/nltk_data...
[nltk_data] | Unzipping corpora/indian.zip.
[nltk_data] | Downloading package jeita to /root/nltk_data...
[nltk_data] | Downloading package kimmo to /root/nltk_data...
[nltk_data] | Unzipping corpora/kimmo.zip.
[nltk_data] | Downloading package knbc to /root/nltk_data...
[nltk_data] | Downloading package large_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/large_grammars.zip.
[nltk_data] | Downloading package lin_thesaurus to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/lin_thesaurus.zip.
[nltk_data] | Downloading package mac_morpho to /root/nltk_data...
[nltk_data] | Unzipping corpora/mac_morpho.zip.
[nltk_data] | Downloading package machado to /root/nltk_data...
[nltk_data] | Downloading package masc_tagged to /root/nltk_data...
[nltk_data] | Downloading package maxent_ne_chunker to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] | Downloading package maxent_treebank_pos_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data] | Downloading package moses_sample to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/moses_sample.zip.
[nltk_data] | Downloading package movie_reviews to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/movie_reviews.zip.
[nltk_data] | Downloading package mte_teip5 to /root/nltk_data...
[nltk_data] | Unzipping corpora/mte_teip5.zip.
[nltk_data] | Downloading package mwa_ppdb to /root/nltk_data...
[nltk_data] | Unzipping misc/mwa_ppdb.zip.
[nltk_data] | Downloading package names to /root/nltk_data...
[nltk_data] | Unzipping corpora/names.zip.
[nltk_data] | Downloading package nombank.1.0 to /root/nltk_data...
[nltk_data] | Downloading package nonbreaking_prefixes to

keyboard_arrow_down Code
import nltk
from nltk.book import *

def search_text(text, word):


print("Concordance for '{}' in the text:".format(word))
text.concordance(word)
print()
https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 2/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

def count_vocabulary(text):
vocab_count = len(set(text))
lexical_diversity = len(set(text)) / len(text)
print("Vocabulary count:", vocab_count)
print("Lexical diversity:", lexical_diversity)
print()

def calculate_frequency_distribution(text):
fdist = FreqDist(text)
print("Top 10 most common words:")
print(fdist.most_common(10))
print()

def find_collocations(text):
print("Collocations in the text:")
text.collocations()
print()

def extract_bigrams(text):
print("First 10 bigrams:")
bigrams = list(nltk.bigrams(text))
print(bigrams[:10])
print()

if __name__ == "__main__":
print("Text analysis using NLTK\n")

# Text: Moby Dick by Herman Melville


print("Text: Moby Dick by Herman Melville")

# Searching text
search_text(text1, "monstrous")

# Counting vocabulary
count_vocabulary(text1)

# Frequency distribution
calculate_frequency_distribution(text1)

# Collocations
find_collocations(text1)

# Bigrams
extract_bigrams(text1)

08

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 3/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

This came towards us ,


whale or ork we have r
pears . Some were thick
d savage could ever hav
untainous ! That Himmal
still worse and more de
Whales . I shall ere l
whales , I am strongly
them which are to be fo
re is no telling . But
tentimes cast up dead u

, ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]

ab; sperm
n; cried Ahab;
tubb; chief

'Herman'), ('Herman', 'Melville'), ('Melville', '1851'), ('1851', ']'), (']', 'ETYMOLOGY'), ('ET

keyboard_arrow_down Ex 3
import nltk

# Download NLTK data (you can also use nltk.download() for interactive download)
nltk.download('gutenberg')
nltk.download('brown')

# Import corpora
from nltk.corpus import gutenberg, brown

# Access Gutenberg Corpus


print("Books in the Gutenberg Corpus:")
print(gutenberg.fileids())

# Access specific text from Gutenberg Corpus


emma_words = gutenberg.words('austen-emma.txt')
print("\nNumber of words in Emma by Jane Austen:", len(emma_words))

# Access categories in the Brown Corpus


print("\nCategories in the Brown Corpus:")
print(brown.categories())

# Access text from a specific category in the Brown Corpus


news_text = brown.words(categories='news')
print("\nNumber of words in the 'news' category of the Brown Corpus:", len(news_text))

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 4/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

[nltk_data] Downloading package gutenberg to /root/nltk_data...


[nltk_data] Package gutenberg is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data] Package brown is already up-to-date!
Books in the Gutenberg Corpus:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.

Number of words in Emma by Jane Austen: 192427

Categories in the Brown Corpus:


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'lear

Number of words in the 'news' category of the Brown Corpus: 100554

keyboard_arrow_down Ex 4
import nltk
from nltk.corpus import stopwords
from collections import Counter

def most_frequent_words(text):
# Tokenize the text
words = nltk.word_tokenize(text.lower())

# Get English stopwords


english_stopwords = set(stopwords.words('english'))

# Filter out stopwords and non-alphabetic words


filtered_words = [word for word in words if word.isalpha() and word not in english_stopwords]

# Count the frequency of each word


word_freq = Counter(filtered_words)

# Get the 50 most common words


most_common_words = word_freq.most_common(50)

return most_common_words

# Example usage:
text = "This is a sample text. It contains words, some of which may be common and others rare."
common_words = most_frequent_words(text)
print("50 most frequent words excluding stopwords:")
for word, freq in common_words:
print(f"{word}: {freq}")

50 most frequent words excluding stopwords:


sample: 1
text: 1
contains: 1
words: 1
may: 1
common: 1
others: 1
rare: 1

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 5/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

keyboard_arrow_down Ex 5
pip install gensim

Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)


Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages (from g
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from ge
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (fr

from gensim.models import Word2Vec


from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Sample text data


corpus = [
"Word embeddings are dense vector representations of words.",
"They capture semantic information about words.",
"Word2Vec is a popular technique for generating word embeddings."
]

# Tokenize and preprocess the text data


tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
stop_words = set(stopwords.words('english'))
preprocessed_corpus = [[word for word in sentence if word.isalpha() and word not in stop_words] for

# Train the Word2Vec model


model = Word2Vec(sentences=preprocessed_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Save the trained model


model.save("word2vec_model.bin")

# Load the saved model


loaded_model = Word2Vec.load("word2vec_model.bin")

# Get the vector representation of a word


vector = loaded_model.wv['word']
print("Vector representation of 'word':", vector)

# Find similar words


similar_words = loaded_model.wv.most_similar('word', topn=3)
print("Words similar to 'word':", similar_words)

Vector representation of 'word': [ 9.4563962e-05 3.0773198e-03 -6.8126451e-03 -1.3754654e-03


7.6685809e-03 7.3464094e-03 -3.6732971e-03 2.6427018e-03
-8.3171297e-03 6.2054861e-03 -4.6373224e-03 -3.1641065e-03
9.3113566e-03 8.7338570e-04 7.4907029e-03 -6.0740625e-03
5.1605068e-03 9.9228229e-03 -8.4573915e-03 -5.1356913e-03
-7.0648370e-03 -4.8626517e-03 -3.7785638e-03 -8.5361991e-03
7.9556061e-03 -4.8439382e-03 8.4236134e-03 5.2625705e-03
-6.5500261e-03 3.9578713e-03 5.4701497e-03 -7.4265362e-03
-7.4057197e-03 -2.4752307e-03 -8.6257253e-03 -1.5815723e-03
-4.0343284e-04 3.2996845e-03 1.4418805e-03 -8.8142155e-04
-5.5940580e-03 1.7303658e-03 -8.9737179e-04 6.7936908e-03
3.9735902e-03 4.5294715e-03 1.4343059e-03 -2.6998555e-03

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 6/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory
-4.3668128e-03 -1.0320747e-03 1.4370275e-03 -2.6460087e-03
-7.0737829e-03 -7.8053069e-03 -9.1217868e-03 -5.9351693e-03
-1.8474245e-03 -4.3238713e-03 -6.4606704e-03 -3.7173224e-03
4.2891586e-03 -3.7390434e-03 8.3781751e-03 1.5339935e-03
-7.2423196e-03 9.4337985e-03 7.6312125e-03 5.4932819e-03
-6.8488456e-03 5.8226790e-03 4.0090932e-03 5.1853694e-03
4.2559016e-03 1.9397545e-03 -3.1701624e-03 8.3538452e-03
9.6121803e-03 3.7926030e-03 -2.8369951e-03 7.1275235e-06
1.2188185e-03 -8.4583247e-03 -8.2239453e-03 -2.3101569e-04
1.2372875e-03 -5.7433806e-03 -4.7252737e-03 -7.3460746e-03
8.3286157e-03 1.2129784e-04 -4.5093987e-03 5.7017053e-03
9.1800150e-03 -4.0998720e-03 7.9646818e-03 5.3754342e-03
5.8791232e-03 5.1259040e-04 8.2130842e-03 -7.0190406e-03]
Words similar to 'word': [('capture', 0.1991206258535385), ('dense', 0.17272451519966125), ('te
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!

keyboard_arrow_down Ex 6
!pip install datasets

Collecting datasets
Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.7/536.7 kB 3.8 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datase
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from dat
Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 10.6 MB/s eta 0:00:00
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (fro
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from da
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets
Collecting multiprocess (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 12.8 MB/s eta 0:00:00
Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from dataset
Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packag
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datas
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from dat
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (fro
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from a
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (fr
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packag
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-pac
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packa
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from re
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (f
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (f
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-package
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pa
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 7/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

!pip install transformers


from transformers import pipeline

# Load zero-shot classification pipeline


classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Take user input for the sequence


print("\033[1mEnter the text to classify:\033[0m")
sequence = input()

# Take user input for candidate labels


print("\n\033[1mEnter candidate labels separated by commas:\033[0m")
user_labels_input = input()
candidate_labels = [label.strip() for label in user_labels_input.split(",")]

print("\n")

# Perform zero-shot classification


result = classifier(sequence, candidate_labels)
print("Classification Result:")
print("Sequence:", sequence)
print("Candidate Labels:", candidate_labels)
print("Predicted Label:", result['labels'][0])
print("Scores:", result['scores'][0])

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 8/9
3/12/24, 7:11 AM T&S.ipynb - Colaboratory

output Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.37.2)


Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transfo
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-p
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from tra
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from tra
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (fro
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transfo
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-package
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (f
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from tran
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-pac
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packag
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from req
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (f
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (f
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingf
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or dat
warnings.warn(
config.json: 100% 1.15k/1.15k [00:00<00:00, 65.6kB/s]

model.safetensors: 100% 1.63G/1.63G [00:15<00:00, 140MB/s]

tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 1.30kB/s]

vocab.json: 100% 899k/899k [00:00<00:00, 11.1MB/s]

https://colab.research.google.com/drive/1kp-7IipDStvwx89kU_1KoS-amZmA5tpe#scrollTo=IUdoHmDLtpnd&printMode=true 9/9

You might also like