0% found this document useful (0 votes)

29 views6 pages

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Practical 09: Text Pre Processing

Aim:- To clean the text data and make it ready to feed data to the model

Theory:- Tokenization

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens

can be either words, characters, or subwords. Hence, tokenization can be broadly classified into

3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter,

the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word,

it becomes an example of Word tokenization.

Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:

1. Character tokens: s-m-a-r-t-e-r

2. Subword tokens: smart-er

The True Reasons behind Tokenization

As tokens are the building blocks of Natural Language, the most common way of processing

the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning

architectures in NLP – process the raw text at the token level. Similarly, the most popular deep

learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the

token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is
performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that
vocabulary can be constructed by considering each unique token in the corpus or by
considering the top K Frequently Occurring Words.

Word Tokenization

Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text

into individual words based on a certain delimiter. Depending upon delimiters, different word-

level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes

under word tokenization.

Character Tokenization

Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks

we saw above about Word Tokenization.

• Character Tokenizers handles OOV words coherently by preserving the information of

the word. It breaks down the OOV word into characters and represents the word in

terms of these characters

• It also limits the size of the vocabulary. Want to talk a guess on the size of the

vocabulary? 26 since the vocabulary contains a unique set of characters

Text classification with TensorFlow Hub: Movie reviews

This notebook classifies movie reviews as positive or negative using the text of the review.
This is an example of binary—or two-class—classification, an important and widely
applicable kind of machine learning problem.
The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and
Keras.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary
classification problem and the model outputs logits (a single-unit layer with a linear
activation), you'll use the binary_crossentropy loss function.
This isn't the only choice for a loss function, you could, for instance, choose
mean_squared_error. But, generally, binary_crossentropy is better for dealing with
probabilities—it measures the "distance" between probability distributions, or in our case,
between the ground-truth distribution and the predictions.
Later, when you are exploring regression problems (say, to predict the price of a house),
you'll see how to use another loss function called mean squared error.

Code :-

from keras.preprocessing.text import Tokenizer

#from keras.preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences
import numpy as np

maxlen = 10
#training_samples = 20
#validation_samples = 100
max_words = 10

Model_tokenizer = Tokenizer(num_words = max_words)

texts = ["This is a girl.","Girl is tall","A tall boy is here",]

#model training
Model_tokenizer.fit_on_texts(texts)

#use the trained model for predicting the sequence number

sequences = Model_tokenizer.texts_to_sequences(texts)

sequences

text_new = ["this is my house","the girl and boy","The house is small"]

sequences_new = Model_tokenizer.texts_to_sequences(text_new)
print(sequences_new)

seq_try = [[5, 1,2,23], [3, 6]]

text_try = Model_tokenizer.sequences_to_texts(seq_try)
print(text_try)
print(Model_tokenizer.document_count)
print(Model_tokenizer.get_config())

data = pad_sequences(sequences,maxlen=6)
print(data)

data_new = pad_sequences(sequences_new,maxlen=4)
print(data_new)

Text classification with TensorFlow Hub: Movie reviews

import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

Download the IMDB dataset

# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)

Explore the data

train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))

train_examples_batch

Build the model

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:4])
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Loss function and optimizer

model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

Train the model

len(train_data)

history = model.fit(train_data.shuffle(10000).batch(512),
epochs=2,
validation_data=validation_data.batch(512),
verbose=1)

Evaluate the model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):

print("%s: %.3f" % (name, value))

Conclusion
Text preprocessing involves transforming text into a clean and consistent format that can then
be fed into a model for further analysis and learning. Text preprocessing techniques may be
general so that they are applicable to many types of applications, or they can be specialized
for a specific task.
Experiment Date of Grade Teacher's Sign
Number Performance

Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Text Classification With Transformer - 1716327784332
No ratings yet
Text Classification With Transformer - 1716327784332
3 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
17 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Keras RNN Guide for Beginners
No ratings yet
Keras RNN Guide for Beginners
13 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Transformer
No ratings yet
Transformer
39 pages
IRT Lab Programs
No ratings yet
IRT Lab Programs
9 pages
Keras NLP Encoding and Sentiment Analysis
No ratings yet
Keras NLP Encoding and Sentiment Analysis
8 pages
DLT Experiment 2
No ratings yet
DLT Experiment 2
7 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Sentence Embedding Code
No ratings yet
Sentence Embedding Code
9 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
Classification CNN
No ratings yet
Classification CNN
7 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
Glove
100% (1)
Glove
10 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Character-Level Convolutional Networks For Text Classification
No ratings yet
Character-Level Convolutional Networks For Text Classification
9 pages
TF Recitation
No ratings yet
TF Recitation
38 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
No ratings yet
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
13 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
DL Lab 8 Excuted
No ratings yet
DL Lab 8 Excuted
3 pages
08 Natural Language Processing in Tensorflow
No ratings yet
08 Natural Language Processing in Tensorflow
29 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
106106213
No ratings yet
106106213
637 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Transfer Learning & NLP Tools
No ratings yet
Transfer Learning & NLP Tools
34 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
33 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Hugging Face
100% (1)
Hugging Face
11 pages
Over Description About The Model
No ratings yet
Over Description About The Model
3 pages
Tensorflow
No ratings yet
Tensorflow
9 pages
09 Milestone Project 2 Skimlit
No ratings yet
09 Milestone Project 2 Skimlit
32 pages
NLP Basics
No ratings yet
NLP Basics
119 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
08 NLP With Deep Learning
No ratings yet
08 NLP With Deep Learning
31 pages
Reproducibility at ICLR 2019
No ratings yet
Reproducibility at ICLR 2019
82 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
No ratings yet
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
28 pages
Tensor Flow 2
No ratings yet
Tensor Flow 2
3 pages
CNN and RNN Code
No ratings yet
CNN and RNN Code
10 pages
Google Aiml
No ratings yet
Google Aiml
50 pages
TensorFlow for Developers
No ratings yet
TensorFlow for Developers
75 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Image Captioning With Visual Attention PDF
No ratings yet
Image Captioning With Visual Attention PDF
16 pages
EN Request For Proposal Quantum Template
No ratings yet
EN Request For Proposal Quantum Template
47 pages
GEOPLM Siemens PLM Tecnomatix Robcad
No ratings yet
GEOPLM Siemens PLM Tecnomatix Robcad
2 pages
Sentence Transformations 14 Present Perfect and Pa - 109063
No ratings yet
Sentence Transformations 14 Present Perfect and Pa - 109063
2 pages
5E Applying 549 Multiple Choice Questions - 041009
67% (3)
5E Applying 549 Multiple Choice Questions - 041009
78 pages
Apec Architects
No ratings yet
Apec Architects
9 pages
Monolithic Architecture and System Call
No ratings yet
Monolithic Architecture and System Call
1 page
BOX CULVERTS-Method Statement
79% (52)
BOX CULVERTS-Method Statement
2 pages
Chinese Syndicate Loots Warships in Southeast Asian Seas
No ratings yet
Chinese Syndicate Loots Warships in Southeast Asian Seas
22 pages
The Lancet Comission On Glogal Mental Health
No ratings yet
The Lancet Comission On Glogal Mental Health
46 pages
Fsp250-60Pln: (Active PFC & Rohs Compliant)
No ratings yet
Fsp250-60Pln: (Active PFC & Rohs Compliant)
2 pages
CPA Exam Prep: Asset & Ops Analysis
No ratings yet
CPA Exam Prep: Asset & Ops Analysis
2 pages
Unit 3 Clutches
No ratings yet
Unit 3 Clutches
54 pages
Learning Journal Unit 8 Information Retrieval
No ratings yet
Learning Journal Unit 8 Information Retrieval
4 pages
SP Ishares Core Msci Total International Stock Etf 7 31
No ratings yet
SP Ishares Core Msci Total International Stock Etf 7 31
16 pages
Department of Education: A. Access
No ratings yet
Department of Education: A. Access
68 pages
Review On Fatigue Problems of Orthotropic Steel Bridge Deck
No ratings yet
Review On Fatigue Problems of Orthotropic Steel Bridge Deck
17 pages
IBPS PO Prelims Day 5 Combined 168614589089
No ratings yet
IBPS PO Prelims Day 5 Combined 168614589089
42 pages
RT Svp011e en - 04222021
No ratings yet
RT Svp011e en - 04222021
106 pages
Odbc
No ratings yet
Odbc
2 pages
SolidWorks PCB Course Guide
No ratings yet
SolidWorks PCB Course Guide
17 pages
Elevator History and Functions and The Invention of This Amazing Piece of Work
No ratings yet
Elevator History and Functions and The Invention of This Amazing Piece of Work
69 pages
Surge Arrester for Medium Voltage
No ratings yet
Surge Arrester for Medium Voltage
1 page
Lesson 14 - Business Etiquette & Personal Branding
No ratings yet
Lesson 14 - Business Etiquette & Personal Branding
14 pages
Clearance Form
No ratings yet
Clearance Form
1 page
ASQLSSR Feb2020 Chakey With Online Figures PDF
No ratings yet
ASQLSSR Feb2020 Chakey With Online Figures PDF
10 pages
228 B.M. 1625 Uy Timosa
No ratings yet
228 B.M. 1625 Uy Timosa
2 pages
Arc-Bds Pro Brochure
No ratings yet
Arc-Bds Pro Brochure
18 pages
Annexure IV
No ratings yet
Annexure IV
8 pages
Identity Governance and Administration Excerpt Final
No ratings yet
Identity Governance and Administration Excerpt Final
18 pages
Mine Survey Lab 2
No ratings yet
Mine Survey Lab 2
5 pages

DL Practical 09text Pre Processing

Uploaded by

DL Practical 09text Pre Processing

Uploaded by

Practical 09: Text Pre Processing

3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

it becomes an example of Word tokenization.

1. Character tokens: s-m-a-r-t-e-r

2. Subword tokens: smart-er

The True Reasons behind Tokenization

the raw text happens at the token level.

under word tokenization.

we saw above about Word Tokenization.

• Character Tokenizers handles OOV words coherently by preserving the information of

terms of these characters

vocabulary? 26 since the vocabulary contains a unique set of characters

Text classification with TensorFlow Hub: Movie reviews

Loss function and optimizer

from keras.preprocessing.text import Tokenizer

Model_tokenizer = Tokenizer(num_words = max_words)

texts = ["This is a girl.","Girl is tall","A tall boy is here",]

#use the trained model for predicting the sequence number

text_new = ["this is my house","the girl and boy","The house is small"]

seq_try = [[5, 1,2,23], [3, 6]]

Text classification with TensorFlow Hub: Movie reviews

Download the IMDB dataset

Explore the data

train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))

Build the model

Loss function and optimizer

Train the model

Evaluate the model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):

You might also like