Practical 09: Text Pre Processing
Aim:- To clean the text data and make it ready to feed data to the model
Theory:- Tokenization
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens
can be either words, characters, or subwords. Hence, tokenization can be broadly classified into
3 types – word, character, and subword (n-gram characters) tokenization.
For example, consider the sentence: “Never give up”.
The most common way of forming tokens is based on space. Assuming space as a delimiter,
the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word,
it becomes an example of Word tokenization.
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
1. Character tokens: s-m-a-r-t-e-r
2. Subword tokens: smart-er
The True Reasons behind Tokenization
As tokens are the building blocks of Natural Language, the most common way of processing
the raw text happens at the token level.
For example, Transformer based models – the State of The Art (SOTA) Deep Learning
architectures in NLP – process the raw text at the token level. Similarly, the most popular deep
learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the
token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is
performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that
vocabulary can be constructed by considering each unique token in the corpus or by
considering the top K Frequently Occurring Words.
Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text
into individual words based on a certain delimiter. Depending upon delimiters, different word-
level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes
under word tokenization.
Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks
we saw above about Word Tokenization.
• Character Tokenizers handles OOV words coherently by preserving the information of
the word. It breaks down the OOV word into characters and represents the word in
terms of these characters
• It also limits the size of the vocabulary. Want to talk a guess on the size of the
vocabulary? 26 since the vocabulary contains a unique set of characters
Text classification with TensorFlow Hub: Movie reviews
This notebook classifies movie reviews as positive or negative using the text of the review.
This is an example of binary—or two-class—classification, an important and widely
applicable kind of machine learning problem.
The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and
Keras.
Loss function and optimizer
A model needs a loss function and an optimizer for training. Since this is a binary
classification problem and the model outputs logits (a single-unit layer with a linear
activation), you'll use the binary_crossentropy loss function.
This isn't the only choice for a loss function, you could, for instance, choose
mean_squared_error. But, generally, binary_crossentropy is better for dealing with
probabilities—it measures the "distance" between probability distributions, or in our case,
between the ground-truth distribution and the predictions.
Later, when you are exploring regression problems (say, to predict the price of a house),
you'll see how to use another loss function called mean squared error.
Code :-
from keras.preprocessing.text import Tokenizer
#from keras.preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences
import numpy as np
maxlen = 10
#training_samples = 20
#validation_samples = 100
max_words = 10
Model_tokenizer = Tokenizer(num_words = max_words)
texts = ["This is a girl.","Girl is tall","A tall boy is here",]
#model training
Model_tokenizer.fit_on_texts(texts)
#use the trained model for predicting the sequence number
sequences = Model_tokenizer.texts_to_sequences(texts)
sequences
text_new = ["this is my house","the girl and boy","The house is small"]
sequences_new = Model_tokenizer.texts_to_sequences(text_new)
print(sequences_new)
seq_try = [[5, 1,2,23], [3, 6]]
text_try = Model_tokenizer.sequences_to_texts(seq_try)
print(text_try)
print(Model_tokenizer.document_count)
print(Model_tokenizer.get_config())
data = pad_sequences(sequences,maxlen=6)
print(data)
data_new = pad_sequences(sequences_new,maxlen=4)
print(data_new)
Text classification with TensorFlow Hub: Movie reviews
import os
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
Download the IMDB dataset
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
Explore the data
train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))
train_examples_batch
Build the model
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:4])
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()
Loss function and optimizer
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
Train the model
len(train_data)
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=2,
validation_data=validation_data.batch(512),
verbose=1)
Evaluate the model
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
print("%s: %.3f" % (name, value))
Conclusion
Text preprocessing involves transforming text into a clean and consistent format that can then
be fed into a model for further analysis and learning. Text preprocessing techniques may be
general so that they are applicable to many types of applications, or they can be specialized
for a specific task.
Experiment Date of Grade Teacher's Sign
Number Performance