Lesson 2 Feature Engineering On Text Data
Lesson 2 Feature Engineering On Text Data
Explain N-gram
Levenshtein
TF-IDF Distance
N-Gram
Girl
Male
Female
Boy
N-grams are combinations of adjacent words or letters of length n in the source text.
Trigram
Group (contiguous sequence) of n
words or characters
Unigram Bigram
n =2 Bigram
This is
n = 3 (Trigram)
a sentence
Probability Calculation: Bigram Model
Text Comparison
Information Retrieval
Applications
Automatic Text
Categorization
Autocomplete
Bag-of-Words
Bag-of-Words
Example Usage:
Male Girl
Boy Female
Processed Data
• Document
• Tweet Unordered
• Review comments collection of words
Bag-of-Words
Bag-of-Words model is the way of extracting features from text and representing the
text data, while modeling the text with a machine learning algorithm.
Tokenization:
While creating the bag of
words, tokenized word of each
01 Tokenization observation is used.
Process:
• Collect data
• Create a vocabulary by listing all
Process 02 unique words
• Create document vectors after
scoring
Scoring mechanism:
Scoring Mechanism 03 • Word hashing
• TF-IDF
• Boolean value
Bag-of-Words: Example
Difficult to compare
Twinkle twinkle little star “twinkl”,”littl,”star”
Multiple occurrences
The silence of lambs “silence”,”lamb” of word: difficult to
handle
Term or Word
daughter lamb littl mari star silenc twinkl
1 0 1 0 0 0 0
I have a little daughter
0 1 1 1 0 0 0
Mary had a little lamb
0 0 1 0 1 0 2
Twinkle twinkle little star
0 1 0 0 0 1 0
The silence of lambs
Term Matrix
Document-
Term Matrix
Represents documents in a Follows the bag-of-words
row or terms in a column approach
Document-Term Matrix Calculation
01 Create a matrix of n x m
n >= 1, m >= 1n
n is number of doc
m is number of unique
terms
Assign count of each term
02 against the respective
document
Document-Term Matrix: Applications
Improving Search
Results
Finding Topics
Document-Term Matrix: Example
Example:
Doc1 1 1 1 1 1 1 1 0 0 0 0 0
Doc2 0 0 1 0 1 1 1 1 1 0 0 0
Doc3 0 0 1 1 0 1 0 1 0 1 1 1
Document-Term Matrix: Example
Doc1 1 1 1 1 1 1 1 0 0 0 0 0
Doc2 0 0 1 0 1 1 1 1 1 0 0 0
Compare documents:
0 0 1 0 1 1 1 0 0 0 0 0
Dot
product
Document-Term Matrix: Analyze Dot Product—Example
• The more the dot product is, the more similar are the documents
• This flaw may result in the document pair having very different words. This may
have the same dot product as the document pairs which are very similar.
Document-Term Matrix: Analyze Dot Product—Example
= 4
--------------------
sqrt(7).sqrt(6)
Example:
• Cost occurs more frequently in an economy related document. To overcome
this limitation TF-IDF is used which assigns weights to the words based on
their relevance in the document.
TF-IDF
Doc1 1 1 1 1 1 1 1 0 0 0 0 0
Doc2 0 0 1 0 1 1 1 1 1 0 0 0
Doc3 0 0 1 1 0 1 0 1 0 1 1 1
Document 1 1 3 2 2 3 2 2 1 1 1 1
Frequency
Doc1 1/1 1/1 1/3 1/2 1/2 1/3 1/2 0/2 0/1 0/1 0/1 0/1
Doc2 0/1 0/1 1/3 0/2 1/2 1/3 1/2 1/2 1/1 0/1 0/1 0/1
Doc3 0/1 0/1 1/3 1/2 0/2 1/3 0/2 1/2 0/1 1/1 1/1 1/1
1 1 3 2 2 3 2 2 1 1 1 1
Document Frequency
Term Frequency
Term Frequency • Highlights the words or terms which are unique to the document
• These words are better for characterizing
TF-IDF
TF = Term Frequency
IDF = Inverse Document Frequency
TF= count(t,d)
Count of term ‘t’ in document ‘d’
--------------
|d| Total number of terms in document ‘d’
String-matching
RNA/DNA sequencing
Applications Spell-checking
Plagiarism detection
Levenshtein Distance: Example
n is character length of
word1
m is character length of 01 Create a matrix of n x m
word2
H 1
HYUNDAI HONDA
O 2
N 3
D 4
A 5
i=2 i=3
Step 3 to 5 i=1
H Y U N D A I H Y U N D A I H Y U N D A I
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
H 1 0 H 1 0 1 H 1 0 1 2
O 2 1 O 2 1 1 O 2 1 1 2
N 3 2 N 3 2 2 N 3 2 2 2
D 4 3 D 4 3 3 D 4 3 3 3
A 5 4 A 5 4 4 A 5 4 4 4
Levenshtein Distance Calculation: Example
Strings to compare
Set n to be the length of s s t
Set m to be the length of t
If n = 0, return m and exit
HYUNDAI HONDA
If m = 0, return n and exit
Construct a matrix containing 0..m rows and 0..n
columns
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
H 1 0 1 2 3 H 1 0 1 2 3 4
i=4 i=5
O 2 1 1 2 3 O 2 1 1 2 3 4
N 3 2 2 2 2 N 3 2 2 2 2 3
D 4 3 3 3 3 D 4 3 3 3 3 2
A 5 4 4 4 4 A 5 4 4 4 4 3
H Y U N D A I H Y U N D A I
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
H 1 0 1 2 3 4 5 H 1 0 1 2 3 4 5 6
i=6 O 2 1 1 2 3 4 5 i=7 O 2 1 1 2 3 4 5 6
N 3 2 2 2 2 3 4 N 3 2 2 2 2 3 4 5
D 4 3 3 3 3 2 3 D 4 3 3 3 3 2 3 4
A 5 4 4 4 4 3 2 A 5 4 4 4 4 3 2 3
Levenshtein Distance Calculation: Example
0 1 2 3 4 5 6 7
2 H 1 0 1 2 3 4 5 6
Matrix is filled from the
upper-left to the lower- O 2 1 1 2 3 4 5 6
right corner
N 3 2 2 2 2 3 4 5
3
Set cell d[i,j] of the matrix equal D 4 3 3 3 3 2 3 4
to the minimum of:
a. The cell immediately above A 5 4 4 4 4 3 2
plus 1: d[i-1,j] + 1 3
b. The cell immediately to the left 4
plus 1: d[i,j-1] + 1
c. The cell diagonally above and Number in the lower-right corner is the
to the left plus the cost: d[i-1,j-1] + Levenshtein distance between the two
cost words.
One-Hot Encoding
One-Hot Encoding
littl 0 0 1 0 0 0 0
silenc 0 0 0 0 0 1 0
twinkl 0 0 0 0 0 0 1
Biological Neuron vs. Artificial Neuron
Biological Neurons
Myelin sheath
Axon
Output Signals
Cell nucleus
Input Signals
▪ Neurons are interconnected nerve cells that build the nervous system and transmit information throughout
the body.
▪ Dendrites are extension of a nerve cell that receive impulses from other neurons.
▪ Cell nucleus stores cell’s hereditary material and coordinates cell’s activities.
▪ Axon is a nerve fiber that is used by neurons to transmit impulses.
▪ Synapse is the connection between two nerve cells.
Rise of Artificial Neurons
Biological
Neuron
Artificial
Neuron
▪ Researchers Warren McCullock and Walter Pitts published their first concept of simplified
brain cell in 1943.
▪ Nerve cell was considered similar to a simple logic gate with binary outputs.
▪ Dendrites can be assumed to process the input signal with a certain threshold such that if the
signal exceeds the threshold, the output signal is generated.
Definition of Artificial Neuron
1
W0
X1
W1
∈ Output
X2 W2
Wn
Xm
Perceptron: The Main Processing Unit
S f (S) Output
𝑊3
Summation
Activation
Function
Function
While the weights determine the slope of the equation, bias shifts the output line towards left or
right.
S f (S) Output
𝑊3
Summation Activation
Function Function
𝑋𝑛
The XOR Problem
A perceptron can learn anything that it can represent, i.e., anything separable with a
hyperplane. However, it cannot represent Exclusive OR since it is not linearly separable.
x1
-1 1
x2
-1
Multilayer Perceptrons
Ψ(a)
1 The most common output
function(Sigmoid)
a
LGN
V1
IT V4 V2
Edges
And lines
Faces Shapes
And objects
▪ The idea of CNNs was neurobiologically motivated by the findings of locally-sensitive and orientation-selective
nerve cells in the visual cortex.
▪ Inventors of CNN designed a network structure that implicitly extracts relevant features.
▪ Convolutional Neural Networks are a special kind of multilayer neural networks.
History of CNN
Success
Stories In 1995, Yann LeCun,
professor of computer
science at the New York
University, introduced the
concept of convolutional
neural networks.
The Core Idea Behind CNN
Local Connections
Layering
Local Connections
Layering
Spatial Invariance
Local Connections
Layering
Spatial Invariance
Represents the capability of CNN’s to learn
abstractions invariant of size, contrast, rotation, and
variation
Few Popular CNNs
▪ LeNet, 1998
▪ AlexNet, 2012
▪ VGGNet, 2014
▪ ResNet, 2015
CNN Architectures
VGGNet
▪ 16 layers
▪ Only 3*3 convolutions
▪ 138 million parameters
ResNet
▪ 152 layers
▪ ResNet50
CNN Applications
Input A Task A
Layer n
Input B Task B
Back-Propagation
Large Vocabulary
Word Embedding
Word embedding
techniques: Applications of word embedding:
• Word2vec • Music or video
• Glove recommendation system
• Analyzing survey responses
Word Embedding: Overview
Child
kid
o Synonym
o Classification of the word: Positive, negative or neutral Woman
Man
Word2vec
Word2vec
Word2vec is one of
Word2vec is a
the most popular
two-layer neural
techniques of word
network.
embedding.
MALE
Word2vec
FEMALE
Two flavors of
Input is text corpus
algorithm:
and output is set
• Continuous Bag-
of vectors.
of-Words (CBOW)
• Skip-Gram
Word2vec
The core concept of Word2vec approach is to predict a word with the given neighboring word
or predict a neighboring word with the given word which is likely to capture the contextual
meaning of the word.
Context Context
Focus Word
Word2vec Algorithms
Word2vec Algorithms
Continuous Bag-of-
Skip-Gram
Words (CBOW)
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
Skip-Gram Model: Example
0
brown
0 w(t-2)
0
0
0 fox
0
1 Neural Network
Jumps (or any other 1 w(t-1)
0
probabilistic model)
0
0
0
over
w(t) 1
1
the
0
0 w(t+2)
CBOW Model
Common Bag-of-Words (CBOW) algorithm is used to predict the target word in the given context.
w(t-2)
w(t-1)
sum
w(t)
w(t+1)
w(t+2)
Word2vec: Advantages
Problem Statement: In vector space model, the entities are transformed into vector
representation. Based on the co-ordinate points, we can apply the techniques to find the most
similar points in vector space. Create a word-to-vector model which gives you the similar word for
happy.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Doc2vec Model
Doc2vec Model
Classifier on
Average or Concatenate
Paragraph Matrix W W W
• This algorithm may not be the ideal choice for the corpus with lots
of misspellings like tweets.
Topic Modeling
Topic Modeling
HR Search Engine
Document Sorting
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Standardization: Recasting:
05
Standardize the range of Recast the data along with
continuous variables the principal component
axes
01 04
Feature vector:
Find the principal
Covariance matrix components in the order
computation: of significance
Understand how variables
vary from mean
02 03
Eigenvector and values
computation:
Determine principal
components of the data
Principal Component Analysis: Steps
After standardization is done, all the variables will be on the same scale
3
Topic 1
It is an unsupervised approach Topic 2
that involves techniques such as: Topic 3
• TF-IDF Topic 4
Word • Non-negative matrix
Analogies factorization
• Latent Dirichlet Allocation
• LSA
Applications include:
• Document clustering
• Information retrieval
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA)
LDA is a matrix
factorization technique.
D2 0 1 1 1 0 0 0 Document
Mary had a little lamb
Term
Matrix
D3 0 0 1 0 1 0 2
Twinkle Twinkle little star
T1 t2 t3 t4…………………………………………………………….t1000
5000 terms/words (t)
Problem:
There are so many parameters to extract information and so, the
task is to reduce number of parameters without losing
information
Latent Dirichlet Allocation: Example
1000
documents(d) …………………………………..
Probability of
P(z|d) topic z given
document d
Topics/Latent
Z1 z2 z3
Variable (z)
Probability
P(t|z)
of term t
given topic z
5000 T1 t2 t3 t4…………………………………………………………….t1000
Terms/Words (t)
LDA Model M1
Z1 Z2 .. zn
1000
d1
documents (d) ………………………
d2
…………..
Probability of d3
P(z
topic z given
|d) d4
document d
…
dn
Topics/Latent Z1 z2
Variable (z) z3……………………………zn
P(t Probability of M2
|z) term t given
topic z
t1 t2 t3 t4 t5 …. tn
5000 T1 t2 t3
t4……………………………tn z1
terms/words (t)
z2
..
zn
Latent Dirichlet Allocation: Example
LDA Model
Bag-of-Word Model
1000 …………………………
documents(d) ………………………… ………..
………..
Z1 z2 z3
Topics
5000
T1 t2 t3 t4………………….t1000
terms/words (t) T1 t2 t3 t4………………t1000
Parameters
50 Lakhs 60 Thousand
P(t|d)
Word Analogies
Word Analogies
An analogy question is the one that finds the relationship between words.
Subtract the first vector from the second in the first word-pair
It is open-source.
2
System Requirement:
Operating system:
macOS / OS X · Linux ·
Windows
Python version:
>> import gensim
Python >=2.7
Dependency:
• NumPy >= 1.11.3
• SciPy >= 0.18.1
• Six >= 1.5.0
• smart_open >= 1.2.1
Gensim: Vectorization
#Gensim Library
#Load Gensim
from gensim import corpora
#text processing
texts = [[word
for word in document.lower().split()]
for document in documents
]
#convert into dictionary
dictionary = corpora.Dictionary(texts)
#document to convert in vector
new_doc = "ed-tech company for e-learning courses"
#document to bag of words conversion
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
Gensim: Vectorization
#Gensim library
#Loading gensim
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
#create a corpus from a list of text
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
#Train the model
lda = LdaModel(common_corpus, num_topics=10)
#new corpus of unseen documents
other_texts = [
['data', 'unstructured', 'time'],
['bigdata', 'intelligence', 'natural'],
['language', 'machine', 'computer']
]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
unseen_doc = other_corpus[0]
#get topic probability distribution for a document
vector = lda[unseen_doc]
print(vector)
Gensim: Topic Modeling
Output:
[(0, 0.050000038), (1, 0.5499996), (2, 0.050000038), (3, 0.05000004), (4, 0.050000038),
(5, 0.050000038), (6, 0.05000004), (7, 0.05000004), (8, 0.05000004), (9, 0.050000038)]
Gensim: Text Summarization
#Gensim Library
#load gensim
from gensim.summarization import summarize
summary = summarize(text_to_summarize)
print("Summarized text:\n", summary)
Gensim: Text Summarization
Output:
Summarized text:
2.3 million jobs will be created in the AI field by 2020 (Source: Gartner)
Identify Topics from News Items
Problem Statement: Identification of document for a domain or keyword is a tough task. Write a
script which will provide the important topics from the news data.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Working of Word Analogies
Problem Statement: Apply word analogies technique using word2vec for identification of new next
word.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Build Your Own News Search Engine
Objective: Use text feature engineering (TF-IDF) and some rules to make
our first search engine for news articles. For any input query, we’ll
present the five most relevant news articles.
Explain N-gram
a. 7
b. 8
c. 9
d. 10
Knowledge
Check How many bigrams can be generated from given sentence?
“Simplilearn is a great source to learn machine learning”
1
a. 7
b. 8
c. 9
d. 10
a. Feature engineering
a. Feature engineering
d. Vectorization
Knowledge
Check
What is the purpose of topic modeling?
4
d. Vectorization