Lesson 2 Feature Engineering On Text Data
Lesson 2 Feature Engineering On Text Data
Explain N-gram
TF-IDF
N-Gram
Girl
Male
Female
Boy
N-grams are combinations of adjacent words or letters of length n in the source text.
Trigram
Group (contiguous sequence) of n
words or characters
Unigram Bigram
n =2 Bigram
Text Comparison
Information Retrieval
Applications
Automatic Text
Categorization
Autocomplete
Bag-of-Words
Bag-of-Words
Example Usage:
Male Girl
Boy Female
Processed Data
• Document
• Tweet Unordered
• Review comments collection of words
Bag-of-Words
Bag-of-Words model is the way of extracting features from text and representing the
text data, while modeling the text with a machine learning algorithm.
Tokenization:
While creating the bag of
words, tokenized word of each
01 Tokenization observation is used.
Process:
• Collect data
• Create a vocabulary by listing all
Process 02 unique words
• Create document vectors after
scoring
Scoring mechanism:
Scoring Mechanism 03 • Word hashing
• TF-IDF
• Boolean value
Bag-of-Words: Example
Difficult to compare
Twinkle twinkle little star “twinkl”,”littl,”star”
Multiple occurrences
The silence of lambs “silence”,”lamb” of word: difficult to
handle
Term or Word
daughter lamb littl mari star silenc twinkl
1 0 1 0 0 0 0
I have a little daughter
0 1 1 1 0 0 0
Mary had a little lamb
0 0 1 0 1 0 2
Twinkle twinkle little star
0 1 0 0 0 1 0
The silence of lambs
Term Matrix
Example:
• Cost occurs more frequently in an economy related document. To overcome
this limitation TF-IDF is used which assigns weights to the words based on
their relevance in the document.
TF-IDF
Doc1 1 1 1 1 1 1 1 0 0 0 0 0
Doc2 0 0 1 0 1 1 1 1 1 0 0 0
Doc3 0 0 1 1 0 1 0 1 0 1 1 1
Document 1 1 3 2 2 3 2 2 1 1 1 1
Frequency
Doc1 1/1 1/1 1/3 1/2 1/2 1/3 1/2 0/2 0/1 0/1 0/1 0/1
Doc2 0/1 0/1 1/3 0/2 1/2 1/3 1/2 1/2 1/1 0/1 0/1 0/1
Doc3 0/1 0/1 1/3 1/2 0/2 1/3 0/2 1/2 0/1 1/1 1/1 1/1
1 1 3 2 2 3 2 2 1 1 1 1
Document Frequency
Term Frequency
Term Frequency • Highlights the words or terms which are unique to the document
• These words are better for characterizing
TF-IDF
TF = Term Frequency
IDF = Inverse Document Frequency
TF= count(t,d)
Count of term ‘t’ in document ‘d’
--------------
|d| Total number of terms in document ‘d’
littl 0 0 1 0 0 0 0
silenc 0 0 0 0 0 1 0
twinkl 0 0 0 0 0 0 1
Word2vec
Word2vec
Word2vec is one of
Word2vec is a
the most popular
two-layer neural
techniques of word
network.
embedding.
MALE
Word2vec
FEMALE
Two flavors of
Input is text corpus
algorithm:
and output is set of
• Continuous
vectors.
Bag-of-Words
(CBOW)
• Skip-Gram
Word2vec
The core concept of Word2vec approach is to predict a word with the given neighboring word
or predict a neighboring word with the given word which is likely to capture the contextual
meaning of the word.
Context Context
Focus Word
Word2vec Algorithms
Word2vec Algorithms
Continuous
Skip-Gram
Bag-of-Words (CBOW)
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
Skip-Gram Model: Example
0
brown
0 w(t-2)
0
0
0 fox
0
1 Neural Network
Jumps (or any other 1 w(t-1)
0
probabilistic model)
0
0
0
over
w(t) 1
1
the
0
0 w(t+2)
CBOW Model
Common Bag-of-Words (CBOW) algorithm is used to predict the target word in the given context.
w(t-2)
w(t-1)
sum
w(t)
w(t+1)
w(t+2)
Word2vec: Advantages
Problem Statement: In vector space model, the entities are transformed into vector
representation. Based on the co-ordinate points, we can apply the techniques to find the most
similar points in vector space. Create a word-to-vector model which gives you the similar word for
happy.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Doc2vec Model
Doc2vec Model
Classifier on
Average or Concatenate
Paragraph Matrix W W W
• This algorithm may not be the ideal choice for the corpus with lots
of misspellings like tweets.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Standardization: Recasting:
05
Standardize the range of Recast the data along with
continuous variables the principal component
axes
01 04
Feature vector:
Find the principal
Covariance matrix components in the order
computation: of significance
Understand how variables
vary from mean
02 03
Eigenvector and values
computation:
Determine principal
components of the data
Principal Component Analysis: Steps
After standardization is done, all the variables will be on the same scale
3
Topic 1
It is an unsupervised approach Topic 2
that involves techniques such as: Topic 3
• TF-IDF Topic 4
Word • Non-negative matrix
Analogies factorization
• Latent Dirichlet Allocation
• LSA
Applications include:
• Document clustering
• Information retrieval
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA)
LDA is a matrix
factorization technique.
LDA converts
M2 is a topic-term document-term matrix into
matrix. two lower-dimensional
matrix, M1and M2.
M1 is a document-topic
matrix.
Latent Dirichlet Allocation: Example
D2 0 1 1 1 0 0 0 Document
Mary had a little lamb
Term
Matrix
D3 0 0 1 0 1 0 2
Twinkle Twinkle little star
T1 t2 t3 t4…………………………………………………………….t1000
5000 terms/words (t)
Problem:
There are so many parameters to extract information and so, the
task is to reduce number of parameters without losing
information
Latent Dirichlet Allocation: Example
1000
documents(d) …………………………………..
Probability of
P(z|d) topic z given
document d
Topics/Latent
Z1 z2 z3
Variable (z)
Probability
P(t|z)
of term t
given topic z
5000 T1 t2 t3 t4…………………………………………………………….t1000
Terms/Words (t)
LDA Model M1
Z1 Z2 .. zn
1000 d1
documents (d) ………………………
d2
…………..
Probability of d3
P(z
topic z given d4
|d)
document d
…
dn
Topics/Latent Z1 z2
Variable (z) z3……………………………zn
P(t Probability of M2
|z) term t given
topic z
t1 t2 t3 t4 t5 …. tn
5000 T1 t2 t3 t4……………………………tn
z1
terms/words (t)
z2
..
zn
Latent Dirichlet Allocation: Example
LDA Model
Bag-of-Word Model
1000 …………………………
documents(d) ………………………… ………..
………..
Z1 z2 z3
Topics
5000 T1 t2 t3 t4………………….t1000
terms/words (t) T1 t2 t3 t4………………t1000
Parameters
50 Lakhs 60 Thousand
P(t|d)
Topic Modeling
Topic Modeling
HR Search Engine
Document Sorting
Gensim
Gensim: Introduction
It is open-source.
2
System Requirement:
Operating system:
macOS / OS X · Linux ·
Windows
Python version:
Python >=2.7 >> import gensim
Dependency:
• NumPy >= 1.11.3
• SciPy >= 0.18.1
• Six >= 1.5.0
• smart_open >= 1.2.1
Gensim: Vectorization
#Gensim Library
#Load Gensim
from gensim import corpora
#text processing
texts = [[word
for word in document.lower().split()]
for document in documents
]
#convert into dictionary
dictionary = corpora.Dictionary(texts)
#document to convert in vector
new_doc = "ed-tech company for e-learning courses"
#document to bag of words conversion
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
Gensim: Vectorization
#Gensim library
#Loading gensim
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
#create a corpus from a list of text
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
#Train the model
lda = LdaModel(common_corpus, num_topics=10)
#new corpus of unseen documents
other_texts = [
['data', 'unstructured', 'time'],
['bigdata', 'intelligence', 'natural'],
['language', 'machine', 'computer']
]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
unseen_doc = other_corpus[0]
#get topic probability distribution for a document
vector = lda[unseen_doc]
print(vector)
Gensim: Topic Modeling
Output:
[(0, 0.050000038), (1, 0.5499996), (2, 0.050000038), (3, 0.05000004), (4, 0.050000038),
(5, 0.050000038), (6, 0.05000004), (7, 0.05000004), (8, 0.05000004), (9, 0.050000038)]
Word Embedding
Word Embedding
Large Vocabulary
Word Embedding
Word embedding
techniques: Applications of word embedding:
• Word2vec • Music or video
• Glove recommendation system
• Analyzing survey responses
Word Embedding: Overview
Child
kid
Problem Statement: Identification of document for a domain or keyword is a tough task. Write a script
which will provide the important topics from the news data.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Working of Word Analogies
Problem Statement: Apply word analogies technique using word2vec for identification of new next
word.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Build Your Own News Search Engine
Objective: Use text feature engineering (TF-IDF) and some rules to make
our first search engine for news articles. For any input query, we’ll
present the five most relevant news articles.
Explain N-gram
a. 7
b. 8
c. 9
d. 10
Knowledge
Check How many bigrams can be generated from given sentence?
“Simplilearn is a great source to learn machine learning”
1
a. 7
b. 8
c. 9
d. 10
a. Feature engineering
a. Feature engineering
d. Vectorization
Knowledge
Check What is the purpose of topic modeling?
d. Vectorization