[go: up one dir, main page]

0% found this document useful (0 votes)
17 views131 pages

Lesson 2 Feature Engineering On Text Data

Uploaded by

pradeep191988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views131 pages

Lesson 2 Feature Engineering On Text Data

Uploaded by

pradeep191988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Natural Language Processing

Feature Engineering on Text Data


Learning Objectives

By the end of this lesson, you will be able to:

Explain N-gram

Demonstrate the different word embedding models

Perform operations on word analogies

Demonstrate the working of Bag-of-Words

Demonstrate the working of top modeling technique


Feature Extraction
What Is Feature Extraction?

Clean Data Model


Computers do not have any standard representation of
words

Once the text is cleaned and normalized, it needs to be


transformed into features which can be used for modeling
Feature
Extraction
Feature Extraction Techniques

Feature extraction technique depends on what kind of model is


intended to be used.

Levenshtein
TF-IDF Distance
N-Gram
Girl
Male
Female
Boy

Bag-of-Words N-Gram Document-Term Matrix TF-IDF Levenshtein Distance


N-Gram
N-Gram: Introduction

N-grams are combinations of adjacent words or letters of length n in the source text.

Trigram
Group (contiguous sequence) of n
words or characters

Unigram Bigram

P(w | h) probability of N-Gram Probabilistic model n >= 1


word w given per of word sequence
history h n =1 Unigram

n =2 Bigram

Assigns probabilities to n =3 Trigram


the sequences of words . . .
. . .
n =n N-Gram
N-Gram: Example

Example: This is a sentence

n = 1 (Unigram) This is a sentence

n = 2 (Bigram) This is a sentence

This is
n = 3 (Trigram)
a sentence
Probability Calculation: Bigram Model

It approximates the probability of a word by applying


conditional probability to the preceding word.

P(w1, w2, w3, .... , wn) ⇒ P(wn | wn-1)

Example: P(This is a sentence of) ⇒ P(of | sentence)


N-Gram: Applications

Text Comparison

Spelling Error Detection

Information Retrieval

Applications

Automatic Text
Categorization

Spelling Error Correction

Autocomplete
Bag-of-Words
Bag-of-Words

Used to perform document-level task


1

Is a vectorization technique to represent text data


2

Has no effect of grammar and order of words in sentence


3

Example Usage:

Sentiment Analysis Spam Detection


Bag-of-Words

Male Girl
Boy Female

Processed Data
• Document
• Tweet Unordered
• Review comments collection of words
Bag-of-Words

Bag-of-Words model is the way of extracting features from text and representing the
text data, while modeling the text with a machine learning algorithm.

Tokenization:
While creating the bag of
words, tokenized word of each
01 Tokenization observation is used.

Process:
• Collect data
• Create a vocabulary by listing all
Process 02 unique words
• Create document vectors after
scoring

Scoring mechanism:
Scoring Mechanism 03 • Word hashing
• TF-IDF
• Boolean value
Bag-of-Words: Example

Apply Text Processing


I have a little daughter “littl”,”, ”daughter”
Inefficient

Mary had a little lamb “mari”,”littl,”lamb”

Difficult to compare
Twinkle twinkle little star “twinkl”,”littl,”star”

Multiple occurrences
The silence of lambs “silence”,”lamb” of word: difficult to
handle

Corpus (D): Set of


Documents
Bag-of-Words: Example

I have a little daughter Vocabulary (V)

Mary had a little lamb


“littl”, “daughter”,“mari”,”lamb”,”twinkl”, “star”, ”silenc”
Twinkle twinkle little star

Collect unique words


The silence of lambs

Corpus (D): Set of


Documents
Bag-of-Words: Vector Representation Example

Term or Word
daughter lamb littl mari star silenc twinkl
1 0 1 0 0 0 0
I have a little daughter
0 1 1 1 0 0 0
Mary had a little lamb
0 0 1 0 1 0 2
Twinkle twinkle little star
0 1 0 0 0 1 0
The silence of lambs

Term Frequency Document-Term Matrix


Corpus (D): Set of
Documents Frequency of a term or word-
occurrence in a document
Bag-of-Words: Recap of Terms Used

Term Term Frequency

Each processed word is Frequency of a term-


called term occurrence in a document

Term Matrix

Matrix showing frequency of


each term-occurrence in
documents
Document-Term Matrix
Document-Term Matrix

Represents the Creates numerical


frequency of word in a representation of
set of documents documents

Document-
Term Matrix
Represents documents in a Follows the bag-of-words
row or terms in a column approach
Document-Term Matrix Calculation

01 Create a matrix of n x m
n >= 1, m >= 1n
n is number of doc
m is number of unique
terms
Assign count of each term
02 against the respective
document
Document-Term Matrix: Applications

Improving Search
Results

Applications Text Analysis

Finding Topics
Document-Term Matrix: Example

Example:

Doc 1: Random forest is an ensemble learning method

Doc 2: Ensemble method is a machine learning technique

Doc 3: Machine learning is an application of AI

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Doc3 0 0 1 1 0 1 0 1 0 1 1 1
Document-Term Matrix: Example

Random Forest is an ensemble learning method machin technique application of ai


e

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Compare documents:

1. Based on how many words are common

0 0 1 0 1 1 1 0 0 0 0 0

Doc1.Doc2 = Doc1(0).Doc2(0) + Doc1(0).Doc2(0) + Doc1(0).Doc2(0) +…….+ Doc1(n).Doc2(n) = 4

Dot
product
Document-Term Matrix: Analyze Dot Product—Example

• The more the dot product is, the more similar are the documents

• Issue with dot product:


Document pair, which captures the overlap value, does not take into consideration
the values which are not in common

• This flaw may result in the document pair having very different words. This may
have the same dot product as the document pairs which are very similar.
Document-Term Matrix: Analyze Dot Product—Example

To overcome this, dot product is measured in cosine similarity as below:

Cos(θ) = Dot product


-------------------
||Doc1||.||Doc2||

= 4
--------------------
sqrt(7).sqrt(6)

Complete Identical vector will have Cosine similarity =1

Complete Unidentical vector will have Cosine similarity =-1


TF-IDF
TF-IDF

The Term Frequency-Inverse


Document Frequency is
abbreviated as TF-IDF

• Bag-of-Words assumes that each word is equally important


• In real-world scenario, each word has its own weight based on the context

Example:
• Cost occurs more frequently in an economy related document. To overcome
this limitation TF-IDF is used which assigns weights to the words based on
their relevance in the document.
TF-IDF

It has two parts:


It represents the numerical
• Term Frequency (TF)
statistics
• Inverse Document Frequency
(IDF)

Applications of TF-IDF are:


• Text Mining
• User Modeling
TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Doc3 0 0 1 1 0 1 0 1 0 1 1 1

Document 1 1 3 2 2 3 2 2 1 1 1 1
Frequency

Sum of occurrence of a word across documents


TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1/1 1/1 1/3 1/2 1/2 1/3 1/2 0/2 0/1 0/1 0/1 0/1

Doc2 0/1 0/1 1/3 0/2 1/2 1/3 1/2 1/2 1/1 0/1 0/1 0/1

Doc3 0/1 0/1 1/3 1/2 0/2 1/3 0/2 1/2 0/1 1/1 1/1 1/1

1 1 3 2 2 3 2 2 1 1 1 1

Document Frequency

Term Frequency

Sum of occurrence of a word across documents


TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1/3 1/2 1/2 1/3 1/2 0 0 0 0 0

Doc2 0 0 1/3 0 1/2 1/3 1/2 1/2 1 0 0 0

Doc3 0 0 1/3 1/2 0 1/3 0 1/2 0 1 1 1

Term Frequency • Is proportional to frequency of occurrence of a word or term in a document


• Is inversely proportional to the number of documents in which a word or
term occurs
TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1/3 1/2 1/2 1/3 1/2 0 0 0 0 0

Doc2 0 0 1/3 0 1/2 1/3 1/2 1/2 1 0 0 0

Doc3 0 0 1/3 1/2 0 1/3 0 1/2 0 1 1 1

Term Frequency • Highlights the words or terms which are unique to the document
• These words are better for characterizing
TF-IDF

TF-IDF = TF(t,d) * IDF(t,D)


t is terms
d is document

TF = Term Frequency
IDF = Inverse Document Frequency
TF= count(t,d)
Count of term ‘t’ in document ‘d’
--------------
|d| Total number of terms in document ‘d’

IDF = log(|D|) Log of total number of documents in collection ‘D’


-------------
|{d ⊂ D} : {t ⊂ d} Number of documents where ‘t’ is present
TF-IDF

Term Frequency (TF)


TF Frequent occurrence of a term in a document is measured by term frequency.
TF (t, d) = Number of times t appears in document d / Total number of terms in
the document d

Inverse Document Frequency (IDF)


IDF IDF measures how important a term is.
IDF (t) = Log_e (Total number of documents / Number of documents
with term t in it)

TF-IDF = TF (t,d) * IDF (t)


t is term
d is document
Levenshtein Distance
Levenshtein Distance

It is the string metric for It is the minimum number of


measuring the difference single-character edits
between two sequences

Characteristics of Greater the Levenshtein


It is also referred as edit distance, more different the
Levenshtein
distance strings
Distance

Mainly, 3 operations are


Applications of Levenshtein
performed during
distance:
calculation:
• String-matching
• Insertion
• Spell-checking
• Deletion
• Plagiarism detection
• Substitution
Levenshtein Distance: Applications

String-matching

RNA/DNA sequencing

Applications Spell-checking

Remote location update

Plagiarism detection
Levenshtein Distance: Example

Distance calculation between Singing and 0


Singing
As both strings are exactly same

Distance calculation between Singing and 1


Ringing
Singing -> Ringing [Replace ‘S’ with ‘R’]

Distance calculation between Sleep and 2


Slip
Sleep -> Slep [Remove single ‘e’]
Slep -> Slip [Replace ‘e’ with ‘i’]

Distance calculation between Kitten and 3


Sitting
Kitten -> Sitten [Replace ‘K’ with ‘S’]
Sitten -> Sittin [Replace ‘e’ with ‘I’]
Sitting -> Sitting [Add ‘g’ in the end]
Levenshtein Distance Calculation

n is character length of
word1
m is character length of 01 Create a matrix of n x m
word2

02 Assign the distance between two


strings based on the following rules:
Levenshtein Distance Calculation: Example
Step 1 and 2
Strings to compare
H Y U N D A I
n
0 1 2 3 4 5 6 7
s t
m

H 1
HYUNDAI HONDA
O 2

N 3

D 4

A 5
i=2 i=3
Step 3 to 5 i=1
H Y U N D A I H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 H 1 0 1 H 1 0 1 2

O 2 1 O 2 1 1 O 2 1 1 2

N 3 2 N 3 2 2 N 3 2 2 2

D 4 3 D 4 3 3 D 4 3 3 3

A 5 4 A 5 4 4 A 5 4 4 4
Levenshtein Distance Calculation: Example

Strings to compare
Set n to be the length of s s t
Set m to be the length of t
If n = 0, return m and exit
HYUNDAI HONDA
If m = 0, return n and exit
Construct a matrix containing 0..m rows and 0..n
columns

Initialize the first row to 0..n


Initialize the first column to 0..m

Examine each character of s (i from 1 to n)


Examine each character of t (j from 1 to m)

If s[i] equals t[j], the cost is 0


If s[i] doesn't equal t[j], the cost is 1

Set cell d[i,j] of the matrix equal to the minimum of:


a. The cell immediately above plus 1: d[i-1,j] + 1
b. The cell immediately to the left plus 1: d[i,j-1] + 1
c. The cell diagonally above and to the left plus the
cost: d[i-1,j-1] + cost
Levenshtein Distance Calculation: Example
Step 1 and 2
H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 1 2 3 H 1 0 1 2 3 4
i=4 i=5
O 2 1 1 2 3 O 2 1 1 2 3 4

N 3 2 2 2 2 N 3 2 2 2 2 3

D 4 3 3 3 3 D 4 3 3 3 3 2

A 5 4 4 4 4 A 5 4 4 4 4 3

H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 1 2 3 4 5 H 1 0 1 2 3 4 5 6

i=6 O 2 1 1 2 3 4 5 i=7 O 2 1 1 2 3 4 5 6

N 3 2 2 2 2 3 4 N 3 2 2 2 2 3 4 5

D 4 3 3 3 3 2 3 D 4 3 3 3 3 2 3 4

A 5 4 4 4 4 3 2 A 5 4 4 4 4 3 2 3
Levenshtein Distance Calculation: Example

Distance calculation between HYUNDAI and HONDA


1
Matrix is initialized, measuring
in the (m, n) cell H Y U N D A I

0 1 2 3 4 5 6 7
2 H 1 0 1 2 3 4 5 6
Matrix is filled from the
upper-left to the lower- O 2 1 1 2 3 4 5 6
right corner
N 3 2 2 2 2 3 4 5
3
Set cell d[i,j] of the matrix equal D 4 3 3 3 3 2 3 4
to the minimum of:
a. The cell immediately above A 5 4 4 4 4 3 2
plus 1: d[i-1,j] + 1 3
b. The cell immediately to the left 4
plus 1: d[i,j-1] + 1
c. The cell diagonally above and Number in the lower-right corner is the
to the left plus the cost: d[i-1,j-1] + Levenshtein distance between the two
cost words.
One-Hot Encoding
One-Hot Encoding

Used for deeper analysis of text


1

Performs numerical representation of each word


2

Used for categorical data


3

Higher the distinct categorical value, higher the sparsity


4
One-Hot Encoding

Assigns vector value 1


Treats each word
where the particular word
as class
is present and 0 at other
places

How does it work?


One-Hot Encoding: Example

daughter lamb littl mari star silenc twinkl


lamb 0 1 0 0 0 0 0

littl 0 0 1 0 0 0 0

silenc 0 0 0 0 0 1 0

twinkl 0 0 0 0 0 0 1
Biological Neuron vs. Artificial Neuron
Biological Neurons

Myelin sheath

Dendrites Axon terminals

Axon

Output Signals
Cell nucleus
Input Signals

▪ Neurons are interconnected nerve cells that build the nervous system and transmit information throughout
the body.
▪ Dendrites are extension of a nerve cell that receive impulses from other neurons.
▪ Cell nucleus stores cell’s hereditary material and coordinates cell’s activities.
▪ Axon is a nerve fiber that is used by neurons to transmit impulses.
▪ Synapse is the connection between two nerve cells.
Rise of Artificial Neurons

Biological
Neuron
Artificial
Neuron

▪ Researchers Warren McCullock and Walter Pitts published their first concept of simplified
brain cell in 1943.
▪ Nerve cell was considered similar to a simple logic gate with binary outputs.
▪ Dendrites can be assumed to process the input signal with a certain threshold such that if the
signal exceeds the threshold, the output signal is generated.
Definition of Artificial Neuron

An artificial neuron is analogous to biological neurons, where each neuron takes


inputs, adds weights to them separately, sums them up, and passes this sum
through a transfer function to produce a nonlinear output.
Biological Neurons and Artificial Neurons: A Comparison
Biological Neuron
Artificial
Neuron

Biological Neurons Artificial Neurons

Cell Nucleus Nod


e
Dendrite Inpu
s t
Synaps Weights or interconnections
e
Axo Output
n
Neural Networks
Perceptron

▪ Single layer neural network


▪ Consists of weights, the summation processor, and an activation function

Inputs Weights Net inputs Activation


function function

1
W0

X1
W1
∈ Output

X2 W2

Wn
Xm
Perceptron: The Main Processing Unit

S f (S) Output

𝑊3

Summation
Activation
Function
Function

Note: Inputs X and weights W are real values.


Weights and Biases in a Perceptron

While the weights determine the slope of the equation, bias shifts the output line towards left or
right.

S f (S) Output

𝑊3

Summation Activation
Function Function

𝑋𝑛
The XOR Problem

A perceptron can learn anything that it can represent, i.e., anything separable with a
hyperplane. However, it cannot represent Exclusive OR since it is not linearly separable.

x1

-1 1
x2

-1
Multilayer Perceptrons

Ψ(a)
1 The most common output
function(Sigmoid)
a

One or more layers


Input of hidden units Output
nodes (hidden layers) neurons
Convolutional Neural Net (CNN)
Human Visual and CNN

LGN

V1
IT V4 V2
Edges
And lines

Faces Shapes
And objects

▪ The idea of CNNs was neurobiologically motivated by the findings of locally-sensitive and orientation-selective
nerve cells in the visual cortex.
▪ Inventors of CNN designed a network structure that implicitly extracts relevant features.
▪ Convolutional Neural Networks are a special kind of multilayer neural networks.
History of CNN

Success
Stories In 1995, Yann LeCun,
professor of computer
science at the New York
University, introduced the
concept of convolutional
neural networks.
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance Represent how each set of neurons in a cluster are


connected to each other, which in turn represents a
set of features
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance

Represents the hierarchy in features that are


learned
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance
Represents the capability of CNN’s to learn
abstractions invariant of size, contrast, rotation, and
variation
Few Popular CNNs

▪ LeNet, 1998
▪ AlexNet, 2012
▪ VGGNet, 2014
▪ ResNet, 2015
CNN Architectures

VGGNet

▪ 16 layers
▪ Only 3*3 convolutions
▪ 138 million parameters

ResNet

▪ 152 layers
▪ ResNet50
CNN Applications

Input A Task A

Layer n

AnB : Frozen Weights


Back-Propagation

Input B Task B

Back-Propagation

Transfer Learning and Fine Tuning Feature Extraction


Word Embedding
Word Embedding

Use the following while working with individual words or phrases:

Text Generation Machine Translation

Large Vocabulary
Word Embedding

It represents text in the


N-dimensional space, in the Vectors are called
form of vectors embeddings

It is the distributed Each word is mapped to


representation one real-valued vector

Word embedding
techniques: Applications of word embedding:
• Word2vec • Music or video
• Glove recommendation system
• Analyzing survey responses
Word Embedding: Overview

Child
kid

• Word embedding represents word in vector form

• Some properties must be exhibited while representing a Lion

word in vector form:


o Similar meaning words should be closer to each other School
when compared to the words which don’t have similar
meaning
o Words having difference in meaning should be kept at
the same distance from each other
Queen
• This kind of representation helps in finding:
o Analogy word King

o Synonym
o Classification of the word: Positive, negative or neutral Woman

Man
Word2vec
Word2vec

Word2vec is one of
Word2vec is a
the most popular
two-layer neural
techniques of word
network.
embedding.
MALE

Word2vec

FEMALE

Two flavors of
Input is text corpus
algorithm:
and output is set
• Continuous Bag-
of vectors.
of-Words (CBOW)
• Skip-Gram
Word2vec

The core concept of Word2vec approach is to predict a word with the given neighboring word
or predict a neighboring word with the given word which is likely to capture the contextual
meaning of the word.

The quick brown fox jumps over the lazy dog

Context Context

Focus Word
Word2vec Algorithms

Word2vec Algorithms

Continuous Bag-of-
Skip-Gram
Words (CBOW)

Predict a “neighboring word” Predict a “given word” given the


given the “given word” “neighboring word”
Skip-Gram Model

It is used to predict the source context words given in a target word.

w(t-2)

w(t-1)

w(t)

w(t+1)

w(t+2)
Skip-Gram Model: Example

0
brown

0 w(t-2)
0

0
0 fox
0
1 Neural Network
Jumps (or any other 1 w(t-1)
0
probabilistic model)
0
0
0
over
w(t) 1

One-hot encoded 0 w(t+1)


vector
0

1
the
0

0 w(t+2)
CBOW Model

Common Bag-of-Words (CBOW) algorithm is used to predict the target word in the given context.

w(t-2)

w(t-1)
sum

w(t)

w(t+1)

w(t+2)
Word2vec: Advantages

Ready to be used in deep Meaning of word is


learning-ready architecture distributed in vector

Train vectors are Vector size does not


reused grow with vocabulary
Word2vec Model Creation

Problem Statement: In vector space model, the entities are transformed into vector
representation. Based on the co-ordinate points, we can apply the techniques to find the most
similar points in vector space. Create a word-to-vector model which gives you the similar word for
happy.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Doc2vec Model
Doc2vec Model

The following are the uses of Doc2vec model:

• Creates numeric representation of a document


• Uses unsupervised algorithm
• Finds similarity between sentences, paragraphs, and documents

Classifier on

Average or Concatenate

Paragraph Matrix W W W

Paragraph id the cat sat


Doc2vec Model

• It is an extension of CBOW model.

• It is called distributed memory version of paragraph vector.

• This algorithm may not be the ideal choice for the corpus with lots
of misspellings like tweets.
Topic Modeling
Topic Modeling

It is a type of statistical model and has the following advantages:

Discovering the abstract Document


topics in a collection of clustering
documents

Information retrieval from Organizing large blocks of


unstructured text and textual data
feature selection
Topic Modeling: Industry Use Cases

HR Search Engine

News Companies E-Commerce

Document Sorting
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)

It is a dimensionality reduction method that reduces the number of variables.

Standardization: Recasting:
05
Standardize the range of Recast the data along with
continuous variables the principal component
axes
01 04

Feature vector:
Find the principal
Covariance matrix components in the order
computation: of significance
Understand how variables
vary from mean
02 03
Eigenvector and values
computation:
Determine principal
components of the data
Principal Component Analysis: Steps

Feature Vector Standardization

Eigenvectors and Covariance Matrix


Eigenvalues Computation Computation
Step 1: Standardization

Standardize the range of continuous variables for their equal


contribution
1

Higher range will dominate, which will create a bias


2

After standardization is done, all the variables will be on the same scale
3

It can be achieved by z = (value - mean) / std deviation


4
Step 2: Covariance Matrix Computation

It is used to identify the relationship between the variables


1

Variables should not be highly correlated


2

Covariance matrix (n x n) is calculated where n is number of dimensions


3
Step 3: Eigenvectors and Eigenvalues Computation

It is used to determine the principal components


1

New variables are constructed as linear combinations of initial variables


and are called principal components
2

New variables will have less correlated data


3
Step 4: Feature Vector

Decision is taken to keep all components or remove lesser significant


variables
1

Remaining components will form the matrix of vectors


2
Principal Component Analysis

Two-dimensional data transformation after applying PCA:


Principal Component Analysis

It is the process to automatically identify


topics present in text object.

Topic 1
It is an unsupervised approach Topic 2
that involves techniques such as: Topic 3
• TF-IDF Topic 4
Word • Non-negative matrix
Analogies factorization
• Latent Dirichlet Allocation
• LSA

Applications include:
• Document clustering
• Information retrieval
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA)

LDA is a matrix
factorization technique.

For each word w of each


doc d, word assignment is Documents will be
updated till the represented as
convergence point. document-term matrix.

LDA converts document-


M2 is a topic-term term matrix into two lower-
matrix. dimensional matrix, M1and
M2.
M1 is a document-topic
matrix.
Latent Dirichlet Allocation: Example

Term/Word Bag of Word Model

daughter lamb littl mari star silenc twinkl


D1 1 0 1 0 0 0 0
I have a little daughter

D2 0 1 1 1 0 0 0 Document
Mary had a little lamb
Term
Matrix
D3 0 0 1 0 1 0 2
Twinkle Twinkle little star

The silence of lambs 0 1 0 0 0 1 0


Dn

For D3 P(t|d)=1/4 ¼ 2/4

Corpus (D) : Set of Probability of word occurring in Document No. of parameters: 3


Documents
For 1 document and 3 words, number of parameters are = 1*3=3
Latent Dirichlet Allocation: Example

1000 documents(d) …………………………………..

T1 t2 t3 t4…………………………………………………………….t1000
5000 terms/words (t)

For 1000 documents and 5000 words, number of parameters


Parameters P(t|d) are = 1000*5000=5000000 (50 Lakhs)

Problem:
There are so many parameters to extract information and so, the
task is to reduce number of parameters without losing
information
Latent Dirichlet Allocation: Example

Solution: Topic is a mix of terms that is likely to generate the


Introduce a layer of topics called the Latent Variable term.
Example: Finance, Science, Sport, etc.
LDA Model

1000
documents(d) …………………………………..

Probability of
P(z|d) topic z given
document d

Topics/Latent
Z1 z2 z3
Variable (z)

Probability
P(t|z)
of term t
given topic z

5000 T1 t2 t3 t4…………………………………………………………….t1000

Terms/Words (t)

For 1000 documents, 5000 words, 10 topics, the number of parameters


are = 1000*10+10*5000=60000
Latent Dirichlet Allocation: Example

LDA Model M1

Z1 Z2 .. zn
1000
d1
documents (d) ………………………
d2
…………..
Probability of d3
P(z
topic z given
|d) d4
document d

dn
Topics/Latent Z1 z2
Variable (z) z3……………………………zn

P(t Probability of M2
|z) term t given
topic z
t1 t2 t3 t4 t5 …. tn
5000 T1 t2 t3
t4……………………………tn z1
terms/words (t)
z2

..

zn
Latent Dirichlet Allocation: Example

LDA Model
Bag-of-Word Model

1000 …………………………
documents(d) ………………………… ………..
………..

Z1 z2 z3
Topics

5000
T1 t2 t3 t4………………….t1000
terms/words (t) T1 t2 t3 t4………………t1000

Parameters
50 Lakhs 60 Thousand
P(t|d)
Word Analogies
Word Analogies

An analogy question is the one that finds the relationship between words.

Example: man is to woman, what king is to ___.


Answer: “queen”

Below is the process of work analogies:

Convert each word into a high-dimensional vector

Subtract the first vector from the second in the first word-pair

Add that to the first word in the second word-pair

The word closest to the resultant answer would be the solution


Gensim
Gensim: Introduction

Gensim is a free python library which is platform-independent.


1

It is open-source.
2

It is robust and scalable.


3

It analyzes plain-text documents for semantic structure.


4

It is used to retrieve semantically similar documents.


5
Gensim: Syntax and Library

System Requirement:

Operating system:
macOS / OS X · Linux ·
Windows

Python version:
>> import gensim
Python >=2.7

Dependency:
• NumPy >= 1.11.3
• SciPy >= 0.18.1
• Six >= 1.5.0
• smart_open >= 1.2.1
Gensim: Vectorization

#Gensim Library
#Load Gensim
from gensim import corpora

#documents for building vocabulary


documents = ["Simplilearn is an ed-tech company",
"We provide multiple e-learning courses"]

#text processing
texts = [[word
for word in document.lower().split()]
for document in documents
]
#convert into dictionary
dictionary = corpora.Dictionary(texts)
#document to convert in vector
new_doc = "ed-tech company for e-learning courses"
#document to bag of words conversion
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
Gensim: Vectorization

Output: [(1, 1), (2, 1), (5, 1), (6, 1)]


Gensim: Topic Modeling

#Gensim library
#Loading gensim
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
#create a corpus from a list of text
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
#Train the model
lda = LdaModel(common_corpus, num_topics=10)
#new corpus of unseen documents
other_texts = [
['data', 'unstructured', 'time'],
['bigdata', 'intelligence', 'natural'],
['language', 'machine', 'computer']
]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
unseen_doc = other_corpus[0]
#get topic probability distribution for a document
vector = lda[unseen_doc]
print(vector)
Gensim: Topic Modeling

Output:
[(0, 0.050000038), (1, 0.5499996), (2, 0.050000038), (3, 0.05000004), (4, 0.050000038),
(5, 0.050000038), (6, 0.05000004), (7, 0.05000004), (8, 0.05000004), (9, 0.050000038)]
Gensim: Text Summarization

#Gensim Library
#load gensim
from gensim.summarization import summarize

text_to_summarize = """Artificial intelligence has become a


powerful driving force in a wide range of industries,
helping people and businesses create exciting, innovative
products and services,
enable more informed business decisions, and achieve key
performance goals.
The median salary of an AI engineer in the US is $171,715(Source:
Datamation).
By 2022, the AI market will grow at a CAGR of 53.25 per cent, and
an estimated.
2.3 million jobs will be created in the AI field by 2020 (Source:
Gartner)."""

summary = summarize(text_to_summarize)
print("Summarized text:\n", summary)
Gensim: Text Summarization

Output:
Summarized text:
2.3 million jobs will be created in the AI field by 2020 (Source: Gartner)
Identify Topics from News Items

Problem Statement: Identification of document for a domain or keyword is a tough task. Write a
script which will provide the important topics from the news data.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Working of Word Analogies

Problem Statement: Apply word analogies technique using word2vec for identification of new next
word.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Build Your Own News Search Engine

Objective: Use text feature engineering (TF-IDF) and some rules to make
our first search engine for news articles. For any input query, we’ll
present the five most relevant news articles.

Problem Statement: Reuters Ltd. is an international news agency


headquartered in London and is a division of Thomson Reuters. The
data was originally collected and labeled by Carnegie Group Inc. and
Reuters Ltd. in the course of developing the construe text
categorization system. An important step before assessing similarity
between documents, or between documents and a search query, is the
right representation i.e., correct feature engineering. We’ll make a
process that provides the most similar news articles to a given text
string (search query).
Key Takeaways

You are now able to:

Explain N-gram

Demonstrate the different word embedding models

Perform operations on word analogies

Demonstrate the working of Bag-of-Words

Demonstrate the working of top modeling technique


Knowledge Check
Knowledge
Check How many bigrams can be generated from the given sentence?
“Simplilearn is a great source to learn machine learning”
1

a. 7

b. 8

c. 9

d. 10
Knowledge
Check How many bigrams can be generated from given sentence?
“Simplilearn is a great source to learn machine learning”
1

a. 7

b. 8

c. 9

d. 10

The correct answer is b


Bigrams: Simplilearn is, is a, a great, great source, source to, to learn, learn machine, machine learning
Knowledge
Check
The main advantages of document-term matrix are:
2

a. Feature engineering

b. Understanding the frequency of word

c. Converting text into vectors

d. All of the above


Knowledge
Check
The main advantages of document-term matrix are:
2

a. Feature engineering

b. Understanding the frequency of word

c. Converting text into vectors

d. All of the above

The correct answer is d


Document-term matrix converts sentences into vectors, and it is achieved by creating matrix of unique
words of sentences.
Knowledge
Check
Highest distance in the Levenshtein approach depicts:
3

a. More similar words

b. More dissimilar words

c. Cannot decide the distance

d. Depends on the length of words


Knowledge
Check
Highest distance in the Levenshtein approach depicts:
3

a. More similar words

b. More dissimilar words

c. Cannot decide the distance

d. Depends on the length of words

The correct answer is b


Highest distance in the Levenshtein approach depicts more dissimilar words.
Knowledge
Check
What is the purpose of topic modeling?
4

a. Clustering the documents

b. Converting text into vectors

c. Understanding the frequency of word

d. Vectorization
Knowledge
Check
What is the purpose of topic modeling?
4

a. Clustering the documents

b. Converting text into vectors

c. Understanding the frequency of word

d. Vectorization

The correct answer is a


Topic modeling provides the topic which is used to map the documents.
Knowledge
Check
Which techniques are used to find the similarity between text?
5

a. Cosine, Levenshtein, Document-Term Matrix

b. Cosine, Word2vec, Document-Term Matrix

c. POS, Document-Term Matrix, Levenshtein

d. Cosine, Levenshtein, Word2vec, POS


Knowledge
Check
Which techniques are used to find the similarity between text?
5

a. Cosine, Levenshtein, Document-Term Matrix

b. Cosine, Word2vec, Document-Term Matrix

c. POS, Document-Term Matrix, Levenshtein

d. Cosine, Levenshtein, Word2vec, POS

The correct answer is d


Cosine, Levenshtein, Word2vec, and POS are the techniques used to find the similarity between text.
Thank You

You might also like