0% found this document useful (0 votes)

17 views131 pages

Lesson 2 Feature Engineering On Text Data

Uploaded by

pradeep191988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views131 pages

Lesson 2 Feature Engineering On Text Data

Uploaded by

pradeep191988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 131

Natural Language Processing

Feature Engineering on Text Data

Learning Objectives

By the end of this lesson, you will be able to:

Explain N-gram

Demonstrate the different word embedding models

Perform operations on word analogies

Demonstrate the working of Bag-of-Words

Demonstrate the working of top modeling technique

Feature Extraction
What Is Feature Extraction?

Clean Data Model

Computers do not have any standard representation of
words

Once the text is cleaned and normalized, it needs to be

transformed into features which can be used for modeling
Feature
Extraction
Feature Extraction Techniques

Feature extraction technique depends on what kind of model is

intended to be used.

Levenshtein
TF-IDF Distance
N-Gram
Girl
Male
Female
Boy

Bag-of-Words N-Gram Document-Term Matrix TF-IDF Levenshtein Distance

N-Gram
N-Gram: Introduction

N-grams are combinations of adjacent words or letters of length n in the source text.

Trigram
Group (contiguous sequence) of n
words or characters

Unigram Bigram

P(w | h) probability of N-Gram Probabilistic model n >= 1

word w given per of word sequence
history h n =1 Unigram

n =2 Bigram

Assigns probabilities to n =3 Trigram

the sequences of words . . .
. . .
n =n N-Gram
N-Gram: Example

Example: This is a sentence

n = 1 (Unigram) This is a sentence

n = 2 (Bigram) This is a sentence

This is
n = 3 (Trigram)
a sentence
Probability Calculation: Bigram Model

It approximates the probability of a word by applying

conditional probability to the preceding word.

P(w1, w2, w3, .... , wn) ⇒ P(wn | wn-1)

Example: P(This is a sentence of) ⇒ P(of | sentence)

N-Gram: Applications

Text Comparison

Spelling Error Detection

Information Retrieval

Applications

Automatic Text
Categorization

Spelling Error Correction

Autocomplete
Bag-of-Words
Bag-of-Words

Used to perform document-level task

Is a vectorization technique to represent text data

Has no effect of grammar and order of words in sentence

Example Usage:

Sentiment Analysis Spam Detection

Bag-of-Words

Male Girl
Boy Female

Processed Data
• Document
• Tweet Unordered
• Review comments collection of words
Bag-of-Words

Bag-of-Words model is the way of extracting features from text and representing the
text data, while modeling the text with a machine learning algorithm.

Tokenization:
While creating the bag of
words, tokenized word of each
01 Tokenization observation is used.

Process:
• Collect data
• Create a vocabulary by listing all
Process 02 unique words
• Create document vectors after
scoring

Scoring mechanism:
Scoring Mechanism 03 • Word hashing
• TF-IDF
• Boolean value
Bag-of-Words: Example

Apply Text Processing

I have a little daughter “littl”,”, ”daughter”
Inefficient

Mary had a little lamb “mari”,”littl,”lamb”

Difficult to compare
Twinkle twinkle little star “twinkl”,”littl,”star”

Multiple occurrences
The silence of lambs “silence”,”lamb” of word: difficult to
handle

Corpus (D): Set of

Documents
Bag-of-Words: Example

I have a little daughter Vocabulary (V)

Mary had a little lamb

“littl”, “daughter”,“mari”,”lamb”,”twinkl”, “star”, ”silenc”
Twinkle twinkle little star

Collect unique words

The silence of lambs

Corpus (D): Set of

Documents
Bag-of-Words: Vector Representation Example

Term or Word
daughter lamb littl mari star silenc twinkl
1 0 1 0 0 0 0
I have a little daughter
0 1 1 1 0 0 0
Mary had a little lamb
0 0 1 0 1 0 2
Twinkle twinkle little star
0 1 0 0 0 1 0
The silence of lambs

Term Frequency Document-Term Matrix

Corpus (D): Set of
Documents Frequency of a term or word-
occurrence in a document
Bag-of-Words: Recap of Terms Used

Term Term Frequency

Each processed word is Frequency of a term-

called term occurrence in a document

Term Matrix

Matrix showing frequency of

each term-occurrence in
documents
Document-Term Matrix
Document-Term Matrix

Represents the Creates numerical

frequency of word in a representation of
set of documents documents

Document-
Term Matrix
Represents documents in a Follows the bag-of-words
row or terms in a column approach
Document-Term Matrix Calculation

01 Create a matrix of n x m
n >= 1, m >= 1n
n is number of doc
m is number of unique
terms
Assign count of each term
02 against the respective
document
Document-Term Matrix: Applications

Improving Search
Results

Applications Text Analysis

Finding Topics
Document-Term Matrix: Example

Example:

Doc 1: Random forest is an ensemble learning method

Doc 2: Ensemble method is a machine learning technique

Doc 3: Machine learning is an application of AI

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Doc3 0 0 1 1 0 1 0 1 0 1 1 1
Document-Term Matrix: Example

Random Forest is an ensemble learning method machin technique application of ai

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Compare documents:

1. Based on how many words are common

0 0 1 0 1 1 1 0 0 0 0 0

Doc1.Doc2 = Doc1(0).Doc2(0) + Doc1(0).Doc2(0) + Doc1(0).Doc2(0) +…….+ Doc1(n).Doc2(n) = 4

Dot
product
Document-Term Matrix: Analyze Dot Product—Example

• The more the dot product is, the more similar are the documents

• Issue with dot product:

Document pair, which captures the overlap value, does not take into consideration
the values which are not in common

• This flaw may result in the document pair having very different words. This may
have the same dot product as the document pairs which are very similar.
Document-Term Matrix: Analyze Dot Product—Example

To overcome this, dot product is measured in cosine similarity as below:

Cos(θ) = Dot product

-------------------
||Doc1||.||Doc2||

= 4
--------------------
sqrt(7).sqrt(6)

Complete Identical vector will have Cosine similarity =1

Complete Unidentical vector will have Cosine similarity =-1

TF-IDF
TF-IDF

The Term Frequency-Inverse

Document Frequency is
abbreviated as TF-IDF

• Bag-of-Words assumes that each word is equally important

• In real-world scenario, each word has its own weight based on the context

Example:
• Cost occurs more frequently in an economy related document. To overcome
this limitation TF-IDF is used which assigns weights to the words based on
their relevance in the document.
TF-IDF

It has two parts:

It represents the numerical
• Term Frequency (TF)
statistics
• Inverse Document Frequency
(IDF)

Applications of TF-IDF are:

• Text Mining
• User Modeling
TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1 1 1 1 1 0 0 0 0 0

Doc2 0 0 1 0 1 1 1 1 1 0 0 0

Doc3 0 0 1 1 0 1 0 1 0 1 1 1

Document 1 1 3 2 2 3 2 2 1 1 1 1
Frequency

Sum of occurrence of a word across documents

TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1/1 1/1 1/3 1/2 1/2 1/3 1/2 0/2 0/1 0/1 0/1 0/1

Doc2 0/1 0/1 1/3 0/2 1/2 1/3 1/2 1/2 1/1 0/1 0/1 0/1

Doc3 0/1 0/1 1/3 1/2 0/2 1/3 0/2 1/2 0/1 1/1 1/1 1/1

1 1 3 2 2 3 2 2 1 1 1 1

Document Frequency

Term Frequency

Sum of occurrence of a word across documents

TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1/3 1/2 1/2 1/3 1/2 0 0 0 0 0

Doc2 0 0 1/3 0 1/2 1/3 1/2 1/2 1 0 0 0

Doc3 0 0 1/3 1/2 0 1/3 0 1/2 0 1 1 1

Term Frequency • Is proportional to frequency of occurrence of a word or term in a document

• Is inversely proportional to the number of documents in which a word or
term occurs
TF-IDF: Example

Random Forest is an ensemble learning method machine technique application of ai

Doc1 1 1 1/3 1/2 1/2 1/3 1/2 0 0 0 0 0

Doc2 0 0 1/3 0 1/2 1/3 1/2 1/2 1 0 0 0

Doc3 0 0 1/3 1/2 0 1/3 0 1/2 0 1 1 1

Term Frequency • Highlights the words or terms which are unique to the document
• These words are better for characterizing
TF-IDF

TF-IDF = TF(t,d) * IDF(t,D)

t is terms
d is document

TF = Term Frequency
IDF = Inverse Document Frequency
TF= count(t,d)
Count of term ‘t’ in document ‘d’
--------------
|d| Total number of terms in document ‘d’

IDF = log(|D|) Log of total number of documents in collection ‘D’

-------------
|{d ⊂ D} : {t ⊂ d} Number of documents where ‘t’ is present
TF-IDF

Term Frequency (TF)

TF Frequent occurrence of a term in a document is measured by term frequency.
TF (t, d) = Number of times t appears in document d / Total number of terms in
the document d

Inverse Document Frequency (IDF)

IDF IDF measures how important a term is.
IDF (t) = Log_e (Total number of documents / Number of documents
with term t in it)

TF-IDF = TF (t,d) * IDF (t)

t is term
d is document
Levenshtein Distance
Levenshtein Distance

It is the string metric for It is the minimum number of

measuring the difference single-character edits
between two sequences

Characteristics of Greater the Levenshtein

It is also referred as edit distance, more different the
Levenshtein
distance strings
Distance

Mainly, 3 operations are

Applications of Levenshtein
performed during
distance:
calculation:
• String-matching
• Insertion
• Spell-checking
• Deletion
• Plagiarism detection
• Substitution
Levenshtein Distance: Applications

String-matching

RNA/DNA sequencing

Applications Spell-checking

Remote location update

Plagiarism detection
Levenshtein Distance: Example

Distance calculation between Singing and 0

Singing
As both strings are exactly same

Distance calculation between Singing and 1

Ringing
Singing -> Ringing [Replace ‘S’ with ‘R’]

Distance calculation between Sleep and 2

Slip
Sleep -> Slep [Remove single ‘e’]
Slep -> Slip [Replace ‘e’ with ‘i’]

Distance calculation between Kitten and 3

Sitting
Kitten -> Sitten [Replace ‘K’ with ‘S’]
Sitten -> Sittin [Replace ‘e’ with ‘I’]
Sitting -> Sitting [Add ‘g’ in the end]
Levenshtein Distance Calculation

n is character length of
word1
m is character length of 01 Create a matrix of n x m
word2

02 Assign the distance between two

strings based on the following rules:
Levenshtein Distance Calculation: Example
Step 1 and 2
Strings to compare
H Y U N D A I
n
0 1 2 3 4 5 6 7
s t
m

H 1
HYUNDAI HONDA
O 2

N 3

D 4

A 5
i=2 i=3
Step 3 to 5 i=1
H Y U N D A I H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 H 1 0 1 H 1 0 1 2

O 2 1 O 2 1 1 O 2 1 1 2

N 3 2 N 3 2 2 N 3 2 2 2

D 4 3 D 4 3 3 D 4 3 3 3

A 5 4 A 5 4 4 A 5 4 4 4
Levenshtein Distance Calculation: Example

Strings to compare
Set n to be the length of s s t
Set m to be the length of t
If n = 0, return m and exit
HYUNDAI HONDA
If m = 0, return n and exit
Construct a matrix containing 0..m rows and 0..n
columns

Initialize the first row to 0..n

Initialize the first column to 0..m

Examine each character of s (i from 1 to n)

Examine each character of t (j from 1 to m)

If s[i] equals t[j], the cost is 0

If s[i] doesn't equal t[j], the cost is 1

Set cell d[i,j] of the matrix equal to the minimum of:

a. The cell immediately above plus 1: d[i-1,j] + 1
b. The cell immediately to the left plus 1: d[i,j-1] + 1
c. The cell diagonally above and to the left plus the
cost: d[i-1,j-1] + cost
Levenshtein Distance Calculation: Example
Step 1 and 2
H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 1 2 3 H 1 0 1 2 3 4
i=4 i=5
O 2 1 1 2 3 O 2 1 1 2 3 4

N 3 2 2 2 2 N 3 2 2 2 2 3

D 4 3 3 3 3 D 4 3 3 3 3 2

A 5 4 4 4 4 A 5 4 4 4 4 3

H Y U N D A I H Y U N D A I

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

H 1 0 1 2 3 4 5 H 1 0 1 2 3 4 5 6

i=6 O 2 1 1 2 3 4 5 i=7 O 2 1 1 2 3 4 5 6

N 3 2 2 2 2 3 4 N 3 2 2 2 2 3 4 5

D 4 3 3 3 3 2 3 D 4 3 3 3 3 2 3 4

A 5 4 4 4 4 3 2 A 5 4 4 4 4 3 2 3
Levenshtein Distance Calculation: Example

Distance calculation between HYUNDAI and HONDA

1
Matrix is initialized, measuring
in the (m, n) cell H Y U N D A I

0 1 2 3 4 5 6 7
2 H 1 0 1 2 3 4 5 6
Matrix is filled from the
upper-left to the lower- O 2 1 1 2 3 4 5 6
right corner
N 3 2 2 2 2 3 4 5
3
Set cell d[i,j] of the matrix equal D 4 3 3 3 3 2 3 4
to the minimum of:
a. The cell immediately above A 5 4 4 4 4 3 2
plus 1: d[i-1,j] + 1 3
b. The cell immediately to the left 4
plus 1: d[i,j-1] + 1
c. The cell diagonally above and Number in the lower-right corner is the
to the left plus the cost: d[i-1,j-1] + Levenshtein distance between the two
cost words.
One-Hot Encoding
One-Hot Encoding

Used for deeper analysis of text

Performs numerical representation of each word

Used for categorical data

Higher the distinct categorical value, higher the sparsity

4
One-Hot Encoding

Assigns vector value 1

Treats each word
where the particular word
as class
is present and 0 at other
places

How does it work?

One-Hot Encoding: Example

daughter lamb littl mari star silenc twinkl

lamb 0 1 0 0 0 0 0

littl 0 0 1 0 0 0 0

silenc 0 0 0 0 0 1 0

twinkl 0 0 0 0 0 0 1
Biological Neuron vs. Artificial Neuron
Biological Neurons

Myelin sheath

Dendrites Axon terminals

Axon

Output Signals
Cell nucleus
Input Signals

▪ Neurons are interconnected nerve cells that build the nervous system and transmit information throughout
the body.
▪ Dendrites are extension of a nerve cell that receive impulses from other neurons.
▪ Cell nucleus stores cell’s hereditary material and coordinates cell’s activities.
▪ Axon is a nerve fiber that is used by neurons to transmit impulses.
▪ Synapse is the connection between two nerve cells.
Rise of Artificial Neurons

Biological
Neuron
Artificial
Neuron

▪ Researchers Warren McCullock and Walter Pitts published their first concept of simplified
brain cell in 1943.
▪ Nerve cell was considered similar to a simple logic gate with binary outputs.
▪ Dendrites can be assumed to process the input signal with a certain threshold such that if the
signal exceeds the threshold, the output signal is generated.
Definition of Artificial Neuron

An artificial neuron is analogous to biological neurons, where each neuron takes

inputs, adds weights to them separately, sums them up, and passes this sum
through a transfer function to produce a nonlinear output.
Biological Neurons and Artificial Neurons: A Comparison
Biological Neuron
Artificial
Neuron

Biological Neurons Artificial Neurons

Cell Nucleus Nod

e
Dendrite Inpu
s t
Synaps Weights or interconnections
e
Axo Output
n
Neural Networks
Perceptron

▪ Single layer neural network

▪ Consists of weights, the summation processor, and an activation function

Inputs Weights Net inputs Activation

function function

1
W0

X1
W1
∈ Output

X2 W2

Wn
Xm
Perceptron: The Main Processing Unit

S f (S) Output

𝑊3

Summation
Activation
Function
Function

Note: Inputs X and weights W are real values.

Weights and Biases in a Perceptron

While the weights determine the slope of the equation, bias shifts the output line towards left or
right.

S f (S) Output

𝑊3

Summation Activation
Function Function

𝑋𝑛
The XOR Problem

A perceptron can learn anything that it can represent, i.e., anything separable with a
hyperplane. However, it cannot represent Exclusive OR since it is not linearly separable.

-1 1
x2

-1
Multilayer Perceptrons

Ψ(a)
1 The most common output
function(Sigmoid)
a

One or more layers

Input of hidden units Output
nodes (hidden layers) neurons
Convolutional Neural Net (CNN)
Human Visual and CNN

LGN

V1
IT V4 V2
Edges
And lines

Faces Shapes
And objects

▪ The idea of CNNs was neurobiologically motivated by the findings of locally-sensitive and orientation-selective
nerve cells in the visual cortex.
▪ Inventors of CNN designed a network structure that implicitly extracts relevant features.
▪ Convolutional Neural Networks are a special kind of multilayer neural networks.
History of CNN

Success
Stories In 1995, Yann LeCun,
professor of computer
science at the New York
University, introduced the
concept of convolutional
neural networks.
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance Represent how each set of neurons in a cluster are

connected to each other, which in turn represents a
set of features
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance

Represents the hierarchy in features that are

learned
The Core Idea Behind CNN

Local Connections

Layering

Spatial Invariance
Represents the capability of CNN’s to learn
abstractions invariant of size, contrast, rotation, and
variation
Few Popular CNNs

▪ LeNet, 1998
▪ AlexNet, 2012
▪ VGGNet, 2014
▪ ResNet, 2015
CNN Architectures

VGGNet

▪ 16 layers
▪ Only 3*3 convolutions
▪ 138 million parameters

ResNet

▪ 152 layers
▪ ResNet50
CNN Applications

Input A Task A

Layer n

AnB : Frozen Weights

Back-Propagation

Input B Task B

Back-Propagation

Transfer Learning and Fine Tuning Feature Extraction

Word Embedding
Word Embedding

Use the following while working with individual words or phrases:

Text Generation Machine Translation

Large Vocabulary
Word Embedding

It represents text in the

N-dimensional space, in the Vectors are called
form of vectors embeddings

It is the distributed Each word is mapped to

representation one real-valued vector

Word embedding
techniques: Applications of word embedding:
• Word2vec • Music or video
• Glove recommendation system
• Analyzing survey responses
Word Embedding: Overview

Child
kid

• Word embedding represents word in vector form

• Some properties must be exhibited while representing a Lion

word in vector form:

o Similar meaning words should be closer to each other School
when compared to the words which don’t have similar
meaning
o Words having difference in meaning should be kept at
the same distance from each other
Queen
• This kind of representation helps in finding:
o Analogy word King

o Synonym
o Classification of the word: Positive, negative or neutral Woman

Man
Word2vec
Word2vec

Word2vec is one of
Word2vec is a
the most popular
two-layer neural
techniques of word
network.
embedding.
MALE

Word2vec

FEMALE

Two flavors of
Input is text corpus
algorithm:
and output is set
• Continuous Bag-
of vectors.
of-Words (CBOW)
• Skip-Gram
Word2vec

The core concept of Word2vec approach is to predict a word with the given neighboring word
or predict a neighboring word with the given word which is likely to capture the contextual
meaning of the word.

The quick brown fox jumps over the lazy dog

Context Context

Focus Word
Word2vec Algorithms

Word2vec Algorithms

Continuous Bag-of-
Skip-Gram
Words (CBOW)

Predict a “neighboring word” Predict a “given word” given the

given the “given word” “neighboring word”
Skip-Gram Model

It is used to predict the source context words given in a target word.

w(t-2)

w(t-1)

w(t)

w(t+1)

w(t+2)
Skip-Gram Model: Example

0
brown

0 w(t-2)
0

0
0 fox
0
1 Neural Network
Jumps (or any other 1 w(t-1)
0
probabilistic model)
0
0
0
over
w(t) 1

One-hot encoded 0 w(t+1)

vector
0

1
the
0

0 w(t+2)
CBOW Model

Common Bag-of-Words (CBOW) algorithm is used to predict the target word in the given context.

w(t-2)

w(t-1)
sum

w(t)

w(t+1)

w(t+2)
Word2vec: Advantages

Ready to be used in deep Meaning of word is

learning-ready architecture distributed in vector

Train vectors are Vector size does not

reused grow with vocabulary
Word2vec Model Creation

Problem Statement: In vector space model, the entities are transformed into vector
representation. Based on the co-ordinate points, we can apply the techniques to find the most
similar points in vector space. Create a word-to-vector model which gives you the similar word for
happy.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Doc2vec Model
Doc2vec Model

The following are the uses of Doc2vec model:

• Creates numeric representation of a document

• Uses unsupervised algorithm
• Finds similarity between sentences, paragraphs, and documents

Classifier on

Average or Concatenate

Paragraph Matrix W W W

Paragraph id the cat sat

Doc2vec Model

• It is an extension of CBOW model.

• It is called distributed memory version of paragraph vector.

• This algorithm may not be the ideal choice for the corpus with lots
of misspellings like tweets.
Topic Modeling
Topic Modeling

It is a type of statistical model and has the following advantages:

Discovering the abstract Document

topics in a collection of clustering
documents

Information retrieval from Organizing large blocks of

unstructured text and textual data
feature selection
Topic Modeling: Industry Use Cases

HR Search Engine

News Companies E-Commerce

Document Sorting
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)

It is a dimensionality reduction method that reduces the number of variables.

Standardization: Recasting:
05
Standardize the range of Recast the data along with
continuous variables the principal component
axes
01 04

Feature vector:
Find the principal
Covariance matrix components in the order
computation: of significance
Understand how variables
vary from mean
02 03
Eigenvector and values
computation:
Determine principal
components of the data
Principal Component Analysis: Steps

Feature Vector Standardization

Eigenvectors and Covariance Matrix

Eigenvalues Computation Computation
Step 1: Standardization

Standardize the range of continuous variables for their equal

contribution
1

Higher range will dominate, which will create a bias

After standardization is done, all the variables will be on the same scale
3

It can be achieved by z = (value - mean) / std deviation

4
Step 2: Covariance Matrix Computation

It is used to identify the relationship between the variables

Variables should not be highly correlated

Covariance matrix (n x n) is calculated where n is number of dimensions

3
Step 3: Eigenvectors and Eigenvalues Computation

It is used to determine the principal components

New variables are constructed as linear combinations of initial variables

and are called principal components
2

New variables will have less correlated data

3
Step 4: Feature Vector

Decision is taken to keep all components or remove lesser significant

variables
1

Remaining components will form the matrix of vectors

2
Principal Component Analysis

Two-dimensional data transformation after applying PCA:

Principal Component Analysis

It is the process to automatically identify

topics present in text object.

Topic 1
It is an unsupervised approach Topic 2
that involves techniques such as: Topic 3
• TF-IDF Topic 4
Word • Non-negative matrix
Analogies factorization
• Latent Dirichlet Allocation
• LSA

Applications include:
• Document clustering
• Information retrieval
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA)

LDA is a matrix
factorization technique.

For each word w of each

doc d, word assignment is Documents will be
updated till the represented as
convergence point. document-term matrix.

LDA converts document-

M2 is a topic-term term matrix into two lower-
matrix. dimensional matrix, M1and
M2.
M1 is a document-topic
matrix.
Latent Dirichlet Allocation: Example

Term/Word Bag of Word Model

daughter lamb littl mari star silenc twinkl

D1 1 0 1 0 0 0 0
I have a little daughter

D2 0 1 1 1 0 0 0 Document
Mary had a little lamb
Term
Matrix
D3 0 0 1 0 1 0 2
Twinkle Twinkle little star

The silence of lambs 0 1 0 0 0 1 0

For D3 P(t|d)=1/4 ¼ 2/4

Corpus (D) : Set of Probability of word occurring in Document No. of parameters: 3

Documents
For 1 document and 3 words, number of parameters are = 1*3=3
Latent Dirichlet Allocation: Example

1000 documents(d) …………………………………..

T1 t2 t3 t4…………………………………………………………….t1000
5000 terms/words (t)

For 1000 documents and 5000 words, number of parameters

Parameters P(t|d) are = 1000*5000=5000000 (50 Lakhs)

Problem:
There are so many parameters to extract information and so, the
task is to reduce number of parameters without losing
information
Latent Dirichlet Allocation: Example

Solution: Topic is a mix of terms that is likely to generate the

Introduce a layer of topics called the Latent Variable term.
Example: Finance, Science, Sport, etc.
LDA Model

1000
documents(d) …………………………………..

Probability of
P(z|d) topic z given
document d

Topics/Latent
Z1 z2 z3
Variable (z)

Probability
P(t|z)
of term t
given topic z

5000 T1 t2 t3 t4…………………………………………………………….t1000

Terms/Words (t)

For 1000 documents, 5000 words, 10 topics, the number of parameters

are = 1000*10+10*5000=60000
Latent Dirichlet Allocation: Example

LDA Model M1

Z1 Z2 .. zn
1000
d1
documents (d) ………………………
d2
…………..
Probability of d3
P(z
topic z given
|d) d4
document d
…

dn
Topics/Latent Z1 z2
Variable (z) z3……………………………zn

P(t Probability of M2
|z) term t given
topic z
t1 t2 t3 t4 t5 …. tn
5000 T1 t2 t3
t4……………………………tn z1
terms/words (t)
z2

zn
Latent Dirichlet Allocation: Example

LDA Model
Bag-of-Word Model

1000 …………………………
documents(d) ………………………… ………..
………..

Z1 z2 z3
Topics

5000
T1 t2 t3 t4………………….t1000
terms/words (t) T1 t2 t3 t4………………t1000

Parameters
50 Lakhs 60 Thousand
P(t|d)
Word Analogies
Word Analogies

An analogy question is the one that finds the relationship between words.

Example: man is to woman, what king is to ___.

Answer: “queen”

Below is the process of work analogies:

Convert each word into a high-dimensional vector

Subtract the first vector from the second in the first word-pair

Add that to the first word in the second word-pair

The word closest to the resultant answer would be the solution

Gensim
Gensim: Introduction

Gensim is a free python library which is platform-independent.

It is open-source.
2

It is robust and scalable.

It analyzes plain-text documents for semantic structure.

It is used to retrieve semantically similar documents.

5
Gensim: Syntax and Library

System Requirement:

Operating system:
macOS / OS X · Linux ·
Windows

Python version:
>> import gensim
Python >=2.7

Dependency:
• NumPy >= 1.11.3
• SciPy >= 0.18.1
• Six >= 1.5.0
• smart_open >= 1.2.1
Gensim: Vectorization

#Gensim Library
#Load Gensim
from gensim import corpora

#documents for building vocabulary

documents = ["Simplilearn is an ed-tech company",
"We provide multiple e-learning courses"]

#text processing
texts = [[word
for word in document.lower().split()]
for document in documents
]
#convert into dictionary
dictionary = corpora.Dictionary(texts)
#document to convert in vector
new_doc = "ed-tech company for e-learning courses"
#document to bag of words conversion
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
Gensim: Vectorization

Output: [(1, 1), (2, 1), (5, 1), (6, 1)]

Gensim: Topic Modeling

#Gensim library
#Loading gensim
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
#create a corpus from a list of text
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
#Train the model
lda = LdaModel(common_corpus, num_topics=10)
#new corpus of unseen documents
other_texts = [
['data', 'unstructured', 'time'],
['bigdata', 'intelligence', 'natural'],
['language', 'machine', 'computer']
]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
unseen_doc = other_corpus[0]
#get topic probability distribution for a document
vector = lda[unseen_doc]
print(vector)
Gensim: Topic Modeling

Output:
[(0, 0.050000038), (1, 0.5499996), (2, 0.050000038), (3, 0.05000004), (4, 0.050000038),
(5, 0.050000038), (6, 0.05000004), (7, 0.05000004), (8, 0.05000004), (9, 0.050000038)]
Gensim: Text Summarization

#Gensim Library
#load gensim
from gensim.summarization import summarize

text_to_summarize = """Artificial intelligence has become a

powerful driving force in a wide range of industries,
helping people and businesses create exciting, innovative
products and services,
enable more informed business decisions, and achieve key
performance goals.
The median salary of an AI engineer in the US is $171,715(Source:
Datamation).
By 2022, the AI market will grow at a CAGR of 53.25 per cent, and
an estimated.
2.3 million jobs will be created in the AI field by 2020 (Source:
Gartner)."""

summary = summarize(text_to_summarize)
print("Summarized text:\n", summary)
Gensim: Text Summarization

Output:
Summarized text:
2.3 million jobs will be created in the AI field by 2020 (Source: Gartner)
Identify Topics from News Items

Problem Statement: Identification of document for a domain or keyword is a tough task. Write a
script which will provide the important topics from the news data.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Working of Word Analogies

Problem Statement: Apply word analogies technique using word2vec for identification of new next
word.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Build Your Own News Search Engine

Objective: Use text feature engineering (TF-IDF) and some rules to make
our first search engine for news articles. For any input query, we’ll
present the five most relevant news articles.

Problem Statement: Reuters Ltd. is an international news agency

headquartered in London and is a division of Thomson Reuters. The
data was originally collected and labeled by Carnegie Group Inc. and
Reuters Ltd. in the course of developing the construe text
categorization system. An important step before assessing similarity
between documents, or between documents and a search query, is the
right representation i.e., correct feature engineering. We’ll make a
process that provides the most similar news articles to a given text
string (search query).
Key Takeaways

You are now able to:

Explain N-gram

Demonstrate the different word embedding models

Perform operations on word analogies

Demonstrate the working of Bag-of-Words

Demonstrate the working of top modeling technique

Knowledge Check
Knowledge
Check How many bigrams can be generated from the given sentence?
“Simplilearn is a great source to learn machine learning”
1

a. 7

b. 8

c. 9

d. 10
Knowledge
Check How many bigrams can be generated from given sentence?
“Simplilearn is a great source to learn machine learning”
1

a. 7

b. 8

c. 9

d. 10

The correct answer is b

Bigrams: Simplilearn is, is a, a great, great source, source to, to learn, learn machine, machine learning
Knowledge
Check
The main advantages of document-term matrix are:
2

a. Feature engineering

b. Understanding the frequency of word

c. Converting text into vectors

d. All of the above

Knowledge
Check
The main advantages of document-term matrix are:
2

a. Feature engineering

b. Understanding the frequency of word

c. Converting text into vectors

d. All of the above

The correct answer is d

Document-term matrix converts sentences into vectors, and it is achieved by creating matrix of unique
words of sentences.
Knowledge
Check
Highest distance in the Levenshtein approach depicts:
3

a. More similar words

b. More dissimilar words

c. Cannot decide the distance

d. Depends on the length of words

Knowledge
Check
Highest distance in the Levenshtein approach depicts:
3

a. More similar words

b. More dissimilar words

c. Cannot decide the distance

d. Depends on the length of words

The correct answer is b

Highest distance in the Levenshtein approach depicts more dissimilar words.
Knowledge
Check
What is the purpose of topic modeling?
4

a. Clustering the documents

b. Converting text into vectors

c. Understanding the frequency of word

d. Vectorization
Knowledge
Check
What is the purpose of topic modeling?
4

a. Clustering the documents

b. Converting text into vectors

c. Understanding the frequency of word

d. Vectorization

The correct answer is a

Topic modeling provides the topic which is used to map the documents.
Knowledge
Check
Which techniques are used to find the similarity between text?
5

a. Cosine, Levenshtein, Document-Term Matrix

b. Cosine, Word2vec, Document-Term Matrix

c. POS, Document-Term Matrix, Levenshtein

d. Cosine, Levenshtein, Word2vec, POS

Knowledge
Check
Which techniques are used to find the similarity between text?
5

a. Cosine, Levenshtein, Document-Term Matrix

b. Cosine, Word2vec, Document-Term Matrix

c. POS, Document-Term Matrix, Levenshtein

d. Cosine, Levenshtein, Word2vec, POS

The correct answer is d

Cosine, Levenshtein, Word2vec, and POS are the techniques used to find the similarity between text.
Thank You

Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Lab 5
No ratings yet
Lab 5
27 pages
BBC Sports Text Preprocessing Guide
No ratings yet
BBC Sports Text Preprocessing Guide
6 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Lect 5
No ratings yet
Lect 5
40 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
No ratings yet
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
4 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
TF Idf
No ratings yet
TF Idf
27 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Lect 04
No ratings yet
Lect 04
44 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Module III
No ratings yet
Module III
42 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
5 pages
Vector Semantics - NLP
No ratings yet
Vector Semantics - NLP
118 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
DSC 202
No ratings yet
DSC 202
8 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Assignment 1 Instruction V2 1-1
No ratings yet
Assignment 1 Instruction V2 1-1
21 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Lab 4
No ratings yet
Lab 4
24 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
Unit 2
No ratings yet
Unit 2
48 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Bag - of - Words NLP
100% (1)
Bag - of - Words NLP
23 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
87 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Quantitative Text Analysis Methods
No ratings yet
Quantitative Text Analysis Methods
55 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Ci 5
No ratings yet
Ci 5
17 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
Dupppppppppp
No ratings yet
Dupppppppppp
15 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
NLP Unit II Study Material
No ratings yet
NLP Unit II Study Material
47 pages
Lesson 04 Using Loop Constructs
No ratings yet
Lesson 04 Using Loop Constructs
26 pages
Fundamentals
No ratings yet
Fundamentals
2 pages
Lesson 8 AutoEncoders
No ratings yet
Lesson 8 AutoEncoders
29 pages
Ics Publish Subscribe
No ratings yet
Ics Publish Subscribe
8 pages
Lesson - 03 - Using Operators and Decision Constructs
No ratings yet
Lesson - 03 - Using Operators and Decision Constructs
26 pages
Market Analysis in Banking Domain - Code
No ratings yet
Market Analysis in Banking Domain - Code
2 pages
1Z0-1042-25-DEMO-2 Exam Dumps For Cert
100% (1)
1Z0-1042-25-DEMO-2 Exam Dumps For Cert
6 pages
Ics Overview
No ratings yet
Ics Overview
33 pages
1Z0-1042-25-2 Dumps For Cert
100% (2)
1Z0-1042-25-2 Dumps For Cert
5 pages
Oracle-Fusion-Financials Sample Resumes-3
No ratings yet
Oracle-Fusion-Financials Sample Resumes-3
6 pages
Project Portfolio Management
No ratings yet
Project Portfolio Management
5 pages
Self Learning
No ratings yet
Self Learning
183 pages
1Z0-1042-24 Dumps For Cert
No ratings yet
1Z0-1042-24 Dumps For Cert
4 pages
Oracle Fusion Financials Setup Guide
No ratings yet
Oracle Fusion Financials Setup Guide
59 pages
How To Configure Rule Using First Approver Field
No ratings yet
How To Configure Rule Using First Approver Field
2 pages
INterview Questions Fusion Financials
No ratings yet
INterview Questions Fusion Financials
6 pages
Create A Service PO With Accrue at Receipt Disabled
No ratings yet
Create A Service PO With Accrue at Receipt Disabled
3 pages
Oracle-Fusion-Financials Sample Resumes-2-1
No ratings yet
Oracle-Fusion-Financials Sample Resumes-2-1
5 pages
How To Exit The Reincarnation System
100% (4)
How To Exit The Reincarnation System
24 pages
tinyML Talks Eben Upton 210304
No ratings yet
tinyML Talks Eben Upton 210304
22 pages
THREADS PPT (1) - 1
No ratings yet
THREADS PPT (1) - 1
13 pages
10 English Eng 2024 25
No ratings yet
10 English Eng 2024 25
5 pages
Booklist Spring B 2016 Levels 1,2,3,4, & 5
No ratings yet
Booklist Spring B 2016 Levels 1,2,3,4, & 5
5 pages
Sap r3 Transactions and Tables
No ratings yet
Sap r3 Transactions and Tables
5 pages
Linear Algebra Notes
No ratings yet
Linear Algebra Notes
31 pages
Install LabVIEW Arduino Toolkit VIPM
No ratings yet
Install LabVIEW Arduino Toolkit VIPM
11 pages
Le Sport Essay French
100% (1)
Le Sport Essay French
4 pages
ISO Standard 9241
0% (1)
ISO Standard 9241
4 pages
Analogies
No ratings yet
Analogies
2 pages
No1. RMI - CMRT - 6.4
No ratings yet
No1. RMI - CMRT - 6.4
189 pages
Shower Glass & Fitting Submittal
No ratings yet
Shower Glass & Fitting Submittal
223 pages
All I Want Is You (Soundtrack From Juno) : Wood Sway Rug Bee Nod Moat Bride Rumble Wild Shade Pod Kiss Seed
No ratings yet
All I Want Is You (Soundtrack From Juno) : Wood Sway Rug Bee Nod Moat Bride Rumble Wild Shade Pod Kiss Seed
1 page
Chapter 3
No ratings yet
Chapter 3
34 pages
EDIT 068 - Pronoun - Usage
No ratings yet
EDIT 068 - Pronoun - Usage
4 pages
PISO Verilog PDF
No ratings yet
PISO Verilog PDF
5 pages
TRANSLATION
No ratings yet
TRANSLATION
8 pages
Sanika Kolekar: Web Developer & Java Expert
No ratings yet
Sanika Kolekar: Web Developer & Java Expert
1 page
Spanish Bullfighting Lesson Plan
No ratings yet
Spanish Bullfighting Lesson Plan
7 pages
Wilshire Bus
No ratings yet
Wilshire Bus
2 pages
Derrida and Spinoza
No ratings yet
Derrida and Spinoza
2 pages
BG Info
No ratings yet
BG Info
3 pages
DidIndo EuropeanlanguagesstemfromaTrans Eurasianoriginallanguagewithreview
No ratings yet
DidIndo EuropeanlanguagesstemfromaTrans Eurasianoriginallanguagewithreview
12 pages
Mid Term Part II Practice Pack C1 2024..
No ratings yet
Mid Term Part II Practice Pack C1 2024..
7 pages
Elements of Trauma Fiction in
No ratings yet
Elements of Trauma Fiction in
12 pages
2.2 - DE UNIT 2-Nand-Nor Realization, K-MAP
No ratings yet
2.2 - DE UNIT 2-Nand-Nor Realization, K-MAP
31 pages
Homa Katouzian
No ratings yet
Homa Katouzian
4 pages
Job's Trials and Faith
No ratings yet
Job's Trials and Faith
127 pages
University of Rajasthan, Jaipur: Saturday
No ratings yet
University of Rajasthan, Jaipur: Saturday
1 page