0% found this document useful (0 votes)

14 views6 pages

Text Similarity Cosine BOW TF-IDF Lecture

Uploaded by

tomthinnganba29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Text Similarity Cosine BOW TF-IDF Lecture

Uploaded by

tomthinnganba29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

How to Compute the Similarity Between Two Text Documents?

1. Introduction

Computing the similarity between two text documents is a common task in NLP, with
several practical applications.

2. What Is Text Similarity?

Let us try to define what we mean by similarity. Consider the sentences:

 The teacher gave his speech to an empty room

 There was almost nobody when the professor was talking

Although they convey a very similar meaning, they are written in a completely different way.
In fact, the two sentences just have one word in common (“the”), and not a really significant
one at that. But we need a similarity algorithm to return a high score for this pair. Now,
consider the sentences:

 The teacher gave his speech to an empty a full room

 There was almost nobody when the professor was talking

The two sentences now have an opposite meaning.

When we want to compute similarity based on meaning, we call it semantic text

similarity. Due to the complexities of natural language, this is a very complex task to
accomplish, and it’s still an active research area. In any case, most modern methods to
compute similarity try to take the semantics into account to some extent.

Traditional text similarity methods only work on a lexical level, that is, using only the
words in the sentence. These were mostly developed before the rise of deep learning but can
still be used today. They are faster to implement and run, and can provide a better trade-off
depending on the use case.

the foundational aspects of traditional document similarity algorithms. In the second

part, we’ll then introduce word embeddings which we can use to integrate at least some
semantic considerations.

3. Document Vectors

The traditional approach to compute text similarity between documents is to do so by

transforming the input documents into real-valued vectors. The goal is to have a vector
space where similar documents are “close”, according to a chosen similarity measure. This
approach takes the name of Vector Space Model, and it’s very convenient because it allows
us to use simple linear algebra to compute similarities. We just have to define two things:

1. A way of transforming documents into vectors

2. A similarity measure for vectors
So, let’s see the possible ways of transforming a text document into a vector.

3.1. Document Vectors: an Example

The simplest way to build a vector from text is to use word counts. We’ll do this with
three example sentences and then compute their similarity. After that, we’ll go over actual
methods that can be used to compute the document vectors.

Let’s consider three documents:

D1. We went to the pizza place and you ate no pizza at all.
D2. I ate pizza with you yesterday at home.
D3. There’s no place like home.

To build our vectors, we’ll count the occurrences of each word in a sentence:

Pre-processing: Tokenization, stemmed or lemmatized, stopword removal, etc. in order to

reduce sparsity.

Once we have our vectors, we can use the similarity measure: cosine similarity measures
the angle between the two vectors and returns a real value between -1 and 1.

If the vectors only have positive values, the output will actually lie between 0 and 1. It will
return 0 when the two vectors are orthogonal, that is, the documents don’t have any
similarity, and 1 when the two vectors are parallel, that is, the documents are completely
identical:

As we can see, the first two documents have the highest similarity, since they share three
words. Note that since the word pizza appears two times it will contribute more to the
similarity compared to “at”, “ate” or “you”.

4. TF-IDF Vectors

The idea behind TF-IDF is that we first compute the number of documents in which a word
appears in. If a word appears in many documents, it will be less relevant in the computation
of the similarity, and vice versa. We call this value the inverse document frequency or IDF,

assuming we use base-10 logarithm. Note that if a word were to appear in all three
documents, its IDF would be 0.

We can compute the IDF just once, as a pre-processing step, for each word in our corpus and
it will tell us how significant that word is in the corpus itself.

At this point, instead of using the raw word counts, we can compute the document vectors by
weighing it with the IDF. For each document, we’ll compute the count for each word,
transform it into a frequency (that is, dividing the count by the total number of words in the
document), and then multiply by the IDF.
4.1. Pros and Cons of TF-IDF

The weighting factor provided by this method is a great advantage compared to using
raw word frequencies, and it’s important to note that its usefulness is not limited to the
handling of stopwords.

Every word will have a different weight, and since word usage varies with the topic of
discussion, this weight will be tailored according to what the input corpus is about.

For example, the word “lawyer” could have low importance in a collection of legal
documents (in terms of establishing similarities between two of them), while, instead, high
importance in a set of news articles. Intuitively this makes sense because most legal
documents will talk about lawyers, but most news articles won’t.

The downside of this method as described is that it doesn’t take into account any semantic
aspect. Two words like “teacher” and “professor”, although similar in meaning, will
correspond to two different dimensions of the resulting document vectors, contributing 0 to
the overall similarity.

In any case, this method or variations of it, are still very efficient and widely used, for
example in implementing search engine results ranking. In this scenario, we can use the
similarity between the input query and the result documents to rank higher those that are very
similar.

5. Word Embeddings

Word embeddings are high-dimensional vectors that represent words. We can create them in
an unsupervised way from a collection of documents, in general using neural networks, by
analyzing all the contexts in which the word occurs.

This results in vectors that are similar (according to cosine similarity) for words that appear
in similar contexts, and thus have a similar meaning. For example, since the words “teacher”
and “professor” can sometimes be used interchangeably, their embeddings will be close
together.

For this reason, using word embeddings can enable us to handle synonyms or words with
similar meaning in the computation of similarity, which we couldn’t do by using word
frequencies.

However, word embeddings are just vector representations of words, and there are several
ways that we can use to integrate them into our text similarity computation.

Text similarity is a very active research field, and techniques are continuously evolving and
improving. In this article, we’ve given an overview of possible ways to implement text
similarity, focusing on the Vector Space Model and Word Embeddings.

We’ve seen how methods like TF-IDF can help in weighting terms appropriately, but
without taking into account any semantic aspects
For these reasons, when choosing what method to use, it’s important to always consider our
use case and requirements carefully

Example 1: Using vector space representation, find out the ranks by computing the distance
between the points representing the docs and the query “Tropical fish” using Euclidean distance

D1 : Tropical freshwater Aquarium fish.

D2 : Tropical fish, aquarium care tank setup.

D3 : Keeping Tropical fish and goldfish in aquariums and Fish bowls.

D4 : The Tropical tank homepage - Tropical fish and Aquariums.

Terms
D1 D2 D3 D4
Aquarium 1 1 1 1
bowl 0 0 1 0
care 0 1 0 0
fish 1 1 2 1
Fresh water 1 1 0 0
Gold fish 0 0 1 0
Home page 0 0 0 1
keep 0 0 1 0
Setup 0 0 0 0
tank 0 0 0 1
tropical 1 1 1 2

The docs D1, D2, D3, and D4 are represented by the following vectors:

D1 - [1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]

D2 - [1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1]

D3 - [1, 1, 0, 2, 0, 1, 0, 1, 0, 0, 1]

D4 - [1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2]

The dimension of the space is t = 11, vocabulary of the collection

The query “Tropical fish” is represented by the following vector:

Tropical fish - [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]

Example 2: Compute (i) cosine similarity on BOW

(ii) cosine similarity on TF-IDF with BOW

And find out the ranking of documents for the following docs:

D1: The best Italian restaurant enjoy the best pasta.

D2: American restaurant enjoy the best hamburger.

D3: Korean restaurant enjoy the best bibimbap.

D4: the best the best American restaurant.

Terms
D1 D2 D3 D4
The 2 1 1 2
best 2 1 1 2
Italian 1 0 0 0
Restaurant 1 1 1 1
Enjoy 1 1 1 0
Pasta 1 0 0 0
American 0 0 0 1
Hamburger 0 0 0 0
Korean 0 1 1 0
Bibimbap 0 1 1 1

The docs D1, D2, D3, and D4 are represented by the following vectors:

D1 is represented – [ 2, 2, 1, 1, 1, 1, 0, 0, 0, 0]

D2 is represented – [ 1, 1, 0, 1, 1, 0, 1, 1, 0, 0]

D3 is represented – [ 1, 1, 0, 1, 1, 0, 0, 0, 1, 1]

D4 is represented – [ 2, 2, 0, 1, 0, 0, 1, 0, 0, 1]

Cosine similarity between q⃗ , d⃗ is given by

| V|

q⃗ • d⃗ ⃗q d⃗
∑ qi d i
cos (¿ ⃗q , d⃗ )=
i=1
= • = ¿
|q⃗||⃗d| |⃗q| |d⃗|
√∑ √∑
|V | |V |
2 2
qi di
i=1 i=1
Doc TF on BOW Cosine Similarity
with D4
D1: The best Italian restaurant [2, 2, 1, 1, 1, 1, 0, 0, 0, 0 ] ?
enjoy the best pasta.
D2: American restaurant enjoy [1, 1, 0, 1, 1, 0, 1, 1, 0, 0 ] ?
the best hamburger.
D3: Korean restaurant enjoy [1, 1, 0, 1, 1, 0, 0, 0, 1, 1 ] ?
the best bibimbap.
D4: the best the best American [2, 2, 0, 1, 0, 0, 1, 0,0, 1 ] 1
restaurant.

(ii)

tf = how frequently a term occurs in a doc.

idf = log(tot # of docs) / (# of docs with the term in it)

The idf (inverse document frequency) of t by

id f t= log 10 (N /d f t )
The tf-idf weight of a term is the product of its tf weight and its idf weight

w t , d=(1+log t f t , d )× log 10 (¿ N /d f t )¿

word Tf of Tf of Tf of Tf of id f t= log 10 (N /d ftf-idf

t) tf-idf tf-idf tf-idf
d1 d2 d3 d4 of d1 of d2 of d3 of d4
The 2/8 1/6 1/6 2/6 log(4/4)= 0 0 0 0 0
Best 2/8 1/6 1/6 2/6 log(4/4)= 0 0 0 0 0
Italian 1/8 0/6 0/6 0/6 log(4/1)= 0.6 0.075 0 0 0
Restaurant 1/8 1/6 1/6 1/6 log(4/4)= 0 0 0 0 0
Enjoy 1/8 1/6 1/6 0/6 log(4/3)= 0.13 0.016 0.02 0.02 0
Pasta 1/8 0/6 0/6 0/6 log(4/1)= 0.6 0.075 0 0 0
American 0/8 1/6 0/6 1/6 log(4/2)= 0.3 0 0.05 0 0.05
Hamburger 0/8 1/6 0/6 0/6 log(4/1)= 0.6 0 2.1 0 0
Korean 0/8 0/6 1/6 0/6 log(4/1)= 0.6 0 0 0 0
Bibimbap 0/8 0/6 1/6 0/6 log(4/1)= 0.6 0 0 0.1 0

Doc TF-IDF on BOW Cosine Similarity

with D4
D1 : The best Italian restaurant [0, 0, 0.075, 0, 0.016, 0.075, 0, 0, 0, 0] ?
enjoy the best pasta.
D2 : American restaurant enjoy [0, 0, 0, 0, 0.02, 0, 0.05, 2.1, 0, 0] ?
the best hamburger.
D3 : Korean restaurant enjoy [0, 0, 0, 0, 0.02, 0, 0, 0, 0,0.1] ?
the best bibimbap.
D4 : the best the best [0, 0, 0, 0, 0, 0, 0.05, 0, 0, 0] 1
American restaurant.

FDS1 Differential Manual
No ratings yet
FDS1 Differential Manual
12 pages
JS Prom (Closing Remarks)
60% (5)
JS Prom (Closing Remarks)
5 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Lect 5
No ratings yet
Lect 5
40 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Lect 04
No ratings yet
Lect 04
44 pages
Unit IV
No ratings yet
Unit IV
58 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Lec 6
No ratings yet
Lec 6
2 pages
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
No ratings yet
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
3 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Vector Semantics - NLP
No ratings yet
Vector Semantics - NLP
118 pages
Text Similarity in Vector Space Models: A Comparative Study
No ratings yet
Text Similarity in Vector Space Models: A Comparative Study
17 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Unit 2 TB
No ratings yet
Unit 2 TB
20 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
Consistency and Structure Analysis of Scholarly Papers Using Based On Natural Language Processing
No ratings yet
Consistency and Structure Analysis of Scholarly Papers Using Based On Natural Language Processing
18 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Unit IV
No ratings yet
Unit IV
57 pages
Allnlp
No ratings yet
Allnlp
15 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
TF Idf
No ratings yet
TF Idf
4 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Lab 5
No ratings yet
Lab 5
27 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Precision Recal TF Idf
No ratings yet
Precision Recal TF Idf
36 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Week 5
No ratings yet
Week 5
26 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Presentation of Oral Exam 2222
No ratings yet
Presentation of Oral Exam 2222
49 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
Language Independent Document
No ratings yet
Language Independent Document
10 pages
Module III
No ratings yet
Module III
42 pages
NLP m3
No ratings yet
NLP m3
111 pages
TF Idf
No ratings yet
TF Idf
3 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Aiml P5
No ratings yet
Aiml P5
10 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
No ratings yet
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
9 pages
Vector Space Modeling With TFIDF
No ratings yet
Vector Space Modeling With TFIDF
4 pages
Lecture 3. Vector Semantics
No ratings yet
Lecture 3. Vector Semantics
51 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Prototext-metatext translation shifts: A model with examples based on Bible translation
From Everand
Prototext-metatext translation shifts: A model with examples based on Bible translation
Bruno Osimo
No ratings yet
Dictionary of Computer Vision and Image Processing
From Everand
Dictionary of Computer Vision and Image Processing
Robert B. Fisher
No ratings yet
Chemical Food Additives Have A Long History
No ratings yet
Chemical Food Additives Have A Long History
1 page
Lab4 Microsoft Defender For Office 365 - Attack Simulator
No ratings yet
Lab4 Microsoft Defender For Office 365 - Attack Simulator
21 pages
Homeostasis Meets Motivation in The Battle To Cont
No ratings yet
Homeostasis Meets Motivation in The Battle To Cont
14 pages
Steps To Writing An Argumentative Research Paper 2
No ratings yet
Steps To Writing An Argumentative Research Paper 2
2 pages
Light GBM
No ratings yet
Light GBM
3 pages
Installation and Operating Manual - 62752
No ratings yet
Installation and Operating Manual - 62752
40 pages
Grade 10 Heredity
No ratings yet
Grade 10 Heredity
7 pages
PWD Ramp Around
No ratings yet
PWD Ramp Around
3 pages
Sufyan Bin Uzayr - Mastering CSS - A Beginner's Guide-CRC Press (2023)
No ratings yet
Sufyan Bin Uzayr - Mastering CSS - A Beginner's Guide-CRC Press (2023)
447 pages
A Project Manager's Optimism and Stress Management
No ratings yet
A Project Manager's Optimism and Stress Management
20 pages
Unitex Group
No ratings yet
Unitex Group
9 pages
Montalk 9 24 06
100% (1)
Montalk 9 24 06
426 pages
Ahead LLC Jan
67% (3)
Ahead LLC Jan
9 pages
Sci Judgement On Arnab Goswami
No ratings yet
Sci Judgement On Arnab Goswami
55 pages
Program To Add Two 32 Bit Numbers - ProjectsGeek
No ratings yet
Program To Add Two 32 Bit Numbers - ProjectsGeek
8 pages
Prestressed Concrete-Ch-2-2020
No ratings yet
Prestressed Concrete-Ch-2-2020
32 pages
Dell Case Analysis Presentation Rev 3
No ratings yet
Dell Case Analysis Presentation Rev 3
35 pages
Figure 1a Figure 1b
No ratings yet
Figure 1a Figure 1b
33 pages
MTR NAPLAN Style Year 7 Numeracy Calculator 2015 Updated
100% (2)
MTR NAPLAN Style Year 7 Numeracy Calculator 2015 Updated
50 pages
OGP Guideline Road Transportation PDF
No ratings yet
OGP Guideline Road Transportation PDF
24 pages
MIICA International Teachers Award 2024
No ratings yet
MIICA International Teachers Award 2024
2 pages
Log gmshToFoam
No ratings yet
Log gmshToFoam
17 pages
Transport Phenomena: CHE411A
0% (1)
Transport Phenomena: CHE411A
2 pages
MidTerm Case @
No ratings yet
MidTerm Case @
6 pages
Naitik Admit Card
No ratings yet
Naitik Admit Card
3 pages
Chapter 6
No ratings yet
Chapter 6
53 pages
1) SENTRON 3WL - Complete Presentation
100% (4)
1) SENTRON 3WL - Complete Presentation
39 pages
Eapp 12-Q3-M9
No ratings yet
Eapp 12-Q3-M9
15 pages