[go: up one dir, main page]

0% found this document useful (0 votes)
14 views6 pages

Text Similarity Cosine BOW TF-IDF Lecture

Uploaded by

tomthinnganba29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Text Similarity Cosine BOW TF-IDF Lecture

Uploaded by

tomthinnganba29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

How to Compute the Similarity Between Two Text Documents?

1. Introduction

Computing the similarity between two text documents is a common task in NLP, with
several practical applications.

2. What Is Text Similarity?

Let us try to define what we mean by similarity. Consider the sentences:

 The teacher gave his speech to an empty room


 There was almost nobody when the professor was talking

Although they convey a very similar meaning, they are written in a completely different way.
In fact, the two sentences just have one word in common (“the”), and not a really significant
one at that. But we need a similarity algorithm to return a high score for this pair. Now,
consider the sentences:

 The teacher gave his speech to an empty a full room


 There was almost nobody when the professor was talking

The two sentences now have an opposite meaning.

When we want to compute similarity based on meaning, we call it semantic text


similarity. Due to the complexities of natural language, this is a very complex task to
accomplish, and it’s still an active research area. In any case, most modern methods to
compute similarity try to take the semantics into account to some extent.

Traditional text similarity methods only work on a lexical level, that is, using only the
words in the sentence. These were mostly developed before the rise of deep learning but can
still be used today. They are faster to implement and run, and can provide a better trade-off
depending on the use case.

the foundational aspects of traditional document similarity algorithms. In the second


part, we’ll then introduce word embeddings which we can use to integrate at least some
semantic considerations.

3. Document Vectors

The traditional approach to compute text similarity between documents is to do so by


transforming the input documents into real-valued vectors. The goal is to have a vector
space where similar documents are “close”, according to a chosen similarity measure. This
approach takes the name of Vector Space Model, and it’s very convenient because it allows
us to use simple linear algebra to compute similarities. We just have to define two things:

1. A way of transforming documents into vectors


2. A similarity measure for vectors
So, let’s see the possible ways of transforming a text document into a vector.

3.1. Document Vectors: an Example

The simplest way to build a vector from text is to use word counts. We’ll do this with
three example sentences and then compute their similarity. After that, we’ll go over actual
methods that can be used to compute the document vectors.

Let’s consider three documents:

D1. We went to the pizza place and you ate no pizza at all.
D2. I ate pizza with you yesterday at home.
D3. There’s no place like home.

To build our vectors, we’ll count the occurrences of each word in a sentence:

Pre-processing: Tokenization, stemmed or lemmatized, stopword removal, etc. in order to


reduce sparsity.

Once we have our vectors, we can use the similarity measure: cosine similarity measures
the angle between the two vectors and returns a real value between -1 and 1.

If the vectors only have positive values, the output will actually lie between 0 and 1. It will
return 0 when the two vectors are orthogonal, that is, the documents don’t have any
similarity, and 1 when the two vectors are parallel, that is, the documents are completely
identical:

As we can see, the first two documents have the highest similarity, since they share three
words. Note that since the word pizza appears two times it will contribute more to the
similarity compared to “at”, “ate” or “you”.

4. TF-IDF Vectors

The idea behind TF-IDF is that we first compute the number of documents in which a word
appears in. If a word appears in many documents, it will be less relevant in the computation
of the similarity, and vice versa. We call this value the inverse document frequency or IDF,

assuming we use base-10 logarithm. Note that if a word were to appear in all three
documents, its IDF would be 0.

We can compute the IDF just once, as a pre-processing step, for each word in our corpus and
it will tell us how significant that word is in the corpus itself.

At this point, instead of using the raw word counts, we can compute the document vectors by
weighing it with the IDF. For each document, we’ll compute the count for each word,
transform it into a frequency (that is, dividing the count by the total number of words in the
document), and then multiply by the IDF.
4.1. Pros and Cons of TF-IDF

The weighting factor provided by this method is a great advantage compared to using
raw word frequencies, and it’s important to note that its usefulness is not limited to the
handling of stopwords.

Every word will have a different weight, and since word usage varies with the topic of
discussion, this weight will be tailored according to what the input corpus is about.

For example, the word “lawyer” could have low importance in a collection of legal
documents (in terms of establishing similarities between two of them), while, instead, high
importance in a set of news articles. Intuitively this makes sense because most legal
documents will talk about lawyers, but most news articles won’t.

The downside of this method as described is that it doesn’t take into account any semantic
aspect. Two words like “teacher” and “professor”, although similar in meaning, will
correspond to two different dimensions of the resulting document vectors, contributing 0 to
the overall similarity.

In any case, this method or variations of it, are still very efficient and widely used, for
example in implementing search engine results ranking. In this scenario, we can use the
similarity between the input query and the result documents to rank higher those that are very
similar.

5. Word Embeddings

Word embeddings are high-dimensional vectors that represent words. We can create them in
an unsupervised way from a collection of documents, in general using neural networks, by
analyzing all the contexts in which the word occurs.

This results in vectors that are similar (according to cosine similarity) for words that appear
in similar contexts, and thus have a similar meaning. For example, since the words “teacher”
and “professor” can sometimes be used interchangeably, their embeddings will be close
together.

For this reason, using word embeddings can enable us to handle synonyms or words with
similar meaning in the computation of similarity, which we couldn’t do by using word
frequencies.

However, word embeddings are just vector representations of words, and there are several
ways that we can use to integrate them into our text similarity computation.

Text similarity is a very active research field, and techniques are continuously evolving and
improving. In this article, we’ve given an overview of possible ways to implement text
similarity, focusing on the Vector Space Model and Word Embeddings.

We’ve seen how methods like TF-IDF can help in weighting terms appropriately, but
without taking into account any semantic aspects
For these reasons, when choosing what method to use, it’s important to always consider our
use case and requirements carefully

Example 1: Using vector space representation, find out the ranks by computing the distance
between the points representing the docs and the query “Tropical fish” using Euclidean distance

D1 : Tropical freshwater Aquarium fish.

D2 : Tropical fish, aquarium care tank setup.

D3 : Keeping Tropical fish and goldfish in aquariums and Fish bowls.

D4 : The Tropical tank homepage - Tropical fish and Aquariums.

Terms
D1 D2 D3 D4
Aquarium 1 1 1 1
bowl 0 0 1 0
care 0 1 0 0
fish 1 1 2 1
Fresh water 1 1 0 0
Gold fish 0 0 1 0
Home page 0 0 0 1
keep 0 0 1 0
Setup 0 0 0 0
tank 0 0 0 1
tropical 1 1 1 2

The docs D1, D2, D3, and D4 are represented by the following vectors:

D1 - [1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]

D2 - [1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1]

D3 - [1, 1, 0, 2, 0, 1, 0, 1, 0, 0, 1]

D4 - [1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2]

The dimension of the space is t = 11, vocabulary of the collection

The query “Tropical fish” is represented by the following vector:

Tropical fish - [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]

Example 2: Compute (i) cosine similarity on BOW

(ii) cosine similarity on TF-IDF with BOW

And find out the ranking of documents for the following docs:

D1: The best Italian restaurant enjoy the best pasta.


D2: American restaurant enjoy the best hamburger.

D3: Korean restaurant enjoy the best bibimbap.

D4: the best the best American restaurant.

Terms
D1 D2 D3 D4
The 2 1 1 2
best 2 1 1 2
Italian 1 0 0 0
Restaurant 1 1 1 1
Enjoy 1 1 1 0
Pasta 1 0 0 0
American 0 0 0 1
Hamburger 0 0 0 0
Korean 0 1 1 0
Bibimbap 0 1 1 1

The docs D1, D2, D3, and D4 are represented by the following vectors:

D1 is represented – [ 2, 2, 1, 1, 1, 1, 0, 0, 0, 0]

D2 is represented – [ 1, 1, 0, 1, 1, 0, 1, 1, 0, 0]

D3 is represented – [ 1, 1, 0, 1, 1, 0, 0, 0, 1, 1]

D4 is represented – [ 2, 2, 0, 1, 0, 0, 1, 0, 0, 1]

Cosine similarity between q⃗ , d⃗ is given by


| V|

q⃗ • d⃗ ⃗q d⃗
∑ qi d i
cos (¿ ⃗q , d⃗ )=
i=1
= • = ¿
|q⃗||⃗d| |⃗q| |d⃗|
√∑ √∑
|V | |V |
2 2
qi di
i=1 i=1
Doc TF on BOW Cosine Similarity
with D4
D1: The best Italian restaurant [2, 2, 1, 1, 1, 1, 0, 0, 0, 0 ] ?
enjoy the best pasta.
D2: American restaurant enjoy [1, 1, 0, 1, 1, 0, 1, 1, 0, 0 ] ?
the best hamburger.
D3: Korean restaurant enjoy [1, 1, 0, 1, 1, 0, 0, 0, 1, 1 ] ?
the best bibimbap.
D4: the best the best American [2, 2, 0, 1, 0, 0, 1, 0,0, 1 ] 1
restaurant.

(ii)

tf = how frequently a term occurs in a doc.

idf = log(tot # of docs) / (# of docs with the term in it)


The idf (inverse document frequency) of t by

id f t= log 10 (N /d f t )
The tf-idf weight of a term is the product of its tf weight and its idf weight

w t , d=(1+log t f t , d )× log 10 (¿ N /d f t )¿

word Tf of Tf of Tf of Tf of id f t= log 10 (N /d ftf-idf


t) tf-idf tf-idf tf-idf
d1 d2 d3 d4 of d1 of d2 of d3 of d4
The 2/8 1/6 1/6 2/6 log(4/4)= 0 0 0 0 0
Best 2/8 1/6 1/6 2/6 log(4/4)= 0 0 0 0 0
Italian 1/8 0/6 0/6 0/6 log(4/1)= 0.6 0.075 0 0 0
Restaurant 1/8 1/6 1/6 1/6 log(4/4)= 0 0 0 0 0
Enjoy 1/8 1/6 1/6 0/6 log(4/3)= 0.13 0.016 0.02 0.02 0
Pasta 1/8 0/6 0/6 0/6 log(4/1)= 0.6 0.075 0 0 0
American 0/8 1/6 0/6 1/6 log(4/2)= 0.3 0 0.05 0 0.05
Hamburger 0/8 1/6 0/6 0/6 log(4/1)= 0.6 0 2.1 0 0
Korean 0/8 0/6 1/6 0/6 log(4/1)= 0.6 0 0 0 0
Bibimbap 0/8 0/6 1/6 0/6 log(4/1)= 0.6 0 0 0.1 0

Doc TF-IDF on BOW Cosine Similarity


with D4
D1 : The best Italian restaurant [0, 0, 0.075, 0, 0.016, 0.075, 0, 0, 0, 0] ?
enjoy the best pasta.
D2 : American restaurant enjoy [0, 0, 0, 0, 0.02, 0, 0.05, 2.1, 0, 0] ?
the best hamburger.
D3 : Korean restaurant enjoy [0, 0, 0, 0, 0.02, 0, 0, 0, 0,0.1] ?
the best bibimbap.
D4 : the best the best [0, 0, 0, 0, 0, 0, 0.05, 0, 0, 0] 1
American restaurant.

You might also like