Text Similarity Cosine BOW TF-IDF Lecture
Text Similarity Cosine BOW TF-IDF Lecture
1. Introduction
Computing the similarity between two text documents is a common task in NLP, with
several practical applications.
Although they convey a very similar meaning, they are written in a completely different way.
In fact, the two sentences just have one word in common (“the”), and not a really significant
one at that. But we need a similarity algorithm to return a high score for this pair. Now,
consider the sentences:
Traditional text similarity methods only work on a lexical level, that is, using only the
words in the sentence. These were mostly developed before the rise of deep learning but can
still be used today. They are faster to implement and run, and can provide a better trade-off
depending on the use case.
3. Document Vectors
The simplest way to build a vector from text is to use word counts. We’ll do this with
three example sentences and then compute their similarity. After that, we’ll go over actual
methods that can be used to compute the document vectors.
D1. We went to the pizza place and you ate no pizza at all.
D2. I ate pizza with you yesterday at home.
D3. There’s no place like home.
To build our vectors, we’ll count the occurrences of each word in a sentence:
Once we have our vectors, we can use the similarity measure: cosine similarity measures
the angle between the two vectors and returns a real value between -1 and 1.
If the vectors only have positive values, the output will actually lie between 0 and 1. It will
return 0 when the two vectors are orthogonal, that is, the documents don’t have any
similarity, and 1 when the two vectors are parallel, that is, the documents are completely
identical:
As we can see, the first two documents have the highest similarity, since they share three
words. Note that since the word pizza appears two times it will contribute more to the
similarity compared to “at”, “ate” or “you”.
4. TF-IDF Vectors
The idea behind TF-IDF is that we first compute the number of documents in which a word
appears in. If a word appears in many documents, it will be less relevant in the computation
of the similarity, and vice versa. We call this value the inverse document frequency or IDF,
assuming we use base-10 logarithm. Note that if a word were to appear in all three
documents, its IDF would be 0.
We can compute the IDF just once, as a pre-processing step, for each word in our corpus and
it will tell us how significant that word is in the corpus itself.
At this point, instead of using the raw word counts, we can compute the document vectors by
weighing it with the IDF. For each document, we’ll compute the count for each word,
transform it into a frequency (that is, dividing the count by the total number of words in the
document), and then multiply by the IDF.
4.1. Pros and Cons of TF-IDF
The weighting factor provided by this method is a great advantage compared to using
raw word frequencies, and it’s important to note that its usefulness is not limited to the
handling of stopwords.
Every word will have a different weight, and since word usage varies with the topic of
discussion, this weight will be tailored according to what the input corpus is about.
For example, the word “lawyer” could have low importance in a collection of legal
documents (in terms of establishing similarities between two of them), while, instead, high
importance in a set of news articles. Intuitively this makes sense because most legal
documents will talk about lawyers, but most news articles won’t.
The downside of this method as described is that it doesn’t take into account any semantic
aspect. Two words like “teacher” and “professor”, although similar in meaning, will
correspond to two different dimensions of the resulting document vectors, contributing 0 to
the overall similarity.
In any case, this method or variations of it, are still very efficient and widely used, for
example in implementing search engine results ranking. In this scenario, we can use the
similarity between the input query and the result documents to rank higher those that are very
similar.
5. Word Embeddings
Word embeddings are high-dimensional vectors that represent words. We can create them in
an unsupervised way from a collection of documents, in general using neural networks, by
analyzing all the contexts in which the word occurs.
This results in vectors that are similar (according to cosine similarity) for words that appear
in similar contexts, and thus have a similar meaning. For example, since the words “teacher”
and “professor” can sometimes be used interchangeably, their embeddings will be close
together.
For this reason, using word embeddings can enable us to handle synonyms or words with
similar meaning in the computation of similarity, which we couldn’t do by using word
frequencies.
However, word embeddings are just vector representations of words, and there are several
ways that we can use to integrate them into our text similarity computation.
Text similarity is a very active research field, and techniques are continuously evolving and
improving. In this article, we’ve given an overview of possible ways to implement text
similarity, focusing on the Vector Space Model and Word Embeddings.
We’ve seen how methods like TF-IDF can help in weighting terms appropriately, but
without taking into account any semantic aspects
For these reasons, when choosing what method to use, it’s important to always consider our
use case and requirements carefully
Example 1: Using vector space representation, find out the ranks by computing the distance
between the points representing the docs and the query “Tropical fish” using Euclidean distance
Terms
D1 D2 D3 D4
Aquarium 1 1 1 1
bowl 0 0 1 0
care 0 1 0 0
fish 1 1 2 1
Fresh water 1 1 0 0
Gold fish 0 0 1 0
Home page 0 0 0 1
keep 0 0 1 0
Setup 0 0 0 0
tank 0 0 0 1
tropical 1 1 1 2
The docs D1, D2, D3, and D4 are represented by the following vectors:
D1 - [1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]
D2 - [1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1]
D3 - [1, 1, 0, 2, 0, 1, 0, 1, 0, 0, 1]
D4 - [1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2]
And find out the ranking of documents for the following docs:
Terms
D1 D2 D3 D4
The 2 1 1 2
best 2 1 1 2
Italian 1 0 0 0
Restaurant 1 1 1 1
Enjoy 1 1 1 0
Pasta 1 0 0 0
American 0 0 0 1
Hamburger 0 0 0 0
Korean 0 1 1 0
Bibimbap 0 1 1 1
The docs D1, D2, D3, and D4 are represented by the following vectors:
D1 is represented – [ 2, 2, 1, 1, 1, 1, 0, 0, 0, 0]
D2 is represented – [ 1, 1, 0, 1, 1, 0, 1, 1, 0, 0]
D3 is represented – [ 1, 1, 0, 1, 1, 0, 0, 0, 1, 1]
D4 is represented – [ 2, 2, 0, 1, 0, 0, 1, 0, 0, 1]
q⃗ • d⃗ ⃗q d⃗
∑ qi d i
cos (¿ ⃗q , d⃗ )=
i=1
= • = ¿
|q⃗||⃗d| |⃗q| |d⃗|
√∑ √∑
|V | |V |
2 2
qi di
i=1 i=1
Doc TF on BOW Cosine Similarity
with D4
D1: The best Italian restaurant [2, 2, 1, 1, 1, 1, 0, 0, 0, 0 ] ?
enjoy the best pasta.
D2: American restaurant enjoy [1, 1, 0, 1, 1, 0, 1, 1, 0, 0 ] ?
the best hamburger.
D3: Korean restaurant enjoy [1, 1, 0, 1, 1, 0, 0, 0, 1, 1 ] ?
the best bibimbap.
D4: the best the best American [2, 2, 0, 1, 0, 0, 1, 0,0, 1 ] 1
restaurant.
(ii)
id f t= log 10 (N /d f t )
The tf-idf weight of a term is the product of its tf weight and its idf weight
w t , d=(1+log t f t , d )× log 10 (¿ N /d f t )¿