Measuring Semantic Similarity Between Words Using Web Search Engines
Measuring Semantic Similarity Between Words Using Web Search Engines
Measuring Semantic Similarity Between Words Using Web Search Engines
Yutaka Matsuo
Mitsuru Ishizuka
danushka@mi.ci.i.utokyo.ac.jp
ishizuka@i.u-tokyo.ac.jp
y.matsuo@aist.go.jp
ABSTRACT
1.
Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data
extraction have used various semantic similarity measures.
Despite the usefulness of semantic similarity measures in
these applications, robustly measuring semantic similarity
between two words (or entities) remains a challenging task.
We propose a robust semantic similarity measure that uses
the information available on the Web to measure similarity
between words or entities. The proposed method exploits
page counts and text snippets returned by a Web search
engine. We define various similarity scores for two given
words P and Q, using the page counts for the queries P, Q
and P AND Q. Moreover, we propose a novel approach to
compute semantic similarity using automatically extracted
lexico-syntactic patterns from text snippets. These different
similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure.
Experimental results on Miller-Charles benchmark dataset
show that the proposed measure outperforms all the existing
web-based semantic similarity measures by a wide margin,
achieving a correlation coefficient of 0.834. Moreover, the
proposed semantic similarity measure significantly improves
the accuracy (F -measure of 0.78) in a community mining
task, and in an entity disambiguation task, thereby verifying
the capability of the proposed measure to capture semantic
similarity using web content.
General Terms
Algorithms
Keywords
semantic similarity, Web mining
INTRODUCTION
1
Page count may not necessarily be equal to the word frequency because the queried word might appear many times
on one page
2
http:://www.google.com
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2007, May 812, 2007, Banff, Alberta, Canada.
ACM 978-1-59593-654-7/07/0005.
757
2.
Snippets, a brief window of text extracted by a search engine around the query term in a document, provide useful information regarding the local context of the query term. Semantic similarity measures defined over snippets, have been
used in query expansion [36], personal name disambiguation [4] and community mining [6]. Processing snippets is
also efficient as it obviates the trouble of downloading web
pages, which might be time consuming depending on the size
of the pages. However, a widely acknowledged drawback of
using snippets is that, because of the huge scale of the web
and the large number of documents in the result set, only
those snippets for the top-ranking results for a query can
be processed efficiently. Ranking of search results, hence
snippets, is determined by a complex combination of various factors unique to the underlying search engine. Therefore, no guarantee exists that all the information we need to
measure semantic similarity between a given pair of words
is contained in the top-ranking snippets.
This paper proposes a method that considers both page
counts and lexico-syntactic patterns extracted from snippets,
thereby overcoming the problems described above.
For example, let us consider the following snippet from
Google for the query Jaguar AND cat.
RELATED WORK
Semantic similarity measures are important in many Webrelated tasks. In query expansion [5, 25, 40] a user query is
modified using synonymous words to improve the relevancy
of the search. One method to find appropriate words to
include in a query is to compare the previous user queries
using semantic similarity measures. If there exist a previous
query that is semantically related to the current query, then
it can be suggested either to the user or internally used by
the search engine to modify the original query.
Semantic similarity measures have been used in Semantic
Web related applications such as automatic annotation of
Web pages [7], community mining [23, 19], and keyword
extraction for inter-entity relation representation [26].
Semantic similarity measures are necessary for various applications in natural language processing such as word-sense
disambiguation [32], language modeling [34], synonym extraction [16], and automatic thesauri extraction [8]. Manually compiled taxonomies such as WordNet3 and large text
corpora have been used in previous works on semantic similarity [16, 31, 13, 17]. Regarding the Web as a live corpus has become an active research topic recently. Simple,
unsupervised models demonstrably perform better when ngram counts are obtained from the Web rather than from a
large corpus [14, 15]. Resnik and Smith [33] extracted bilingual sentences from the Web to create a parallel corpora
for machine translation. Turney [38] defined a point-wise
mutual information (PMI-IR) measure using the number of
hits returned by a Web search engine to recognize synonyms.
Matsuo et. al, [20] used a similar approach to measure the
similarity between words and apply their method in a graphbased word clustering algorithm.
Given a taxonomy of concepts, a straightforward method
to calculate similarity between two words (concepts) is to
find the length of the shortest path connecting the two words
in the taxonomy [30]. If a word is polysemous then multiple
paths might exist between the two words. In such cases,
only the shortest path between any two senses of the words
is considered for calculating similarity. A problem that is
frequently acknowledged with this approach is that it relies
on the notion that all links in the taxonomy represent a
uniform distance.
Resnik [31] proposed a similarity measure using information content. He defined the similarity between two concepts
C1 and C2 in the taxonomy as the maximum of the information content of all concepts C that subsume both C1 and
C2 . Then the similarity between two words is defined as
the maximum of the similarity between any concepts that
the words belong to. He used WordNet as the taxonomy;
information content is calculated using the Brown corpus.
Li et al., [41] combined structural semantic information
from a lexical taxonomy and information content from a corpus in a nonlinear model. They proposed a similarity measure that uses shortest path length, depth and local density
in a taxonomy. Their experiments reported a Pearson correlation coefficient of 0.8914 on the Miller and Charles [24]
benchmark dataset. They did not evaluate their method in
terms of similarities among named entities. Lin [17] defined
the similarity between two concepts as the information that
is in common to both concepts and the information contained in each individual concept.
758
http://wordnet.princeton.edu/
3.
METHOD
3.1
Outline
WebOverlap(P, Q)
(
0
=
H(P Q)
if H(P Q) c
otherwise.
min(H(P ),H(Q))
(2)
if H(P Q) c
otherwise.
(3)
if H(P Q) c
otherwise.
(4)
(1)
3.2
if H(P Q) c
otherwise.
759
3.3
P
N
Y, X and Y are, X and Y are two. Finally, function CountFreq counts the frequency of each pattern we extracted. The
procedure described above yields a set of patterns with their
frequencies in text snippets obtained from a search engine.
It considers the words that fall between X and Y as well as
words that precede X and succeeds Y .
To leverage the pattern extraction process, we select 5000
pairs of synonymous nouns from WordNet synsets. For polysemous nouns we selected the synonyms for the dominant
sense. The pattern extraction algorithm described in Figure 3 yields 4, 562, 471 unique patterns. Of those patterns,
80% occur less than 10 times. It is impossible to train a classifier with such numerous sparse patterns. We must measure
the confidence of each pattern as an indicator of synonymy.
For that purpose, we employ the following procedure.
First, we run the pattern extraction algorithm described
in Figure 3 with a non-synonymous set of word-pairs and
count the frequency of the extracted patterns. We then use
a test of statistical significance to evaluate the probable applicability of a pattern as an indicator of synonymy. The
fundamental idea of this analysis is that, if a pattern appears a statistically significant number of times in snippets
for synonymous words than in snippets for non-synonymous
words, then it is a reliable indicator of synonymy.
To create a set of non-synonymous word-pairs, we select
two nouns from WordNet arbitrarily. If the selected two
nouns do not appear in any WordNet synset then we select
them as a non-synonymous word-pair. We repeat this procedure until we obtain 5000 pairs of non-synonymous words.
For each extracted pattern v, we create a contingency
table, as shown in Table 1 using its frequency pv in snippets for synonymous word pairs and nv in snippets for nonsynonymous word pairs. In Table 1, P denotes the total
frequency ofPall patterns in snippets for synonymous word
pairs (P = v pv ) and N is the
P same in snippets for nonsynonymous word pairs (N = v nv ).
Using the information in Table 1, we calculate the 2 [18]
value for each pattern as,
All
2 =
(P + N )(pv (N nv ) nv (P pv ))2
.
P N (pv + nv )(P + N pv nv )
(5)
760
3.4
0.798
0.796
0.794
0.792
0.790
0.788
0.786
0.784
0.782
0.780
D GetSnippets(A B)
N null
for each snippetd D
do N N + GetNgrams(d, A, B)
SelP ats SelectPatterns(N, GoodP ats)
P F Normalize(SelP ats)
F [P F, W ebJaccard, W ebOverlap, W ebDice, W ebP M I]
return (F )
0.800
(6)
4.
EXPERIMENTS
We conduct two sets of experiments to evaluate the proposed semantic similarity measure. First we compare the
similarity scores produced by the proposed measure against
Miller-Charles benchmark dataset. We analyze the behavior of the proposed measure with the number of patterns
used as features, the number of snippets used to extract the
patterns, and the size of the training dataset. Secondly, we
apply the proposed measure in two real-world applications:
community mining and entity disambiguation.
For each pair of words (A, B), we create a feature vector F as shown in Figure 4. First, we query Google for
A AND B and collect snippets. Then we replace the
query words A and B with two wildcards X and Y , respectively in each snippet. Function GetNgrams extracts
n-grams for n = 2, 3, 4 and 5 from the snippets. We select
n-grams having exactly one X and one Y as we did in the
pattern extraction algorithm in Figure 3. Let us assume
the set of patterns selected based on their 2 values in section 3.3 to be GoodP ats. Then, the function SelectPatterns
selects the n-grams from N which appear in GoodP ats. In
N ormalize(SelP ats), we normalize the count of each pattern by diving it from the total number of counts of the
observed patterns. This function returns a vector of patterns where each element is the normalized frequency of the
corresponding pattern in the snippets for the query AB.
We append similarity scores calculated using page counts in
section 3.2 to create the final feature vector F for the wordpair (A, B). This procedure yields a 204 dimensional (4
page-counts based similarity scores and 200 lexico-syntactic
patterns) feature vector F . We form such feature vectors
for all synonymous word-pairs (positive training examples)
as well as for non-synonymous word-pairs (negative training examples). We then train a two-class support vector
machine with the labelled feature vectors.
Once we have trained an SVM using synonymous and nonsynonymous word pairs, we can use it to compute the semantic similarity between two given words. Following the same
method we used to generate feature vectors for training, we
create a feature vector F 0 for the given pair of words (A0 , B 0 ),
between which we need to measure the semantic similarity.
We define the semantic similarity SemSim(A0 , B 0 ) between
4.1
4.2
Pattern Selection
We trained a linear kernel SVM with top N pattern features (ranked according to their 2 values) and calculated
the correlation coefficient against the Miller-Charles benchmark dataset. Results of the experiment are shown in Figure 5. In Figure 5 a steep improvement of correlation with
the number of top-ranking patterns is appearent; it reaches
6
761
CODC(P, Q)
(
0 h
i
f (Q@P )
f (P @Q
=
log H(P ) H(Q)
e
otherwise.
(7)
4.3
if f (P @Q) = 0
4.4
Taxonomy-Based Methods
Semantic Similarity
7
We did not use any of the words in the benchmark dataset
or their synsets for training
762
Correlation Coefficient
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
4.6
0.71
0.70
We used synonymous word pairs extracted from WordNet synsets as positive training examples and automatically
generated non-synonymous word pairs as negative training
examples to train a two-class support vector machine in section 3.4. To determine the optimum combination of positive
and negative training examples, we trained a linear kernel
SVM with different combinations of positive and negative
training examples and evaluated accuracy against the human ratings in the Miller-Charles benchmark dataset. Experimental results are summarized in Figure 7. Maximum
correlation coefficient of 0.8345 is achieved with 1900 positive training examples and 2400 negative training examples.
Moreover, Figure 7 reveals that correlation does not improve beyond 2500 positive and negative training examples.
Therefore, we can conclude that 2500 examples are sufficient
to leverage the proposed semantic similarity measure.
100 200 300 400 500 600 700 800 900 1000
Number of snippets
4.5
Training Data
We computed the correlation with the Miller-Charles ratings for different numbers of snippets to investigate the effect
of the number of snippets used to extract patterns upon the
763
correlation
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
500
4000
3500
3000
2500
2000
1500 positive examples
1000
500
1000
1500
2000
2500
negative examples 3000 3500
name in the data set and average the results over the dataset.
For each person p in our data set, let us denote the cluster
that p belongs to by C(p). Moreover, we use A(p) to denote
the affiliation of person p, e.g., A(Tiger Woods) =Tennis
Player. Then we calculate precision and recall for person
p as,
4000
4.7
Community Mining
1
1
2 ||(|| 1)
sim(u, v)
(8)
(u,v)
Here, is the merger of the two clusters A and B. || denotes the number of elements (persons) in and sim(u, v)
is the semantic similarity between two persons u and v in
. We terminate GAAC process when exactly five clusters
are formed. We adopt this clustering method with different semantic similarity measures sim(u, v) to compare their
accuracy in clustering people who belong to the same community.
We employed the B-CUBED metric [1] to evaluate the
clustering results. The B-CUBED evaluation metric was
originally proposed for evaluating cross-document co-reference
chains. We compute precision, recall and F -score for each
8
http://dmoz.org
764
Precision(p) =
F(p) =
2 Precision(p) Recall(p)
.
Precision(p) + Recall(p)
(11)
Overall precision, recall and F -score are computed by taking the averaged sum over all the names in the dataset.
Precision =
Recall =
1
N
1
N
Precision(p)
(12)
Recall(p)
(13)
pDataSet
X
pDataSet
F Score =
1
N
F(p)
(14)
pDataSet
4.8
Entity Disambiguation
Disambiguating named entities is important in various applications such as information retrieval [9], social network
extraction [19, 3, 4], Word Sense Disambiguation (WSD) [21],
citation matching [11] and cross-document co-reference resolution [28, 10].
1
|Sa ||Sb |
sim(a, b)
robustly capture semantic similarity between named entities. In future research, we intend to apply the proposed
semantic similarity measure in automatic synonym extraction, query suggestion and name alias recognition.
5.
(15)
aSa ,bSb
4.9
REFERENCES
Conclusion
765
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
766
F
0.5243
0.56
0.5243
0.6468
0.5761
0.6358
0.691