CN104090918B

CN104090918B - Sentence similarity calculation method based on information amount

Info

Publication number: CN104090918B
Application number: CN201410268361.4A
Authority: CN
Inventors: 吴昊; 黄河燕
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-06-16
Filing date: 2014-06-16
Publication date: 2017-02-22
Anticipated expiration: 2034-06-16
Also published as: CN104090918A

Abstract

The present invention relates to a kind of sentence similarity calculation method based on the amount of information, comprising the following steps: first, determine the meaning of the word through the concept with the largest amount of information between two sentence words; then use the hierarchical structure of the semantic web and corpus statistics to Calculate the amount of information of words and the amount of public information between multiple words; then apply the principle of inclusion and exclusion in combinatorics to calculate the total information amount of multiple words, so as to obtain the respective information amount of the two sentences and the total information of the two sentences Quantity; Finally, the similarity of sentences is defined and calculated according to the Jaccard similarity principle. The present invention can realistically simulate human beings' judgment on the similarity of sentences, and does not need to use corpus training parameters or use experience parameters, does not depend on the scale of the corpus, and does not need other natural language processing technologies such as part-of-speech tagging; the time performance is excellent, and it is suitable for general-length sentences. The sentence is right, to obtain quasi-real-time computing efficiency on the current mainstream multi-core PC.

Description

A Calculation Method of Sentence Similarity Based on Information Amount

技术领域technical field

本发明涉及一种句子相似度计算方法，具体涉及一种基于信息量的句子相似度计算方法，属于自然语言处理技术领域。The invention relates to a sentence similarity calculation method, in particular to an information-based sentence similarity calculation method, which belongs to the technical field of natural language processing.

背景技术Background technique

句子或短文本相似度计算是自然语言处理的一项重要研究内容，近年来在信息检索、机器翻译、问答系统、自动文摘等应用领域中的作用越来越重要。传统的方法多采用文档相似度的计算方法，仅把句子词语看成相互没有关联的无意义符号，对于计算含有少量词语的句子不够精确。而目前常用的混合方法通常需要在相关数据集上训练参数或者使用经验参数，其缺点是依赖训练数据集，通用性不强。Sentence or short text similarity calculation is an important research content of natural language processing. In recent years, it has played an increasingly important role in information retrieval, machine translation, question answering system, automatic summarization and other application fields. The traditional method mostly uses the calculation method of document similarity, and only regards the sentence words as meaningless symbols that are not related to each other, which is not accurate enough for the calculation of sentences containing a small number of words. However, the commonly used hybrid methods usually need to train parameters on related data sets or use empirical parameters. The disadvantage is that they rely on training data sets and are not very versatile.

发明内容Contents of the invention

本发明方法的目的在于解决上述问题，提供一种基于信息量的句子相似度计算方法，通过使用信息量这个语言的本质属性，使用容斥原理获得精确的多个词语的总信息量，从而得到更接近于人主观判断得到的句子相似度结果。The purpose of the method of the present invention is to solve the above-mentioned problem, provide a kind of sentence similarity calculation method based on the amount of information, by using the essential property of the language of the amount of information, use the tolerance-exclusion principle to obtain accurate total information amount of a plurality of words, thereby obtain It is closer to the sentence similarity result obtained by human subjective judgment.

本发明方法的思想是首先通过两个句子词语间具有最大的信息量的概念确定词语的词义；然后利用语义网(比如WordNet)的层次结构和语料库(比如BNC语料库或Brown语料库等)统计来计算词语的信息量和多词语间的公共信息量；接下来应用组合数学中容斥原理计算多个词语的总信息量，从而分别得到两个句子各自的信息量，以及两个句子总共的信息量；最后根据Jaccard相似度原理定义并计算出句子的相似度。The idea of the inventive method is first to determine the meaning of the word by the concept of the maximum amount of information between two sentence words; then utilize the hierarchical structure of the Semantic Web (such as WordNet) and the statistics of the corpus (such as the BNC corpus or the Brown corpus, etc.) to calculate The amount of information of words and the amount of public information between multiple words; then apply the principle of tolerance and exclusion in combinatorics to calculate the total information amount of multiple words, so as to obtain the respective information amount of two sentences and the total information amount of two sentences ; Finally, define and calculate the sentence similarity according to the Jaccard similarity principle.

为达到上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于信息量的句子相似度计算方法，包括以下步骤：A method for calculating sentence similarity based on information content, comprising the following steps:

步骤1：输入待计算的两个句子s_a和s_b，记句子s_a和s_b分别为：Step 1: Input the two sentences s _a and s _b to be calculated, and record the sentences s _a and s _b as:

其中，和分别表示句子s_a和s_b的第i个词语，n和m分别表示句子s_a和s_b的词语数；in, and represent the i-th word of sentences s _a and s _b respectively, and n and m represent the number of words of sentences s _a and s _b respectively;

步骤2：对输入句子中的词语进行词义选择，过程如下：Step 2: Select the meaning of the words in the input sentence, the process is as follows:

词语的词义按照式1确定：words The meaning of the word is determined according to Formula 1:

[式1][Formula 1]

其中，subsum(c₁,c₂)为在语义网中包含概念c₁和c₂的所有概念集合，表示在语义网中所有包含词语的概念的集合，concepts(s_b)表示语义网中包含句子s_b中的所有词语的概念的集合，P(c)为概念c在语料库中的频率，特殊的，如果P(c)为0，则log P(c)为0，P(c)的值按照式2确定：Among them, subsum(c ₁ ,c ₂ ) is the collection of all concepts including concepts c ₁ and c ₂ in the Semantic Web, Represents all words in the Semantic Web that contain The set of concepts, concepts(s _b ) represents the set of concepts in the semantic web that contains all the words in the sentence s _b , P(c) is the frequency of concept c in the corpus, special, if P(c) is 0 , then log P(c) is 0, and the value of P(c) is determined according to formula 2:

[式2][Formula 2]

P(c)＝Σ_w∈words(c)count(w)/NP(c)＝ _{Σw∈words(c)} count(w)/N

其中words(c)表示语义网中概念c以及概念c的所有子概念中的所有词语的集合，count(w)为词语w在语料库中的频数，N表示语义网中全部概念的频数之和，而每个概念的频数为该概念中全部词语在语料库中的总频数之和；Among them, words(c) represents the set of all words in concept c and all sub-concepts of concept c in the semantic web, count(w) is the frequency of word w in the corpus, and N represents the sum of the frequencies of all concepts in the semantic web, The frequency of each concept is the sum of the total frequencies of all words in the concept in the corpus;

同理，将式1中替换成concepts(s_b)替换成concepts(s_a)，可得句子s_b中第i个词语的词义 Similarly, in formula 1 replace with concepts(s _b ) is replaced by concepts(s _a ), the meaning of the i-th word in the sentence s _b can be obtained

词义确定后句子s_a和s_b可以记为：After the word meaning is determined, the sentences s _a and s _b can be recorded as:

步骤3：根据步骤2所得确定词义的句子，应用组合数学中的容斥原理计算句子s_a和s_b各自的信息量以及二者的总信息量，计算过程如下：Step 3: According to the sentence obtained in step 2 to determine the meaning of the word, apply the principle of inclusion and exclusion in combinatorics to calculate the respective information content of sentences s _a and s _b and the total information content of the two sentences. The calculation process is as follows:

句子s_a的信息量IC(s_a)的计算公式如式3所示：The calculation formula of the information content IC(s _a ) of the sentence s _a is shown in formula 3:

[式3][Formula 3]

其中，表示通过语义网的层次结构和语料库统计共同构建的语义信息空间，根据式4计算：in, Represents the semantic information space jointly constructed by the hierarchical structure of the Semantic Web and corpus statistics, Calculate according to formula 4:

[式4][Formula 4]

其中，为在语义网中包含概念的所有概念的集合；in, To include concepts in the Semantic Web The collection of all concepts of

同理，把式3和式4中所有字母a替换成b，n替换成m，可得句子s_b的信息量；In the same way, replace all letters a in formula 3 and formula 4 with b, and n with m, and the information amount of sentence s _b can be obtained;

把句子s_a和s_b中所有不重复的词语的集合看成一个新的句子，则通过式5得到句子s_a和s_b的总信息量IC(s_a∪s_b)：The set of all non-repetitive words in sentences s _a and s _b is regarded as a new sentence, and the total information IC(s _a ∪s _b ) of sentences s _a and s _b can be obtained through formula 5:

[式5][Formula 5]

其中，p为句子s_a和s_b不重复的词语的总数；Among them, p is the total number of words that are not repeated in sentences s _a and s _b ;

步骤4：由并集和交集之间的关系定义两个句子s_a和s_b的公共信息量COMMONIC(s_a,s_b)，计算公式如式6所示：Step 4: Define the public information amount COMMONIC(s _a , s _b ) of two sentences s _a and s _b from the relationship between union and intersection, and the calculation formula is shown in formula 6:

[式6][Formula 6]

COMMONIC(s_a,s_b)＝IC(s_a)+IC(s_b)-IC(s_a∪s_b)COMMONIC(s _a ,s _b )＝IC(s _a )+IC(s _b )-IC(s _a ∪s _b )

步骤5：根据Jaccard相关性原理，定义句子s_a和s_b的相似度sim(s_a,s_b)，计算公式如式7所示：Step 5: According to the Jaccard correlation principle, define the similarity sim(s _a , s _b ) of sentences s _a and s _b , and the calculation formula is shown in formula 7:

[式7][Formula 7]

步骤6：输出两个句子的相似度sim(s_a,s_b)。Step 6: Output the similarity sim(s _a ,s _b ) of the two sentences.

有益效果Beneficial effect

对比现有技术，本发明方法进一步提高了混合方法的精确度，能逼真的模拟人类对句子相似程度的判断，并且不需要使用语料训练参数或使用经验参数、不依赖语料库的规模、无需词性标注等其他自然语言处理技术，可以直接用于句子相似度计算，具有较好的通用性；时间性能优秀，对一般长度的句子对，在当前主流多核PC机上获得准实时计算效率。Compared with the prior art, the method of the present invention further improves the accuracy of the hybrid method, can realistically simulate human judgment on the similarity of sentences, and does not need to use corpus training parameters or use experience parameters, does not depend on the size of the corpus, and does not require part-of-speech tagging and other natural language processing technologies can be directly used for sentence similarity calculation, which has good versatility; the time performance is excellent, and for sentence pairs of general length, quasi-real-time computing efficiency can be obtained on current mainstream multi-core PCs.

附图说明Description of drawings

图1是本发明方法的流程图。Figure 1 is a flow chart of the method of the present invention.

图2是本发明方法与其他方法对比图。Fig. 2 is a comparison diagram between the method of the present invention and other methods.

具体实施方式detailed description

下面将结合附图对本发明的实施过程进行详细说明。The implementation process of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明方法主要分为5个步骤：As shown in Figure 1, the inventive method is mainly divided into 5 steps:

步骤1：输入待计算的两个句子。记句子s_a和s_b分别为：Step 1: Enter the two sentences to be calculated. Remember the sentences s _a and s _b are:

其中，和分别表示句子s_a和s_b的第i个词语，n和m分别表示句子s_a和s_b的词语数。in, and represent the i-th word of sentences s _a and s _b respectively, and n and m represent the number of words of sentences s _a and s _b respectively.

步骤2：因词语的一词多义现象普遍存在，对输入句子的词语进行词义选择，可以消除句子语义的不确定性，从而为后续计算句子相似度做准备。具体过程如下：Step 2: Due to the common phenomenon of polysemous words, selecting the meaning of the words in the input sentence can eliminate the uncertainty of the sentence semantics, so as to prepare for the subsequent calculation of sentence similarity. The specific process is as follows:

(1)在输入的两个句子中，各选取一个词语组成词语对；(1) In the two input sentences, select a word to form a word pair;

(2)使用语义网(比如WordNet)中词语的词义(或概念)来代替词语，该词语对转换成多个词义对；(2) Use the meaning (or concept) of the word in the semantic network (such as WordNet) to replace the word, and the word pair is converted into multiple word sense pairs;

(3)计算词义对之间的公共信息量；(3) Calculate the amount of public information between word sense pairs;

(4)选取最大的公共信息量作为该词语对的公共信息量；(4) choose the maximum amount of public information as the amount of public information of the word pair;

(5)把所有词语对的公共信息量按照从大到小排序；(5) sort the public information of all word pairs in descending order;

(6)定义两个数组，用来记录每个句子中词语的词义，并初始化数组中的所有元素值为未确定词义；(6) define two arrays, be used for recording the meaning of words in each sentence, and all element values in the initialization array are undetermined meanings;

(6)按照(5)的顺序依次处理，如果词语的词义还没有确定，即词语在数组中对应的元素值仍为未确定词义，则把该信息量对应的词语的词义记录到词义数组中；(6) Process according to the order of (5), if the meaning of the word has not been determined, that is, the corresponding element value of the word in the array is still undetermined, then the meaning of the word corresponding to the amount of information is recorded in the meaning array ;

(7)直到数组元素值没有未确定词义为止，即全部词语的词义确定完毕。(7) Until the array element values have no undetermined meanings, that is, the meanings of all words are determined.

则词语的词义(或概念)按照如式1确定：words The meaning (or concept) of is determined according to formula 1:

[式1][Formula 1]

其中in

[式2][Formula 2]

P(c)＝∑_w∈words(c)count(w)/NP(c)＝ _{∑w∈words(c)} count(w)/N

这里，subsum(c₁,c₂)为在语义网中包含概念c₁和c₂的所有概念集合，表示在语义网中的所有包含词语的概念的集合，concepts(s_b)表示语义网中包含句子s_b中的所有词语的概念的集合，P(c)为概念c在某语料库中的频率。特殊的，如果P(c)为0，则logP(c)为0；Here, subsum(c ₁ ,c ₂ ) is the collection of all concepts including concepts c ₁ and c ₂ in the Semantic Web, Represents all contained words in the Semantic Web concepts(s _b ) represents the set of concepts in the semantic web including all the words in sentence s _b , and P(c) is the frequency of concept c in a corpus. In particular, if P(c) is 0, then logP(c) is 0;

words(c)表示语义网中概念c以及概念c的所有子概念中的所有词语的集合，count(w)为词语w在某一语料库中的频数，N表示语义网中全部概念的频数之和，而每个概念的频数为该概念中全部词语在语料库中的总频数之和。words(c) represents the collection of all words in concept c and all sub-concepts of concept c in the semantic web, count(w) is the frequency of word w in a certain corpus, and N represents the sum of the frequencies of all concepts in the semantic web , and the frequency of each concept is the sum of the total frequencies of all words in the concept in the corpus.

同理可得句子s_b中第i个词语的词义词义确定后句子s_a和s_b可以记为：In the same way, the meaning of the i-th word in the sentence s _b can be obtained After the word meaning is determined, the sentences s _a and s _b can be recorded as:

步骤3：根据步骤2所得确定词义的句子，应用组合数学中的容斥原理计算句子s_a和s_b各自的信息量，即句子s_a的信息量的计算公式如式3所示：Step 3: According to the sentence whose meaning is determined in step 2, apply the principle of inclusion and exclusion in combinatorics to calculate the respective information volumes of sentences s _a and s _b , that is, the calculation formula of the information volume of sentence s _a is shown in formula 3:

[式3][Formula 3]

其中in

[式4][Formula 4]

这里，表示通过语义网的层次结构和语料库统计共同构建的语义信息空间中概念的交集，为在语义网中包含概念的所有概念的集合，对于取值在1至n之间的任意k，表示句子s_a中取k个词的一种组合。here, Represent concepts in the semantic information space jointly constructed by the hierarchical structure of the Semantic Web and corpus statistics the intersection of To include concepts in the Semantic Web The set of all concepts of , for any k between 1 and n, Indicates a combination of k words in the sentence s _a .

同理，把式3和式4中所有字母a替换成b，n替换成m，可得句子s_b的信息量。In the same way, replace all letters a in formula 3 and formula 4 with b, and n with m, and the information amount of sentence s _b can be obtained.

把句子s_a和s_b中所有不重复的词语的集合看成一个新的句子，则通过式5得到句子s_a和s_b的总信息量：The set of all non-repetitive words in sentences s _a and s _b is regarded as a new sentence, then the total information amount of sentences s _a and s _b can be obtained through formula 5:

[式5][Formula 5]

其中，p为两个句子不重复的词语总数。Among them, p is the total number of words that are not repeated in the two sentences.

步骤4：通过步骤3得到的两个句子各自的信息量以及两个句子的总信息量。由并集和交集之间的关系得到两个句子s_a和s_b的公共信息量COMMONIC(s_a,s_b)如式6所示：Step 4: The respective information amounts of the two sentences and the total information amount of the two sentences obtained in step 3. The public information COMMONIC(s _a , s _b ) of the two sentences s _a and s _b is obtained from the relationship between the union and the intersection, as shown in Formula 6:

[式6][Formula 6]

步骤5：通过步骤3和步骤4得到的两个句子的总信息量和公共信息量，根据Jaccard相关性原理，两个句子s_a和s_b的相似度可通过式7计算：Step 5: The total information and public information of the two sentences obtained through steps 3 and 4, according to the Jaccard correlation principle, the similarity between the two sentences s _a and s _b can be calculated by formula 7:

[式7][Formula 7]

如上所述，本发明提供了一种基于信息量的句子相似度计算方法。通过用户输入的真实句子对，系统将自动计算出逼真于人类对句子相似度判断的结果。As mentioned above, the present invention provides a method for calculating sentence similarity based on information content. Through the real sentence pair input by the user, the system will automatically calculate the result that is realistic to the human judgment of sentence similarity.

下面给出本发明方法与现有其他4种句子相似度计算方法的比较结果。实验中使用了的语义网WordNet和语料库BNC。评估采用线性相关的Pearson相关系数(PCC)，一般排序相关的Spearman rank相关系数(SRCC)和概率排序相关的Kendall rank相关系数(KRCC)。表1为对词语对应句子对采用各种方法的打分结果。The comparison results between the method of the present invention and other four existing sentence similarity calculation methods are given below. The semantic web WordNet and corpus BNC are used in the experiment. Pearson correlation coefficient (PCC) for linear correlation, Spearman rank correlation coefficient (SRCC) for general rank correlation and Kendall rank correlation coefficient (KRCC) for probability rank correlation were used for evaluation. Table 1 shows the scoring results of the sentence pairs corresponding to the words using various methods.

表1句子对的人工打分和各种方法打分Table 1 Manual scoring of sentence pairs and scoring by various methods

从表1可得到图2。从图2中可以看到，本文方法在PCC、SRCC和KRCC上都优于其他4种方法，说明模型对句子相似度的判断更接近人类的主观判断。另外，人工数据集的单一志愿者打分与全部志愿者打分均值之间的PCC均值为0.825，最大值为0.921；而本发明方法的PCC为高于单人打分的平均值且低于单人打分的上限值；说明模型对句子相似度的判断水平高于人的平均水平，且结果可信。Figure 2 can be obtained from Table 1. As can be seen from Figure 2, the method in this paper is superior to the other four methods on PCC, SRCC and KRCC, indicating that the model's judgment of sentence similarity is closer to human subjective judgment. In addition, the average value of PCC between the scores of a single volunteer and the mean of all volunteers in the artificial data set is 0.825, and the maximum value is 0.921; while the PCC of the method of the present invention is higher than the average value of single-person scoring and lower than the average value of single-person scoring The upper limit value of ; it shows that the judgment level of the model on sentence similarity is higher than the average human level, and the result is credible.

以上所述的具体描述，对发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific description above further elaborates the purpose, technical solution and beneficial effect of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not used to limit the protection of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. A method for calculating sentence similarity based on information volume, characterized in that, comprising the following steps:

Step 1: Input the two sentences s _a and s _b to be calculated, and record the sentences s _a and s _b as:

{s the s}_{a a} = = {{{w w}_{i i}^{a a} | | i i = = 11,, 22,, ... ...,, n no}}

{s the s}_{b b} = = {{{w w}_{i i}^{b b} | | i i = = 11,, 22,, ... ...,, m m}}

in, and represent the i-th word of sentences s _a and s _b respectively, and n and m represent the number of words of sentences s _a and s _b respectively;

Step 2: Select the meaning of the words in the input sentence, the process is as follows:

words The meaning of the word is determined according to Formula 1:

[Formula 1]

{c c}_{i i}^{a a} = = arg arg max max {{- - log log P P ((c c))}}

c∈subsum(c ₁ ,c ₂ )

{c c}_{11} &Element; &Element; c c o o n no c c e e p p t t (({w w}_{i i}^{a a}))

c ₂ ∈ concepts(s _b )

Among them, subsum(c ₁ ,c ₂ ) is the collection of all concepts including concepts c ₁ and c ₂ in the Semantic Web, Represents all words in the Semantic Web that contain The set of concepts, concepts(s _b ) represents the set of concepts in the semantic web that contains all the words in the sentence s _b , P(c) is the frequency of concept c in the corpus, special, if P(c) is 0 , then logP(c) is 0, and all values of P(c) are determined according to formula 2:

[Formula 2]

P(c)＝ _{∑w∈words(c)} count(w)/N

Among them, words(c) represents the collection of all words in concept c and all sub-concepts of concept c in the semantic web, count(w) is the frequency of word w in the corpus, and N represents the sum of the frequencies of all concepts in the semantic web, The frequency of each concept is the sum of the total frequencies of all words in the concept in the corpus;

Similarly, in formula 1 replace with concepts(s _b ) is replaced by concepts(s _a ), the meaning of the i-th word in the sentence s _b can be obtained

After the word meaning is determined, the sentences s _a and s _b can be recorded as:

{S S}_{a a} = = {{{c c}_{i i}^{a a} | | i i = = 11,, 22,, ... ...,, n no}}

{s the s}_{b b} = = {{{c c}_{i i}^{b b} | | i i = = 11,, 22,, ... ...,, m m}}

Step 3: According to the sentence obtained in step 2 to determine the meaning of the word, apply the principle of inclusion and exclusion in combinatorics to calculate the respective information content of sentences s _a and s _b and the total information content of the two sentences. The calculation process is as follows:

The calculation formula of the information content IC(s _a ) of the sentence s _a is shown in formula 3:

[Formula 3]

I I C C (({s the s}_{a a})) = = {Σ Σ}_{k k = = 11}^{n no} {((- - 11))}^{k k - - 11} \underset{11 \leq \leq {i i}_{11} < < {i i}_{22} ... ... < < {i i}_{k k} \leq \leq n no}{Σ Σ} c c o o m m m m o o n no I I C C (({c c}_{{i i}_{11}}^{a a},, {c c}_{{i i}_{22}}^{a a},, ... ...,, {c c}_{{i i}_{k k}}^{a a}))

in, Represents the semantic information space jointly constructed by the hierarchical structure of the Semantic Web and corpus statistics, Calculate according to formula 4:

[Formula 4]

c c o o m m m m o o n no I I C C (({c c}_{{i i}_{11}}^{a a},, {c c}_{{i i}_{22}}^{a a},, ... ...,, {c c}_{{i i}_{k k}}^{a a})) = = \underset{c c &Element; &Element; s the s u u b b s the s u u m m (({c c}_{{i i}_{11}}^{a a},, {c c}_{{i i}_{22}}^{a a},, ... ...,, {c c}_{{i i}_{k k}}^{a a}))}{max max} [[- - log log P P ((c c))]]

in, To include concepts in the Semantic Web The collection of all concepts of

In the same way, replace all letters a in formula 3 and formula 4 with b, and n with m, and the information amount of sentence s _b can be obtained;

Considering the set of all non-repetitive words in sentences s _a and s _b as a new sentence, the total information IC(s _a ∪s _b ) of sentences s _a and s _b can be obtained through formula 5:

[Formula 5]

I I C C (({s the s}_{a a} \cup \cup {s the s}_{b b})) = = {Σ Σ}_{k k = = 11}^{n no} {((- - 11))}^{k k - - 11} \underset{11 \leq \leq {i i}_{11} < < ... ... < < {i i}_{k k} \leq \leq n no}{Σ Σ} c c o o m m m m o o n no I I C C (({c c}_{{i i}_{11}}^{a a},, ... ...,, {c c}_{{i i}_{k k}}^{a a}))

Among them, p is the total number of words that are not repeated in sentences s _a and s _b ;

Step 4: Define the public information amount COMMONIC(s _a , s _b ) of two sentences s _a and s _b from the relationship between union and intersection, and the calculation formula is shown in formula 6:

[Formula 6]

COMMONIC(s _a ,s _b )＝IC(s _a )+IC(s _b )-IC(s _a ∪s _b )

Step 5: According to the Jaccard correlation principle, define the similarity sim(s _a , s _b ) of sentences s _a and s _b , and the calculation formula is shown in formula 7:

[Formula 7]

s the s i i m m (({S S}_{a a},, {S S}_{b b})) = = \frac{C C O o M m M m O o N N I I C C (({s the s}_{a a},, {s the s}_{b b}))}{I I C C (({s the s}_{a a} \cup \cup {s the s}_{b b}))}

Step 6: Output the similarity sim(s _a ,s _b ) of the two sentences.