Disclosure of Invention
The invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison, which is applied to an actual cross-language search engine and a web cross-language information retrieval system, can solve the problems of query theme drift and word mismatching in cross-language information retrieval, and improves the cross-language retrieval performance.
The technical scheme of the invention is as follows:
the cross-language query expansion method for realizing rule back-part mining through weight comparison comprises the following steps:
step 1: the source language query firstly retrieves the target language documents across languages, and constructs and preprocesses a set of initial-check related feedback documents. The method comprises the following specific steps:
and (1-1) translating the source language user query into a target language through a machine translation system, and retrieving a target language text document set by adopting a vector space retrieval model to obtain a target language document of the primary detection front row.
The machine translation system is: microsoft applied to the Microsoft Translator API, or Google machine translation interface, etc.
And (1-2) constructing a primary detection related feedback document set by performing relevance judgment on primary detection front target language documents.
(1-3) preprocessing a relevant feedback document set of the initial examination, and constructing a target language text document index library and a feature word library;
the pretreatment method comprises the following steps: removing stop words, extracting feature words and calculating the weight of the feature words according to the formula (1);
in the formula (1), wijRepresenting a document diMiddle characteristic word tjWeight of (tf)j,iRepresentation feature word tjIn document diThe present invention will tfj,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processingiTf of each feature wordj,iDivided by document diMaximum word frequency, idfjIs the inverse document frequency.
Step 2: mining a frequent item set containing original query terms in an initial examination related feedback document set through item set weight value comparison, and pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:
(2-1) mining text feature word 1_ frequent item set L1The method comprises the following specific steps:
(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C1;
(2-1-2) scanning the target language text document index library, and counting the total number n of text documents and C1Term set weight w [ C ]1];
(2-1-3) calculating a minimum weight support threshold (MWS). The MWS calculation formula is shown in formula (2).
MWS=n×ms (2)
In the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library.
(2-1-4) if w [ C ]1]Not less than MWS, then C1That is, the text feature word 1_ frequent item set L1Add to the frequent itemset set FIS.
(2-2) mining text feature word 2_ frequent item set L2The method comprises the following specific steps:
(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L1Deriving multiple 2_ candidate sets C from concatenation2;
The method of Aproiri ligation is described in detail in the literature (Agrawal R, Imilinski T, Swami A. minor association rules between entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993:207-
(2-2-2) pruning 2_ candidate item set C without original query terms2;
(2-2-3) for the remaining 2_ candidate set C2Scanning the target language text document index library to count each remaining 2_ candidate item set C2Term set weight w [ C ]2];
(2-2-4) if w [ C ]2]Not less than MWSx 2, then the 2_ candidate item set C2That is, the text feature word 2_ frequent item set L2Adding to a frequent item set FIS;
(2-3) mining text feature word k _ frequent item set LkAnd k is more than or equal to 2, and the specific steps are as follows:
(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set Lk-1Deriving a plurality of k _ candidate sets C from concatenationk=(i1,i2,…,ik) The k is more than or equal to 2;
(2-3-2) scanning the target language text document index library, and respectively counting each CkTerm set weight w [ C ]k]And each CkMiddle maximum item weight wmRespectively obtaining each CkMiddle maximum item weight wmCorresponding item imSaid m ∈ (1,2, …, k);
(2-3-3) if said item imCorresponding 1_ item set (i)m) Infrequent, or wm<MWS, then the corresponding C is prunedk;
(2-3-4) for the remaining CkRespectively calculate each CkItem set relevancy IRe (C)k) If w [ C ]k]Not less than MWS x k and IRe (C)k) Not less than mini thenkNamely a text feature word k _ frequent item set LkAdding to a frequent item set FIS; the miniRE is a minimum term set relevancy threshold; the IRe (C)k) The formula (2) is shown in the formula;
in the formula (2), wmin[(iq)]And wmax[(ip)]The meanings of (A) are as follows: for Ck=(i1,i2,…ik) K _ candidate item set CkEach item i of1,i2,…,ikRespectively as 1_ item set alone corresponds to (i)1),(i2),…,(ik);wmin[(iq)]And wmax[(ip)]Respectively represent 1_ item set (i)1),(i2),…,(ik) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);
(2-3-5) if the text feature word k _ frequent item set LkIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;
and step 3: adopting a chi-square analysis-confidence evaluation framework to perform k _ frequent item set L on each text characteristic word in a frequent item set FISkAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:
taking out any one text characteristic word k _ frequent item set L from frequent item set FISkDigging each L according to the following stepskAll of the association rule patterns containing the original query terms.
(3-1) construction of LkAll proper subset item set sets of (a);
(3-2) arbitrarily taking two proper subset item sets q from the proper subset item set
tAnd E
tAnd is and
q
t∪E
t=L
k,
Q
TLset of terms for the original query of the target language, E
tComputing a set of terms (q) for a set of feature terms that does not contain the original query term
t,E
t) Card squareValue, the chi-squared value Chis (q)
t,E
t) The calculation formula is shown in formula (4).
In formula (4), w [ (q)t)]Is a set of items qtWeight, k, of item set in index library of text documents in target language1Is a set of items qtLength of (d), w [ (E)t)]As a set of items EtItem set weights, k, in a target language text document index library2As a set of items EtLength of (d), w [ (q)t,Et)]Is a set of items (q)t,Et) Item set weights, k, in a target language text document index libraryLIs a set of items (q)t,Et) N is the total number of text documents in the target language text document index library.
(3-3) if Chis (q)t,Et)>0, calculating the confidence WConf (q) of the text feature word weighted association rulet→Et). If WConf (q)t→Et) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q ist→EtIs a strong weighted association rule pattern and is added to the weighted association rule pattern set WAR. The WConf (q)t→Et) The formula (5) is shown in the following formula.
In formula (5), w [ (q)t)],k1,w[(qt,Et)],kLThe same formula (4) is defined.
(3-4) if LkIf and only if each proper subset entry set of (a) is taken once, then this time LkThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS againkAnd go to step (3-1) to perform another LkOtherwise, turning to the step (3-2) and sequencingExecuting each step; if each L in the frequent item set FISkAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.
And 4, step 4: extracting weighted association rule back-part E from weighted association rule pattern set WARtAnd (5) as query expansion words, calculating expansion word weights.
Extracting each weighted association rule q from the weighted association rule pattern set WARt→EtAs query expansion term, the weight w of the expansion termeThe calculation formula is shown in formula (6).
we=0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)
In equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum values of the 3 measurement values are taken, respectively.
And 5: combining the expansion words with the original query words to form a new query, and searching the target language document again to complete cross-language query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison. The method comprises the steps of mining a frequent item set containing original query terms in an initial check related feedback target language document set through item set weight value comparison, pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, mining a text characteristic word relevance rule mode containing the original query terms from the frequent item set by using a chi-square analysis-confidence evaluation framework, taking a part item set after a front part is the relevance rule of the original query term set as a query expansion word, realizing cross-language query expansion, and combining the expansion word and the original query word into a new query to retrieve the target language document again. Experimental results show that the invention can improve the retrieval performance of the cross-language text information.
(2) The standard data set NTCIR-5CLIR commonly used internationally is selected as the experimental corpus of the method. The existing mining method is selected as the comparison method of the invention, and experimental results show that the cross-language text retrieval result P @15 and the average R-precision value of the method of the invention are both higher than those of the comparison method, the effect is obvious, the retrieval performance of the method of the invention is superior to that of the comparison method, the cross-language information retrieval performance can be improved, the problems of query drift and word mismatching in the cross-language information retrieval are reduced, and the method has high application value and wide popularization prospect.
Detailed Description
The following description of the embodiments of the method of the present invention is provided in conjunction with the accompanying drawings, and should not be construed as limiting the scope of the claims.
The following introduces concepts related to the present invention:
1. front piece and back piece of text characteristic word association rule
Let T1、T2Is an arbitrary set of text feature terms, to be shaped as T1→T2Is called a text feature word association rule, wherein T1Called rule antecedent, T2Referred to as rule back-parts.
2. Let DS be { d }1,d2,…,dnIs a Set of text Documents (DS), where di(1. ltoreq. i. ltoreq. n) is the ith document in the document set DS, di={t1,t2,…,tm,…,tp},tm(m ═ 1,2, …, p) is document feature word item, short feature item, generally composed of word, word or phrase, diThe corresponding feature item weight set Wi={wi1,wi2,…,wim,…,wip},wimFor the ith document diM characteristic item tmCorresponding weight, T ═ T1,t2,…,tnDenotes the set of global feature items in DS, and each subset of T is called asAnd (4) a feature item set, namely an item set for short.
Suppose that k _ candidate item set C is counted in the text document index libraryk=(i1,i2,…,ik) Term set weight w [ C ]k]To obtain CkEach item i1,i2,…,ikThe corresponding weights are w1,w2,…,wkThen, the said w1,w2,…,wkCalled item weight, and CkTerm set weight w [ C ]k]=w1+w2+…+wk。
Example 1
As shown in fig. 1, the cross-language query expansion method for implementing rule back-part mining by weight comparison includes the following steps:
step 1: the source language query firstly retrieves the target language documents across languages, and constructs and preprocesses a set of initial-check related feedback documents. The method comprises the following specific steps:
and (1-1) translating the source language user query into a target language through a machine translation system, and retrieving a target language text document set by adopting a vector space retrieval model to obtain a target language document of the primary detection front row.
The machine translation system is: microsoft applied to the Microsoft Translator API, or Google machine translation interface, etc.
And (1-2) constructing a primary detection related feedback document set by performing relevance judgment on primary detection front target language documents.
And (1-3) preprocessing the initial examination related feedback document set, and constructing a target language text document index library and a feature word library.
The preprocessing method of the preliminary examination related feedback document set is to adopt corresponding preprocessing methods according to different languages, for example, if the target language is english, the preprocessing method is: removing English stop words, extracting an English feature word stem by using a Porter program (see the detailed website: http:// tartartargarus. org/martin/Porter Stemmer), calculating the weight of the English feature words, and if the target language is Chinese, the preprocessing method comprises the following steps: removing Chinese stop words, extracting Chinese characteristic words after segmenting Chinese documents, and calculating the weight of the Chinese characteristic words.
The invention provides a calculation formula of the weight of the feature words of the initial examination related feedback documents, which is shown as a formula (1).
In the formula (1), wijRepresenting a document diMiddle characteristic word tjWeight of (tf)j,iRepresentation feature word tjIn document diFrequency of words in (1), commonly tfj,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processingiEach feature word tf inj,iDivided by document diMaximum word frequency, idfjIs the Inverse Document Frequency (Inverse Document Frequency).
The source of the cross-language query expansion words is the cross-language initial check related feedback documents, so that in the cross-language initial check related feedback document set, the more the number of the initial check related feedback documents containing a certain text characteristic word is, the more the characteristic word is related to the original query, and the more important the characteristic word is, the higher the weight of the characteristic word is.
Step 2: mining a frequent item set containing original query terms in an initial examination related feedback document set through item set weight value comparison, and pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:
(2-1) mining text feature word 1_ frequent item set L1The method comprises the following specific steps:
(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C1;
(2-1-2) scanning the target language text document index library, and counting the total number n of text documents and C1Term set weight w [ C ]1];
(2-1-3) calculating a minimum weight support threshold (MWS). The MWS calculation formula is shown in formula (2).
MWS=n×ms (2)
In the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library.
(2-1-4) if w [ C ]1]Not less than MWS, then C1That is, the text feature word 1_ frequent item set L1Add to frequent itemset set fis (frequency itemset).
(2-2) mining text feature word 2_ frequent item set L2The method comprises the following specific steps:
(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L1Deriving multiple 2_ candidate sets C from concatenation2。
The method of Aproiri ligation is described in detail in the literature (Agrawal R, Imilinski T, Swami A. minor association rules between entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993:207-
(2-2-2) pruning 2_ candidate item set C without original query terms2;
(2-2-3) for the remaining 2_ candidate set C2Scanning the target language text document index library to count each remaining 2_ candidate item set C2Term set weight w [ C ]2];
(2-2-4) if w [ C ]2]Not less than MWSx 2, then the 2_ candidate item set C2That is, the text feature word 2_ frequent item set L2Adding to a frequent item set FIS;
(2-3) mining text feature word k _ frequent item set LkAnd k is more than or equal to 2, and the specific steps are as follows:
(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set Lk-1Deriving a plurality of k _ candidate sets C from concatenationk=(i1,i2,…,ik) The k is more than or equal to 2;
(2-3-2) scanning the target language text document index library, and respectively counting each CkTerm set weight w [ C ]k]And each CkMiddle maximum item weight wmRespectively obtaining each CkMiddle maximum item weight wmCorresponding item imSaid m ∈ (1,2, …, k);
(2-3-3) if said item imCorresponding 1_ item set (i)m) Infrequent, or wm<MWS, then the corresponding C is prunedk;
(2-3-4) for the remaining CkRespectively calculate each CkItem set relevancy IRe (C)k) If w [ C ]k]Not less than MWS x k and IRe (C)k) Not less than mini thenkNamely a text feature word k _ frequent item set LkAdding to a frequent item set FIS; otherwise, pruning the Ck;
The miniRE is a minimum term set relevancy threshold; the IRe (C)k) The formula (3) is shown in the formula;
in the formula (3), wmin[(iq)]And wmax[(ip)]The meanings of (A) are as follows: for Ck=(i1,i2,…ik) K _ candidate item set CkEach item i of1,i2,…,ikRespectively as 1_ item set alone corresponds to (i)1),(i2),…,(ik);wmin[(iq)]And wmax[(ip)]Respectively represent 1_ item set (i)1),(i2),…,(ik) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);
(2-3-5) if the text feature word k _ frequent item set LkIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;
the pruning method comprises the following steps:
(1) for k _ candidate Ck=(i1,i2,…,ik) If said C iskTerm set weight w [ C ]k]<MWS x k, said infrequent, said C is prunedk(ii) a If said C iskItem set relevancy ofIRe(Ck)<minIRe, then CkIs an invalid item set, prunes the Ck(ii) a In summary, the present invention only excavates w [ C ]k]Not less than MWS x k and IRe (C)k) And the miniRE is a minimum term set association degree threshold value.
(2) If k _ candidate Ck=(i1,i2,…,ik) The middle-largest item weight is smaller than the minimum weight support threshold MWS, then CkIf not, clipping out the Ck;
(3) Let k _ candidate Ck=(i1,i2,…,ik) The item corresponding to the medium maximum item weight is solely used as a 1_ item set as (i)m) If the 1_ item set (i)m) If not, clipping out the Ck。
(4) When the candidate 2_ item set is mined, the candidate 2_ item set without the original query term is deleted, and the candidate 2_ item set with the original query term is left.
And step 3: from each k _ frequent item set L in the frequent item set FIS using a Chi-Square analysis-confidence evaluation frameworkkAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:
taking out any one text characteristic word k _ frequent item set L from frequent item set FISkDigging each L according to the following stepskAll of the association rule patterns containing the original query terms.
(3-1) construction of LkAll proper subset item set sets of (a);
(3-2) arbitrarily taking two proper subset item sets q from the proper subset item set
tAnd E
tAnd is and
q
t∪E
t=L
k,
Q
TLset of terms for the original query of the target language, E
tComputing a set of terms (q) for a set of feature terms that does not contain the original query term
t,E
t) Chi-square value of, said chi-square value Chis (q)
t,E
t) The calculation formula is shown in formula (4).
In formula (4), w [ (q)t)]Is a set of items qtWeight, k, of item set in index library of text documents in target language1Is a set of items qtLength of (d), w [ (E)t)]As a set of items EtWeight, k, of item set in index library of text documents in target language2As a set of items EtLength of (d), w [ (q)t,Et)]Is a set of items (q)t,Et) Item set weights, k, in a target language text document index libraryLIs a set of items (q)t,Et) N is the total number of text documents in the target language text document index library.
The core idea of Chi-square Analysis is to measure the correlation between data items, if Chis (q-square Analysis)t,Et) Two proper subset item sets q are illustrated as 0tAnd EtIndependent of each other, no correlation exists, and therefore, the occurrence of some association rules of false correlation can be avoided.
(3-3) if Chis (q)t,Et)>0, calculating the confidence WConf (q) of the text feature word weighted association rulet→Et). If WConf (q)t→Et) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q ist→EtIs a strong weighted association rule pattern and is added to the weighted association rule pattern set WAR. The WConf (q)t→Et) The formula (5) is shown in the following formula.
In formula (5), w [ (q)t)],k1,w[(qt,Et)],kLThe same formula (4) is defined.
(3-4) if LkIf and only if each proper subset entry set of (a) is taken once, then this time LkThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS againkAnd transferred to step (3-1) for another LkMining the weighted association rule mode, otherwise, turning to the step (3-2) and sequentially executing the steps; if each L in the frequent item set FISkAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.
And 4, step 4: extracting weighted association rule back-part E from weighted association rule pattern set WARtAnd (5) as query expansion words, calculating expansion word weights.
Extracting each weighted association rule q from the weighted association rule pattern set WARt→EtThe back part Et is used as an inquiry expansion word, because the relevance is an important index for measuring the relevance degree of each item in the item set, and the confidence value and the chi-square value are important indexes for measuring the relevance between the front part and the back part of the relevance rule mode, in view of the above, the invention takes the relevance degree, the chi-square value and the confidence value as the calculation basis of the weight value of the expansion word, and provides the weight value w of the expansion word according to the importance degree of the 3 measurement values to the expansion wordeThe formula (6) is shown as follows:
we=0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)
in equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum value of the above-mentioned 3 measurement values is taken.
And 5: combining the expansion words with the original query words to form a new query, and searching the target language document again to complete cross-language query expansion.
Experimental design and results:
in order to illustrate the effectiveness of the method, Indonesian and English are taken as language objects, Indonesian-English cross-language information retrieval experiments based on the method and the comparison method are carried out, and the cross-language retrieval performance of the method and the comparison method is compared.
The experimental corpora:
the experimental corpus of the invention is a standard data set NTCIR-5CLIR corpus (see website: http:// research. ni. ac. jp/NTCIR/permission/NTCIR-5/perm-en-CLIR. html), namely News texts of English document sets Mainichi Daily News 2000, 2001 and Korea Times 2001 in the NTCIR-5CLIR corpus are selected, and 26224 English documents are shared as the experimental data of the invention, specifically, a News text 6608 (m 00) of the Mainichi Daily News 2000, a News text 5547 (m01) of the Mainichi Daily News 2001 and a News text 14069 (k01) of the Korea 2001.
The NTCIR-5CLIR corpus comprises a document test set, 50 query subject sets and corresponding result sets, wherein each query subject type comprises 4 types such as Title, Desc, Narr and Conc, and the result sets comprise 2 evaluation criteria, namely a highly relevant Rigid criterion and a highly relevant, relevant and partially relevant Relax criterion. The invention selects a Title and a Desc type for the type of the query subject for experiments, wherein the Title query belongs to short query, the query subject is described briefly by nouns and noun phrases, the Desc query belongs to long query, and the query subject is described briefly in sentence form.
The evaluation indexes of the experimental result of the invention are P @15 and the average R-precision ratio. The P @15 is the accuracy of the first 15 results returned for a test query, the average R-precision is the arithmetic average of the R-precision corresponding to all queries, and the R-precision is the precision calculated when R documents are retrieved.
The comparison method comprises the following steps:
(1) comparative method 1: Indonesia-English cross-language reference retrieval method. The comparison method 1 refers to a retrieval result obtained by retrieving an English document after translating Indonesia query into English by a machine, and various query expansions are not implemented in the retrieval process.
(2) Comparative method 2: Indonesia-English cross-language query post-translation expansion method based on weighted association pattern mining. The comparison method 2 is a cross-language query expansion method based on documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper mined based on a weighted association mode, 2017,36(3): 307-. The experimental parameters were: the minimum confidence threshold mc is 0.01, the minimum interestingness threshold mi is 0.0001, and the minimum confidence threshold ms is 0.007,0.008,0.009,0.01, 0.011.
(3) Comparative method 3: the Indonesia-English cross-language query post-translation expansion method based on the pseudo-correlation feedback is a cross-language query expansion method based on documents (Wudan, Happy and King Help. cross-language query expansion [ J ] information bulletin, 2010,29(2): 232-. The experimental method comprises the following steps: extracting 20 front English documents of Indonesia-English cross-language preliminary examination to construct a preliminary examination related document set, extracting characteristic lexical items and calculating weights of the characteristic lexical items, and arranging the front 20 characteristic lexical items as English extension words in descending order according to the weights to realize cross-English cross-language query translation and then extension.
The experimental methods and results are as follows:
the source program of the method and the comparison method is operated, firstly, Title and Desc queries of 50 Indonesian query subjects are translated into English through a machine translation system, and English documents are retrieved so as to realize Indonesia-English cross-language information retrieval. In the experiment, after user-related feedback is carried out on 50 English documents in the front of cross-language initial examination, the initial examination user-related feedback documents are obtained (for simplicity, in the experiment, the related documents in the front of the initial examination 50 documents containing known result sets are regarded as the initial examination related documents), the association rule mode is obtained after the mining method is realized, and the association rule is extracted and used as an expansion word to realize cross-language query expansion. Through experiments, the Indonesia-English cross-language retrieval result P @15 and the average R-precision of the method and the comparison method are respectively shown in tables 1 to 2, and a 3_ item set is mined in the experimental process, wherein the experimental parameters of the method are as follows: the minimum confidence thresholds mc are 0.5,0.6,0.7,0.8, and 0.9, respectively, the minimum support threshold ms is 0.5, and the minimum term set relevancy threshold minIRe is 0.4.
TABLE 1 search Performance comparison of the inventive and comparative methods (Title Inquiry subject)
TABLE 2 search Performance comparison of the inventive and comparative methods (Desc Inquiry theme)
Tables 1 and 2 show that the cross-language retrieval result P @15 and the average R-precision value of the method are higher than those of the 3 comparison methods, and the effect is obvious. The experimental result shows that the method is effective, can actually improve the cross-language information retrieval performance, and has high application value and wide popularization prospect.