CN109684464B

CN109684464B - A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison

Info

Publication number: CN109684464B
Application number: CN201811646511.5A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2018-12-30
Filing date: 2018-12-30
Publication date: 2021-06-04
Anticipated expiration: 2038-12-30
Also published as: CN109684464A

Abstract

The invention discloses a cross-language query expansion method for realizing rule consequent mining through weight comparison. First, cross-language first retrieval is used to construct a preliminary inspection related feedback document set, and then frequent itemsets containing original query terms are mined in the document set. The item set association degree value and the item set with the largest item weight or the largest item weight value are used to prune the candidate item set, and the chi-square analysis-confidence evaluation framework is used to mine the text feature word associations containing the original query terms from the frequent itemset. The rule mode uses the association rule item set whose antecedent is the original query term item set as the query expansion word to realize cross-language query expansion. The invention can overcome the defects of the existing weighted association rule mining method, improve the mining efficiency, can mine extended words related to the original query, improve and improve cross-language information retrieval performance, and reduce query subject drift and word mismatch problems in retrieval. It has good application value and promotion prospects in cross-language search engines and web cross-language retrieval systems.

Description

Cross-language query expansion method for realizing rule back-part mining through weight comparison

Technical Field

The invention belongs to the field of information retrieval, and particularly relates to a cross-language query expansion method for realizing rule back-part mining through weight comparison.

Background

At present, network information resources with the characteristic of multi-language are rapidly increased, and become network big data with huge hidden economic value and research value. The problems encountered by network users in the process of searching other language information resources in the network big data resources by using query expressions of languages familiar to the network users are serious query topic drift, word mismatching and the like, and cross-language query expansion is one of key technologies for solving the problems.

The cross-language query expansion is one of core technologies for improving and improving the cross-language information retrieval performance, can solve the problems of serious query topic drift, word mismatching and the like which are long-term puzzled in the cross-language information retrieval, and refers to a process of finding an expansion word related to an original query by adopting a certain strategy, combining the expansion word with the original query to obtain a new query and retrieving again in the cross-language information retrieval process. In recent decades, scholars have conducted highly effective research on cross-language query expansion methods, and have obtained some research results, for example, the cross-language query expansion method based on potential semantic analysis proposed by fencing et al (fencing, sudan. cross-language query expansion method based on potential semantic analysis [ J ] computer engineering, 2009,35(10):49-53.), and wudan et al propose a cross-language query expansion method based on pseudo-correlation feedback (wudan, which celebration, wanghing. cross-language query expansion based on pseudo-correlation feedback [ J ] information bulletin, 2010,29(2): 232-.

Disclosure of Invention

The invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison, which is applied to an actual cross-language search engine and a web cross-language information retrieval system, can solve the problems of query theme drift and word mismatching in cross-language information retrieval, and improves the cross-language retrieval performance.

The technical scheme of the invention is as follows:

the cross-language query expansion method for realizing rule back-part mining through weight comparison comprises the following steps:

step 1: the source language query firstly retrieves the target language documents across languages, and constructs and preprocesses a set of initial-check related feedback documents. The method comprises the following specific steps:

and (1-1) translating the source language user query into a target language through a machine translation system, and retrieving a target language text document set by adopting a vector space retrieval model to obtain a target language document of the primary detection front row.

The machine translation system is: microsoft applied to the Microsoft Translator API, or Google machine translation interface, etc.

And (1-2) constructing a primary detection related feedback document set by performing relevance judgment on primary detection front target language documents.

(1-3) preprocessing a relevant feedback document set of the initial examination, and constructing a target language text document index library and a feature word library;

the pretreatment method comprises the following steps: removing stop words, extracting feature words and calculating the weight of the feature words according to the formula (1);

in the formula (1), w_ijRepresenting a document d_iMiddle characteristic word t_jWeight of (tf)_j,iRepresentation feature word t_jIn document d_iThe present invention will tf_j,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processing_iTf of each feature word_j,iDivided by document d_iMaximum word frequency, idf_jIs the inverse document frequency.

Step 2: mining a frequent item set containing original query terms in an initial examination related feedback document set through item set weight value comparison, and pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:

(2-1) mining text feature word 1_ frequent item set L₁The method comprises the following specific steps:

(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C₁；

(2-1-2) scanning the target language text document index library, and counting the total number n of text documents and C₁Term set weight w [ C ]₁]；

(2-1-3) calculating a minimum weight support threshold (MWS). The MWS calculation formula is shown in formula (2).

MWS＝n×ms (2)

In the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library.

(2-1-4) if w [ C ]₁]Not less than MWS, then C₁That is, the text feature word 1_ frequent item set L₁Add to the frequent itemset set FIS.

(2-2) mining text feature word 2_ frequent item set L₂The method comprises the following specific steps:

(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L₁Deriving multiple 2_ candidate sets C from concatenation₂；

The method of Aproiri ligation is described in detail in the literature (Agrawal R, Imilinski T, Swami A. minor association rules between entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993:207-

(2-2-2) pruning 2_ candidate item set C without original query terms₂；

(2-2-3) for the remaining 2_ candidate set C₂Scanning the target language text document index library to count each remaining 2_ candidate item set C₂Term set weight w [ C ]₂]；

(2-2-4) if w [ C ]₂]Not less than MWSx 2, then the 2_ candidate item set C₂That is, the text feature word 2_ frequent item set L₂Adding to a frequent item set FIS;

(2-3) mining text feature word k _ frequent item set L_kAnd k is more than or equal to 2, and the specific steps are as follows:

(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set L_k-1Deriving a plurality of k _ candidate sets C from concatenation_k＝(i₁,i₂,…,i_k) The k is more than or equal to 2;

(2-3-2) scanning the target language text document index library, and respectively counting each C_kTerm set weight w [ C ]_k]And each C_kMiddle maximum item weight w_mRespectively obtaining each C_kMiddle maximum item weight w_mCorresponding item i_mSaid m ∈ (1,2, …, k);

(2-3-3) if said item i_mCorresponding 1_ item set (i)_m) Infrequent, or w_m<MWS, then the corresponding C is pruned_k；

(2-3-4) for the remaining C_kRespectively calculate each C_kItem set relevancy IRe (C)_k) If w [ C ]_k]Not less than MWS x k and IRe (C)_k) Not less than mini then_kNamely a text feature word k _ frequent item set L_kAdding to a frequent item set FIS; the miniRE is a minimum term set relevancy threshold; the IRe (C)_k) The formula (2) is shown in the formula;

in the formula (2), w_min[(i_q)]And w_max[(i_p)]The meanings of (A) are as follows: for C_k＝(i₁,i₂,…i_k) K _ candidate item set C_kEach item i of₁,i₂,…,i_kRespectively as 1_ item set alone corresponds to (i)₁),(i₂),…,(i_k)；w_min[(i_q)]And w_max[(i_p)]Respectively represent 1_ item set (i)₁),(i₂),…,(i_k) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);

(2-3-5) if the text feature word k _ frequent item set L_kIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;

and step 3: adopting a chi-square analysis-confidence evaluation framework to perform k _ frequent item set L on each text characteristic word in a frequent item set FIS_kAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:

taking out any one text characteristic word k _ frequent item set L from frequent item set FIS_kDigging each L according to the following steps_kAll of the association rule patterns containing the original query terms.

(3-1) construction of L_kAll proper subset item set sets of (a);

(3-2) arbitrarily taking two proper subset item sets q from the proper subset item set_tAnd E_tAnd is and

q_t∪E_t＝L_k，

Q_TLset of terms for the original query of the target language, E_tComputing a set of terms (q) for a set of feature terms that does not contain the original query term_t,E_t) Card squareValue, the chi-squared value Chis (q)_t,E_t) The calculation formula is shown in formula (4).

In formula (4), w [ (q)_t)]Is a set of items q_tWeight, k, of item set in index library of text documents in target language₁Is a set of items q_tLength of (d), w [ (E)_t)]As a set of items E_tItem set weights, k, in a target language text document index library₂As a set of items E_tLength of (d), w [ (q)_t,E_t)]Is a set of items (q)_t,E_t) Item set weights, k, in a target language text document index library_LIs a set of items (q)_t,E_t) N is the total number of text documents in the target language text document index library.

(3-3) if Chis (q)_t,E_t)>0, calculating the confidence WConf (q) of the text feature word weighted association rule_t→E_t). If WConf (q)_t→E_t) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q is_t→E_tIs a strong weighted association rule pattern and is added to the weighted association rule pattern set WAR. The WConf (q)_t→E_t) The formula (5) is shown in the following formula.

In formula (5), w [ (q)_t)]，k₁，w[(q_t,E_t)]，k_LThe same formula (4) is defined.

(3-4) if L_kIf and only if each proper subset entry set of (a) is taken once, then this time L_kThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS again_kAnd go to step (3-1) to perform another L_kOtherwise, turning to the step (3-2) and sequencingExecuting each step; if each L in the frequent item set FIS_kAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.

And 4, step 4: extracting weighted association rule back-part E from weighted association rule pattern set WAR_tAnd (5) as query expansion words, calculating expansion word weights.

Extracting each weighted association rule q from the weighted association rule pattern set WAR_t→E_tAs query expansion term, the weight w of the expansion term_eThe calculation formula is shown in formula (6).

w_e＝0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)

In equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum values of the 3 measurement values are taken, respectively.

And 5: combining the expansion words with the original query words to form a new query, and searching the target language document again to complete cross-language query expansion.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison. The method comprises the steps of mining a frequent item set containing original query terms in an initial check related feedback target language document set through item set weight value comparison, pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, mining a text characteristic word relevance rule mode containing the original query terms from the frequent item set by using a chi-square analysis-confidence evaluation framework, taking a part item set after a front part is the relevance rule of the original query term set as a query expansion word, realizing cross-language query expansion, and combining the expansion word and the original query word into a new query to retrieve the target language document again. Experimental results show that the invention can improve the retrieval performance of the cross-language text information.

(2) The standard data set NTCIR-5CLIR commonly used internationally is selected as the experimental corpus of the method. The existing mining method is selected as the comparison method of the invention, and experimental results show that the cross-language text retrieval result P @15 and the average R-precision value of the method of the invention are both higher than those of the comparison method, the effect is obvious, the retrieval performance of the method of the invention is superior to that of the comparison method, the cross-language information retrieval performance can be improved, the problems of query drift and word mismatching in the cross-language information retrieval are reduced, and the method has high application value and wide popularization prospect.

Drawings

FIG. 1 is a flow diagram of a cross-language query expansion method for implementing rule back-part mining by weight comparison according to the present invention.

Detailed Description

The following description of the embodiments of the method of the present invention is provided in conjunction with the accompanying drawings, and should not be construed as limiting the scope of the claims.

The following introduces concepts related to the present invention:

1. front piece and back piece of text characteristic word association rule

Let T₁、T₂Is an arbitrary set of text feature terms, to be shaped as T₁→T₂Is called a text feature word association rule, wherein T₁Called rule antecedent, T₂Referred to as rule back-parts.

2. Let DS be { d }₁,d₂,…,d_nIs a Set of text Documents (DS), where d_i(1. ltoreq. i. ltoreq. n) is the ith document in the document set DS, d_i＝{t₁,t₂,…,t_m,…,t_p}，t_m(m ═ 1,2, …, p) is document feature word item, short feature item, generally composed of word, word or phrase, d_iThe corresponding feature item weight set W_i＝{w_i1,w_i2,…,w_im,…,w_ip}，w_imFor the ith document d_iM characteristic item t_mCorresponding weight, T ═ T₁,t₂,…,t_nDenotes the set of global feature items in DS, and each subset of T is called asAnd (4) a feature item set, namely an item set for short.

Suppose that k _ candidate item set C is counted in the text document index library_k＝(i₁,i₂,…,i_k) Term set weight w [ C ]_k]To obtain C_kEach item i₁,i₂,…,i_kThe corresponding weights are w₁,w₂,…,w_kThen, the said w₁,w₂,…,w_kCalled item weight, and C_kTerm set weight w [ C ]_k]＝w₁+w₂+…+w_k。

Example 1

As shown in fig. 1, the cross-language query expansion method for implementing rule back-part mining by weight comparison includes the following steps:

And (1-3) preprocessing the initial examination related feedback document set, and constructing a target language text document index library and a feature word library.

The preprocessing method of the preliminary examination related feedback document set is to adopt corresponding preprocessing methods according to different languages, for example, if the target language is english, the preprocessing method is: removing English stop words, extracting an English feature word stem by using a Porter program (see the detailed website: http:// tartartargarus. org/martin/Porter Stemmer), calculating the weight of the English feature words, and if the target language is Chinese, the preprocessing method comprises the following steps: removing Chinese stop words, extracting Chinese characteristic words after segmenting Chinese documents, and calculating the weight of the Chinese characteristic words.

The invention provides a calculation formula of the weight of the feature words of the initial examination related feedback documents, which is shown as a formula (1).

In the formula (1), w_ijRepresenting a document d_iMiddle characteristic word t_jWeight of (tf)_j,iRepresentation feature word t_jIn document d_iFrequency of words in (1), commonly tf_j,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processing_iEach feature word tf in_j,iDivided by document d_iMaximum word frequency, idf_jIs the Inverse Document Frequency (Inverse Document Frequency).

The source of the cross-language query expansion words is the cross-language initial check related feedback documents, so that in the cross-language initial check related feedback document set, the more the number of the initial check related feedback documents containing a certain text characteristic word is, the more the characteristic word is related to the original query, and the more important the characteristic word is, the higher the weight of the characteristic word is.

MWS＝n×ms (2)

(2-1-4) if w [ C ]₁]Not less than MWS, then C₁That is, the text feature word 1_ frequent item set L₁Add to frequent itemset set fis (frequency itemset).

(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L₁Deriving multiple 2_ candidate sets C from concatenation₂。

(2-2-2) pruning 2_ candidate item set C without original query terms₂；

(2-3-4) for the remaining C_kRespectively calculate each C_kItem set relevancy IRe (C)_k) If w [ C ]_k]Not less than MWS x k and IRe (C)_k) Not less than mini then_kNamely a text feature word k _ frequent item set L_kAdding to a frequent item set FIS; otherwise, pruning the C_k；

The miniRE is a minimum term set relevancy threshold; the IRe (C)_k) The formula (3) is shown in the formula;

in the formula (3), w_min[(i_q)]And w_max[(i_p)]The meanings of (A) are as follows: for C_k＝(i₁,i₂,…i_k) K _ candidate item set C_kEach item i of₁,i₂,…,i_kRespectively as 1_ item set alone corresponds to (i)₁),(i₂),…,(i_k)；w_min[(i_q)]And w_max[(i_p)]Respectively represent 1_ item set (i)₁),(i₂),…,(i_k) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);

the pruning method comprises the following steps:

(1) for k _ candidate C_k＝(i₁,i₂,…,i_k) If said C is_kTerm set weight w [ C ]_k]<MWS x k, said infrequent, said C is pruned_k(ii) a If said C is_kItem set relevancy ofIRe(C_k)<minIRe, then C_kIs an invalid item set, prunes the C_k(ii) a In summary, the present invention only excavates w [ C ]_k]Not less than MWS x k and IRe (C)_k) And the miniRE is a minimum term set association degree threshold value.

(2) If k _ candidate C_k＝(i₁,i₂,…,i_k) The middle-largest item weight is smaller than the minimum weight support threshold MWS, then C_kIf not, clipping out the C_k；

(3) Let k _ candidate C_k＝(i₁,i₂,…,i_k) The item corresponding to the medium maximum item weight is solely used as a 1_ item set as (i)_m) If the 1_ item set (i)_m) If not, clipping out the C_k。

(4) When the candidate 2_ item set is mined, the candidate 2_ item set without the original query term is deleted, and the candidate 2_ item set with the original query term is left.

And step 3: from each k _ frequent item set L in the frequent item set FIS using a Chi-Square analysis-confidence evaluation framework_kAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:

(3-1) construction of L_kAll proper subset item set sets of (a);

q_t∪E_t＝L_k，

Q_TLset of terms for the original query of the target language, E_tComputing a set of terms (q) for a set of feature terms that does not contain the original query term_t,E_t) Chi-square value of, said chi-square value Chis (q)_t,E_t) The calculation formula is shown in formula (4).

In formula (4), w [ (q)_t)]Is a set of items q_tWeight, k, of item set in index library of text documents in target language₁Is a set of items q_tLength of (d), w [ (E)_t)]As a set of items E_tWeight, k, of item set in index library of text documents in target language₂As a set of items E_tLength of (d), w [ (q)_t,E_t)]Is a set of items (q)_t,E_t) Item set weights, k, in a target language text document index library_LIs a set of items (q)_t,E_t) N is the total number of text documents in the target language text document index library.

The core idea of Chi-square Analysis is to measure the correlation between data items, if Chis (q-square Analysis)_t,E_t) Two proper subset item sets q are illustrated as 0_tAnd E_tIndependent of each other, no correlation exists, and therefore, the occurrence of some association rules of false correlation can be avoided.

(3-4) if L_kIf and only if each proper subset entry set of (a) is taken once, then this time L_kThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS again_kAnd transferred to step (3-1) for another L_kMining the weighted association rule mode, otherwise, turning to the step (3-2) and sequentially executing the steps; if each L in the frequent item set FIS_kAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.

Extracting each weighted association rule q from the weighted association rule pattern set WAR_t→E_tThe back part Et is used as an inquiry expansion word, because the relevance is an important index for measuring the relevance degree of each item in the item set, and the confidence value and the chi-square value are important indexes for measuring the relevance between the front part and the back part of the relevance rule mode, in view of the above, the invention takes the relevance degree, the chi-square value and the confidence value as the calculation basis of the weight value of the expansion word, and provides the weight value w of the expansion word according to the importance degree of the 3 measurement values to the expansion word_eThe formula (6) is shown as follows:

w_e＝0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)

in equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum value of the above-mentioned 3 measurement values is taken.

Experimental design and results:

in order to illustrate the effectiveness of the method, Indonesian and English are taken as language objects, Indonesian-English cross-language information retrieval experiments based on the method and the comparison method are carried out, and the cross-language retrieval performance of the method and the comparison method is compared.

The experimental corpora:

the experimental corpus of the invention is a standard data set NTCIR-5CLIR corpus (see website: http:// research. ni. ac. jp/NTCIR/permission/NTCIR-5/perm-en-CLIR. html), namely News texts of English document sets Mainichi Daily News 2000, 2001 and Korea Times 2001 in the NTCIR-5CLIR corpus are selected, and 26224 English documents are shared as the experimental data of the invention, specifically, a News text 6608 (m 00) of the Mainichi Daily News 2000, a News text 5547 (m01) of the Mainichi Daily News 2001 and a News text 14069 (k01) of the Korea 2001.

The NTCIR-5CLIR corpus comprises a document test set, 50 query subject sets and corresponding result sets, wherein each query subject type comprises 4 types such as Title, Desc, Narr and Conc, and the result sets comprise 2 evaluation criteria, namely a highly relevant Rigid criterion and a highly relevant, relevant and partially relevant Relax criterion. The invention selects a Title and a Desc type for the type of the query subject for experiments, wherein the Title query belongs to short query, the query subject is described briefly by nouns and noun phrases, the Desc query belongs to long query, and the query subject is described briefly in sentence form.

The evaluation indexes of the experimental result of the invention are P @15 and the average R-precision ratio. The P @15 is the accuracy of the first 15 results returned for a test query, the average R-precision is the arithmetic average of the R-precision corresponding to all queries, and the R-precision is the precision calculated when R documents are retrieved.

The comparison method comprises the following steps:

(1) comparative method 1: Indonesia-English cross-language reference retrieval method. The comparison method 1 refers to a retrieval result obtained by retrieving an English document after translating Indonesia query into English by a machine, and various query expansions are not implemented in the retrieval process.

(2) Comparative method 2: Indonesia-English cross-language query post-translation expansion method based on weighted association pattern mining. The comparison method 2 is a cross-language query expansion method based on documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper mined based on a weighted association mode, 2017,36(3): 307-. The experimental parameters were: the minimum confidence threshold mc is 0.01, the minimum interestingness threshold mi is 0.0001, and the minimum confidence threshold ms is 0.007,0.008,0.009,0.01, 0.011.

(3) Comparative method 3: the Indonesia-English cross-language query post-translation expansion method based on the pseudo-correlation feedback is a cross-language query expansion method based on documents (Wudan, Happy and King Help. cross-language query expansion [ J ] information bulletin, 2010,29(2): 232-. The experimental method comprises the following steps: extracting 20 front English documents of Indonesia-English cross-language preliminary examination to construct a preliminary examination related document set, extracting characteristic lexical items and calculating weights of the characteristic lexical items, and arranging the front 20 characteristic lexical items as English extension words in descending order according to the weights to realize cross-English cross-language query translation and then extension.

The experimental methods and results are as follows:

the source program of the method and the comparison method is operated, firstly, Title and Desc queries of 50 Indonesian query subjects are translated into English through a machine translation system, and English documents are retrieved so as to realize Indonesia-English cross-language information retrieval. In the experiment, after user-related feedback is carried out on 50 English documents in the front of cross-language initial examination, the initial examination user-related feedback documents are obtained (for simplicity, in the experiment, the related documents in the front of the initial examination 50 documents containing known result sets are regarded as the initial examination related documents), the association rule mode is obtained after the mining method is realized, and the association rule is extracted and used as an expansion word to realize cross-language query expansion. Through experiments, the Indonesia-English cross-language retrieval result P @15 and the average R-precision of the method and the comparison method are respectively shown in tables 1 to 2, and a 3_ item set is mined in the experimental process, wherein the experimental parameters of the method are as follows: the minimum confidence thresholds mc are 0.5,0.6,0.7,0.8, and 0.9, respectively, the minimum support threshold ms is 0.5, and the minimum term set relevancy threshold minIRe is 0.4.

TABLE 1 search Performance comparison of the inventive and comparative methods (Title Inquiry subject)

TABLE 2 search Performance comparison of the inventive and comparative methods (Desc Inquiry theme)

Tables 1 and 2 show that the cross-language retrieval result P @15 and the average R-precision value of the method are higher than those of the 3 comparison methods, and the effect is obvious. The experimental result shows that the method is effective, can actually improve the cross-language information retrieval performance, and has high application value and wide popularization prospect.

Claims

1. The cross-language query expansion method for realizing rule back-part mining through weight comparison is characterized by comprising the following steps of:

step 1: the method comprises the steps that a source language user queries are translated into a target language through a machine translation system, a vector space retrieval model is adopted to retrieve a target language text document set to obtain a primary detection front row target language document, a primary detection user related document set is established through carrying out correlation judgment on the primary detection front row target language document, the primary detection user related document set is preprocessed, and a target language text document index library and a feature word library are established;

step 2: mining a frequent item set containing original query terms in the preliminary examination user related feedback document set through item set weight value comparison, and pruning the item set by using the item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:

(2-1-2) scanning of the eyeIndexing library of markup language text documents, statistics of total number n of text documents and statistics C₁Term set weight w [ C ]₁]；

(2-1-3) calculating a minimum weight support threshold (MWS), wherein the MWS calculation formula is shown as the formula (2):

MWS＝n×ms (2)

in the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library;

(2-1-4) if w [ C ]₁]Not less than MWS, then C₁That is, the text feature word 1_ frequent item set L₁Adding to a frequent item set FIS;

(2-2-2) pruning 2_ candidate item set C without original query terms₂；

(2-3-4) for the remaining C_kRespectively calculate each C_kItem set relevancy IRe (C)_k) If w [ C ]_k]Not less than MWS x k and IRe (C)_k) Not less than mini then_kNamely a text feature word k _ frequent item set L_kAdding to a frequent item set FIS; the miniRE is a minimum term set relevancy threshold; the IRe (C)_k) The formula (3) is shown in the formula;

in the formula (3), w_min[(i_q)]And w_max[(i_p)]The meanings of (A) are as follows: for C_k＝(i₁,i₂,…i_k) K _ candidate item set C_kEach item i of₁,i₂,…,i_kWhen the terms are respectively used as 1_ item sets independently, the terms correspond to (i)₁),(i₂),…,(i_k)；w_min[(i_q)]And w_max[(i_p)]Respectively represent 1_ item set (i)₁),(i₂),…,(i_k) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);

and step 3: adopting a chi-square analysis-confidence evaluation framework to perform k _ frequent item set L on each text characteristic word in a frequent item set FIS_kMining a text feature word weighting association rule mode containing an original query term, wherein k is more than or equal to 2; the specific method comprises the following steps:

extracting any one text characteristic word from frequent item set FISk _ frequent item set L_kDigging each L according to the following steps_kAll association rule patterns containing the original query terms:

(3-1) construction of L_kAll proper subset item set sets of (a);

q_t∪E_t＝L_k，

Q_TLset of terms for the original query of the target language, E_tComputing a set of terms (q) for a set of feature terms that does not contain the original query terms_t,E_t) Chi square value of (q)_t,E_t) The calculation formula is shown in formula (4):

in formula (4), w [ (q)_t)]Is a set of items q_tWeight, k, of item set in index library of text documents in target language₁Is a set of items q_tLength of (d), w [ (E)_t)]As a set of items E_tWeight, k, of item set in index library of text documents in target language₂As a set of items E_tLength of (d), w [ (q)_t,E_t)]Is a set of items (q)_t,E_t) Item set weights, k, in a target language text document index library_LIs a set of items (q)_t,E_t) N is the total number of the text documents in the target language text document index library;

(3-3) if Chis (q)_t,E_t)>0, calculating the confidence WConf (q) of the text feature word weighted association rule_t→E_t) If WConf (q)_t→E_t) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q is_t→E_tIs a strong rule-of-association pattern added to the adderA set of rights association rule patterns WAR; the WConf (q)_t→E_t) Is represented by equation (5):

in formula (5), w [ (q)_t)]，k₁，w[(q_t,E_t)]，k_LThe same formula (4) is defined;

(3-4) if L_kIf and only if each proper subset entry set of (a) is taken once, then this time L_kThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS again_kAnd go to step (3-1) to perform another L_kMining the weighted association rule mode, otherwise, turning to the step (3-2) to execute each step in sequence; if each L in the frequent item set FIS_kIf the mining weighted association rule mode is taken out, the mining of the whole weighted association rule mode is ended, and the following step 4 is carried out;

and 4, step 4: extracting each weighted association rule q from the weighted association rule pattern set WAR_t→E_tRear part E of_tAs the query expansion word, calculating the weight w of the expansion word according to the formula (6)_e：

w_e＝0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)

In the formula (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum value of the confidence of the weighted association rule, the maximum value of the chi-square value and the maximum value of the association degree, respectively;

and 5: and 4, combining the query expansion words and the original query words into a new query, and searching the target language document again to complete cross-language query expansion.

2. The method for implementing rule back-part mining through weight comparison according to claim 1, wherein the step 1 preprocesses the preliminary examination user-related document set by a specific method comprising: removing stop words, extracting feature words and calculating the weight of the feature words according to the formula (1);

in the formula (1), w_ijRepresenting a document d_iMiddle characteristic word t_jWeight of (tf)_j,iRepresentation feature word t_jIn document d_iWord frequency, idf in_jIs the inverse document frequency.