[go: up one dir, main page]

CN109684464B - A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison - Google Patents

A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison Download PDF

Info

Publication number
CN109684464B
CN109684464B CN201811646511.5A CN201811646511A CN109684464B CN 109684464 B CN109684464 B CN 109684464B CN 201811646511 A CN201811646511 A CN 201811646511A CN 109684464 B CN109684464 B CN 109684464B
Authority
CN
China
Prior art keywords
item set
text
item
weight
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811646511.5A
Other languages
Chinese (zh)
Other versions
CN109684464A (en
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN201811646511.5A priority Critical patent/CN109684464B/en
Publication of CN109684464A publication Critical patent/CN109684464A/en
Application granted granted Critical
Publication of CN109684464B publication Critical patent/CN109684464B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了通过权值比较实现规则后件挖掘的跨语言查询扩展方法,首先跨语言首次检索构建初检相关反馈文档集,再在该文档集挖掘含有原查询词项的频繁项集,用项集关联度值及项集的项目权值最大者或者最大项目权值对候选项集剪枝,采用卡方分析‑置信度评价框架从频繁项集中挖掘含有原查询词项的文本特征词关联规则模式,将前件是原查询词项集合的关联规则后件项集作为查询扩展词,实现跨语言查询扩展。本发明能克服现有加权关联规则挖掘方法的缺陷,提高挖掘效率,能挖掘出与原查询相关的扩展词,提高和改善跨语言信息检索性能,减少检索中查询主题漂移和词不匹配问题,在跨语言搜索引擎和web跨语言检索系统中具有较好的应用价值和推广前景。

Figure 201811646511

The invention discloses a cross-language query expansion method for realizing rule consequent mining through weight comparison. First, cross-language first retrieval is used to construct a preliminary inspection related feedback document set, and then frequent itemsets containing original query terms are mined in the document set. The item set association degree value and the item set with the largest item weight or the largest item weight value are used to prune the candidate item set, and the chi-square analysis-confidence evaluation framework is used to mine the text feature word associations containing the original query terms from the frequent itemset. The rule mode uses the association rule item set whose antecedent is the original query term item set as the query expansion word to realize cross-language query expansion. The invention can overcome the defects of the existing weighted association rule mining method, improve the mining efficiency, can mine extended words related to the original query, improve and improve cross-language information retrieval performance, and reduce query subject drift and word mismatch problems in retrieval. It has good application value and promotion prospects in cross-language search engines and web cross-language retrieval systems.

Figure 201811646511

Description

Cross-language query expansion method for realizing rule back-part mining through weight comparison
Technical Field
The invention belongs to the field of information retrieval, and particularly relates to a cross-language query expansion method for realizing rule back-part mining through weight comparison.
Background
At present, network information resources with the characteristic of multi-language are rapidly increased, and become network big data with huge hidden economic value and research value. The problems encountered by network users in the process of searching other language information resources in the network big data resources by using query expressions of languages familiar to the network users are serious query topic drift, word mismatching and the like, and cross-language query expansion is one of key technologies for solving the problems.
The cross-language query expansion is one of core technologies for improving and improving the cross-language information retrieval performance, can solve the problems of serious query topic drift, word mismatching and the like which are long-term puzzled in the cross-language information retrieval, and refers to a process of finding an expansion word related to an original query by adopting a certain strategy, combining the expansion word with the original query to obtain a new query and retrieving again in the cross-language information retrieval process. In recent decades, scholars have conducted highly effective research on cross-language query expansion methods, and have obtained some research results, for example, the cross-language query expansion method based on potential semantic analysis proposed by fencing et al (fencing, sudan. cross-language query expansion method based on potential semantic analysis [ J ] computer engineering, 2009,35(10):49-53.), and wudan et al propose a cross-language query expansion method based on pseudo-correlation feedback (wudan, which celebration, wanghing. cross-language query expansion based on pseudo-correlation feedback [ J ] information bulletin, 2010,29(2): 232-.
Disclosure of Invention
The invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison, which is applied to an actual cross-language search engine and a web cross-language information retrieval system, can solve the problems of query theme drift and word mismatching in cross-language information retrieval, and improves the cross-language retrieval performance.
The technical scheme of the invention is as follows:
the cross-language query expansion method for realizing rule back-part mining through weight comparison comprises the following steps:
step 1: the source language query firstly retrieves the target language documents across languages, and constructs and preprocesses a set of initial-check related feedback documents. The method comprises the following specific steps:
and (1-1) translating the source language user query into a target language through a machine translation system, and retrieving a target language text document set by adopting a vector space retrieval model to obtain a target language document of the primary detection front row.
The machine translation system is: microsoft applied to the Microsoft Translator API, or Google machine translation interface, etc.
And (1-2) constructing a primary detection related feedback document set by performing relevance judgment on primary detection front target language documents.
(1-3) preprocessing a relevant feedback document set of the initial examination, and constructing a target language text document index library and a feature word library;
the pretreatment method comprises the following steps: removing stop words, extracting feature words and calculating the weight of the feature words according to the formula (1);
Figure BDA0001932164800000021
in the formula (1), wijRepresenting a document diMiddle characteristic word tjWeight of (tf)j,iRepresentation feature word tjIn document diThe present invention will tfj,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processingiTf of each feature wordj,iDivided by document diMaximum word frequency, idfjIs the inverse document frequency.
Step 2: mining a frequent item set containing original query terms in an initial examination related feedback document set through item set weight value comparison, and pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:
(2-1) mining text feature word 1_ frequent item set L1The method comprises the following specific steps:
(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C1
(2-1-2) scanning the target language text document index library, and counting the total number n of text documents and C1Term set weight w [ C ]1];
(2-1-3) calculating a minimum weight support threshold (MWS). The MWS calculation formula is shown in formula (2).
MWS=n×ms (2)
In the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library.
(2-1-4) if w [ C ]1]Not less than MWS, then C1That is, the text feature word 1_ frequent item set L1Add to the frequent itemset set FIS.
(2-2) mining text feature word 2_ frequent item set L2The method comprises the following specific steps:
(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L1Deriving multiple 2_ candidate sets C from concatenation2
The method of Aproiri ligation is described in detail in the literature (Agrawal R, Imilinski T, Swami A. minor association rules between entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993:207-
(2-2-2) pruning 2_ candidate item set C without original query terms2
(2-2-3) for the remaining 2_ candidate set C2Scanning the target language text document index library to count each remaining 2_ candidate item set C2Term set weight w [ C ]2];
(2-2-4) if w [ C ]2]Not less than MWSx 2, then the 2_ candidate item set C2That is, the text feature word 2_ frequent item set L2Adding to a frequent item set FIS;
(2-3) mining text feature word k _ frequent item set LkAnd k is more than or equal to 2, and the specific steps are as follows:
(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set Lk-1Deriving a plurality of k _ candidate sets C from concatenationk=(i1,i2,…,ik) The k is more than or equal to 2;
(2-3-2) scanning the target language text document index library, and respectively counting each CkTerm set weight w [ C ]k]And each CkMiddle maximum item weight wmRespectively obtaining each CkMiddle maximum item weight wmCorresponding item imSaid m ∈ (1,2, …, k);
(2-3-3) if said item imCorresponding 1_ item set (i)m) Infrequent, or wm<MWS, then the corresponding C is prunedk
(2-3-4) for the remaining CkRespectively calculate each CkItem set relevancy IRe (C)k) If w [ C ]k]Not less than MWS x k and IRe (C)k) Not less than mini thenkNamely a text feature word k _ frequent item set LkAdding to a frequent item set FIS; the miniRE is a minimum term set relevancy threshold; the IRe (C)k) The formula (2) is shown in the formula;
Figure BDA0001932164800000031
in the formula (2), wmin[(iq)]And wmax[(ip)]The meanings of (A) are as follows: for Ck=(i1,i2,…ik) K _ candidate item set CkEach item i of1,i2,…,ikRespectively as 1_ item set alone corresponds to (i)1),(i2),…,(ik);wmin[(iq)]And wmax[(ip)]Respectively represent 1_ item set (i)1),(i2),…,(ik) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);
(2-3-5) if the text feature word k _ frequent item set LkIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;
and step 3: adopting a chi-square analysis-confidence evaluation framework to perform k _ frequent item set L on each text characteristic word in a frequent item set FISkAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:
taking out any one text characteristic word k _ frequent item set L from frequent item set FISkDigging each L according to the following stepskAll of the association rule patterns containing the original query terms.
(3-1) construction of LkAll proper subset item set sets of (a);
(3-2) arbitrarily taking two proper subset item sets q from the proper subset item settAnd EtAnd is and
Figure BDA0001932164800000033
qt∪Et=Lk
Figure BDA0001932164800000034
QTLset of terms for the original query of the target language, EtComputing a set of terms (q) for a set of feature terms that does not contain the original query termt,Et) Card squareValue, the chi-squared value Chis (q)t,Et) The calculation formula is shown in formula (4).
Figure BDA0001932164800000032
In formula (4), w [ (q)t)]Is a set of items qtWeight, k, of item set in index library of text documents in target language1Is a set of items qtLength of (d), w [ (E)t)]As a set of items EtItem set weights, k, in a target language text document index library2As a set of items EtLength of (d), w [ (q)t,Et)]Is a set of items (q)t,Et) Item set weights, k, in a target language text document index libraryLIs a set of items (q)t,Et) N is the total number of text documents in the target language text document index library.
(3-3) if Chis (q)t,Et)>0, calculating the confidence WConf (q) of the text feature word weighted association rulet→Et). If WConf (q)t→Et) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q ist→EtIs a strong weighted association rule pattern and is added to the weighted association rule pattern set WAR. The WConf (q)t→Et) The formula (5) is shown in the following formula.
Figure BDA0001932164800000041
In formula (5), w [ (q)t)],k1,w[(qt,Et)],kLThe same formula (4) is defined.
(3-4) if LkIf and only if each proper subset entry set of (a) is taken once, then this time LkThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS againkAnd go to step (3-1) to perform another LkOtherwise, turning to the step (3-2) and sequencingExecuting each step; if each L in the frequent item set FISkAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.
And 4, step 4: extracting weighted association rule back-part E from weighted association rule pattern set WARtAnd (5) as query expansion words, calculating expansion word weights.
Extracting each weighted association rule q from the weighted association rule pattern set WARt→EtAs query expansion term, the weight w of the expansion termeThe calculation formula is shown in formula (6).
we=0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)
In equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum values of the 3 measurement values are taken, respectively.
And 5: combining the expansion words with the original query words to form a new query, and searching the target language document again to complete cross-language query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a cross-language query expansion method for realizing rule back-part mining through weight comparison. The method comprises the steps of mining a frequent item set containing original query terms in an initial check related feedback target language document set through item set weight value comparison, pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, mining a text characteristic word relevance rule mode containing the original query terms from the frequent item set by using a chi-square analysis-confidence evaluation framework, taking a part item set after a front part is the relevance rule of the original query term set as a query expansion word, realizing cross-language query expansion, and combining the expansion word and the original query word into a new query to retrieve the target language document again. Experimental results show that the invention can improve the retrieval performance of the cross-language text information.
(2) The standard data set NTCIR-5CLIR commonly used internationally is selected as the experimental corpus of the method. The existing mining method is selected as the comparison method of the invention, and experimental results show that the cross-language text retrieval result P @15 and the average R-precision value of the method of the invention are both higher than those of the comparison method, the effect is obvious, the retrieval performance of the method of the invention is superior to that of the comparison method, the cross-language information retrieval performance can be improved, the problems of query drift and word mismatching in the cross-language information retrieval are reduced, and the method has high application value and wide popularization prospect.
Drawings
FIG. 1 is a flow diagram of a cross-language query expansion method for implementing rule back-part mining by weight comparison according to the present invention.
Detailed Description
The following description of the embodiments of the method of the present invention is provided in conjunction with the accompanying drawings, and should not be construed as limiting the scope of the claims.
The following introduces concepts related to the present invention:
1. front piece and back piece of text characteristic word association rule
Let T1、T2Is an arbitrary set of text feature terms, to be shaped as T1→T2Is called a text feature word association rule, wherein T1Called rule antecedent, T2Referred to as rule back-parts.
2. Let DS be { d }1,d2,…,dnIs a Set of text Documents (DS), where di(1. ltoreq. i. ltoreq. n) is the ith document in the document set DS, di={t1,t2,…,tm,…,tp},tm(m ═ 1,2, …, p) is document feature word item, short feature item, generally composed of word, word or phrase, diThe corresponding feature item weight set Wi={wi1,wi2,…,wim,…,wip},wimFor the ith document diM characteristic item tmCorresponding weight, T ═ T1,t2,…,tnDenotes the set of global feature items in DS, and each subset of T is called asAnd (4) a feature item set, namely an item set for short.
Suppose that k _ candidate item set C is counted in the text document index libraryk=(i1,i2,…,ik) Term set weight w [ C ]k]To obtain CkEach item i1,i2,…,ikThe corresponding weights are w1,w2,…,wkThen, the said w1,w2,…,wkCalled item weight, and CkTerm set weight w [ C ]k]=w1+w2+…+wk
Example 1
As shown in fig. 1, the cross-language query expansion method for implementing rule back-part mining by weight comparison includes the following steps:
step 1: the source language query firstly retrieves the target language documents across languages, and constructs and preprocesses a set of initial-check related feedback documents. The method comprises the following specific steps:
and (1-1) translating the source language user query into a target language through a machine translation system, and retrieving a target language text document set by adopting a vector space retrieval model to obtain a target language document of the primary detection front row.
The machine translation system is: microsoft applied to the Microsoft Translator API, or Google machine translation interface, etc.
And (1-2) constructing a primary detection related feedback document set by performing relevance judgment on primary detection front target language documents.
And (1-3) preprocessing the initial examination related feedback document set, and constructing a target language text document index library and a feature word library.
The preprocessing method of the preliminary examination related feedback document set is to adopt corresponding preprocessing methods according to different languages, for example, if the target language is english, the preprocessing method is: removing English stop words, extracting an English feature word stem by using a Porter program (see the detailed website: http:// tartartargarus. org/martin/Porter Stemmer), calculating the weight of the English feature words, and if the target language is Chinese, the preprocessing method comprises the following steps: removing Chinese stop words, extracting Chinese characteristic words after segmenting Chinese documents, and calculating the weight of the Chinese characteristic words.
The invention provides a calculation formula of the weight of the feature words of the initial examination related feedback documents, which is shown as a formula (1).
Figure BDA0001932164800000061
In the formula (1), wijRepresenting a document diMiddle characteristic word tjWeight of (tf)j,iRepresentation feature word tjIn document diFrequency of words in (1), commonly tfj,iCarrying out standardization processing, wherein the standardization processing refers to that the document d is subjected to standardization processingiEach feature word tf inj,iDivided by document diMaximum word frequency, idfjIs the Inverse Document Frequency (Inverse Document Frequency).
The source of the cross-language query expansion words is the cross-language initial check related feedback documents, so that in the cross-language initial check related feedback document set, the more the number of the initial check related feedback documents containing a certain text characteristic word is, the more the characteristic word is related to the original query, and the more important the characteristic word is, the higher the weight of the characteristic word is.
Step 2: mining a frequent item set containing original query terms in an initial examination related feedback document set through item set weight value comparison, and pruning the item set by using an item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:
(2-1) mining text feature word 1_ frequent item set L1The method comprises the following specific steps:
(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C1
(2-1-2) scanning the target language text document index library, and counting the total number n of text documents and C1Term set weight w [ C ]1];
(2-1-3) calculating a minimum weight support threshold (MWS). The MWS calculation formula is shown in formula (2).
MWS=n×ms (2)
In the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library.
(2-1-4) if w [ C ]1]Not less than MWS, then C1That is, the text feature word 1_ frequent item set L1Add to frequent itemset set fis (frequency itemset).
(2-2) mining text feature word 2_ frequent item set L2The method comprises the following specific steps:
(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L1Deriving multiple 2_ candidate sets C from concatenation2
The method of Aproiri ligation is described in detail in the literature (Agrawal R, Imilinski T, Swami A. minor association rules between entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993:207-
(2-2-2) pruning 2_ candidate item set C without original query terms2
(2-2-3) for the remaining 2_ candidate set C2Scanning the target language text document index library to count each remaining 2_ candidate item set C2Term set weight w [ C ]2];
(2-2-4) if w [ C ]2]Not less than MWSx 2, then the 2_ candidate item set C2That is, the text feature word 2_ frequent item set L2Adding to a frequent item set FIS;
(2-3) mining text feature word k _ frequent item set LkAnd k is more than or equal to 2, and the specific steps are as follows:
(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set Lk-1Deriving a plurality of k _ candidate sets C from concatenationk=(i1,i2,…,ik) The k is more than or equal to 2;
(2-3-2) scanning the target language text document index library, and respectively counting each CkTerm set weight w [ C ]k]And each CkMiddle maximum item weight wmRespectively obtaining each CkMiddle maximum item weight wmCorresponding item imSaid m ∈ (1,2, …, k);
(2-3-3) if said item imCorresponding 1_ item set (i)m) Infrequent, or wm<MWS, then the corresponding C is prunedk
(2-3-4) for the remaining CkRespectively calculate each CkItem set relevancy IRe (C)k) If w [ C ]k]Not less than MWS x k and IRe (C)k) Not less than mini thenkNamely a text feature word k _ frequent item set LkAdding to a frequent item set FIS; otherwise, pruning the Ck
The miniRE is a minimum term set relevancy threshold; the IRe (C)k) The formula (3) is shown in the formula;
Figure BDA0001932164800000071
in the formula (3), wmin[(iq)]And wmax[(ip)]The meanings of (A) are as follows: for Ck=(i1,i2,…ik) K _ candidate item set CkEach item i of1,i2,…,ikRespectively as 1_ item set alone corresponds to (i)1),(i2),…,(ik);wmin[(iq)]And wmax[(ip)]Respectively represent 1_ item set (i)1),(i2),…,(ik) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);
(2-3-5) if the text feature word k _ frequent item set LkIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;
the pruning method comprises the following steps:
(1) for k _ candidate Ck=(i1,i2,…,ik) If said C iskTerm set weight w [ C ]k]<MWS x k, said infrequent, said C is prunedk(ii) a If said C iskItem set relevancy ofIRe(Ck)<minIRe, then CkIs an invalid item set, prunes the Ck(ii) a In summary, the present invention only excavates w [ C ]k]Not less than MWS x k and IRe (C)k) And the miniRE is a minimum term set association degree threshold value.
(2) If k _ candidate Ck=(i1,i2,…,ik) The middle-largest item weight is smaller than the minimum weight support threshold MWS, then CkIf not, clipping out the Ck
(3) Let k _ candidate Ck=(i1,i2,…,ik) The item corresponding to the medium maximum item weight is solely used as a 1_ item set as (i)m) If the 1_ item set (i)m) If not, clipping out the Ck
(4) When the candidate 2_ item set is mined, the candidate 2_ item set without the original query term is deleted, and the candidate 2_ item set with the original query term is left.
And step 3: from each k _ frequent item set L in the frequent item set FIS using a Chi-Square analysis-confidence evaluation frameworkkAnd mining a text feature word weighting association rule mode containing the original query terms, wherein k is more than or equal to 2. The method comprises the following specific steps:
taking out any one text characteristic word k _ frequent item set L from frequent item set FISkDigging each L according to the following stepskAll of the association rule patterns containing the original query terms.
(3-1) construction of LkAll proper subset item set sets of (a);
(3-2) arbitrarily taking two proper subset item sets q from the proper subset item settAnd EtAnd is and
Figure BDA0001932164800000082
qt∪Et=Lk
Figure BDA0001932164800000083
QTLset of terms for the original query of the target language, EtComputing a set of terms (q) for a set of feature terms that does not contain the original query termt,Et) Chi-square value of, said chi-square value Chis (q)t,Et) The calculation formula is shown in formula (4).
Figure BDA0001932164800000081
In formula (4), w [ (q)t)]Is a set of items qtWeight, k, of item set in index library of text documents in target language1Is a set of items qtLength of (d), w [ (E)t)]As a set of items EtWeight, k, of item set in index library of text documents in target language2As a set of items EtLength of (d), w [ (q)t,Et)]Is a set of items (q)t,Et) Item set weights, k, in a target language text document index libraryLIs a set of items (q)t,Et) N is the total number of text documents in the target language text document index library.
The core idea of Chi-square Analysis is to measure the correlation between data items, if Chis (q-square Analysis)t,Et) Two proper subset item sets q are illustrated as 0tAnd EtIndependent of each other, no correlation exists, and therefore, the occurrence of some association rules of false correlation can be avoided.
(3-3) if Chis (q)t,Et)>0, calculating the confidence WConf (q) of the text feature word weighted association rulet→Et). If WConf (q)t→Et) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q ist→EtIs a strong weighted association rule pattern and is added to the weighted association rule pattern set WAR. The WConf (q)t→Et) The formula (5) is shown in the following formula.
Figure BDA0001932164800000091
In formula (5), w [ (q)t)],k1,w[(qt,Et)],kLThe same formula (4) is defined.
(3-4) if LkIf and only if each proper subset entry set of (a) is taken once, then this time LkThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS againkAnd transferred to step (3-1) for another LkMining the weighted association rule mode, otherwise, turning to the step (3-2) and sequentially executing the steps; if each L in the frequent item set FISkAnd if the mining weighted association rule mode is taken out, finishing the mining of the whole weighted association rule mode, and turning to the following step 4.
And 4, step 4: extracting weighted association rule back-part E from weighted association rule pattern set WARtAnd (5) as query expansion words, calculating expansion word weights.
Extracting each weighted association rule q from the weighted association rule pattern set WARt→EtThe back part Et is used as an inquiry expansion word, because the relevance is an important index for measuring the relevance degree of each item in the item set, and the confidence value and the chi-square value are important indexes for measuring the relevance between the front part and the back part of the relevance rule mode, in view of the above, the invention takes the relevance degree, the chi-square value and the confidence value as the calculation basis of the weight value of the expansion word, and provides the weight value w of the expansion word according to the importance degree of the 3 measurement values to the expansion wordeThe formula (6) is shown as follows:
we=0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)
in equation (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum values of the weighted association rule confidence, chi-squared value and association degree, respectively, that is, when the expanded word repeatedly appears in a plurality of weighted association rule patterns, the maximum value of the above-mentioned 3 measurement values is taken.
And 5: combining the expansion words with the original query words to form a new query, and searching the target language document again to complete cross-language query expansion.
Experimental design and results:
in order to illustrate the effectiveness of the method, Indonesian and English are taken as language objects, Indonesian-English cross-language information retrieval experiments based on the method and the comparison method are carried out, and the cross-language retrieval performance of the method and the comparison method is compared.
The experimental corpora:
the experimental corpus of the invention is a standard data set NTCIR-5CLIR corpus (see website: http:// research. ni. ac. jp/NTCIR/permission/NTCIR-5/perm-en-CLIR. html), namely News texts of English document sets Mainichi Daily News 2000, 2001 and Korea Times 2001 in the NTCIR-5CLIR corpus are selected, and 26224 English documents are shared as the experimental data of the invention, specifically, a News text 6608 (m 00) of the Mainichi Daily News 2000, a News text 5547 (m01) of the Mainichi Daily News 2001 and a News text 14069 (k01) of the Korea 2001.
The NTCIR-5CLIR corpus comprises a document test set, 50 query subject sets and corresponding result sets, wherein each query subject type comprises 4 types such as Title, Desc, Narr and Conc, and the result sets comprise 2 evaluation criteria, namely a highly relevant Rigid criterion and a highly relevant, relevant and partially relevant Relax criterion. The invention selects a Title and a Desc type for the type of the query subject for experiments, wherein the Title query belongs to short query, the query subject is described briefly by nouns and noun phrases, the Desc query belongs to long query, and the query subject is described briefly in sentence form.
The evaluation indexes of the experimental result of the invention are P @15 and the average R-precision ratio. The P @15 is the accuracy of the first 15 results returned for a test query, the average R-precision is the arithmetic average of the R-precision corresponding to all queries, and the R-precision is the precision calculated when R documents are retrieved.
The comparison method comprises the following steps:
(1) comparative method 1: Indonesia-English cross-language reference retrieval method. The comparison method 1 refers to a retrieval result obtained by retrieving an English document after translating Indonesia query into English by a machine, and various query expansions are not implemented in the retrieval process.
(2) Comparative method 2: Indonesia-English cross-language query post-translation expansion method based on weighted association pattern mining. The comparison method 2 is a cross-language query expansion method based on documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper mined based on a weighted association mode, 2017,36(3): 307-. The experimental parameters were: the minimum confidence threshold mc is 0.01, the minimum interestingness threshold mi is 0.0001, and the minimum confidence threshold ms is 0.007,0.008,0.009,0.01, 0.011.
(3) Comparative method 3: the Indonesia-English cross-language query post-translation expansion method based on the pseudo-correlation feedback is a cross-language query expansion method based on documents (Wudan, Happy and King Help. cross-language query expansion [ J ] information bulletin, 2010,29(2): 232-. The experimental method comprises the following steps: extracting 20 front English documents of Indonesia-English cross-language preliminary examination to construct a preliminary examination related document set, extracting characteristic lexical items and calculating weights of the characteristic lexical items, and arranging the front 20 characteristic lexical items as English extension words in descending order according to the weights to realize cross-English cross-language query translation and then extension.
The experimental methods and results are as follows:
the source program of the method and the comparison method is operated, firstly, Title and Desc queries of 50 Indonesian query subjects are translated into English through a machine translation system, and English documents are retrieved so as to realize Indonesia-English cross-language information retrieval. In the experiment, after user-related feedback is carried out on 50 English documents in the front of cross-language initial examination, the initial examination user-related feedback documents are obtained (for simplicity, in the experiment, the related documents in the front of the initial examination 50 documents containing known result sets are regarded as the initial examination related documents), the association rule mode is obtained after the mining method is realized, and the association rule is extracted and used as an expansion word to realize cross-language query expansion. Through experiments, the Indonesia-English cross-language retrieval result P @15 and the average R-precision of the method and the comparison method are respectively shown in tables 1 to 2, and a 3_ item set is mined in the experimental process, wherein the experimental parameters of the method are as follows: the minimum confidence thresholds mc are 0.5,0.6,0.7,0.8, and 0.9, respectively, the minimum support threshold ms is 0.5, and the minimum term set relevancy threshold minIRe is 0.4.
TABLE 1 search Performance comparison of the inventive and comparative methods (Title Inquiry subject)
Figure BDA0001932164800000111
TABLE 2 search Performance comparison of the inventive and comparative methods (Desc Inquiry theme)
Figure BDA0001932164800000112
Tables 1 and 2 show that the cross-language retrieval result P @15 and the average R-precision value of the method are higher than those of the 3 comparison methods, and the effect is obvious. The experimental result shows that the method is effective, can actually improve the cross-language information retrieval performance, and has high application value and wide popularization prospect.

Claims (2)

1. The cross-language query expansion method for realizing rule back-part mining through weight comparison is characterized by comprising the following steps of:
step 1: the method comprises the steps that a source language user queries are translated into a target language through a machine translation system, a vector space retrieval model is adopted to retrieve a target language text document set to obtain a primary detection front row target language document, a primary detection user related document set is established through carrying out correlation judgment on the primary detection front row target language document, the primary detection user related document set is preprocessed, and a target language text document index library and a feature word library are established;
step 2: mining a frequent item set containing original query terms in the preliminary examination user related feedback document set through item set weight value comparison, and pruning the item set by using the item set relevance value and the maximum item weight value or the maximum item weight value of the item set, wherein the method comprises the following specific steps:
(2-1) mining text feature word 1_ frequent item set L1The method comprises the following specific steps:
(2-1-1) extracting text characteristic words from the characteristic word library as 1_ candidate item set C1
(2-1-2) scanning of the eyeIndexing library of markup language text documents, statistics of total number n of text documents and statistics C1Term set weight w [ C ]1];
(2-1-3) calculating a minimum weight support threshold (MWS), wherein the MWS calculation formula is shown as the formula (2):
MWS=n×ms (2)
in the formula (2), ms is a minimum support threshold, and n is the total number of text documents in the target language text document index library;
(2-1-4) if w [ C ]1]Not less than MWS, then C1That is, the text feature word 1_ frequent item set L1Adding to a frequent item set FIS;
(2-2) mining text feature word 2_ frequent item set L2The method comprises the following specific steps:
(2-2-1) adopting an Aproiri connection method to connect the text characteristic word 1_ frequent item set L1Deriving multiple 2_ candidate sets C from concatenation2
(2-2-2) pruning 2_ candidate item set C without original query terms2
(2-2-3) for the remaining 2_ candidate set C2Scanning the target language text document index library to count each remaining 2_ candidate item set C2Term set weight w [ C ]2];
(2-2-4) if w [ C ]2]Not less than MWSx 2, then the 2_ candidate item set C2That is, the text feature word 2_ frequent item set L2Adding to a frequent item set FIS;
(2-3) mining text feature word k _ frequent item set LkAnd k is more than or equal to 2, and the specific steps are as follows:
(2-3-1) adopting an Aproiri connection method to connect the text characteristic word (k-1) _ frequent item set Lk-1Deriving a plurality of k _ candidate sets C from concatenationk=(i1,i2,…,ik) The k is more than or equal to 2;
(2-3-2) scanning the target language text document index library, and respectively counting each CkTerm set weight w [ C ]k]And each CkMiddle maximum item weight wmRespectively obtaining each CkMiddle maximum item weight wmCorresponding item imSaid m ∈ (1,2, …, k);
(2-3-3) if said item imCorresponding 1_ item set (i)m) Infrequent, or wm<MWS, then the corresponding C is prunedk
(2-3-4) for the remaining CkRespectively calculate each CkItem set relevancy IRe (C)k) If w [ C ]k]Not less than MWS x k and IRe (C)k) Not less than mini thenkNamely a text feature word k _ frequent item set LkAdding to a frequent item set FIS; the miniRE is a minimum term set relevancy threshold; the IRe (C)k) The formula (3) is shown in the formula;
Figure FDA0001932164790000021
in the formula (3), wmin[(iq)]And wmax[(ip)]The meanings of (A) are as follows: for Ck=(i1,i2,…ik) K _ candidate item set CkEach item i of1,i2,…,ikWhen the terms are respectively used as 1_ item sets independently, the terms correspond to (i)1),(i2),…,(ik);wmin[(iq)]And wmax[(ip)]Respectively represent 1_ item set (i)1),(i2),…,(ik) The smallest 1_ item set weight value and the largest 1_ item set weight value; the q is belonged to (1,2, …, k), and p is belonged to (1,2, …, k);
(2-3-5) if the text feature word k _ frequent item set LkIf the text feature word is an empty set, finishing the excavation of the text feature word frequent item set, and turning to the following step 3, otherwise, turning to the step (2-3-1) to continue sequential circulation after adding 1 to k;
and step 3: adopting a chi-square analysis-confidence evaluation framework to perform k _ frequent item set L on each text characteristic word in a frequent item set FISkMining a text feature word weighting association rule mode containing an original query term, wherein k is more than or equal to 2; the specific method comprises the following steps:
extracting any one text characteristic word from frequent item set FISk _ frequent item set LkDigging each L according to the following stepskAll association rule patterns containing the original query terms:
(3-1) construction of LkAll proper subset item set sets of (a);
(3-2) arbitrarily taking two proper subset item sets q from the proper subset item settAnd EtAnd is and
Figure FDA0001932164790000022
qt∪Et=Lk
Figure FDA0001932164790000023
QTLset of terms for the original query of the target language, EtComputing a set of terms (q) for a set of feature terms that does not contain the original query termst,Et) Chi square value of (q)t,Et) The calculation formula is shown in formula (4):
Figure FDA0001932164790000024
in formula (4), w [ (q)t)]Is a set of items qtWeight, k, of item set in index library of text documents in target language1Is a set of items qtLength of (d), w [ (E)t)]As a set of items EtWeight, k, of item set in index library of text documents in target language2As a set of items EtLength of (d), w [ (q)t,Et)]Is a set of items (q)t,Et) Item set weights, k, in a target language text document index libraryLIs a set of items (q)t,Et) N is the total number of the text documents in the target language text document index library;
(3-3) if Chis (q)t,Et)>0, calculating the confidence WConf (q) of the text feature word weighted association rulet→Et) If WConf (q)t→Et) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, the association rule q ist→EtIs a strong rule-of-association pattern added to the adderA set of rights association rule patterns WAR; the WConf (q)t→Et) Is represented by equation (5):
Figure FDA0001932164790000031
in formula (5), w [ (q)t)],k1,w[(qt,Et)],kLThe same formula (4) is defined;
(3-4) if LkIf and only if each proper subset entry set of (a) is taken once, then this time LkThe mining of the text feature word weighted association rule pattern in (1) is finished, and another L is taken out from the complex item set FIS againkAnd go to step (3-1) to perform another LkMining the weighted association rule mode, otherwise, turning to the step (3-2) to execute each step in sequence; if each L in the frequent item set FISkIf the mining weighted association rule mode is taken out, the mining of the whole weighted association rule mode is ended, and the following step 4 is carried out;
and 4, step 4: extracting each weighted association rule q from the weighted association rule pattern set WARt→EtRear part E oftAs the query expansion word, calculating the weight w of the expansion word according to the formula (6)e
we=0.5×max(WConf())+0.3×max(Chis())+0.2×max(IRe()) (6)
In the formula (6), max (WConf ()), max (chs ()) and max (IRe ()) represent the maximum value of the confidence of the weighted association rule, the maximum value of the chi-square value and the maximum value of the association degree, respectively;
and 5: and 4, combining the query expansion words and the original query words into a new query, and searching the target language document again to complete cross-language query expansion.
2. The method for implementing rule back-part mining through weight comparison according to claim 1, wherein the step 1 preprocesses the preliminary examination user-related document set by a specific method comprising: removing stop words, extracting feature words and calculating the weight of the feature words according to the formula (1);
Figure FDA0001932164790000032
in the formula (1), wijRepresenting a document diMiddle characteristic word tjWeight of (tf)j,iRepresentation feature word tjIn document diWord frequency, idf injIs the inverse document frequency.
CN201811646511.5A 2018-12-30 2018-12-30 A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison Expired - Fee Related CN109684464B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811646511.5A CN109684464B (en) 2018-12-30 2018-12-30 A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811646511.5A CN109684464B (en) 2018-12-30 2018-12-30 A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison

Publications (2)

Publication Number Publication Date
CN109684464A CN109684464A (en) 2019-04-26
CN109684464B true CN109684464B (en) 2021-06-04

Family

ID=66191526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811646511.5A Expired - Fee Related CN109684464B (en) 2018-12-30 2018-12-30 A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison

Country Status (1)

Country Link
CN (1) CN109684464B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 A Chinese query expansion method based on the union of query word embedding expansion words and statistical expansion words

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101943952A (en) * 2010-01-27 2011-01-12 北京搜狗科技发展有限公司 Mixed input method of at least two languages and input method system
CN107526839A (en) * 2017-09-08 2017-12-29 广西财经学院 Based on weight positive negative mode completely consequent extended method is translated across language inquiry
WO2018018912A1 (en) * 2016-07-29 2018-02-01 北京搜狗科技发展有限公司 Search method and apparatus, and electronic device
CN108170778A (en) * 2017-12-26 2018-06-15 广西财经学院 Rear extended method is translated across language inquiry by China and Britain based on complete Weighted Rule consequent
CN108334526A (en) * 2017-01-20 2018-07-27 北京搜狗科技发展有限公司 The methods of exhibiting and device of search result items

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1632793A (en) * 2004-12-29 2005-06-29 复旦大学 An Optimal Method for Publishing Relational Data as XML Documents Using Cache
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 A Text-Based Query Expansion and Ranking Method in Image Retrieval
CN102033954B (en) * 2010-12-24 2012-10-17 东北大学 Extensible Markup Language Document Full Text Retrieval Query Indexing Method in Relational Database
CN104298676A (en) * 2013-07-18 2015-01-21 佳能株式会社 Topic mining method and equipment and query expansion method and equipment
CA2943513C (en) * 2014-03-29 2020-08-04 Thomson Reuters Global Resources Improved method, system and software for searching, identifying, retrieving and presenting electronic documents
CN104182527B (en) * 2014-08-27 2017-07-18 广西财经学院 Association rule mining method and its system between Sino-British text word based on partial order item collection
CN107609095B (en) * 2017-09-08 2019-07-09 广西财经学院 A Cross-Language Query Expansion Method Based on Weighted Positive and Negative Rule Antecedents and Relevant Feedback

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101943952A (en) * 2010-01-27 2011-01-12 北京搜狗科技发展有限公司 Mixed input method of at least two languages and input method system
WO2018018912A1 (en) * 2016-07-29 2018-02-01 北京搜狗科技发展有限公司 Search method and apparatus, and electronic device
CN108334526A (en) * 2017-01-20 2018-07-27 北京搜狗科技发展有限公司 The methods of exhibiting and device of search result items
CN107526839A (en) * 2017-09-08 2017-12-29 广西财经学院 Based on weight positive negative mode completely consequent extended method is translated across language inquiry
CN108170778A (en) * 2017-12-26 2018-06-15 广西财经学院 Rear extended method is translated across language inquiry by China and Britain based on complete Weighted Rule consequent

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cross Language Query Expansion Approach for CIMS Based on Weighted D-S Evidence Theory;Xiaobo Wang等;《Key Engineering Materials》;20141231;全文 *
基于项权值变化的矩阵加权关联规则挖掘;周秀梅等;《计算机应用研究》;20151031(第10期);全文 *

Also Published As

Publication number Publication date
CN109684464A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN108763196A (en) A kind of keyword extraction method based on PMI
CN103064969A (en) Method for automatically creating keyword index table
CN106372241B (en) More across the language text search method of English and the system of word-based weighted association pattern
CN110472005A (en) A kind of unsupervised keyword extracting method
CN111831786A (en) Full-text database accurate and efficient retrieval method for perfecting subject term
CN108920482A (en) Microblogging short text classification method based on Lexical Chains feature extension and LDA model
CN109299278B (en) Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent
CN107609095B (en) A Cross-Language Query Expansion Method Based on Weighted Positive and Negative Rule Antecedents and Relevant Feedback
CN109726263B (en) Cross-language post-translation hybrid extension method based on feature word weighted association pattern mining
CN114580557B (en) Method and device for determining document similarity based on semantic analysis
CN109739953B (en) A Text Retrieval Method Based on Chi-Square Analysis-Confidence Framework and Consequence Expansion
CN109684464B (en) A Cross-Language Query Expansion Method for Rule Consequence Mining Through Weight Comparison
CN109684463B (en) Cross-language post-translation and front-part extension method based on weight comparison and mining
CN109299292B (en) A Text Retrieval Method Based on Matrix-Weighted Association Rules Mixed Expansion of Context and Context
CN107526839B (en) Consequent extended method is translated across language inquiry based on weight positive negative mode completely
CN109684465B (en) Pattern Mining and Hybrid Extended Text Retrieval Method Based on Itemset Weight Comparison
CN109739952A (en) Pattern Mining and Extended Cross-Language Retrieval Method Integrating Relevance and Chi-square Values
Li et al. Multi-feature keyword extraction method based on TF-IDF and Chinese grammar analysis
CN108170778B (en) Chinese-English cross-language query post-translation expansion method based on fully weighted rule post-piece
CN107562904B (en) Mining method of weighted positive and negative association patterns between English words by fusing item weight and frequency
CN108416442B (en) A Chinese Interword Matrix Weighted Association Rule Mining Method Based on Item Frequency and Weight
CN113408286A (en) Chinese entity identification method and system for mechanical and chemical engineering field
Shahabi et al. A method for multi-text summarization based on multi-objective optimization use imperialist competitive algorithm
Li et al. Keyword extraction based on lexical chains and word co-occurrence for Chinese news web pages
CN109684462B (en) Mining method of association rules between text words based on weight comparison and chi-square analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210604

Termination date: 20211230

CF01 Termination of patent right due to non-payment of annual fee