CN106484781A

CN106484781A - Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system

Info

Publication number: CN106484781A
Application number: CN201610827858.4A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2016-09-18
Filing date: 2016-09-18
Publication date: 2017-03-08
Anticipated expiration: 2036-09-18
Also published as: CN106484781B

Abstract

The invention discloses an Indonesian-Chinese cross-language retrieval method and system that integrates association patterns and user feedback, uses a machine translation module to translate Indonesian user queries into Chinese queries and submits them to the search engine module for retrieval to obtain a document set of preliminary inspection results, and utilizes user Click behavior-related feedback information extraction module to obtain the relevant document set of user feedback for initial inspection, and obtain the relevant document database for initial inspection after preprocessing by the document preprocessing module, call the fully weighted association rule mining module to build a fully weighted association rule library, and use cross-language query to expand words The generation module builds an extended thesaurus, and uses the cross-language query extension implementation module to submit the combined new query to the search engine module to obtain the final retrieval result Chinese document, and uses the final result display module to submit the final retrieval result to the machine translation module for translation into Indonesian documents and then returned to the user. The invention effectively improves and improves cross-language retrieval performance, and has good practical application value and popularization prospect.

Description

Indonesian-Chinese cross-language retrieval method and system integrating association model and user feedback

技术领域technical field

本发明属于文本信息检索领域，具体是一种融合关联模式和用户反馈的印尼汉跨语言检索方法及系统，适用于采用印尼语查询检索中文文档的跨语言文本信息检索等领域。The invention belongs to the field of text information retrieval, and specifically relates to an Indonesian-Chinese cross-language retrieval method and system that integrates association patterns and user feedback, and is applicable to fields such as cross-language text information retrieval that uses Indonesian language to query and retrieve Chinese documents.

背景技术Background technique

跨语言信息检索指的是以一种语言的查询检索其他语言的信息资源的技术。印尼汉跨语言信息检索方法是用印尼语查询检索中文文档的跨语言检索问题，其中，表达查询的印尼语言称为源语言，所检索的文档的中文语言称为目标语言。随着中国和东盟国家交流越来越密切，面向东盟国家语言的跨语言信息检索方法研究显得迫切和重要。Cross-language information retrieval refers to the technology of retrieving information resources in other languages with a query in one language. The Indonesian-Chinese cross-language information retrieval method is to query and retrieve Chinese documents in Indonesian language. The Indonesian language used to express the query is called the source language, and the Chinese language of the retrieved documents is called the target language. With the increasingly close exchanges between China and ASEAN countries, the research on cross-lingual information retrieval methods for ASEAN languages is urgent and important.

世界各地学者从不同的角度和方向对跨语言信息检索方法及系统进行了深入探讨和研究，取得了丰富的成果，然而，当前跨语言信息检索研究所存在的问题还没有完全解决，该领域亟待解决和关注度比较高的问题之一是跨语言信息检索过程中存在的严重查询主题漂移问题，面临着比单语言检索更为严重的词不匹配问题，这些问题常常导致跨语言检索性能低下，不如单语言检索性能。针对上述问题，近年来，基于查询扩展的跨语言信息检索研究得到了更多的关注和讨论，其研究主要集中在基于相关反馈的(Parton K,GaoJ.Combining Signals for Cross-Lingual Relevance Feedback[C].Proceedingsof8thAsia Information Retrieval Societies Conference(AIRS 2012),Tianjin,China.Springer-Verlag Berlin Heidelberg2012,LNCS 7675,Information RetrievalTechnology.2012:356-365.Lee C J,Croft W B.Cross-Language Pseudo-RelevanceFeedback Techniques for Informal Text[C].Proceedings of 36th EuropeanConference on IR Research(ECIR 2014),Amsterdam,The Netherlands.Advances inInformation Retrieval.Springer International Publishing,2014:260-272.)、潜在语义的(闭剑婷,苏一丹.基于潜在语义分析的跨语言查询扩展方法[J].计算机工程,2009,35(10):49-53.宁健,林鸿飞.基于改进潜在语义分析的跨语言检索[J].中文信息学报,2010,24(3):105-111.)、语言模型的和主题模型的(Ganguly Debasis and Leveling Johannesand Jones Gareth J.F.Cross-lingual topical relevance models[C].In:24thInternational Conference on Computational Linguistics(COLING 2012),2012.；WangXuwen,Zhang Qiang,Wang Xiaojie,et al.LDA based pseudo relevance feedback forcross language information retrieval[C].IEEE International Conference onCloud Computing and Intelligence Systems(CCIS2012).Hangzhou:IEEE,2012:1993-1998.；Xuwen Wang,Qiang Zhang,Xiaojie Wang,et al.Cross-lingual PseudoRelevance Feedback Based on Weak Relevant Topic Alignment.Proceedings ofthe29th Pacific Asia Conference on Language,Information and Computation,PACLIC29,Shanghai,China,2015:529-534.)等跨语言信息检索研究，其语言对象主要是以英语为主，大多都是研究英语和其他语言的跨语言检索问题。Scholars from all over the world have conducted in-depth discussions and research on cross-lingual information retrieval methods and systems from different angles and directions, and have achieved rich results. However, the current problems in cross-lingual information retrieval research have not been completely resolved. One of the problems that have been solved and paid more attention to is the serious query topic drift problem in the process of cross-language information retrieval, which is more serious than single-language retrieval. The problem of word mismatch, these problems often lead to low performance of cross-language retrieval, Not as good as monolingual retrieval performance. In view of the above problems, in recent years, research on cross-lingual information retrieval based on query expansion has received more attention and discussion, and its research mainly focuses on correlation feedback-based (Parton K, GaoJ. Combining Signals for Cross-Lingual Relevance Feedback [C ].Proceedings of 8thAsia Information Retrieval Societies Conference(AIRS 2012),Tianjin,China.Springer-Verlag Berlin Heidelberg2012,LNCS 7675,Information Retrieval Technology.2012:356-365.Lee C J,Croft W B.Cross-Language Informs Pseudo-Relevance Text[C]. Proceedings of 36th European Conference on IR Research (ECIR 2014), Amsterdam, The Netherlands. Advances in Information Retrieval. Springer International Publishing, 2014: 260-272.), latent semantics (Guan Jianting, Su Yidan. Based on latent Cross-language query expansion method based on semantic analysis [J]. Computer Engineering, 2009, 35 (10): 49-53. Ning Jian, Lin Hongfei. Cross-language retrieval based on improved latent semantic analysis [J]. Chinese Journal of Information, 2010, 24(3):105-111.), language model and topic model (Ganguly Debasis and Leveling Johannesand Jones Gareth J.F.Cross-lingual topical relevance models[C].In:24thInternational Conference on Computational Linguistics(COLING 2012),2012 .; WangXuwen, Zhang Qiang, Wang Xiaojie, et al.LDA based pseudo relevance feedback for cross language information retrieval[C].I EEE International Conference on Cloud Computing and Intelligence Systems(CCIS2012).Hangzhou:IEEE,2012:1993-1998.;Xuwen Wang,Qiang Zhang,Xiaojie Wang,et al.Cross-lingual PseudoRelevance Feedback Based on Weak Relevant Topic Alignment.Proceedingsific of the29th Pacific Asia Conference on Language, Information and Computation, PACLIC29, Shanghai, China, 2015:529-534.) and other cross-language information retrieval research, the language objects are mainly English, and most of them are cross-language research on English and other languages. Search questions.

当前，中国南宁市作为中国-东盟博览会永久举办地以来，中国与东盟国家的政治、经济、文化等往来更加频繁和密切，面向东盟国家语言的跨语言信息检索和跨语言信息服务研究显得更加迫切，其重要性日益凸显。At present, since Nanning, China has been the permanent host of the China-ASEAN Expo, the political, economic, and cultural exchanges between China and ASEAN countries have become more frequent and close, and the research on cross-language information retrieval and cross-language information services for the languages of ASEAN countries has become more urgent , its importance is becoming increasingly prominent.

发明内容Contents of the invention

本发明的目的在于针对现有技术中的上述问题，将完全加权关联规则挖掘技术和用户相关反馈结合应用于印尼汉跨语言信息检索，提供一种融合关联模式和用户反馈的印尼汉跨语言检索方法及系统，能提高和改善印尼中跨语言信息检索性能，对长查询的印尼中跨语言检索效果更好。The purpose of the present invention is to solve the above-mentioned problems in the prior art, combine fully weighted association rule mining technology and user-related feedback to Indonesian-Chinese cross-language information retrieval, and provide a Indonesian-Chinese cross-language retrieval that integrates association patterns and user feedback The method and system can improve and improve the performance of Indonesian medium-span language information retrieval, and the Indonesian medium-span language retrieval effect for long queries is better.

为实现上述发明目的，本发明采用了如下技术方案：In order to realize the above-mentioned purpose of the invention, the present invention has adopted following technical scheme:

一种融合关联模式和用户反馈的印尼汉跨语言检索方法，包括如下步骤：An Indonesian-Chinese cross-language retrieval method that integrates association patterns and user feedback, comprising the following steps:

(1)将印尼语用户查询通过机器翻译模块翻译为中文查询式，并提交到搜索引擎在互联网中初步检索，得到初检结果文档集；(1) Translate the Indonesian language user query into Chinese query formula through the machine translation module, and submit it to the search engine for preliminary retrieval on the Internet, and obtain the preliminary inspection result document set;

(2)提取跨语言初检结果文档集前列r篇中文文档提交给用户；(2) Extract the first r Chinese documents from the cross-language preliminary inspection result document set and submit them to the user;

(3)用户对跨语言初检结果文档集的中文文档进行判断得到用户反馈相关文档集，文档集中的文档总篇数设为n；(3) The user judges the Chinese documents in the cross-language preliminary inspection result document set to obtain the user feedback related document set, and the total number of documents in the document set is set to n;

(4)预处理用户反馈相关文档集，即进行中文分词、去除停用词、计算特征词权值和提取特征词的预处理操作，构建初检相关文档数据库；(4) Preprocessing user feedback related document sets, that is, performing preprocessing operations of Chinese word segmentation, removing stop words, calculating feature word weights and extracting feature words, and constructing a database of relevant documents for initial inspection;

(5)扫描初检相关文档数据库，挖掘完全加权特征词1_候选项集C₁，计算C₁权值w(C₁)，统计C₁以外的项目的最大权值maxCw_i(！C₁)和C₁的支持计数n_c1，ms为最小支持度阈值，计算KIWT(1,2)的值，KIWT(1,2)的计算公式是：KIWT(1,2)＝n×1×ms-n_C1×maxCw_i(！C₁)；(5) Scan the relevant document database for initial inspection, mine fully weighted feature word 1_candidate item set C ₁ , calculate C ₁ weight w(C ₁ ), and count the maximum weight maxCw _i (!C ₁ of items other than C ₁ ) and C ₁ support count n _c1 , ms is the minimum support threshold, calculate the value of KIWT(1,2), the calculation formula of KIWT(1,2) is: KIWT(1,2)=n×1×ms -n _C1 ×maxCw _i (!C ₁ );

(6)计算C₁的支持度FTISup(C₁)，如果FTISup(C₁)≧ms，则从1_候选项集C₁挖掘1_频繁项集L₁，并加到完全加权特征词频繁项集集合L，FTISup(C₁)的计算公式是： (6) Calculate the support degree FTISup(C ₁ ) of C _1. If FTISup(C ₁ )≧ms, mine 1_frequent itemset L ₁ from 1_candidate item set C ₁ and add it to the fully weighted feature word frequency Itemset set L, the calculation formula of FTISup(C ₁ ) is:

(7)挖掘k_项集，其中所述的k≧2，包括步骤(7.1)至(7.7)：(7) Mining k_itemsets, wherein said k≧2, including steps (7.1) to (7.7):

(7.1)比较候选(k-1)_项集C_k-1权值和KIWT(k-1,k)值，剪除其W(C_k-1)<KIWT(k-1,k)的候选项集C_k-1；(7.1) Compare the candidate (k-1)_itemset C _k-1 weight and KIWT(k-1,k) value, and cut off the candidates whose W(C _k-1 )<KIWT(k-1,k) Itemset C _k-1 ;

(7.2)将余下的进行候选(k-1)项集C_k-1进行Aproiri连接，得到C_k；(7.2) Aproiri connection is performed on the remaining candidate (k-1) item sets C _k-1 to obtain C _k ;

(7.3)当k＝2时，剪除不含查询项的候选2_项集；(7.3) When k=2, cut off the candidate 2_itemset not containing the query item;

(7.4)扫描初检相关文档数据库,统计C_k以外的项目的最大权值maxCw_i(！C_k)和C_k的支持计数n_ck，计算C_k权值w(C_k)和KIWT(k-1,k)的值，KIWT(k-1,k)的计算公式是：KIWT(k-1,k)＝n×k×ms-n_ck×maxCw_i(！C_k)；(7.4) Scan the initial inspection related document database, count the maximum weight maxCw _i (!C _k ) of items other than C _k and the support count n _ck of C _k , and calculate the C _k weight w(C _k ) and KIWT(k -1,k), the calculation formula of KIWT(k-1,k) is: KIWT(k-1,k)=n×k×ms-n _ck ×maxCw _i (!C _k );

(7.5)剪除n_ck为0的候选项集C_k；(7.5) Cut off the candidate item set C _{k where} n _ck is 0;

(7.6)对余下的候选k_项集C_k，计算C_k支持度FTISup(C_k)，如果FTISup(C_k)≧ms，则从候选k_项集C_k中挖掘k_频繁项集L_k，并加到完全加权特征词频繁项集集合L，FTISup(C_k)的计算公式是： (7.6) For the remaining candidate k_itemset C _k , calculate C _k support FTISup(C _k ), if FTISup(C _k )≧ms, mine k_frequent itemset from candidate k_itemset C _k L _k , and added to the fully weighted feature word frequent itemset set L, the calculation formula of FTISup(C _k ) is:

(7.7)若k大于候选项集长度阈值或者候选k_项集为空集，则挖掘结束，否则，继续循环步骤(7.1)至(7.6)；(7.7) If k is greater than the candidate item set length threshold or the candidate k_item set is an empty set, then the mining ends, otherwise, continue the loop steps (7.1) to (7.6);

(8)从完全加权特征词频繁项集集合L中挖掘含有查询词项的特征词完全加权关联规则，构建完全加权关联规则库；(8) Mining the fully weighted association rules of the feature words containing the query term from the fully weighted feature word frequent itemset set L, and constructing the fully weighted association rule base;

(9)从完全加权关联规则库中提取与原查询相关的跨语言扩展词，构建扩展词库；(9) Extract the cross-lingual expansion words related to the original query from the fully weighted association rule base, and construct the extended thesaurus;

(10)将原查询和扩展词组合提交到搜索引擎再次检索得到最终检索结果中文文档；(10) Submitting the original query and the expanded word combination to the search engine to retrieve the final retrieval result Chinese document again;

(11)将最终检索结果中文文档提交机器翻译模块翻译为印尼语文档，最后将最终检索结果中文文档和最终检索结果印尼语文档返回给用户。(11) Submit the Chinese document of the final retrieval result to the machine translation module for translation into a Indonesian language document, and finally return the Chinese document of the final retrieval result and the Indonesian language document of the final retrieval result to the user.

上述步骤(4)中所述的特征词权值的计算采用tf-idf方法，其计算公式是：其中，tf_m,n表示特征词t_m在文档d_n中的出现次数，df_m表示含有特征词t_m的文档数量，N表示文档集合中总的文档数量。The calculation of the feature word weight described in the above-mentioned steps (4) adopts the tf-idf method, and its calculation formula is: Among them, tf _m,n represents the number of occurrences of the feature word t _m in the document d _n , df _m represents the number of documents containing the feature word t _m , and N represents the total number of documents in the document collection.

上述步骤(8)的方法包括步骤(8.1)至(8.4)：The method of above-mentioned step (8) comprises steps (8.1) to (8.4):

(8.1)从完全加权特征词频繁项集集合L中提取某一完全加权i_频繁项集tlL_i，找出tlL_i的所有真子集；(8.1) Extract a certain fully weighted i_frequent itemset tlL _i from the fully weighted feature word frequent itemset set L, and find out all proper subsets of tlL _i ;

(8.2)从tlL_i的真子集集合中任意取出两个真子集tlI₁和tlI₂，当并且tlI₁∪tlI₂＝L_i，若FTARConf(tlI₁→tlI₂)≧mc，则挖掘出完全加权特征词强关联规则tlI₁→tlI₂；若FTARConf(tlI₂→tlI₁)≧mc，则挖掘出完全加权特征词强关联规则tlI₂→tlI₁；所述的mc为最小置信度阈值，tlI₁和tlI₂为完全加权特征词频繁项集，是tlL_i的真子集项集，FTARConf(tlI₁→tlI₂)为完全加权特征词关联规则tlI₁→tlI₂的置信度，其计算公式是：(8.2) Take out two proper subsets tlI ₁ and tlI ₂ arbitrarily from the proper subset set of tlL _i , when And tlI ₁ ∪tlI ₂ ＝L _i , if FTARConf(tlI ₁ →tlI ₂ )≧mc, fully weighted feature word strong association rules tlI ₁ →tlI ₂ are mined; if FTARConf(tlI ₂ →tlI ₁ )≧mc, Then dig out the fully weighted feature word strong association rule tlI ₂ →tlI ₁ ; the mc is the minimum confidence threshold, tlI ₁ and tlI ₂ are fully weighted feature word frequent itemsets, which are proper subset itemsets of tlL _i , FTARConf (tlI ₁ →tlI ₂ ) is the confidence degree of the fully weighted feature word association rule tlI ₁ →tlI ₂ , and its calculation formula is:

其中，FTISup(L_i)为完全加权频繁项集L_i的支持度，FTISup(tlI₁)为完全加权频繁项集tlI₁的支持度； Among them, FTISup(L _i ) is the support degree of fully weighted frequent itemset L _i , FTISup(tlI ₁ ) is the support degree of fully weighted frequent itemset tlI ₁ ;

(8.3)循环进行步骤(8.2)，直到完全加权i_频繁项集tlL_i的真子集集合中每个真子集都被取出一次，而且仅能取出一次，则转入步骤(8.4)；(8.3) Perform step (8.2) in a loop until each proper subset in the proper subset set of the fully weighted i_frequent itemset tlL _i is taken out once, and can only be taken out once, then turn to step (8.4);

(8.4)循环进行步骤(8.1)至步骤(8.3)，当完全加权特征词频繁项集集合L中的项集都被取出一次，而且仅能取出一次，则挖掘结束。(8.4) Perform step (8.1) to step (8.3) in a loop. When the itemsets in the frequent itemsets set L of fully weighted feature words are taken out once, and only once, the mining ends.

一种适用于上述融合关联模式和用户反馈的印尼汉跨语言检索方法的检索系统，包括以下4个模块和3个数据库：A retrieval system suitable for the above-mentioned Indonesian-Chinese cross-language retrieval method that integrates the association model and user feedback, including the following 4 modules and 3 databases:

机器翻译模块：该模块使用必应机器翻译接口，用于将印尼语用户查询翻译为中文查询，以及将最终检索结果中文文档翻译为印尼语文档提交给用户；Machine translation module: This module uses the Bing machine translation interface to translate Indonesian user queries into Chinese queries, and translate the final search result Chinese documents into Indonesian documents for submission to users;

搜索引擎模块：该模块为搜索引擎，用于对译后的中文查询式在互联网上进行检索，得到跨语言初检结果文档集；Search engine module: this module is a search engine, which is used to search the translated Chinese query formula on the Internet, and obtain the cross-language initial inspection result document set;

完全加权关联模式挖掘和用户相关反馈模块：用于将前列r篇跨语言初检结果文档集提交给用户，由用户对这些文档进行相关性判断并确定初检相关文档数据库，然后采用完全加权关联规则挖掘技术对初检相关文档数据库挖掘与查询相关的扩展词，实现跨语言查询扩展，扩展词和原查询组合再次检索得到最终检索结果中文文档；Fully weighted association pattern mining and user-related feedback module: used to submit the top r cross-language preliminary inspection result document sets to the user, and the user will judge the relevance of these documents and determine the relevant document database for the initial inspection, and then use the fully weighted association The rule mining technology mines the expansion words related to the query on the relevant document database of the initial inspection, realizes cross-language query expansion, and re-searches the combination of the expansion words and the original query to obtain the final retrieval result Chinese document;

最终结果显示模块：用于将最终检索结果中文文档提交到机器翻译模块翻译为印尼语文档，并将最终检索结果中文文档和最终检索结果印尼语文档返回用户；Final result display module: used to submit the Chinese document of the final retrieval result to the machine translation module for translation into an Indonesian document, and return the Chinese document of the final retrieval result and the Indonesian language document of the final retrieval result to the user;

初检相关文档数据库；Preliminary inspection of relevant document databases;

完全加权关联规则库；Fully weighted association rule base;

扩展词库。Extended thesaurus.

上述完全加权关联模式挖掘和用户相关反馈模块包括以下5个模块：The above-mentioned fully weighted association pattern mining and user-related feedback modules include the following 5 modules:

用户点击行为相关反馈提取模块：用于捕捉用户浏览初检结果文档集时所产生的文档下载行为，提取用户下载的初检文档构建用户反馈相关文档集；User click behavior-related feedback extraction module: used to capture the document download behavior generated when the user browses the document set of the preliminary examination result, and extract the document downloaded by the user to construct the relevant document set of user feedback;

文档预处理模块：用于将用户反馈相关文档集进行中文分词、去除停用词、计算特征词权值和提取特征词的预处理，构建初检相关文档数据库；Document preprocessing module: used to perform Chinese word segmentation, remove stop words, calculate feature word weights and extract feature words for user feedback related document sets, and build a database of relevant documents for initial inspection;

完全加权关联规则挖掘模块：用于对初检相关文档数据库进行完全加权关联规则挖掘，挖掘含有原查询词项的完全加权特征词项频繁项集和关联规则模式，构建完全加权关联规则库；Fully weighted association rule mining module: used for fully weighted association rule mining on the primary inspection related document database, mining the fully weighted feature term frequent itemsets and association rule patterns containing the original query term, and constructing a fully weighted association rule library;

跨语言查询扩展词生成模块：用于从完全加权关联规则库中提取与原查询相关的扩展词，构建扩展词库；Cross-lingual query expansion word generation module: used to extract the expansion words related to the original query from the fully weighted association rule base, and build an extended thesaurus;

跨语言查询扩展实现模块：用于从扩展词库中提取中文扩展词，将扩展词和原查询组合成新查询，再次提交给搜索引擎在互联网中检索，得到最终检索结果中文文档。Cross-language query extension implementation module: used to extract Chinese extended words from the extended thesaurus, combine the extended words and the original query to form a new query, submit it to the search engine for retrieval on the Internet, and obtain the final search result Chinese document.

相比于现有技术，本发明的优势在于：Compared with the prior art, the present invention has the advantages of:

(1)本发明将完全加权关联规则挖掘技术和用户相关反馈结合应用于印尼汉跨语言信息检索，提出用户点击下载行为与完全加权关联模式挖掘融合的印尼中跨语言信息检索方法及系统。与单语言中文文本检索基准MB、印尼中跨语言检索基准CLB和传统的基于伪相关反馈的跨语言信息检索方法CLR_PRF比较，本发明方法的检索性能获得了很大的改善和提高，实验结果表明，本发明获得很好的检索结果，其各项指标值都高于基准CLB和CLR_PRF算法的值，查询主题description类型的检索效果也比title类型的好，其检索结果的MAP值提高幅度最大。(1) The present invention combines fully weighted association rule mining technology and user-related feedback to Indonesian-Chinese cross-language information retrieval, and proposes an Indonesian-Chinese cross-language information retrieval method and system that integrates user click download behavior and fully weighted association pattern mining. Compared with the monolingual Chinese text retrieval benchmark MB, the Indonesian cross-language retrieval benchmark CLB and the traditional cross-language information retrieval method CLR_PRF based on pseudo-relevance feedback, the retrieval performance of the method of the present invention has been greatly improved and improved, and the experimental results show that , the present invention obtains very good retrieval results, and its index values are higher than the values of the benchmark CLB and CLR_PRF algorithms, and the retrieval effect of the description type of query subject is also better than that of the title type, and the MAP value of the retrieval results is increased the most.

(2)实验结果表明，本发明提出的融合完全加权关联模式挖掘和用户相关反馈的印尼汉跨语言信息检索方法及系统是有效的，能改善和提高跨语言信息检索性能。其主要原因分析如下：在跨语言信息检索中，查询翻译结果对跨语言检索结果影响较大，常常导致跨语言初检结果质量不如单语言的初检结果，即出现查询主题漂移问题。而将用户点击行为与完全加权关联模式挖掘融合应用到印尼中跨语言信息检索模型，可以获得与原查询最相关的反馈信息，通过完全加权关联规则挖掘得到与原查询相关的扩展词实现跨语言查询扩展，避免了跨语言检索中存在的严重主题漂移问题，提高了印尼中跨语言检索性能。(2) Experimental results show that the Indonesian-Chinese cross-language information retrieval method and system that integrates fully weighted association pattern mining and user-related feedback proposed by the present invention are effective, and can improve and enhance cross-language information retrieval performance. The main reasons are as follows: In cross-language information retrieval, query translation results have a greater impact on cross-language retrieval results, often resulting in the quality of cross-language initial inspection results being inferior to single-language initial inspection results, that is, the problem of query topic drift. However, the fusion of user click behavior and fully weighted association pattern mining is applied to the Indonesian cross-language information retrieval model, and the most relevant feedback information to the original query can be obtained, and the expansion words related to the original query can be obtained through fully weighted association rule mining to realize cross-language Query expansion avoids the serious topic drift problem in cross-language retrieval and improves the performance of cross-language retrieval in Indonesia.

附图说明Description of drawings

图1为本发明融合关联模式和用户反馈的印尼汉跨语言检索方法的框图。Fig. 1 is the block diagram of the Indonesian-Chinese cross-language retrieval method of the present invention fusion association pattern and user's feedback.

图2为本发明融合关联模式和用户反馈的印尼汉跨语言检索系统整体流程图。Fig. 2 is the overall flow chart of the Indonesian-Chinese cross-language retrieval system of the present invention which integrates the association model and user feedback.

图3为本发明融合关联模式和用户反馈的印尼汉跨语言检索系统结构框图。Fig. 3 is a structural block diagram of the Indonesian-Chinese cross-language retrieval system of the present invention which integrates the association model and user feedback.

图4为本发明所述的完全加权关联模式挖掘和用户相关反馈模块结构框图。Fig. 4 is a structural block diagram of the fully weighted association pattern mining and user-related feedback module of the present invention.

具体实施方式detailed description

以下结合实施例及其附图对本发明技术方案作进一步非限制性的详细说明。The technical solution of the present invention will be described in further non-limiting detail below in conjunction with the embodiments and accompanying drawings.

一、为了更好地说明本发明的技术方案，下面将本发明涉及的相关概念介绍如下：One, in order to better illustrate the technical scheme of the present invention, the relevant concepts involved in the present invention are introduced as follows below:

假设用户查询经过跨语言初次检索和用户相关反馈后得到的目标语言(TargetLanguage,TL)初检相关文档集为TLdoc＝{tld₁,tld₂,…,tld_n}，tld_i(1≦i≦n)表示目标语言文档集TLdoc中的第i篇文档，tld_j＝{t₁,t₂,…,t_m,…,t_p}，t_m(m＝1,2,…,p)称为目标语言特征词项目(Feature-term Item,FTI)，简称为特征项，一般是由字、词或词组构成，tld_i中对应的特征项权值集合W_i＝{w_i1,w_i2,…,w_im,…,w_ip},w_im为第i篇文档tld_i中第m个特征项t_m的对应的权值，令tlI＝{t₁,t₂,…,t_k}表示TLdoc中全体特征项集合，则tlI的子集Y称为TLdoc中的特征词项集(Feature-term Itemsets)，即项集Y。Assume that the target language (TargetLanguage, TL) first-check related document set obtained by the user query after the initial cross-language search and user-related feedback is TLdoc＝{tld ₁ ,tld ₂ ,…,tld _n }, tld _i (1≦i≦ n) represents the i-th document in the target language document set TLdoc, tld _j = {t ₁ ,t ₂ ,...,t _m ,...,t _p }, t _m (m=1,2,...,p) is called is the Feature-term Item (FTI) of the target language, referred to as a feature item, generally composed of words, words or phrases, and the corresponding feature item weight set W _{i in tld i} ₌ {w _i1 ,w _i2 , …,w _im ,…,w _ip },w _im is the corresponding weight of the mth feature item t _m in the i-th document tld _i , let tlI={t ₁ ,t ₂ ,…,t _k } express The set of all feature items in TLdoc, the subset Y of tlI is called Feature-term Itemsets in TLdoc, that is, item set Y.

对于项集(tlI₁,tlI₂)，且根据完全加权关联模式挖掘理论知识(黄名选,严小卫,张师超.基于矩阵加权关联规则挖掘的伪相关反馈查询扩展.软件学报,Vol.20,No.7,July 2009,pp.1854-1865)，给出如下一些基本概念。For itemsets (tlI ₁ ,tlI ₂ ), and According to the theoretical knowledge of fully weighted association pattern mining (Huang Mingxuan, Yan Xiaowei, Zhang Shichao. Pseudo-relevance feedback query expansion based on matrix weighted association rule mining. Software Journal, Vol.20, No.7, July 2009, pp.1854-1865), give Some basic concepts are given below.

定义1特征词项集I(I＝(tlI₁,tlI₂))的完全加权支持度(Feature-term ItemsetsSupport,FTISup)计算公式如(1)式所示。Definition 1 The formula for calculating the fully weighted support (Feature-term Itemsets Support, FTISup) of feature term item set I (I=(tlI ₁ , tlI ₂ )) is shown in formula (1).

其中，是项集I在TLdocD中各篇文档的权值总和，k为项集I的项目长度(即项目个数)，n是初检相关文档集TLdoc的文档总数。in, is the sum of the weights of each document in itemset I in TLdocD, k is the item length (that is, the number of items) of itemset I, and n is the total number of documents in the initial inspection related document set TLdoc.

定义2词间关联规则tlI₁→tlI₂的完全加权置信度(Feature-termAssociationRule Confidence,FTARConf)如(2)式所示。Definition 2 The fully weighted confidence (Feature-termAssociationRule Confidence, FTARConf) of the inter-word association rule tlI ₁ → tlI ₂ is shown in formula (2).

其中，FTIsup(tlI₁,tlI₂)为项集(tlI₁,tlI₂)的完全加权支持度。Among them, FTIsup(tlI ₁ , tlI ₂ ) is the fully weighted support degree of the itemset (tlI ₁ , tlI ₂ ).

定义3假设最小支持度阈值为ms，最小置信度阈值为mc，若满足：FTISup(tlI₁,tlI₂)≧ms，FTARConf(tlI₁→tlI₂)≧mc，则称特征词项集(tlI₁,tlI₂)为频繁项集，词间关联规则(tlI₁→tlI₂)为强关联规则。Definition 3 assumes that the minimum support threshold is ms, and the minimum confidence threshold is mc. If it satisfies: FTISup(tlI ₁ ,tlI ₂ )≧ms, FTARConf(tlI ₁ →tlI ₂ )≧mc, then it is called feature word item set (tlI ₁ , tlI ₂ ) are frequent itemsets, and the inter-word association rules (tlI ₁ →tlI ₂ ) are strong association rules.

定义4包含q_项集的特征词k_项集权值阈值(k-Item Weighted Threshold，KIWT)(q<k)是指对包含q_项集的后续项集的权值预测。Definition 4 The feature word k_itemset weight threshold (k-Item Weighted Threshold, KIWT) (q<k) that contains q_itemset refers to the weight prediction of subsequent itemsets that contain q_itemset.

设tlT是完全加权q-项集，且q<k,在(tlI-tlT)项集中，记前(k-q)个权值最大的项目相应的权值为w₁,w₂,…w_k-q，q-项集tlT在TLdoc中的支持计数为SC(tlT)，根据文献(黄名选,严小卫,张师超.基于矩阵加权关联规则挖掘的伪相关反馈查询扩展.软件学报,Vol.20,No.7,July 2009,pp.1854-1865)的k-权值阈值理论知识，给出了包含q_项集的特征词k_项集权值阈值的计算公式如式(3)所示。Let tlT be a fully weighted q-itemset, and q<k, in the (tlI-tlT) itemset, record the corresponding weights of the first (kq) items with the largest weights as w ₁ , w ₂ ,...w kq , the support count of q-itemset _tlT in TLdoc is SC(tlT), according to k -Theoretical knowledge of weight threshold value, the calculation formula of the weight threshold value of feature word k_itemset including q_itemset is given, as shown in formula (3).

二、如图1所示，本实施例的融合关联模式和用户反馈的印尼汉跨语言检索方法包括以下步骤：Two, as shown in Figure 1, the Indonesian-Chinese cross-language retrieval method of the fusion association model of the present embodiment and user feedback comprises the following steps:

(1)将印尼语用户查询通过机器翻译模块翻译为中文查询式，并提交到搜索引擎在互联网中初步检索，得到初检结果文档集；机器翻译模块采用必应机器翻译接口，即Microsoft TranslatorAPI；搜索引擎模块可以是现有的百度或谷歌等搜索引擎；(1) Translate the Indonesian language user query into Chinese query formula through the machine translation module, and submit it to the search engine for preliminary retrieval on the Internet, and obtain the document set of preliminary inspection results; the machine translation module uses the Bing machine translation interface, namely Microsoft TranslatorAPI; The search engine module can be an existing search engine such as Baidu or Google;

(2)提取跨语言初检结果文档集前r篇中文文档提交给用户；(2) Extract the first r Chinese documents from the cross-language preliminary inspection result document set and submit them to the user;

特征词权值的计算采用tf-idf方法，其计算公式是：The calculation of the feature word weight adopts the tf-idf method, and its calculation formula is:

其中，tf_m,n表示特征词t_m在文档d_n中的出现次数，df_m表示含有特征词t_m的文档数量，N表示文档集合中总的文档数量； Among them, tf _{m, n} represent the number of occurrences of the feature word t _m in the document d _n , df _m represents the number of documents containing the feature word t _m , and N represents the total number of documents in the document collection;

(7)挖掘k_项集，其中k≧2，包括步骤(7.1)至(7.7)：(7) Mining k_itemsets, where k≧2, including steps (7.1) to (7.7):

(8)从完全加权特征词频繁项集集合L中挖掘含有查询词项的特征词完全加权关联规则，构建完全加权关联规则库；方法包括步骤(8.1)至(8.4)：(8) Mining the fully weighted association rules of the feature words containing the query term from the fully weighted feature word frequent itemset set L, constructing the fully weighted association rule base; the method includes steps (8.1) to (8.4):

(8.4)循环进行步骤(8.1)至步骤(8.3)，当完全加权特征词频繁项集集合L中的项集都被取出一次，而且仅能取出一次，则挖掘结束；(8.4) Step (8.1) to step (8.3) is carried out in a loop, when the itemsets in the frequent itemsets set L of fully weighted feature words are all taken out once, and can only be taken out once, then the mining ends;

三、如图2至4所示，适用于本实施例融合关联模式和用户反馈的印尼汉跨语言检索方法的检索系统，包括以下4个模块和3个数据库：Three, as shown in Figures 2 to 4, the retrieval system applicable to the Indonesian-Chinese cross-language retrieval method of the fusion of the association model and user feedback in this embodiment includes the following 4 modules and 3 databases:

机器翻译模块：该模块使用必应机器翻译接口，即Microsoft TranslatorAPI，用于将印尼语用户查询翻译为中文查询，以及将最终检索结果中文文档翻译为印尼语文档提交给用户；Machine translation module: This module uses the Bing machine translation interface, that is, Microsoft TranslatorAPI, to translate Indonesian user queries into Chinese queries, and translate the final retrieval result Chinese documents into Indonesian documents and submit them to users;

完全加权关联规则库；Fully weighted association rule base;

扩展词库。Extended thesaurus.

其中，所述完全加权关联模式挖掘和用户相关反馈模块包括以下5个模块：Wherein, the fully weighted association pattern mining and user-related feedback modules include the following 5 modules:

四、结合本发明的技术方案，下面通过实验对本发明的有益效果做进一步说明：Four, in conjunction with technical scheme of the present invention, the beneficial effect of the present invention is further described by experiment below:

由于搜索引擎的研究范围广以及要考虑的因素比较多，本发明改为在基于向量空间模型的印尼中跨语言检索系统中进行，因此，本实验是个模拟实验。编写了本发明方法及系统的源程序进行本发明的实验。采用日本情报信息研究所主办的多国语言处理国际评测会议上的跨语言信息检索标准数据测试集NTCIR-5CLIR的中文语料作为本实验语料。Because the research scope of search engine is wide and there are many factors to be considered, the present invention is carried out in Indonesian interlanguage retrieval system based on vector space model instead, therefore, this experiment is a simulation experiment. The source program of the method and system of the present invention is written to carry out the experiment of the present invention. The Chinese corpus of NTCIR-5CLIR, a cross-lingual information retrieval standard data test set at the International Evaluation Conference on Multilingual Language Processing hosted by the Japan Institute of Information and Information Technology, is used as the experimental corpus.

NTCIR-5CLIR有查询集、文档测试集以及结果集，其中，查询集有50个查询主题，分有TITLE、DESC、NARR和CONC等4种类型，本文实验选择TITLE和DESC类型，TITLE类型查询主题以名词和名词性短语简要描述，属于短查询，DESC类型的是以句子形式简要描述查询主题，属于长查询。其结果集有Rigid和Relax等2种评价标准，Rigid标准是指其答案都是与原查询高度相关或相关的，Relax标准的是指高度相关、相关或部分相关的。NTCIR-5CLIR has a query set, a document test set, and a result set. Among them, the query set has 50 query topics, which are divided into four types: TITLE, DESC, NARR, and CONC. In this paper, the TITLE and DESC types are selected, and the TITLE type query topics A brief description with nouns and noun phrases is a short query, and the DESC type is a brief description of the query subject in the form of a sentence, which is a long query. There are two evaluation criteria for the result set: Rigid and Relax. The Rigid criterion means that the answers are highly relevant or related to the original query. The Relax standard means that the answers are highly relevant, relevant or partially relevant.

为了进行本文印尼中跨语言信息检索模型的实验，邀请翻译机构专业翻译人士将NTCIR-5CLIR中文版50个查询主题人工翻译为印尼语查询。In order to conduct the experiment of the cross-lingual information retrieval model in Indonesia, professional translators from translation agencies were invited to manually translate 50 query topics of the Chinese version of NTCIR-5CLIR into Indonesian queries.

本文实验中，采用中国科学院计算技术研究所研制编写的汉语词法分析系统ICTCLAS对中文实验语料和译后中文查询进行预处理。特征词权值计算采用传统的tf-idf方法，译后查询项权重(w_i,q)计算公式(来自文献G.Salton,C.Buckley.Term-weightingapproaches in automatic text retrieval[J].Information Processing&Management,1988,24(5):513-523.)如式(4)所示。In this experiment, the Chinese lexical analysis system ICTCLAS developed by the Institute of Computing Technology, Chinese Academy of Sciences is used to preprocess the Chinese experimental corpus and translated Chinese queries. The traditional tf-idf method is used for feature word weight calculation, and the post-translation query item weight (w _i,q ) calculation formula (from the literature G.Salton,C.Buckley.Term-weighting approaches in automatic text retrieval[J].Information Processing&Management ,1988,24(5):513-523.) as shown in formula (4).

其中，tf_i,q为查询项在查询文本信息中出现的初始频率，N为初检相关文档总数，df_i为包含第i个查询项的初检相关文档数。Among them, tf _{i, q} is the initial frequency of the query item appearing in the query text information, N is the total number of relevant documents in the initial inspection, and df _i is the number of relevant documents in the initial inspection containing the i-th query item.

本实验中，中文扩展词的权值设置方法是：将矩阵加权关联规则的置信度作为扩展词的权值，当多个关联规则含有重复相同的查询项时，取其置信度最高者作为该扩展词权值。In this experiment, the method of setting the weight of Chinese extended words is as follows: the confidence of the matrix weighted association rules is used as the weight of the extended words. Extended word weights.

实验评测比较基准是：The experimental evaluation benchmarks are:

(1)单语言检索基准(Monolingual Baseline,MB)：用中文查询直接检索中文文档得到的检索结果。(1) Monolingual Baseline (MB): The retrieval results obtained by directly retrieving Chinese documents with Chinese queries.

(2)跨语言检索基准(Cross-language Baseline,CLB)：指没经任何相关反馈的首次跨语言检索结果，即印尼查询经机器翻译系统翻译后检索中文文档得到的检索结果。(2) Cross-language Baseline (CLB): refers to the first cross-language retrieval result without any relevant feedback, that is, the retrieval result obtained by retrieving Chinese documents after the Indonesian query is translated by the machine translation system.

(3)传统的基于伪相关反馈的跨语言检索方法CLR_PRF(Jianfeng Gao,JianyunNie,Jian Zhang,et al,TREC-9CLIR Experiments atMSRCN[C].In:Proc.ofthe9th Text Retrieval Evaluation Conference,2001:343-353.；吴丹,何大庆,王惠临.基于伪相关的跨语言查询扩展[J].情报学报,2010,29(2):232-239.)。本实验中，提取跨语言前列初检文档20篇构建初检相关文档集，提取前列权值(降序排列)的20个特征词为扩展词。(3) The traditional cross-language retrieval method CLR_PRF based on pseudo-relevance feedback (Jianfeng Gao, JianyunNie, Jian Zhang, et al, TREC-9CLIR Experiments at MSRCN[C].In:Proc.ofthe9th Text Retrieval Evaluation Conference,2001:343- 353.; Wu Dan, He Daqing, Wang Huilin. Cross-lingual query expansion based on pseudo-correlation [J]. Journal of Information Science, 2010,29(2):232-239.). In this experiment, 20 cross-language first-order documents were extracted to build a first-inspection related document set, and 20 feature words with the first weight (in descending order) were extracted as extended words.

本发明方法实验参数：提取跨语言初检文档前列100篇文档提交给用户，用户进行相关性判断后确定初检文档集，本文实验中，初检前列100篇中含有已知结果集中的相关文档视为用户相关反馈信息，并提取出来构建用户初检相关文档集，最后，用完全加权关联规则挖掘技术对初检相关文档集挖掘扩展词实现查询扩展。The experimental parameters of the method of the present invention: extract the top 100 documents of the cross-language initial inspection documents and submit them to the user, and the user determines the initial inspection document set after making a correlation judgment. In this experiment, the first 100 initial inspection documents contain related documents in the known result set It is regarded as user-related feedback information, and extracted to construct the relevant document set of the user's initial inspection. Finally, the fully weighted association rule mining technology is used to mine the expansion words of the relevant document set of the initial inspection to realize query expansion.

编写了源程序，将本发明方法与基准方法MB、CLB和CLR_PRF在NTCIR-5CLIR测试集上进行印尼汉跨语言文本检索，比较和分析其跨语言检索性能。The source program is written, and the method of the present invention and the benchmark methods MB, CLB and CLR_PRF are used for Indonesian-Chinese cross-language text retrieval on the NTCIR-5CLIR test set, and their cross-language retrieval performance is compared and analyzed.

(1)基准实验结果(1) Benchmark experiment results

运行实验源程序，提交NTCIR-5CLIR的50个查询主题的title部分和description部分进行中文单语言检索、印尼汉跨语言检索和传统的基于伪相关反馈的印尼汉跨语言检索，即运行基准算法MB、CLB和CLR_PRF，得到3种基准方法检索实验结果如表1所示。Run the experimental source program, submit the title part and description part of 50 query topics of NTCIR-5CLIR for Chinese single-language retrieval, Indonesian-Chinese cross-language retrieval and traditional Indonesian-Chinese cross-language retrieval based on pseudo-correlation feedback, that is, run the benchmark algorithm MB , CLB and CLR_PRF, and the retrieval experiment results of three benchmark methods are shown in Table 1.

表1：Table 1:

表1实验结果表明，印尼汉跨语言检索基准CLB和传统的CLR_PRF方法检索结果的各个评价指标值只达到单语言检索基准MB的30％至60％左右，长查询description类型的检索效果比短查询title类型的检索效果好。对于CLR_PRF算法，其检索评价指标中，除了MAP外，其余的指标值比基准CLB的有所提高，提高幅度为5％至30％左右，而MAP值普遍下降，最大幅度达％46。这些结果说明，跨语言检索受查询翻译因素的影响，检索性能普遍低下，还达不到其相应的单语言检索性能。The experimental results in Table 1 show that the retrieval results of the Indonesian-Chinese cross-language retrieval benchmark CLB and the traditional CLR_PRF method only achieve about 30% to 60% of the single-language retrieval benchmark MB, and the retrieval effect of long query description is better than that of short query The retrieval effect of the title type is good. For the CLR_PRF algorithm, except for MAP, the index values of the other index values have increased compared with the benchmark CLB by about 5% to 30%, while the MAP value has generally decreased, with a maximum rate of %46. These results show that cross-language retrieval is affected by query translation factors, and the retrieval performance is generally low, which is not as good as its corresponding single-language retrieval performance.

(2)本发明方法与基准算法的检索性能比较(2) the retrieval performance comparison of the inventive method and benchmark algorithm

采用NTCIR-5CLIR的50个查询主题的title类型和description类型，对支持度变化和置信度变化时两种情况进行检索性能实验，与印尼汉跨语言检索基准CLB和传统的CLR_PRF方法，以及单语言检索基准MB进行检索性能比较。实验具体参数：支持度阈值变化时检索性能比较如表2所示，置信度阈值变化时检索结果的MAP、P@5和P@15值如表3所示。Using the title type and description type of 50 query topics of NTCIR-5CLIR, the retrieval performance experiment was carried out when the support degree changed and the confidence degree changed, and it was compared with the Indonesian-Chinese cross-language retrieval benchmark CLB and the traditional CLR_PRF method, as well as a single language Retrieve benchmark MB for retrieval performance comparison. Specific parameters of the experiment: Table 2 shows the comparison of retrieval performance when the support threshold changes, and Table 3 shows the MAP, P@5 and P@15 values of the retrieval results when the confidence threshold changes.

表2：Table 2:

表3：table 3:

从表2的实验结果可知，当完全加权支持度阈值变化时，本发明方法检索结果的各项指标值都高于印尼汉跨语言检索基准CLB和传统的伪相关跨语言检索方法CLR_PRF的值，均达到单语言检索基准MB的60％至102％。与基准CLB比较，其提高的幅度最大为91.55％(即Rigid类型的P@5值)，最低的是36.06％类型、Relax评测的P@15值)。与CLR_PRF方法相比，其提高的幅度最大可达244.97％(即description查询类型、Rigid评测的MAP值)，最低的是32.89％，特别地，其description查询类型、Rigid评测的MAP值已经达到并超过单语言检索基准MB的2％。另外，查询主题description类型的检索效果比title类型的好，其检索结果的MAP值提高幅度最大。As can be seen from the experimental results in Table 2, when the fully weighted support threshold changes, the index values of the retrieval results of the method of the present invention are all higher than the value of the Indonesian-Chinese cross-language retrieval benchmark CLB and the traditional pseudo-correlation cross-language retrieval method CLR_PRF, Both achieve 60% to 102% of the monolingual retrieval benchmark MB. Compared with the benchmark CLB, the maximum improvement is 91.55% (that is, the P@5 value of Rigid type), and the lowest is 36.06% type, and the P@15 value of Relax evaluation). Compared with the CLR_PRF method, its improvement can reach up to 244.97% (that is, the MAP value of description query type and Rigid evaluation), and the lowest is 32.89%. In particular, the MAP value of its description query type and Rigid evaluation has reached and Exceeds 2% of the single language retrieval benchmark MB. In addition, the retrieval effect of the query subject description type is better than that of the title type, and the MAP value of the retrieval results has the largest increase.

表3实验结果表明，当置信度阈值变化时，本发明获得很好的检索结果，其各项指标值都高于基准CLB和CLR_PRF算法的值，均达到单语言检索基准MB的58.07％至101.2％，查询主题description类型的检索效果也比title类型的好，其检索结果的MAP值提高幅度最大。The experimental results in Table 3 show that when the confidence threshold changes, the present invention obtains good retrieval results, and its index values are all higher than the values of the benchmark CLB and CLR_PRF algorithms, all reaching 58.07% to 101.2% of the single-language retrieval benchmark MB %, the retrieval effect of the description type of query subject is better than that of the title type, and the MAP value of the retrieval result is increased the most.

综上所述，本发明具有较好的推广应用价值。To sum up, the present invention has good application value.

Claims

1. a kind of Indonesian Chinese interlingual retrieval method that merges association mode and user feedback, is characterized in that, comprises the steps:

(1) Translate the Indonesian language user query into Chinese query formula through the machine translation module, and submit it to the search engine for preliminary retrieval on the Internet, and obtain the preliminary inspection result document set;

(2) Extract the first r Chinese documents from the cross-language preliminary inspection result document set and submit them to the user;

(3) The user judges the Chinese documents in the cross-language preliminary inspection result document set to obtain the user feedback related document set, and the total number of documents in the document set is set to n;

(4) Preprocessing user feedback related document sets, that is, performing preprocessing operations of Chinese word segmentation, removing stop words, calculating feature word weights and extracting feature words, and constructing a database of relevant documents for initial inspection;

(5) Scan the relevant document database for initial inspection, mine fully weighted feature word 1_candidate item set C ₁ , calculate C ₁ weight w(C ₁ ), and count the maximum weight maxCw _i (!C ₁ of items other than C ₁ ) and C ₁ support count n _c1 , ms is the minimum support threshold, calculate the value of KIWT(1,2), the calculation formula of KIWT(1,2) is: KIWT(1,2)=n×1×ms -n _C1 ×maxCw _i (!C ₁ );

(6) Calculate the support degree FTISup(C ₁ ) of C _1. If FTISup(C ₁ )≧ms, mine 1_frequent itemset L ₁ from 1_candidate item set C ₁ and add it to the fully weighted feature word frequency Itemset set L, the calculation formula of FTISup(C ₁ ) is:

(7) Mining k_itemsets, wherein said k≧2, including steps (7.1) to (7.7):

(7.1) Compare the candidate (k-1)_itemset C _k-1 weight and KIWT(k-1,k) value, and cut off the candidates whose W(C _k-1 )<KIWT(k-1,k) Itemset C _k-1 ;

(7.2) Aproiri connection is performed on the remaining candidate (k-1) item sets C _k-1 to obtain C _k ;

(7.3) When k=2, cut off the candidate 2_itemset not containing the query item;

(7.4) Scan the initial inspection related document database, count the maximum weight maxCw _i (!C _k ) of items other than C _k and the support count n _ck of C _k , and calculate the C _k weight w(C _k ) and KIWT(k -1,k), the calculation formula of KIWT(k-1,k) is: KIWT(k-1,k)=n×k×ms-n _ck ×maxCw _i (!C _k );

(7.5) Cut off the candidate item set C _{k where} n _ck is 0;

(7.6) For the remaining candidate k_itemset C _k , calculate C _k support FTISup(C _k ), if FTISup(C _k )≧ms, mine k_frequent itemset from candidate k_itemset C _k L _k , and added to the fully weighted feature word frequent itemset set L, the calculation formula of FTISup(C _k ) is:

(7.7) If k is greater than the candidate item set length threshold or the candidate k_item set is an empty set, then the mining ends, otherwise, continue the loop steps (7.1) to (7.6);

(8) Mining the fully weighted association rules of the feature words containing the query term from the fully weighted feature word frequent itemset set L, and constructing the fully weighted association rule base;

(9) Extract the cross-lingual expansion words related to the original query from the fully weighted association rule base, and construct the extended thesaurus;

(10) Submitting the original query and the expanded word combination to the search engine to retrieve the final retrieval result Chinese document again;

(11) Submit the Chinese document of the final retrieval result to the machine translation module for translation into a Indonesian language document, and finally return the Chinese document of the final retrieval result and the Indonesian language document of the final retrieval result to the user.

2. the Indonesian-Chinese cross-language retrieval method of fusion association pattern according to claim 1 and user feedback, it is characterized in that, the calculation of the feature word weight value described in step (4) adopts tf-idf method, its calculation formula yes: Among them, tf _m,n represents the number of occurrences of the feature word t _m in the document d _n , df _m represents the number of documents containing the feature word t _m , and N represents the total number of documents in the document collection.

3. the Indonesian-Chinese cross-language retrieval method of fusion association mode and user feedback according to claim 1, is characterized in that, the method for step (8) comprises steps (8.1) to (8.4):

(8.1) Extract a certain fully weighted i_frequent itemset tlL _i from the fully weighted feature word frequent itemset set L, and find out all proper subsets of tlL _i ;

(8.2) Take out two proper subsets tlI ₁ and tlI ₂ arbitrarily from the proper subset set of tlL _i , when And tlI ₁ ∪tlI ₂ ＝L _i , if FTARConf(tlI ₁ →tlI ₂ )≧mc, fully weighted feature word strong association rules tlI ₁ →tlI ₂ are mined; if FTARConf(tlI ₂ →tlI ₁ )≧mc, Then dig out the fully weighted feature word strong association rule tlI ₂ →tlI ₁ ; the mc is the minimum confidence threshold, tlI ₁ and tlI ₂ are fully weighted feature word frequent itemsets, which are proper subset itemsets of tlL _i , FTARConf (tlI ₁ →tlI ₂ ) is the confidence degree of the fully weighted feature word association rule tlI ₁ →tlI ₂ , and its calculation formula is:

Among them, FTISup(L _i ) is the support degree of fully weighted frequent itemset L _i , FTISup(tlI ₁ ) is the support degree of fully weighted frequent itemset tlI ₁ ;

(8.3) Perform step (8.2) in a loop until each proper subset in the proper subset set of the fully weighted i_frequent itemset tlL _i is taken out once, and can only be taken out once, then turn to step (8.4);

(8.4) Perform step (8.1) to step (8.3) in a loop. When the itemsets in the frequent itemsets set L of fully weighted feature words are taken out once, and only once, the mining ends.

4. A retrieval system applicable to the Indonesian-Chinese cross-language retrieval method of fusion association mode and user feedback described in claim 1, is characterized in that: comprise following 4 modules and 3 databases:

Machine translation module: This module uses the Bing machine translation interface to translate Indonesian user queries into Chinese queries, and translate the final search result Chinese documents into Indonesian documents for submission to users;

Search engine module: this module is a search engine, which is used to search the translated Chinese query formula on the Internet, and obtain the cross-language initial inspection result document set;

Fully weighted association pattern mining and user-related feedback module: used to submit the top r cross-language preliminary inspection result document sets to the user, and the user will judge the relevance of these documents and determine the relevant document database for the initial inspection, and then use the fully weighted association The rule mining technology mines the expansion words related to the query on the relevant document database of the initial inspection, realizes cross-language query expansion, and re-searches the combination of the expansion words and the original query to obtain the final retrieval result Chinese document;

Final result display module: used to submit the Chinese document of the final retrieval result to the machine translation module for translation into an Indonesian document, and return the Chinese document of the final retrieval result and the Indonesian language document of the final retrieval result to the user;

Preliminary inspection of relevant document databases;

Fully weighted association rule base;

Extended thesaurus.

5. retrieval system according to claim 4, is characterized in that, described fully weighted association pattern mining and user-related feedback module comprise following 5 modules:

User click behavior-related feedback extraction module: used to capture the document download behavior generated when the user browses the document set of the preliminary examination result, and extract the document downloaded by the user to construct the relevant document set of user feedback;

Document preprocessing module: used to perform Chinese word segmentation, remove stop words, calculate feature word weights and extract feature words for user feedback related document sets, and build a database of relevant documents for initial inspection;

Fully weighted association rule mining module: used for fully weighted association rule mining on the primary inspection related document database, mining the fully weighted feature term frequent itemsets and association rule patterns containing the original query term, and constructing a fully weighted association rule library;

Cross-lingual query expansion word generation module: used to extract the expansion words related to the original query from the fully weighted association rule base, and build an extended thesaurus;

Cross-language query extension implementation module: used to extract Chinese extended words from the extended thesaurus, combine the extended words and the original query to form a new query, submit it to the search engine for retrieval on the Internet, and obtain the final search result Chinese document.