[go: up one dir, main page]

CN106980664B - A bilingual comparable corpus mining method and device - Google Patents

A bilingual comparable corpus mining method and device Download PDF

Info

Publication number
CN106980664B
CN106980664B CN201710169141.XA CN201710169141A CN106980664B CN 106980664 B CN106980664 B CN 106980664B CN 201710169141 A CN201710169141 A CN 201710169141A CN 106980664 B CN106980664 B CN 106980664B
Authority
CN
China
Prior art keywords
picture
similarity
query
target
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710169141.XA
Other languages
Chinese (zh)
Other versions
CN106980664A (en
Inventor
洪宇
姚亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201710169141.XA priority Critical patent/CN106980664B/en
Publication of CN106980664A publication Critical patent/CN106980664A/en
Application granted granted Critical
Publication of CN106980664B publication Critical patent/CN106980664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种双语可比较语料挖掘方法及装置,通过预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及文字信息的多模态知识库;将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与查询图片相似的目标图片;根据目标图片对应的文字信息与查询图片对应的文字信息,构建双语可比较语料。本申请采用跨媒体信息检索技术,通过图片作为沟通源语言和目标语言的媒介,进而获取源语言在目标端的等价或可比较的文本,为互联网中的双语可比较资源挖掘提供了新方法,解决了特定双语资源稀缺的问题。

Figure 201710169141

The invention discloses a bilingual comparable corpus mining method and device. By grabbing multiple pictures and corresponding text information from databases in different languages in advance, a multimodal knowledge base containing pictures and text information is established; The pictures in the knowledge base are used as query pictures, and pictures are retrieved in the target language knowledge base to find the target pictures that are similar to the query pictures; according to the text information corresponding to the target pictures and the text information corresponding to the query pictures, a bilingual comparable corpus is constructed. This application adopts cross-media information retrieval technology, uses pictures as a medium for communicating the source language and the target language, and then obtains the equivalent or comparable text of the source language at the target end, which provides a new method for mining bilingual comparable resources in the Internet. Solved the problem of scarcity of specific bilingual resources.

Figure 201710169141

Description

一种双语可比较语料挖掘方法及装置A bilingual comparable corpus mining method and device

技术领域technical field

本发明涉及计算机技术领域,特别是涉及一种双语可比较语料挖掘方法及装置。The invention relates to the field of computer technology, in particular to a bilingual comparable corpus mining method and device.

背景技术Background technique

双语可比较语料是指不同语言中表征相似语义的文本集合。大规模的双语可比较语料中通常包含丰富多样的双语互译单元,例如短语级别、句子级别的互译对,以及双语词典。在小语种或者某些限定领域中,平行资源通常较少,但双语可比较语料相对容易获取。因此,双语可比较语料成为机器翻译和跨语言信息检索领域的重要资源。如何自动获取大规模的双语可比较语料成为机器翻译中一项基本任务。Bilingual comparable corpora refers to a collection of texts in different languages that represent similar semantics. Large-scale bilingual comparable corpora usually contain rich and diverse bilingual translation units, such as phrase-level, sentence-level translation pairs, and bilingual dictionaries. In small languages or in some limited fields, parallel resources are usually few, but bilingual comparable corpora is relatively easy to obtain. Therefore, bilingual comparable corpora has become an important resource in the fields of machine translation and cross-lingual information retrieval. How to automatically obtain large-scale bilingual comparable corpora has become a basic task in machine translation.

目前,双语可比较语料获取的研究方法大致可分为以下三类:一类是基于跨语言信息检索的双语可比较语料构建方法,该方法从源语言的文档中抽取关键词,并基于双语词典将关键词翻译到目标语言,进而将其作为检索查询,检索目标语言的候选文档集合,最终得到可比较的双语文档。第二类是基于内容和结构相似度的双语可比较语料构建方法,该方法利用翻译引擎(谷歌翻译或必应翻译等)将源语言文档翻译到目标语言,得到源文档的伪翻译结果。并进一步从词汇、主题、结构的相似度出发,评价伪翻译文档和目标语言文档相似度,并排序选择相似的文档。第三类方法是从结构化的知识库中(如维基百科等)抽取双语可比较语料。利用维基百科等知识库已有的分类条目和链接信息,获取文档级别的可比较资源。现有方法仅从文本相似度角度出发,借助双语词典或已有知识库构建双语可比较语料资源。这类方法依赖于大量人工标注的词典、知识库或现有翻译系统的性能,在面向小语种或者特定领域时,稀缺的语言资源将制约该类方法的通用性。At present, the research methods of bilingual comparable corpus acquisition can be roughly divided into the following three categories: one is a bilingual comparable corpus construction method based on cross-language information retrieval, which extracts keywords from documents in the source language and uses a bilingual dictionary The keywords are translated into the target language, and then used as a retrieval query to retrieve a set of candidate documents in the target language, and finally obtain comparable bilingual documents. The second category is a bilingual comparable corpus construction method based on content and structure similarity, which uses a translation engine (Google Translate or Bing Translate, etc.) to translate the source language document into the target language, and obtain the pseudo-translation result of the source document. And further proceed from the similarity of vocabulary, theme and structure, evaluate the similarity between the pseudo-translation document and the target language document, and sort and select similar documents. The third type of method is to extract bilingual comparable corpora from structured knowledge bases (such as Wikipedia, etc.). Utilize existing taxonomy entries and link information in knowledge bases such as Wikipedia to obtain comparable resources at the document level. Existing methods only build bilingual comparable corpus resources from the perspective of text similarity with the help of bilingual dictionaries or existing knowledge bases. Such methods rely on the performance of a large number of manually annotated dictionaries, knowledge bases or existing translation systems. When targeting small languages or specific fields, the scarce language resources will restrict the generality of such methods.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种双语可比较语料挖掘方法及装置,以解决现有方法仅从文本相似度角度出发、对特定的稀缺语音资源的通用性较差的问题。The purpose of the present invention is to provide a bilingual comparable corpus mining method and device, so as to solve the problem that the existing methods only start from the perspective of text similarity and have poor generality to specific scarce speech resources.

为解决上述技术问题,本发明提供一种双语可比较语料挖掘方法,包括:In order to solve the above-mentioned technical problems, the present invention provides a bilingual comparable corpus mining method, including:

预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;Capture multiple pictures and corresponding text information from databases in different languages in advance, and establish a multimodal knowledge base containing the pictures and the text information;

将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;Taking the picture in the source language knowledge base as the query picture, performing picture retrieval in the target language knowledge base, and finding out the target picture similar to the query picture;

根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料。According to the text information corresponding to the target image and the text information corresponding to the query image, a bilingual comparable corpus is constructed.

可选地,所述预先从不同语言的数据库中抓取多个图片以及对应的文字信息包括:Optionally, the pre-fetching of multiple pictures and corresponding text information from databases in different languages includes:

利用网络爬虫从新闻网站中抓取图片,所述文字信息为所述图片对应的主题和/或标题信息,将图片以及对应的文字信息作为二元组,存储于所述多模态知识库中。Use a web crawler to grab a picture from a news website, the text information is the subject and/or title information corresponding to the picture, and the picture and the corresponding text information are stored in the multimodal knowledge base as a two-tuple .

可选地,所述在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片包括:Optionally, performing image retrieval in the target language knowledge base, and finding out a target image similar to the query image includes:

采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;Extract the key points of the query picture by using a scale-invariant feature transformation algorithm, and characterize the query picture as a feature vector based on the key points;

提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;Extract the feature vectors of all candidate pictures in the target language knowledge base, and match the query picture and the key points of the candidate pictures;

计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;Calculate the average Euclidean distance between all matching key points as the image similarity between images;

根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。The candidate pictures are sorted according to the picture similarity, and a target picture similar to the query picture is selected.

可选地,所述在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片包括:Optionally, performing image retrieval in the target language knowledge base, and finding out a target image similar to the query image includes:

确定所述查询图片的主题分类以及发布时间信息;Determine the subject classification and release time information of the query picture;

滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;Filtering out pictures that do not match the subject classification and release time information in the target language knowledge base;

在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。Image retrieval is performed in the filtered target language knowledge base to find out a target image similar to the query image.

可选地,所述根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料包括:Optionally, according to the text information corresponding to the target picture and the text information corresponding to the query picture, constructing a bilingual comparable corpus includes:

计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度;calculating the text similarity between the text information corresponding to the query picture and the text information corresponding to the target picture;

根据所述文本相似度对所述目标图片进行重新排序;Reordering the target pictures according to the text similarity;

根据重新排序后的结果,构建双语可比较语料。Based on the re-ranked results, a bilingual comparable corpus is constructed.

可选地,所述计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度包括:Optionally, the calculating the text similarity between the text information corresponding to the query picture and the text information corresponding to the target picture includes:

计算所述查询图片对应的文字信息与所述目标图片对应的文字信息的内容相似度、实体相似度以及结构相似度;Calculate the content similarity, entity similarity and structural similarity between the text information corresponding to the query picture and the text information corresponding to the target picture;

对所述内容相似度、所述实体相似度以及所述结构相似度进行加权平均,计算得到对应文字信息的文本相似度。The content similarity, the entity similarity and the structural similarity are weighted and averaged, and the text similarity of the corresponding text information is calculated.

本发明还提供了一种双语可比较语料挖掘装置,包括:The present invention also provides a bilingual comparable corpus mining device, comprising:

知识库建立模块,用于预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;The knowledge base building module is used to capture multiple pictures and corresponding text information from databases in different languages in advance, and establish a multimodal knowledge base including the pictures and the text information;

查找模块,用于将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;a search module, used for taking the picture in the source language knowledge base as a query picture, performing picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture;

构建模块,用于根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料。The construction module is configured to construct bilingual comparable corpus according to the text information corresponding to the target image and the text information corresponding to the query image.

可选地,所述查找模块包括:Optionally, the search module includes:

提取单元,用于采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;an extraction unit, used for extracting the key points of the query picture by adopting a scale-invariant feature conversion algorithm, and characterizing the query picture as a feature vector based on the key points;

匹配单元,用于提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;a matching unit for extracting the feature vectors of all candidate pictures in the target language knowledge base, and matching the query picture and the key points of the candidate pictures;

相似度计算单元,用于计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;The similarity calculation unit is used to calculate the average Euclidean distance between all matching key points as the image similarity between the images;

选取单元,用于根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。A selection unit, configured to sort candidate pictures according to the picture similarity, and select a target picture similar to the query picture.

可选地,所述查找模块还包括:Optionally, the search module further includes:

确定单元,用于确定所述查询图片的主题分类以及发布时间信息;a determining unit for determining the subject classification and release time information of the query picture;

滤除单元,用于滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;a filtering unit, configured to filter out pictures in the target language knowledge base that do not match the subject classification and release time information;

查找单元,用于在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。A search unit, configured to perform image retrieval in the filtered target language knowledge base to find a target image similar to the query image.

可选地,所述构建模块包括:Optionally, the building blocks include:

文本相似度计算单元,用于计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度;a text similarity calculation unit, configured to calculate the text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;

排序单元,用于根据所述文本相似度对所述目标图片进行重新排序;a sorting unit, configured to reorder the target pictures according to the text similarity;

构建单元,用于根据重新排序后的结果,构建双语可比较语料。The construction unit is used to construct bilingual comparable corpus according to the reordered results.

本发明所提供的双语可比较语料挖掘方法及装置,通过预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及文字信息的多模态知识库;将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与查询图片相似的目标图片;根据目标图片对应的文字信息与查询图片对应的文字信息,构建双语可比较语料。本申请采用跨媒体信息检索技术,通过图片作为沟通源语言和目标语言的媒介,进而获取源语言在目标端的等价或可比较的文本,为互联网中的双语可比较资源挖掘提供了新方法,解决了特定双语资源稀缺的问题。The bilingual comparable corpus mining method and device provided by the present invention builds a multimodal knowledge base including pictures and text information by grabbing multiple pictures and corresponding text information from databases in different languages in advance; The pictures in the database are used as query pictures, and pictures are retrieved in the target language knowledge base to find the target pictures similar to the query pictures; according to the text information corresponding to the target pictures and the text information corresponding to the query pictures, a bilingual comparable corpus is constructed. This application adopts cross-media information retrieval technology, uses pictures as a medium for communicating the source language and the target language, and then obtains the equivalent or comparable text of the source language at the target end, which provides a new method for mining bilingual comparable resources in the Internet. Solved the problem of scarcity of specific bilingual resources.

附图说明Description of drawings

为了更清楚的说明本发明实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following will briefly introduce the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明所提供的双语可比较语料挖掘方法的一种具体实施方式的流程图;1 is a flow chart of a specific implementation of a bilingual comparable corpus mining method provided by the present invention;

图2为新闻文本中挖掘的相似图片和可比较标题示意图;Figure 2 is a schematic diagram of similar pictures and comparable titles mined in news texts;

图3为本发明实施例中查找出与所述查询图片相似的目标图片的过程示意图;3 is a schematic diagram of a process of finding a target picture similar to the query picture in an embodiment of the present invention;

图4为本发明所提供的双语可比较语料挖掘方法的另一种具体实施方式的流程图;4 is a flowchart of another specific implementation of the bilingual comparable corpus mining method provided by the present invention;

图5为图片中关键点的匹配结果示意图;Figure 5 is a schematic diagram of the matching result of the key points in the picture;

图6为本发明实施例提供的双语可比较语料挖掘装置的结构框图。FIG. 6 is a structural block diagram of a bilingual comparable corpus mining apparatus provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make those skilled in the art better understand the solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明所提供的双语可比较语料挖掘方法的一种具体实施方式的流程图如图1所示,该方法包括:A flowchart of a specific implementation of the bilingual comparable corpus mining method provided by the present invention is shown in FIG. 1 , and the method includes:

步骤S101:预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;Step S101: Capture multiple pictures and corresponding text information from databases in different languages in advance, and establish a multimodal knowledge base including the pictures and the text information;

具体地,本发明实施例可以利用网络爬虫从不同语言的新闻网站中抓取大规模的图片以及对应的文字信息,该文字信息可以为所述图片对应的主题和/或标题信息,将图片以及对应的文字信息作为二元组,存储于所述多模态知识库中。Specifically, in this embodiment of the present invention, a web crawler can be used to capture large-scale pictures and corresponding text information from news websites in different languages. The text information can be the subject and/or title information corresponding to the pictures, and the pictures and The corresponding text information is stored in the multimodal knowledge base as a binary group.

需要指出的是,本发明具体基于跨媒体信息检索(Cross-MediaInformationRetrieval,CMIR)技术,其与跨语言信息检索(Cross-LanguageInformationRetrieve,CLIR)技术十分相似,唯一的区别在于沟通源语言和目标语言的媒介不同。跨媒体信息检索使用图片而非文本,作为沟通源语言和目标语言的桥梁,进而获取源语言在目标端的等价或可比较的文本。It should be pointed out that the present invention is specifically based on the Cross-Media Information Retrieval (CMIR) technology, which is very similar to the Cross-Language Information Retrieve (CLIR) technology, and the only difference lies in the communication between the source language and the target language. The medium is different. Cross-media information retrieval uses pictures instead of text as a bridge to communicate between the source language and the target language, and then obtains the equivalent or comparable text of the source language at the target end.

作为一种具体实施方式,CMIR的知识库可以包含图片和图片标题两部分,图片标题描述了该图片的内容,且与图片一一对应,参考附图2实例。CMIR利用相似图片搜索引擎来建立不同语言文本的联系,即相似图片对应的标题认为是双语的可比较文本。CLIR系统则利用弱翻译系统或双语词典将源语言文本翻译到目标语言端,并将翻译结果作为检索查询,到目标语言端检索相似的文本。在某种程度上,CMIR相比CLIR更容易使用,CMIR唯一需要考虑的问题在于如何提升检索的质量,CLIR则需要额外的考虑双语词典的质量和弱翻译器翻译性能。因此,本发明实施例提出的基于跨媒体信息检索技术的双语可比较语料挖掘方法,首次尝试利用图片的相似性来挖掘双语可比较语料。As a specific implementation manner, the knowledge base of CMIR may include two parts, a picture and a picture title. The picture title describes the content of the picture and corresponds to the picture one-to-one. Refer to FIG. 2 for an example. CMIR uses a similar image search engine to establish the connection of texts in different languages, that is, the titles corresponding to similar images are considered to be bilingual comparable texts. The CLIR system uses a weak translation system or a bilingual dictionary to translate the source language text into the target language, and uses the translation result as a search query to retrieve similar texts in the target language. To a certain extent, CMIR is easier to use than CLIR. The only problem that CMIR needs to consider is how to improve the quality of retrieval. CLIR needs to additionally consider the quality of bilingual dictionaries and the translation performance of weak translators. Therefore, the bilingual comparable corpus mining method based on the cross-media information retrieval technology proposed in the embodiment of the present invention is the first attempt to mine bilingual comparable corpus by using the similarity of pictures.

步骤S102:将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;Step S102: using the picture in the source language knowledge base as the query picture, perform picture retrieval in the target language knowledge base, and find out the target picture similar to the query picture;

在不同语言的多模态知识库之间,建立相似图片检索系统,用以在目标语言知识库中检索相似的图片。Between multimodal knowledge bases in different languages, a similar image retrieval system is established to retrieve similar images in the target language knowledge base.

参照图3,本发明实施例中在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片的过程可以具体包括:Referring to FIG. 3 , in the embodiment of the present invention, image retrieval is performed in the target language knowledge base, and the process of finding a target image similar to the query image may specifically include:

步骤S1021:采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;Step S1021: using a scale-invariant feature transformation algorithm to extract key points of the query picture, and characterizing the query picture as a feature vector based on the key points;

步骤S1022:提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;Step S1022: extract the feature vectors of all candidate pictures in the target language knowledge base, and match the query picture and the key points of the candidate pictures;

步骤S1023:计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;Step S1023: Calculate the average Euclidean distance between all matching key points as the picture similarity between the pictures;

步骤S1024:根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。Step S1024: Sort candidate pictures according to the picture similarity, and select a target picture similar to the query picture.

步骤S103:根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料。Step S103 : constructing a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.

本发明所提供的双语可比较语料挖掘方法,通过预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及文字信息的多模态知识库;将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与查询图片相似的目标图片;根据目标图片对应的文字信息与查询图片对应的文字信息,构建双语可比较语料。本申请采用跨媒体信息检索技术,通过图片作为沟通源语言和目标语言的媒介,进而获取源语言在目标端的等价或可比较的文本,为互联网中的双语可比较资源挖掘提供了新方法,解决了特定双语资源稀缺的问题。The bilingual comparable corpus mining method provided by the present invention establishes a multi-modal knowledge base including pictures and text information by grabbing multiple pictures and corresponding text information from databases in different languages in advance; The image is used as the query image, and the image retrieval is performed in the target language knowledge base to find the target image similar to the query image; according to the text information corresponding to the target image and the text information corresponding to the query image, a bilingual comparable corpus is constructed. This application adopts cross-media information retrieval technology, uses pictures as a medium for communicating the source language and the target language, and then obtains the equivalent or comparable text of the source language at the target end, which provides a new method for mining bilingual comparable resources in the Internet. Solved the problem of scarcity of specific bilingual resources.

在上述任一实施例的基础上,为提升相似图片检索系统的精确率和运行效率,本发明实施例进一步提出结合新闻主题信息以及时间标签信息对相似图片进行优化检索的模型。具体地,本发明所提供的双语可比较语料挖掘方法中在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片的过程可以具体为:Based on any of the above embodiments, in order to improve the accuracy and operation efficiency of the similar image retrieval system, the embodiment of the present invention further proposes a model for optimizing retrieval of similar images in combination with news topic information and time tag information. Specifically, in the bilingual comparable corpus mining method provided by the present invention, image retrieval is performed in the target language knowledge base, and the process of finding out the target image similar to the query image can be specifically as follows:

确定所述查询图片的主题分类以及发布时间信息;Determine the subject classification and release time information of the query picture;

滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;Filtering out pictures that do not match the subject classification and release time information in the target language knowledge base;

在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。Image retrieval is performed in the filtered target language knowledge base to find out a target image similar to the query image.

参照图4本发明所提供的双语可比较语料挖掘方法的另一种具体实施方式的流程图,下面对该方法的具体实施过程进行进一步详细阐述。Referring to the flowchart of another specific implementation of the bilingual comparable corpus mining method provided by the present invention in FIG. 4 , the specific implementation process of the method will be further elaborated below.

该过程包括:The process includes:

步骤S201:多模态知识库构建,该模块利用网络爬虫从多种语言的新闻网站中爬取图片及对应图片标题,将图片和对应标题对作为二元组,构建多模态知识库。Step S201 : building a multimodal knowledge base, this module uses web crawlers to crawl pictures and corresponding picture titles from news websites in multiple languages, and constructs a multimodal knowledge base by taking the picture and corresponding title pairs as two-tuples.

具体可分为以下步骤:根据新闻网站栏目划分,分别在不同栏目或主题下爬取对应的新闻页面;解析新闻页面,通过网页结构化分析提取图片和图片标题,并提取新闻发布时间;将图片和对应图片标题构成二元组,存储到多模态知识库中。Specifically, it can be divided into the following steps: according to the division of news website columns, crawling corresponding news pages under different columns or themes; parsing news pages, extracting pictures and picture titles through web page structure analysis, and extracting news release time; and the corresponding image title to form a binary pair, which is stored in the multimodal knowledge base.

按新闻网站主题栏目抓取对应主题或栏目下的新闻网页,并分别存储在对应主题目录下。根据图片标签(如:img,pic等)匹配网页中图片和图片标题,同时去除标题文本中与内容无关的网页标签(如:<a>,</br>等)。将图片和图片标题表示成二元组,作为多模态知识库的一条记录。Grab the news web pages under the corresponding theme or column according to the theme column of the news website, and store them in the corresponding theme directory respectively. Match images and image titles in web pages according to image tags (eg: img, pic, etc.), and remove web page tags (eg: <a>, </br>, etc.) that are not related to content in the title text. The image and image title are represented as two-tuples as a record in the multimodal knowledge base.

以中-英可比较资源挖掘为例,本发明从主流中文新闻网站(如:新华网、凤凰网、新浪网)和英文新闻网站(FOX,CNN,BBC)爬取新闻,根据新闻网站给出的不同栏目(如:军事、财经、科技、体育、娱乐等)将新闻进行分类和主题划分,同时记录新闻发布的时间。对于抓取的新闻,本发明实施例对新闻网页进行结构化分析,提取网页中的图片及图片标题,存储为二元组(image,caption),作为多模态数据库中的一个记录。以此方式,建立不同语言下的多模态数据库。Taking Chinese-English comparable resource mining as an example, the present invention crawls news from mainstream Chinese news websites (such as: Xinhuanet, Fenghuang.com, Sina.com) and English news websites (FOX, CNN, BBC), and gives according to the news website The different columns (such as: military, finance, technology, sports, entertainment, etc.) classify news and themes, and record the time of news release. For crawled news, the embodiment of the present invention performs structural analysis on news web pages, extracts pictures and picture titles in the web pages, and stores them as two-tuples (image, caption) as a record in a multimodal database. In this way, multimodal databases in different languages are established.

步骤S202:确定所述查询图片的主题分类以及发布时间信息;滤除目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;Step S202: Determine the subject classification and release time information of the query picture; filter out pictures that do not match the subject classification and release time information in the target language knowledge base;

人工整合新闻网站已有的主题分类标签体系,对于不同语言的新闻网站,利用翻译词典或借助机器翻译技术实现跨语言的主题标签映射。不同语言和不同新闻网站都拥有各自的新闻分类体系,因此需要建立统一的分类标准。例如将“体育”映射到“Sports”。Manually integrate the existing topic classification and labeling system of news websites. For news websites in different languages, use translation dictionaries or machine translation technology to achieve cross-language topic label mapping. Different languages and different news websites have their own news classification systems, so it is necessary to establish a unified classification standard. For example map "Sports" to "Sports".

融合新闻主题的图片检索优化,根据新闻所在的主题类别将图片标记为指定标签。在进行相似图片检索时,将搜索范围限定在相同主题下。可比较新闻往往具有相同或相似的主题,因此在相似图片检索时,限制图片来源的主题,在相同的主题下检索候选图片并排序,以过滤无效的候选图片。Integrate image retrieval optimization with news topics, and mark images with specified tags according to the topic category of the news. When searching for similar images, limit the search scope to the same subject. Comparable news often has the same or similar themes, so when retrieving similar images, the subject of the image source is limited, and candidate images are retrieved and sorted under the same subject to filter out invalid candidate images.

融合时间标签的图片检索优化,根据新闻发布的时间对新闻进行自动划分,在进行相似图片检索时,将检索范围限定在指定的时间窗口内,并改变时间窗口阈值,获取不同的检索结果。新闻事件的发生具有时效性,导致相似的新闻图片往往出现在相近的时间段内。基于此,利用新闻图片的发布时间作为标签,通过设置不同的时间窗口d,并在对应时间区间内{t-d,t+d}进行相似图片的检索。The optimization of image retrieval that integrates time tags, automatically divides news according to the time of news release, and limits the retrieval range to a specified time window when retrieving similar images, and changes the time window threshold to obtain different retrieval results. The occurrence of news events is time-sensitive, resulting in similar news pictures often appearing in a similar time period. Based on this, using the release time of news pictures as a label, different time windows d are set, and similar pictures are retrieved within the corresponding time interval {t-d, t+d}.

本步骤利用图片所在新闻的主题分类和发布时间等标签,限制相似图片检索模块的搜索空间,以过滤不相似的图片候选,使得面向大规模数据集时,相似图片检索系统的效率和性能得到稳定提升。This step uses labels such as topic classification and release time of the news where the pictures are located to limit the search space of the similar picture retrieval module to filter dissimilar picture candidates, so that the efficiency and performance of the similar picture retrieval system can be stabilized when facing large-scale data sets promote.

步骤S203:在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;Step S203: Perform image retrieval in the filtered target language knowledge base to find out a target image similar to the query image;

通过抽取图片关键点特征,并基于抽取特征计算图片之间的相似度,用以在不同语言的多模态知识库中实现相似图片的检索。By extracting key point features of pictures, and calculating the similarity between pictures based on the extracted features, it is used to retrieve similar pictures in multimodal knowledge bases of different languages.

具体可分为以下步骤:将源语言知识库中图片作为查询图片,采用尺度不变特征转换算法提取图片的关键点,将图片表征为基于关键点的特征向量;对目标图片库所有候选图片提取其特征向量,并匹配查询图片和目标候选图片的关键点;计算所有匹配关键点之间的平均欧氏距离,作为图片之间的相似度度量;对于查询图片,根据图片的相似度对目标图片库中所有图片进行排序,选择Top-N图片作为相似图片集。Specifically, it can be divided into the following steps: take the picture in the source language knowledge base as the query picture, use the scale-invariant feature transformation algorithm to extract the key points of the picture, and represent the picture as a feature vector based on key points; extract all candidate pictures in the target picture library Its feature vector, and match the key points of the query image and the target candidate image; calculate the average Euclidean distance between all matching key points as the similarity measure between the images; for the query image, according to the similarity of the image, the target image All images in the library are sorted, and Top-N images are selected as the similar image set.

具体地,对于检索图片中的关键点x,在目标候选图片中寻找两个最相似的关键点y,z。两个关键点之间的相似度,通过关键点特征向量的欧式距离获得。Specifically, for the keypoint x in the retrieved picture, two most similar keypoints y, z are found in the target candidate picture. The similarity between two keypoints is obtained by the Euclidean distance of the feature vectors of the keypoints.

判断关键点x和目标图片关键点y是否匹配,规则是根据最近关键点和次近关键点的欧氏距离的比值是否小于指定阈值来确定。即若d(x,y)<d(x,z)*θ,则认为关键点x和关键点y相匹配,其中阈值θ∈[0,1],一般设为0.8。图5示出了两个图片中关键点的匹配结果。To determine whether the key point x and the key point y of the target image match, the rule is determined according to whether the ratio of the Euclidean distance between the nearest key point and the next nearest key point is less than the specified threshold. That is, if d(x,y)<d(x,z)*θ, it is considered that the key point x and the key point y match, where the threshold θ∈[0,1] is generally set to 0.8. Figure 5 shows the matching results of keypoints in the two pictures.

计算图片之间的相似度。根据图片中所有匹配的关键点计算整个图片之间的相似度。具体的,计算所有匹配点之间的平均欧式距离。公式如下:Calculate the similarity between pictures. Calculate the similarity between the entire image based on all matching keypoints in the image. Specifically, the average Euclidean distance between all matching points is calculated. The formula is as follows:

Figure BDA0001250644140000101
Figure BDA0001250644140000101

m表示两个图片中相匹配的关键点个数,n表示关键点特征向量的维度。m represents the number of matching keypoints in the two images, and n represents the dimension of the keypoint feature vector.

根据图片相似度,对目标候选图片集进行排序,选择topN的相似图片。According to the image similarity, the target candidate image set is sorted, and the similar images of topN are selected.

步骤S204:根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料。Step S204: According to the text information corresponding to the target image and the text information corresponding to the query image, construct a bilingual comparable corpus.

通过借助相似图片检索系统,在与目标领域相关的新闻主题下进行相似图片检索,并进而抽取双语可比较文本。本发明提出基于图像相似度建模的标题可比较性度量,以及融合图像相似度和文本相似度建模的标题可比较性度量。With the help of a similar image retrieval system, similar images are retrieved under news topics related to the target domain, and bilingual comparable texts are extracted. The invention proposes a title comparability measure based on image similarity modeling, and a title comparability measure integrating image similarity and text similarity modeling.

基于图像相似度建模的标题可比较性度量方法,具体可分为以下步骤:The title comparability measurement method based on image similarity modeling can be divided into the following steps:

利用相似图片检索系统,将源语言的图片作为输入,到目标端检索相似的图片集合,并基于SIFT算法计算出查询图片和检索结果图片的相似度。Using the similar image retrieval system, the image in the source language is used as input, and the similar image collection is retrieved at the target end, and the similarity between the query image and the retrieval result image is calculated based on the SIFT algorithm.

利用图片相似度建模图片标题的可比较性,利用图片的排序结果,作为图片标题可比较度的排序结果,标题之间的可比较性使用以下公式度量:Use the image similarity to model the comparability of image titles, and use the ranking result of the images as the ranking result of the comparability of image titles. The comparability between titles is measured using the following formula:

Figure BDA0001250644140000102
Figure BDA0001250644140000102

其中,S(scap,tcap)表示标题scap和标题tcap之间的可比较度。v(simg,timg)表示图片simg和图片timg之间的相似度,利用SITF算法计算得到,b是常数。where S(s cap , t cap ) represents the comparability between the title s cap and the title t cap . v(s img ,t img ) represents the similarity between the picture s img and the picture t img , which is calculated by the SITF algorithm, and b is a constant.

融合图像相似度和文本相似度的标题可比较性度量方法,具体可分为以下步骤:The title comparability measurement method combining image similarity and text similarity can be divided into the following steps:

利用相似图片检索系统,将源语言的图片作为输入,到目标端检索相似的图片集合,并基于SIFT算法计算出查询图片和检索结果图片的相似度。Using the similar image retrieval system, the image in the source language is used as input, and the similar image collection is retrieved at the target end, and the similarity between the query image and the retrieval result image is calculated based on the SIFT algorithm.

计算查询图片标题和目标候选图片标题的文本相似度。本发明从内容相似度(fc),实体相似度(fe),以及结构相似度(fl)三个角度,综合评价查询标题和候选标题的文本相似度。具体公式如下:Calculate the text similarity between the title of the query image and the title of the target candidate image. The present invention comprehensively evaluates the text similarity between the query title and the candidate title from three angles of content similarity (fc), entity similarity (fe), and structural similarity (fl). The specific formula is as follows:

S=αfc+βfe+γflS=αfc+βfe+γfl

进一步融合图像相似度和图片标题之间的文本相似度得分,对相似图片检索结果(基于图像相似度)进行重排序。根据重排序后的结果,构建可比较文本对。本发明采用如下公式,计算查询与候选综合相似度:It further fuses the text similarity score between the image similarity and the image title, and re-ranks the similar image retrieval results (based on the image similarity). Based on the reordered results, construct comparable text pairs. The present invention adopts the following formula to calculate the comprehensive similarity between the query and the candidate:

Figure BDA0001250644140000111
Figure BDA0001250644140000111

其中Stxt(scap,tcap)表示查询标题和检索标题之间的文本相似度,b是常量,用于控制图片相似度对最终相似度得分的贡献。where S txt (s cap ,t cap ) represents the textual similarity between the query title and the retrieval title, and b is a constant used to control the contribution of image similarity to the final similarity score.

查询图片标题和候选图片标题的文本相似度的计算方法如下:使用Google在线翻译系统将查询标题翻译到目标语言,以保留原文本的词序、句子结构和命名实体等重要信息。The calculation method of the text similarity between the query image title and the candidate image title is as follows: use the Google online translation system to translate the query title into the target language to retain important information such as word order, sentence structure and named entities of the original text.

基于如下三个特征评价查询标题翻译和目标图片标题的可比较性:The comparability of query title translation and target image title is evaluated based on the following three features:

内容相似度:将查询标题翻译后的文本描述进行分词、去停用词,词根化等操作,得到bag-of-words表示。基于向量空间模型,计算查询标题和目标标题的余弦相似度(记作fc)。Content similarity: perform word segmentation, stop word removal, and rooting of the translated text description of the query title to obtain a bag-of-words representation. Based on the vector space model, the cosine similarity (denoted as fc) of the query title and the target title is calculated.

实体相似度:利用斯坦福命名实体识别工具识别图片标题中的命名实体,得到命名实体词袋集合,并基于向量空间模型计算双语文本之间的命名实体相似度(记作fe)。Entity similarity: Use the Stanford named entity recognition tool to identify named entities in image captions, obtain a set of named entity word bags, and calculate the named entity similarity (denoted as fe) between bilingual texts based on a vector space model.

结构相似度:将查询标题和目标标题中内容词(包括名词,动词,副词,形容词,专有名词等)数目的比值作为一个可比较性评价标准(记作fl)。Structural similarity: The ratio of the number of content words (including nouns, verbs, adverbs, adjectives, proper nouns, etc.) in the query title and the target title is used as a comparability evaluation standard (denoted as fl).

采用加权平均方式融合上述三种特征,得到查询标题和目标标题之间可比较性得分。A weighted average method is used to fuse the above three features to obtain the comparability score between the query title and the target title.

S=αfc+βfe+γflS=αfc+βfe+γfl

其中,根据各特征对可比较性的贡献程度,经验地设置α=0.8,β=0.15,γ=0.05。Among them, α=0.8, β=0.15, and γ=0.05 are empirically set according to the contribution degree of each feature to the comparability.

本发明提出的基于跨媒体信息检索的双语可比较语料挖掘方法,为互联网中的双语可比较资源挖掘提供了新方法。不仅提出构建大规模多模态数据库的方法,还提出基于图片相似度辅助可比较资源的挖掘,并基于此提出两种可比较性度量方法,用以获取大规模可比较文本。The bilingual comparable corpus mining method based on cross-media information retrieval proposed by the present invention provides a new method for bilingual comparable resource mining in the Internet. Not only a method of constructing a large-scale multimodal database is proposed, but also the mining of comparable resources based on image similarity is proposed. Based on this, two comparability measurement methods are proposed to obtain large-scale comparable texts.

下面对本发明实施例提供的双语可比较语料挖掘装置进行介绍,下文描述的双语可比较语料挖掘装置与上文描述的双语可比较语料挖掘方法可相互对应参照。The following describes the bilingual comparable corpus mining device provided by the embodiment of the present invention. The bilingual comparable corpus mining device described below and the bilingual comparable corpus mining method described above may refer to each other correspondingly.

图6为本发明实施例提供的双语可比较语料挖掘装置的结构框图,参照图6双语可比较语料挖掘装置可以包括:6 is a block diagram of a structure of a bilingual comparable corpus mining device provided by an embodiment of the present invention. Referring to FIG. 6, the bilingual comparable corpus mining device may include:

知识库建立模块100,用于预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;The knowledge base building module 100 is used for grabbing multiple pictures and corresponding text information from databases in different languages in advance, and establishing a multimodal knowledge base including the pictures and the text information;

查找模块200,用于将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;The search module 200 is configured to use the picture in the source language knowledge base as a query picture, perform picture retrieval in the target language knowledge base, and find out a target picture similar to the query picture;

构建模块300,用于根据所述目标图片对应的文字信息与所述查询图片对应的文字信息,构建双语可比较语料。The construction module 300 is configured to construct a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.

作为一种具体实施方式,本发明所提供的双语可比较语料挖掘装置中,查找模块可以具体包括:As a specific embodiment, in the bilingual comparable corpus mining device provided by the present invention, the search module may specifically include:

提取单元,用于采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;an extraction unit, used for extracting the key points of the query picture by adopting a scale-invariant feature conversion algorithm, and characterizing the query picture as a feature vector based on the key points;

匹配单元,用于提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;a matching unit for extracting the feature vectors of all candidate pictures in the target language knowledge base, and matching the query picture and the key points of the candidate pictures;

相似度计算单元,用于计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;The similarity calculation unit is used to calculate the average Euclidean distance between all matching key points as the image similarity between the images;

选取单元,用于根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。A selection unit, configured to sort candidate pictures according to the picture similarity, and select a target picture similar to the query picture.

作为一种具体实施方式,上述查找模块还可以包括:As a specific implementation manner, the above-mentioned search module may also include:

确定单元,用于确定所述查询图片的主题分类以及发布时间信息;a determining unit for determining the subject classification and release time information of the query picture;

滤除单元,用于滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;a filtering unit, configured to filter out pictures in the target language knowledge base that do not match the subject classification and release time information;

查找单元,用于在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。A search unit, configured to perform image retrieval in the filtered target language knowledge base to find a target image similar to the query image.

具体地,上述构建模块可以具体包括:Specifically, the above-mentioned building modules may specifically include:

文本相似度计算单元,用于计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度;a text similarity calculation unit, configured to calculate the text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;

排序单元,用于根据所述文本相似度对所述目标图片进行重新排序;a sorting unit, configured to reorder the target pictures according to the text similarity;

构建单元,用于根据重新排序后的结果,构建双语可比较语料。The construction unit is used to construct bilingual comparable corpus according to the reordered results.

本实施例的双语可比较语料挖掘装置用于实现前述的双语可比较语料挖掘方法,因此双语可比较语料挖掘装置中的具体实施方式可见前文中的双语可比较语料挖掘方法的实施例部分,例如,知识库建立模块100,查找模块200,构建模块300,分别用于实现上述双语可比较语料挖掘方法中步骤S101,S102,S103,所以,其具体实施方式可以参照相应的各个部分实施例的描述,在此不再赘述。The bilingual comparable corpus mining device of this embodiment is used to realize the aforementioned bilingual comparable corpus mining method, so the specific implementation of the bilingual comparable corpus mining device can be found in the foregoing section of the embodiment of the bilingual comparable corpus mining method, for example , the knowledge base building module 100, the search module 200, and the building module 300 are respectively used to realize the steps S101, S102, and S103 in the above-mentioned bilingual comparable corpus mining method. Therefore, the specific implementation methods can refer to the descriptions of the corresponding partial embodiments. , and will not be repeated here.

本发明所提供的双语可比较语料挖掘装置,通过预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及文字信息的多模态知识库;将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与查询图片相似的目标图片;根据目标图片对应的文字信息与查询图片对应的文字信息,构建双语可比较语料。本申请采用跨媒体信息检索技术,通过图片作为沟通源语言和目标语言的媒介,进而获取源语言在目标端的等价或可比较的文本,为互联网中的双语可比较资源挖掘提供了新方法,解决了特定双语资源稀缺的问题。The bilingual comparable corpus mining device provided by the present invention builds a multi-modal knowledge base including pictures and text information by grabbing multiple pictures and corresponding text information from databases in different languages in advance; The image is used as the query image, and the image retrieval is performed in the target language knowledge base to find the target image similar to the query image; according to the text information corresponding to the target image and the text information corresponding to the query image, a bilingual comparable corpus is constructed. This application adopts cross-media information retrieval technology, uses pictures as a medium for communicating the source language and the target language, and then obtains the equivalent or comparable text of the source language at the target end, which provides a new method for mining bilingual comparable resources in the Internet. Solved the problem of scarcity of specific bilingual resources.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

以上对本发明所提供的双语可比较语料挖掘方法以及装置进行了详细介绍。本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。The method and device for mining a bilingual comparable corpus provided by the present invention are described above in detail. The principles and implementations of the present invention are described herein by using specific examples, and the descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims (8)

1.一种双语可比较语料挖掘方法,其特征在于,包括:1. a bilingual comparable corpus mining method, is characterized in that, comprises: 预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;Capture multiple pictures and corresponding text information from databases in different languages in advance, and establish a multimodal knowledge base containing the pictures and the text information; 将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;Taking the picture in the source language knowledge base as the query picture, performing picture retrieval in the target language knowledge base, and finding out the target picture similar to the query picture; 计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度;calculating the text similarity between the text information corresponding to the query picture and the text information corresponding to the target picture; 根据所述查询图片和所述目标图片的图片相似度和所述文本相似度计算出综合相似度,根据所述综合相似度对所述目标图片进行重新排序;其中,计算综合相似度的方法包括:The comprehensive similarity is calculated according to the picture similarity and the text similarity between the query picture and the target picture, and the target pictures are reordered according to the comprehensive similarity; wherein, the method for calculating the comprehensive similarity includes: :
Figure FDA0002627542310000011
Figure FDA0002627542310000011
S(scap,tcap)表示查询标题scap和检索标题tcap之间的综合相似度;v(simg,timg)表示查询图片simg和目标图片timg之间的图片相似度;Stxt(scap,tcap)表示查询标题scap和检索标题tcap之间的文本相似度,b是常量,用于控制图片相似度对最终相似度得分的贡献;其中,所述检索标题具体为所述目标图片的标题;S(s cap ,t cap ) represents the comprehensive similarity between the query title s cap and the retrieval title t cap ; v(s img ,t img ) represents the image similarity between the query image s img and the target image t img ; S txt (s cap , t cap ) represents the text similarity between the query title s cap and the retrieval title t cap , b is a constant, and is used to control the contribution of the image similarity to the final similarity score; wherein, the retrieval title Specifically, the title of the target image; 根据重新排序后的结果,构建双语可比较语料。Based on the re-ranked results, a bilingual comparable corpus is constructed.
2.如权利要求1所述的双语可比较语料挖掘方法,其特征在于,所述预先从不同语言的数据库中抓取多个图片以及对应的文字信息包括:2. The bilingual comparable corpus mining method as claimed in claim 1, wherein the pre-fetching of multiple pictures and corresponding text information from databases of different languages comprises: 利用网络爬虫从新闻网站中抓取图片,所述文字信息为所述图片对应的标题信息,将图片以及对应的文字信息作为二元组,存储于所述多模态知识库中。A picture is grabbed from a news website by a web crawler, the text information is the title information corresponding to the picture, and the picture and the corresponding text information are stored in the multimodal knowledge base as a two-tuple. 3.如权利要求2所述的双语可比较语料挖掘方法,其特征在于,所述在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片包括:3. The bilingual comparable corpus mining method as claimed in claim 2, wherein the image retrieval is performed in the target language knowledge base, and the target image that is similar to the query image is found to include: 采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;Extract the key points of the query picture by using a scale-invariant feature transformation algorithm, and characterize the query picture as a feature vector based on the key points; 提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;Extract the feature vectors of all candidate pictures in the target language knowledge base, and match the query picture and the key points of the candidate pictures; 计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;Calculate the average Euclidean distance between all matching key points as the image similarity between images; 根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。The candidate pictures are sorted according to the picture similarity, and a target picture similar to the query picture is selected. 4.如权利要求1至3任一项所述的双语可比较语料挖掘方法,其特征在于,所述在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片包括:4. The bilingual comparable corpus mining method according to any one of claims 1 to 3, wherein the image retrieval is performed in the target language knowledge base, and the target image that is similar to the query image is found to include: 确定所述查询图片的主题分类以及发布时间信息;Determine the subject classification and release time information of the query picture; 滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;Filtering out pictures that do not match the subject classification and release time information in the target language knowledge base; 在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。Image retrieval is performed in the filtered target language knowledge base to find out a target image similar to the query image. 5.如权利要求1所述的双语可比较语料挖掘方法,其特征在于,所述计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度包括:5. The bilingual comparable corpus mining method according to claim 1, wherein the calculating the text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture comprises: 计算所述查询图片对应的文字信息与所述目标图片对应的文字信息的内容相似度、实体相似度以及结构相似度;Calculate the content similarity, entity similarity and structural similarity between the text information corresponding to the query picture and the text information corresponding to the target picture; 对所述内容相似度、所述实体相似度以及所述结构相似度进行加权平均,计算得到对应文字信息的文本相似度。The content similarity, the entity similarity and the structural similarity are weighted and averaged, and the text similarity of the corresponding text information is calculated. 6.一种双语可比较语料挖掘装置,其特征在于,包括:6. A bilingual comparable corpus mining device, characterized in that, comprising: 知识库建立模块,用于预先从不同语言的数据库中抓取多个图片以及对应的文字信息,建立包含图片以及所述文字信息的多模态知识库;The knowledge base building module is used to capture multiple pictures and corresponding text information from databases in different languages in advance, and establish a multimodal knowledge base including the pictures and the text information; 查找模块,用于将源语言知识库中的图片作为查询图片,在目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片;a search module, used for taking the picture in the source language knowledge base as a query picture, performing picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture; 构建模块,用于计算所述查询图片对应的文字信息以及所述目标图片对应的文字信息的文本相似度;a building module, configured to calculate the text similarity between the text information corresponding to the query image and the text information corresponding to the target image; 根据所述查询图片和所述目标图片的图片相似度和所述文本相似度计算出综合相似度,根据所述综合相似度对所述目标图片进行重新排序;其中,计算综合相似度的方法包括:The comprehensive similarity is calculated according to the picture similarity and the text similarity between the query picture and the target picture, and the target pictures are reordered according to the comprehensive similarity; wherein, the method for calculating the comprehensive similarity includes: :
Figure FDA0002627542310000021
Figure FDA0002627542310000021
S(scap,tcap)表示查询标题scap和检索标题tcap之间的综合相似度;v(simg,timg)表示查询图片simg和目标图片timg之间的图片相似度;Stxt(scap,tcap)表示查询标题scap和检索标题tcap之间的文本相似度,b是常量,用于控制图片相似度对最终相似度得分的贡献;其中,所述检索标题具体为所述目标图片的标题;S(s cap ,t cap ) represents the comprehensive similarity between the query title s cap and the retrieval title t cap ; v(s img ,t img ) represents the image similarity between the query image s img and the target image t img ; S txt (s cap , t cap ) represents the text similarity between the query title s cap and the retrieval title t cap , b is a constant, and is used to control the contribution of the image similarity to the final similarity score; wherein, the retrieval title Specifically, the title of the target image; 根据重新排序后的结果,构建双语可比较语料。Based on the re-ranked results, a bilingual comparable corpus is constructed.
7.如权利要求6所述的双语可比较语料挖掘装置,其特征在于,所述查找模块包括:7. The bilingual comparable corpus mining device according to claim 6, wherein the search module comprises: 提取单元,用于采用尺度不变特征转换算法提取所述查询图片的关键点,将所述查询图片表征为基于所述关键点的特征向量;an extraction unit, used for extracting the key points of the query picture by adopting a scale-invariant feature conversion algorithm, and characterizing the query picture as a feature vector based on the key points; 匹配单元,用于提取所述目标语言知识库中所有候选图片的特征向量,并匹配所述查询图片和所述候选图片的关键点;a matching unit for extracting the feature vectors of all candidate pictures in the target language knowledge base, and matching the query picture and the key points of the candidate pictures; 相似度计算单元,用于计算所有匹配关键点之间的平均欧式距离,作为图片之间的图片相似度;The similarity calculation unit is used to calculate the average Euclidean distance between all matching key points as the image similarity between the images; 选取单元,用于根据所述图片相似度对候选图片进行排序,选取与所述查询图片相似的目标图片。A selection unit, configured to sort candidate pictures according to the picture similarity, and select a target picture similar to the query picture. 8.如权利要求6或7所述的双语可比较语料挖掘装置,其特征在于,所述查找模块还包括:8. The bilingual comparable corpus mining device according to claim 6 or 7, wherein the search module further comprises: 确定单元,用于确定所述查询图片的主题分类以及发布时间信息;a determining unit for determining the subject classification and release time information of the query picture; 滤除单元,用于滤除所述目标语言知识库中与所述主题分类以及发布时间信息不匹配的图片;a filtering unit, configured to filter out pictures in the target language knowledge base that do not match the subject classification and release time information; 查找单元,用于在过滤后的所述目标语言知识库中进行图片检索,查找出与所述查询图片相似的目标图片。A search unit, configured to perform image retrieval in the filtered target language knowledge base to find a target image similar to the query image.
CN201710169141.XA 2017-03-21 2017-03-21 A bilingual comparable corpus mining method and device Active CN106980664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 A bilingual comparable corpus mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 A bilingual comparable corpus mining method and device

Publications (2)

Publication Number Publication Date
CN106980664A CN106980664A (en) 2017-07-25
CN106980664B true CN106980664B (en) 2020-11-10

Family

ID=59338807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710169141.XA Active CN106980664B (en) 2017-03-21 2017-03-21 A bilingual comparable corpus mining method and device

Country Status (1)

Country Link
CN (1) CN106980664B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110078B (en) * 2018-01-11 2024-04-30 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN109522554B (en) * 2018-11-06 2022-12-02 中国人民解放军战略支援部队信息工程大学 Low-resource document classification method and classification system
CN109710923B (en) * 2018-12-06 2020-09-01 浙江大学 Cross-language entity matching method based on cross-media information
CN112069838A (en) * 2019-06-11 2020-12-11 阿里巴巴集团控股有限公司 Evaluation method of translation quality, translation checking method and device
CN112818212B (en) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111881900B (en) * 2020-07-01 2022-08-23 腾讯科技(深圳)有限公司 Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN114004236B (en) * 2021-09-18 2024-04-30 昆明理工大学 Cross-language news event retrieval method integrating knowledge of event entity
CN114139559B (en) * 2021-12-01 2025-01-28 中科合肥技术创新工程院 A cross-domain method for measuring the comparability of bilingual texts
CN115512146A (en) * 2022-11-01 2022-12-23 北京百度网讯科技有限公司 POI information mining method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053991B (en) * 2009-10-30 2014-07-02 国际商业机器公司 Method and system for multi-language document retrieval
CN103473280B (en) * 2013-08-28 2017-02-08 中国科学院合肥物质科学研究院 Method for mining comparable network language materials
RU2607975C2 (en) * 2014-03-31 2017-01-11 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Constructing corpus of comparable documents based on universal measure of similarity

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Building and using comparable corpora for domain-specific bilingual lexicon extraction;Darja Fišer,Špela Vintar,Nikola Ljubešić,et al.;《BUCC》;20110624;第19-26页 *
Building Comparable Corpora Based on Bilingual LDA Model;Zhu Z,Li M,Chen L,et al.;《ACL》;20130809;第282-287页 *
一种综合多特征的句子相似度计算方法;吴全娥;熊海灵;《计算机系统应用》;20100522;第19卷(第11期);全文 *

Also Published As

Publication number Publication date
CN106980664A (en) 2017-07-25

Similar Documents

Publication Publication Date Title
CN106980664B (en) A bilingual comparable corpus mining method and device
Asim et al. The use of ontology in retrieval: a study on textual, multilingual, and multimedia retrieval
CN102053991B (en) Method and system for multi-language document retrieval
US7788262B1 (en) Method and system for creating context based summary
US20090182547A1 (en) Adaptive Web Mining of Bilingual Lexicon for Query Translation
JP4066600B2 (en) Multilingual document search system
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US20120278302A1 (en) Multilingual search for transliterated content
WO2005124599A2 (en) Content search in complex language, such as japanese
TWI656450B (en) Method and system for extracting knowledge from Chinese corpus
CN107193892B (en) A kind of document subject matter determines method and device
JP2010538374A (en) Resolving the same instructions in an ambiguous natural language processing system
Saenko et al. Unsupervised learning of visual sense models for polysemous words
JP2024091709A (en) Sentence preparation apparatus, sentence preparation method, and sentence preparation program
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
Jain et al. Context sensitive text summarization using k means clustering algorithm
TWI682286B (en) System for document searching using results of text analysis and natural language input
JP2021501387A (en) Methods, computer programs and computer systems for extracting expressions for natural language processing
Ferrés et al. PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles
Fauzi et al. Image understanding and the web: a state-of-the-art review
KR101037091B1 (en) Ontology-based Semantic Retrieval System and Method for Multilingual Authorized Headings through Automatic Language Translation
Hazman et al. An ontology based approach for automatically annotating document segments
Brenner et al. QMUL@ MediaEval 2012: Social Event Detection in Collaborative Photo Collections.
Poornima et al. Automatic Annotation of Educational Videos for Enhancing Information Retrieval.
Tsapatsoulis Web image indexing using WICE and a learning-free language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant