[go: up one dir, main page]

CN102375839A - Method and device for acquiring target data set from candidate data set, and translation machine - Google Patents

Method and device for acquiring target data set from candidate data set, and translation machine Download PDF

Info

Publication number
CN102375839A
CN102375839A CN201010257678XA CN201010257678A CN102375839A CN 102375839 A CN102375839 A CN 102375839A CN 201010257678X A CN201010257678X A CN 201010257678XA CN 201010257678 A CN201010257678 A CN 201010257678A CN 102375839 A CN102375839 A CN 102375839A
Authority
CN
China
Prior art keywords
target data
data set
candidate
candidate data
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010257678XA
Other languages
Chinese (zh)
Inventor
郑仲光
何中军
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010257678XA priority Critical patent/CN102375839A/en
Publication of CN102375839A publication Critical patent/CN102375839A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及从候选数据集获取目标数据集的方法和装置以及翻译机器。其中,从目标数据样本提取特征;利用所述特征从所述候选数据集中抽取目标数据,形成目标数据集。根据本发明的实施方式,能够根据所提供的样本来从候选数据集中抽取目标数据。

Figure 201010257678

The present invention relates to a method and apparatus for obtaining a target data set from a candidate data set and a translation machine. Wherein, features are extracted from target data samples; target data is extracted from the candidate data set by using the features to form a target data set. According to the embodiment of the present invention, the target data can be extracted from the candidate data set according to the provided samples.

Figure 201010257678

Description

从候选数据集获取目标数据集的方法和装置以及翻译机器Method and device for obtaining target data set from candidate data set and translation machine

技术领域 technical field

本申请涉及数据提取,具体而言,涉及一种从候选数据集获取目标数据集的方法和装置。此外,本申请还涉及一种翻译机器。The present application relates to data extraction, in particular, to a method and device for obtaining a target data set from a candidate data set. In addition, the present application also relates to a translation machine.

背景技术 Background technique

传统地,根据特定的目标数据样本从候选数据集中获取特定的目标数据通常通过判断候选数据集中的数据和目标数据样本之间的相似度手工地进行选择,或者甚至从候选数据级中随机选取一些数据作为目标数据。明显地,这种传统的方式和方法不能提供高质量的目标数据。Traditionally, to obtain specific target data from candidate data sets based on specific target data samples is usually manually selected by judging the similarity between the data in the candidate data set and the target data samples, or even randomly selecting some from the candidate data class. data as target data. Obviously, this traditional way and method cannot provide high-quality target data.

发明内容 Contents of the invention

在下文中将给出关于本发明的简要概述,以便提供关于本发明的某些方面的基本理解。应当理解,这个概述并不是关于本发明的穷举性概述。它并不是意图确定本发明的关键或重要部分,也不是意图限定本发明的范围。其目的仅仅是以简化的形式给出某些概念,以此作为稍后论述的更详细描述的前序In the following, a brief overview of the present invention is given in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical parts of the invention nor to delineate the scope of the invention. Its purpose is only to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

根据本申请的实施例,从目标数据样本提取特征,利用所述特征从所述候选数据集中抽取目标数据,从而形成目标数据集。According to an embodiment of the present application, features are extracted from target data samples, and target data is extracted from the candidate data set by using the features, thereby forming a target data set.

这样,基于针对所述特定目的的目标数据样本从候选数据集抽取子集形成目标数据集。所形成的目标数据集的生成更为迅速。此外,所形成的目标数据集更为符合后续处理的要求。In this way, the target data set is formed by subsetting the candidate data set based on the target data samples for the specific purpose. The resulting target data set is more rapidly generated. In addition, the formed target data set is more in line with the requirements of subsequent processing.

附图说明 Description of drawings

本发明可以通过参考下文中结合附图所给出的描述而得到更好的理解,其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并且形成本说明书的一部分,而且用来进一步举例说明本发明的优选实施例和解释本发明的原理和优点。在附图中:The present invention can be better understood by referring to the following description given in conjunction with the accompanying drawings, wherein the same or similar reference numerals are used throughout to designate the same or similar parts. The accompanying drawings, together with the following detailed description, are incorporated in and form a part of this specification, and serve to further illustrate preferred embodiments of the invention and explain the principles and advantages of the invention. In the attached picture:

图1示出了根据本发明的一个实施例的用于从候选数据集获取目标数据集的方法的流程图,FIG. 1 shows a flowchart of a method for obtaining a target data set from a candidate data set according to an embodiment of the present invention,

图2示出了根据本发明的另一实施例的用于从候选数据集获取目标数据集的方法的流程图,Fig. 2 shows a flow chart of a method for obtaining a target data set from a candidate data set according to another embodiment of the present invention,

图3示出了根据本发明的另一实施例的用于从候选数据集获取目标数据集的方法的流程图,Fig. 3 shows a flow chart of a method for obtaining a target data set from a candidate data set according to another embodiment of the present invention,

图4示出了根据本发明的另一实施例的用于从候选数据集获取目标数据集的方法的流程图,Fig. 4 shows a flow chart of a method for obtaining a target data set from a candidate data set according to another embodiment of the present invention,

图5示出了根据本发明的一个实施例的用于从候选数据集获取目标数据集的装置的示意性结构图,Fig. 5 shows a schematic structural diagram of a device for obtaining a target data set from a candidate data set according to an embodiment of the present invention,

图6示出了根据本发明的一个实施例的用于从候选数据集获取目标数据集的装置的抽取单元的示意性结构图,Fig. 6 shows a schematic structural diagram of an extraction unit of an apparatus for obtaining a target data set from a candidate data set according to an embodiment of the present invention,

图7示出了根据本发明的另一实施例的用于从候选数据集获取目标数据集的装置的抽取单元的示意性结构图,Fig. 7 shows a schematic structural diagram of an extraction unit of an apparatus for obtaining a target data set from a candidate data set according to another embodiment of the present invention,

图8示出了根据本发明的另一实施例的用于从候选数据集获取目标数据集的装置的抽取单元的示意性结构图,以及Fig. 8 shows a schematic structural diagram of an extraction unit of an apparatus for obtaining a target data set from a candidate data set according to another embodiment of the present invention, and

图9示出了示出了可用于实施根据本发明的实施例的计算机的示意性框图。Fig. 9 shows a schematic block diagram illustrating a computer that may be used to implement embodiments according to the present invention.

具体实施方式 Detailed ways

在下文中将结合附图对本发明的示范性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施方式的所有特征。然而,应该了解,在开发任何这种实际实施方式的过程中可以做出很多特定于实施方式的决定,以便实现开发人员的具体目标,并且这些决定可能会随着实施方式的不同而有所改变。Exemplary embodiments of the present invention will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. However, it should be understood that many implementation-specific decisions can be made during the development of any such actual implementation in order to achieve the developer's specific goals, and that these decisions may vary from implementation to implementation .

在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的装置结构,而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure closely related to the solution according to the present invention is shown in the drawings, and the relationship with the present invention is omitted. Little other details.

第一实施例first embodiment

图1示出了根据本申请的实施例的从候选数据集获取目标数据集的方法的流程图。为了从候选数据集获取目标数据集,在S110中从目标数据样本提取特征。该目标数据样本可以包括一个或多个数据,其中数据包括数据元素。数据可以包括:字符串、句子或者图片集。相应地,数据元素可以是字符、词语或者图片。显然,所述特征可以是任何特征,作为非限制性的例子,所提取的特征可以由数据元素中的至少一部分构成。例如,如果目标数据样本为一个句子,则其数据元素是构成该句子的词语,而提取到的特征则是句子中的至少一个词语。在从目标数据样本中提取到多个特征时,基于特征在目标数据样本中出现的频率确定各个特征的权重。选取高权重的特征作为目标数据样本的特征。换言之,特征在目标数据样本中出现的频率越高,则其权重就越高。Fig. 1 shows a flowchart of a method for obtaining a target data set from a candidate data set according to an embodiment of the present application. In order to obtain the target data set from the candidate data set, features are extracted from the target data samples in S110. The target data sample may include one or more data, where the data includes data elements. Data can include: strings, sentences, or collections of pictures. Correspondingly, the data elements may be characters, words or pictures. Apparently, the feature may be any feature, and as a non-limiting example, the extracted feature may consist of at least a part of the data elements. For example, if the target data sample is a sentence, its data elements are words constituting the sentence, and the extracted feature is at least one word in the sentence. When multiple features are extracted from the target data sample, the weight of each feature is determined based on the frequency of occurrence of the feature in the target data sample. Select high-weight features as the features of the target data sample. In other words, the more frequently a feature occurs in the target data sample, the higher its weight.

在S120中,利用所述特征从候选数据集中抽取目标数据,形成目标数据集。候选数据集可以是涵盖目标数据集的数据集并且可以包括图片、文本或语料等等。待形成的目标数据集是特定于目标数据样本的数据集。借助S120,根据在S110中提取到的目标数据样本的特征,在候选数据集中找出带有所述特征的候选数据,以形成目标数据集。在此,候选数据同样可以包括数据元素,其数据元素也可以是字符、词语或图片。In S120, the feature is used to extract target data from the candidate data set to form a target data set. A candidate dataset may be a dataset covering the target dataset and may include pictures, text, or corpus, among others. The target data set to be formed is a data set specific to the target data sample. By means of S120, according to the characteristics of the target data samples extracted in S110, candidate data with the characteristics are found in the candidate data set to form the target data set. Here, the candidate data can also include data elements, and the data elements can also be characters, words or pictures.

通过该方法,例如可以根据用户的所感兴趣的领域的信息(在此为目标数据样本)从大的信息库或信息集合(在此为候选数据集)中提取相关的信息(在此为目标数据),从而形成用户定制的信息(在此为目标数据集)。例如,用户对计算机领域的信息感兴趣,因此可以利用一篇与计算机领域相关的文章作为样本,从互联网上众多的信息中搜索到与该领域相关的信息,并且选取与计算机领域紧密相关的信息作为目标数据集提供给该用户。Through this method, for example, relevant information (here, target data samples) can be extracted from a large information base or information collection (here, candidate data sets) according to the information of the user's interested field (here, target data samples). ), thus forming user-customized information (here, the target data set). For example, a user is interested in information in the computer field, so he can use an article related to the computer field as a sample, search for information related to this field from a large number of information on the Internet, and select information closely related to the computer field Provided to this user as the target dataset.

第二实施例second embodiment

如从图2中可以看到的那样,该实施例是对图1所示的实施例的改进方案。需要指出的是,为了描述的简洁而省去了与图1中作用和功能相同的部分的描述。As can be seen from FIG. 2 , this embodiment is a modification of the embodiment shown in FIG. 1 . It should be pointed out that, for the sake of brevity, the description of the part with the same action and function as in FIG. 1 is omitted.

在S110中,从目标数据样本提取特征。在该实施例中的S110与图1所示的实施例中的S110相同,故在此不再赘述。In S110, features are extracted from target data samples. S110 in this embodiment is the same as S110 in the embodiment shown in FIG. 1 , so details are not repeated here.

在S130中,利用特征来查询候选数据集。通过将在S110中提取的目标数据样本的特征作为信息检索关键词来对候选数据集的候选数据进行查询。在此,关键词可以是字符、词语或图片。换言之,用所提取的特征对候选数据集中的各个候选数据进行比对,找出带有该特征的候选数据。In S130, the feature is used to query the candidate data set. The candidate data of the candidate data set is queried by using the feature of the target data sample extracted in S110 as an information retrieval keyword. Here, keywords may be characters, words or pictures. In other words, the extracted feature is used to compare the candidate data in the candidate data set to find the candidate data with the feature.

在S140中,根据查询到的候选数据与目标数据样本的相似度而获得目标数据。在此,相似度可以根据查询到的候选数据包含特征的多少来确定。也就是说,查询到的候选数据包含的特征越多,则候选数据就与目标样本越相似,反之就不相似。当然,也可以采用特征在候选数据中出现的频率作为相似度评价的基础。例如,在多个带有该特征的候选数据中该特征在候选数据中的出现的频率越高,则相似度越高,反之就低。当然,也可以采用多种评价方式来对相似度进行综合评价。此外,相似度还可以借助信息检索方法获得的相似度评分来确定。In S140, the target data is obtained according to the similarity between the queried candidate data and the target data sample. Here, the similarity can be determined according to how many features are included in the queried candidate data. That is to say, the more features the query candidate data contains, the more similar the candidate data is to the target sample, and vice versa. Of course, the frequency of features appearing in candidate data can also be used as the basis for similarity evaluation. For example, among multiple candidate data with the feature, the higher the frequency of occurrence of the feature in the candidate data, the higher the similarity, and vice versa. Of course, multiple evaluation methods can also be used to comprehensively evaluate the similarity. In addition, similarity can also be determined with the help of similarity scores obtained by information retrieval methods.

第三实施例third embodiment

如从图3中可以看到的那样,该实施例是对图1所示的实施例的改进。需要指出的是,为描述的简洁而省去了与图1中作用和功能相同的部分的描述。As can be seen from FIG. 3 , this embodiment is an improvement over the embodiment shown in FIG. 1 . It should be pointed out that, for the sake of brevity, the description of the part with the same role and function as in FIG. 1 is omitted.

在S110中,从目标数据样本提取特征。在该实施例中的S110与图1所示的实施例中的S110相同,故在此不再赘述。In S110, features are extracted from target data samples. S110 in this embodiment is the same as S110 in the embodiment shown in FIG. 1 , so details are not repeated here.

在S150中,利用在S110中提取的目标数据样本的特征对候选数据集中的候选数据进行聚类。In S150, cluster the candidate data in the candidate data set by using the features of the target data samples extracted in S110.

在S160中,根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。在此,合适的类应理解为目标数据样本与聚类所产生的类相似度高的类。In S160, an appropriate class is selected as the target data according to the similarity between the class generated by the clustering and the target data sample. Here, a suitable class should be understood as a class with a high similarity between the target data sample and the class generated by clustering.

第四实施例Fourth embodiment

如从图4中可以看到的那样,图4所示的实施例是第二实施例和第三实施例的组合。As can be seen from FIG. 4, the embodiment shown in FIG. 4 is a combination of the second embodiment and the third embodiment.

在S110中,从目标数据样本提取特征。In S110, features are extracted from target data samples.

在S130中,利用特征来查询候选数据集,并且在S170中,判断查询到的候选数据与目标数据样本的相似度。In S130, the feature is used to query the candidate data set, and in S170, the similarity between the queried candidate data and the target data sample is judged.

在S150中,利用在S110中提取的目标数据样本的特征对候选数据集中的候选数据进行聚类,并且在S180中判断聚类所产生的类与目标数据样本的相似度。为了描述的简洁,省去了在图4中与图2和图3中作用和功能相同的部分的详细描述。其具体内容请参见针对图2和图3的描述,在此不再赘述。In S150, cluster the candidate data in the candidate data set by using the features of the target data sample extracted in S110, and judge the similarity between the clustered cluster and the target data sample in S180. For the sake of brevity, the detailed description of the parts in FIG. 4 that have the same role and function as those in FIG. 2 and FIG. 3 is omitted. For the specific content, please refer to the description of FIG. 2 and FIG. 3 , which will not be repeated here.

在S190中,将根据聚类所产生的类与目标数据样本的相似度同查询到的候选数据与目标数据样本的相似度比较,根据比较结果来选择合适的候选数据作为所述目标数据。将图2和图3中分别提取的候选数据与目标数据样本的相似度进行比较,选取相似度高的候选数据作为目标数据。由此综合上述两种方法的优点来提供更为准确的目标数据。In S190, the similarity between the cluster and the target data sample is compared with the similarity between the queried candidate data and the target data sample, and appropriate candidate data is selected as the target data according to the comparison result. Compare the similarities between the candidate data extracted in Figure 2 and Figure 3 and the target data samples, and select the candidate data with high similarity as the target data. Therefore, the advantages of the above two methods are combined to provide more accurate target data.

第五实施例fifth embodiment

在该实施例中,根据本申请的实施例的从候选数据集获取目标数据集的方法示例性地应用于文本形式的目标数据样本。此时,可以借助于间隔n元词组(interval n-gram phrase)、n元词组(n-gram phrase)或词表从目标数据样本提取特征。在此,候选数据集对应地可以包括多个文本,其中每个文本可以视为包括多个句子构成的句子集合。因此,当目标数据样本为单个句子时,可以视为目标数据仅包括具有一个句子的文本。在此仅以文本为例进行说明,但根据本申请的实施例的方法同样适用于单个句子的目标数据样本。In this embodiment, the method for obtaining a target data set from a candidate data set according to an embodiment of the present application is exemplarily applied to a target data sample in text form. At this point, features can be extracted from the target data samples by means of interval n-gram phrases, n-gram phrases or vocabulary. Here, the candidate data set may correspondingly include a plurality of texts, wherein each text may be regarded as a set of sentences composed of a plurality of sentences. Therefore, when the target data sample is a single sentence, it can be considered that the target data only includes text with one sentence. Here, only text is taken as an example for illustration, but the method according to the embodiment of the present application is also applicable to the target data sample of a single sentence.

在采用间隔n元词组或n元词组进行查询的实施例中,将大小为S的窗口施加于目标数据样本并且由此以[w1,w2,…,wS]来表示目标数据样本。由目标数据样本生成的n元词组表示为P(w1,w2,…,wS),其中wi是窗口中的第i个字并且|P|≤n,n为自然数。该词组中的字可以连续也可以不连续,在连续的情况下为n元词组,而不连续的情况下为间隔n元词组。例如,对句子“促进国家经济的稳定发展”用间隔n元词组来表示,如果n=3,则得到“促进”、“促进国家”、“促进国家经济”、“促进国家的”、“促进国家稳定”、“促进经济”、“促进经济的”以及“促进经济稳定”。借助从目标数据样本提取的n元词组来查询候选数据集,并且根据综合评价来确定候选数据,从而形成目标数据。In an embodiment using spaced n-grams or n-grams for querying, a window of size S is applied to the target data sample and the target data sample is thus represented by [w 1 , w 2 , . . . , w S ]. An n-gram phrase generated from a target data sample is denoted as P(w 1 , w 2 , ..., w S ), where w i is the i-th word in the window and |P|≤n, n is a natural number. The words in the phrase can be continuous or discontinuous, and it is an n-gram phrase in the continuous situation, and it is an interval n-gram phrase in the discontinuous situation. For example, the sentence "promote the stable development of the national economy" is represented by an interval n-gram phrase, if n=3, then "promote", "promote the country", "promote the national economy", "promote the country", "promote national stability", "promoting the economy", "promoting the economy", and "promoting economic stability". The candidate data set is queried by means of the n-grams extracted from the target data sample, and the candidate data is determined according to the comprehensive evaluation, thereby forming the target data.

在采用词表进行查询的实施例中,由目标数据样本形成词表。将从目标数据样本的所有词语根据其在目标数据样本中出现的频率降序地排列。于是,可以用最高频率的词语成组地查询候选数据集、或者也可以采用中等频率的词语成组地查询候选数据集,并且根据相似度来确定候选数据,从而形成目标数据。在此,相似度的确定例如可以根据包含最高频率的词语的多少来确定。In embodiments where a vocabulary is used for querying, the vocabulary is formed from target data samples. All words from the target data sample are sorted in descending order according to their frequency of occurrence in the target data sample. Therefore, the candidate data sets can be queried in groups with the highest frequency words, or the candidate data sets can be queried in groups with medium frequency words, and the candidate data can be determined according to the similarity, thereby forming the target data. Here, the determination of the similarity can be determined according to how many words with the highest frequency are contained, for example.

同样,也可以借助由目标数据样本形成的间隔n元词组(intervaln-gram phrase)、n元词组(n-gram phrase)或词表来对候选数据集进行聚类。例如,在采用间隔n元词组对候选数据聚类的实施例中,由目标数据样本形成n元词组(n-gram phrase)形式的特征集F。将候选数据集中的所有数据转换成特征向量Vs<f1:w1,f2:w2,…,fm:wm>,其中fi是在F中找到的n元词组;wi是对应的权重,i为自然数。优选地,基于所述特征在目标数据样本中出现的频率确定所述特征的权重。将所有向量形成特征矩阵。进行聚类并且从聚类结果中根据相似度选择前N类,以获得目标数据,其中N为自然数并且可以根据经验来选择。Similarly, the candidate data sets can also be clustered by means of interval n-gram phrases, n-gram phrases or vocabulary formed by the target data samples. For example, in an embodiment in which interval n-gram phrases are used to cluster candidate data, a feature set F in the form of n-gram phrases is formed from target data samples. Transform all the data in the candidate dataset into feature vectors V s < f 1 : w 1 , f 2 : w 2 , ..., f m : w m >, where f i is the n-gram found in F; w i is the corresponding weight, and i is a natural number. Preferably, the weight of the feature is determined based on the frequency of occurrence of the feature in the target data sample. Form all vectors into an eigenmatrix. Perform clustering and select the top N classes from the clustering results according to the similarity to obtain the target data, where N is a natural number and can be selected based on experience.

采用词表对候选数据聚类的实施例与上述采用间隔n元词组对候选数据聚类的实施例相似,在此不再赘述。The embodiment of using vocabulary to cluster candidate data is similar to the above embodiment of using interval n-grams to cluster candidate data, and will not be repeated here.

此外,在将根据本申请的实施例的从候选数据集获取目标数据集的方法应用于翻译领域的实施例中,候选数据集为翻译语料库,该翻译语料库包括至少两种语言的相互对应的语料。在此,表述“相互对应”应理解为两种语言的互译。术语“语料”在此应理解为术语、句子或者文章。目标数据集是为特定目的从所述翻译语料库抽取的子集。目标数据样本为针对所述特定目的准备的所述至少两种语言中的至少一种语言的文本。例如至少一种语言的文本可以是以所述至少两种语言中的其中一种语言撰写的文章或者句子等等。In addition, in the embodiment in which the method for obtaining a target data set from a candidate data set according to an embodiment of the present application is applied to the field of translation, the candidate data set is a translation corpus, and the translation corpus includes mutually corresponding corpus in at least two languages . Here, the expression "corresponding to each other" should be understood as mutual translation between two languages. The term "corpus" should be understood here as terms, sentences or articles. A target dataset is a subset drawn from the translation corpus for a specific purpose. The target data sample is a text in at least one of the at least two languages prepared for the specific purpose. For example, the text in at least one language may be an article or a sentence written in one of the at least two languages.

示例性地,候选数据集是控制领域的英汉语料库,其中控制领域涵盖机电控制、电气控制、机械控制等等。而目标数据样本仅涉及电气控制领域的英文语料。根据本申请的实施例的方法,可以从目标数据样本(在此为电气控制领域的英文语料,或者中文语料)提取特征;利用所述特征从候选数据集(在此为控制领域的英汉语料库)抽取目标数据,从而形成目标数据集(在此为电气控制领域的英汉语料库)。在此,也可以由电气控制领域的英文语料(或中文语料)形成上面所提及的n元词组、间隔n元词组或词表来查询控制领域的英汉语料库,或者对控制领域的英汉语料库聚类,或者结合这两种方式,根据候选数据(在此为控制领域的英汉语料库中的英语语料库或者中文语料库)与目标数据样本(对应地,英文或者中文语料)的相似度来确定候选数据,从而形成目标数据集,也就是形成了电气控制领域的英汉双语的语料库。由此实现了形成特定目的的语料库。Exemplarily, the candidate data set is an English-Chinese corpus in the field of control, where the field of control covers electromechanical control, electrical control, mechanical control, and so on. The target data samples only involve English corpus in the field of electrical control. According to the method of the embodiment of the present application, feature can be extracted from target data samples (English corpus or Chinese corpus in electrical control field here); utilize described feature from candidate data set (English-Chinese corpus in control field here) The target data is extracted to form a target data set (here, an English-Chinese corpus in the field of electrical control). Here, it is also possible to query the English-Chinese corpus in the control field by forming the above-mentioned n-grams, interval n-grams or vocabulary from the English corpus (or Chinese corpus) in the field of electrical control, or to aggregate the English-Chinese corpus in the control field. class, or combine these two methods, according to the similarity between the candidate data (here, the English corpus or the Chinese corpus in the English-Chinese corpus in the control domain) and the target data sample (correspondingly, English or Chinese corpus) to determine the candidate data, Thus, the target data set is formed, that is, the English-Chinese bilingual corpus in the field of electrical control is formed. This enables the formation of a purpose-specific corpus.

当然对于本领域技术人员而言应理解的是,候选数据集还包括多种语言,例如英德、英法等等。目标数据样本以及候选数据集也可以涵盖更大的范围而非局限于技术领域,例如普通词典等等。Of course, those skilled in the art should understand that the candidate data sets also include multiple languages, such as English-German, English-French, and so on. Target data samples and candidate data sets can also cover a wider range than limited to technical fields, such as general dictionaries and so on.

类似地,也可以将根据本申请的实施例的从候选数据集获取目标数据集的方法应用于专业辞典,用于形成对成语、习惯用语、专业术语进行解释的目标数据集。在此,与翻译领域的应用不同在于,候选数据集包括术语子集和解释性语料库。其他处理与翻译领域的应用相同,故在此不再赘述。Similarly, the method for obtaining a target data set from a candidate data set according to an embodiment of the present application can also be applied to a professional dictionary to form a target data set explaining idioms, idioms, and professional terms. Here, unlike applications in the translation domain, candidate datasets include term subsets and explanatory corpora. Other processing is the same as the application in the field of translation, so it will not be repeated here.

还应指出的是,在此虽然仅就翻译和解释的语料库为例对本申请的实施例的方法进行了描述。然而本领域技术人员应该理解的是,本申请的实施例并不局限于上述文字处理领域,而是可以应用于任何根据目标数据集的特征从侯选数据集中选取与目标数据集最为相关的数据集的领域。例如,如果目标数目样本是图像,从图像中提取特征从候选数据集即图库中提取带有提取到的特征的图像,其中所提取的图像可以是多个。It should also be noted that the method of the embodiment of the present application is described here only by taking the translated and interpreted corpus as an example. However, those skilled in the art should understand that the embodiments of the present application are not limited to the above-mentioned word processing field, but can be applied to any data set that is most relevant to the target data set from the candidate data set according to the characteristics of the target data set. set of spheres. For example, if the target number of samples is an image, extracting features from the image extracts images with the extracted features from a candidate data set, ie, a gallery, where there may be multiple extracted images.

第七实施例Seventh embodiment

图5示出了根据本申请的一个实施例的用于从候选数据集获取目标数据集的装置的示意性结构图。该装置具有:特征提取单元510,用于从目标数据样本提取特征;抽取单元520,用于利用所述特征从所述候选数据集中抽取目标数据以形成目标数据集。特征提取单元510将提取的特征提供给抽取单元520,以便其从候选数据集中抽取目标数据来形成目标数据集。同样地,该目标数据样本可以包括一个或多个数据,其中数据包括数据元素。数据可以包括:字符串、句子或者图片集。相应地,数据元素可以是字符、词语或者图片。显然,所述特征可以是任何特征。作为非限制性的例子,所提取的特征可以由数据元素中的至少一部分构成。例如,如果目标数据样本为一个句子,则其数据元素是构成该句子的词语,而提取到的特征则是句子中的至少一个词语。在从目标数据样本中提取到多个特征时,基于特征在目标数据样本中出现的频率确定各个特征的权重。选取高权重的特征作为目标数据样本的特征。换言之,特征在目标数据样本中出现的频率越高,则其权重就越高。该装置的工作方式的具体细节与根据图1至图4所述的方法的流程图对应。在此不再赘述。Fig. 5 shows a schematic structural diagram of an apparatus for acquiring a target data set from a candidate data set according to an embodiment of the present application. The device has: a feature extraction unit 510 for extracting features from target data samples; an extraction unit 520 for using the features to extract target data from the candidate data set to form a target data set. The feature extraction unit 510 provides the extracted features to the extraction unit 520 so that it extracts target data from the candidate data set to form a target data set. Likewise, the target data sample may include one or more data, where the data includes data elements. Data can include: strings, sentences, or collections of pictures. Correspondingly, the data elements may be characters, words or pictures. Obviously, said feature can be any feature. As a non-limiting example, the extracted features may consist of at least some of the data elements. For example, if the target data sample is a sentence, its data elements are words constituting the sentence, and the extracted feature is at least one word in the sentence. When multiple features are extracted from the target data sample, the weight of each feature is determined based on the frequency of occurrence of the feature in the target data sample. Select high-weight features as the features of the target data sample. In other words, the more frequently a feature occurs in the target data sample, the higher its weight. The specific details of the working mode of the device correspond to the flow charts of the methods described in FIGS. 1 to 4 . I won't repeat them here.

在另一实施例中,与上述实施不同仅仅在于,抽取单元520还包括:查询单元521和生成单元522,如图6所示。也就是说,该实施例中的抽取单元520借助信息检索的方法来抽取候选数据并且生成目标数据集。在该实施例中,该查询单元521利用特征查询候选数据集。生成单元522根据查询单元521查询到的候选数据与目标数据样本的相似度而获得所述目标数据。查询单元521和生成单元522的具体工作方式请参见上述针对根据本申请的实施例的方法所阐述的实施例。在此不再赘述。In another embodiment, the only difference from the above implementation is that the extraction unit 520 further includes: a query unit 521 and a generation unit 522 , as shown in FIG. 6 . That is to say, the extraction unit 520 in this embodiment extracts candidate data and generates a target data set by means of an information retrieval method. In this embodiment, the query unit 521 uses features to query candidate data sets. The generation unit 522 obtains the target data according to the similarity between the candidate data queried by the query unit 521 and the target data sample. For specific working methods of the query unit 521 and the generation unit 522, please refer to the above-mentioned embodiments described for the method according to the embodiments of the present application. I won't repeat them here.

在另一实施例中,与上述实施不同仅仅在于,抽取单元520包括:聚类单元523和生成单元522,如图7所示。也就是说,该实施例中的抽取单元520借助聚类方法来抽取候选数据并且生成目标数据集。在该实施例中,该聚类单元523利用所述特征对候选数据集中的数据进行聚类。该生成单元522根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。聚类单元523和生成单元522的具体工作方式请参见上述针对根据本申请的实施例的方法所阐述的实施例。在此不再赘述。In another embodiment, the only difference from the above implementation is that the extracting unit 520 includes: a clustering unit 523 and a generating unit 522 , as shown in FIG. 7 . That is to say, the extraction unit 520 in this embodiment extracts candidate data by means of a clustering method and generates a target data set. In this embodiment, the clustering unit 523 uses the features to cluster the data in the candidate data set. The generating unit 522 selects an appropriate class as the target data according to the similarity between the cluster generated class and the target data sample. For specific working methods of the clustering unit 523 and the generating unit 522, please refer to the above-mentioned embodiments described for the method according to the embodiments of the present application. I won't repeat them here.

此外,在另一实施例中,抽取单元520包括查询单元521、聚类单元523、比较单元521和生成单元522,如图8所示。该实施例中的抽取单元520借助信息检索和聚类方法来抽取候选数据并且比较这两种方式获得的候选数据与目标数据样本的相似度来生成目标数据集。换言之,即综合了上述借助信息检索和聚类方法。在该实施例中,该聚类单元523利用特征对候选数据集中的候选数据进行聚类。该比较单元524将根据聚类所产生的类以及查询到的候选数据与目标数据的样本的相似度比较。生成单元522被配置为根据比较结果来选择合适的候选数据作为目标数据。Furthermore, in another embodiment, the extracting unit 520 includes a query unit 521, a clustering unit 523, a comparing unit 521 and a generating unit 522, as shown in FIG. 8 . The extraction unit 520 in this embodiment extracts candidate data by means of information retrieval and clustering methods, and compares the similarity between the candidate data obtained by these two methods and the target data sample to generate a target data set. In other words, it is a combination of the above-mentioned information retrieval and clustering methods. In this embodiment, the clustering unit 523 uses features to cluster the candidate data in the candidate data set. The comparison unit 524 compares the clusters generated according to the clustering and the similarity between the queried candidate data and the samples of the target data. The generation unit 522 is configured to select suitable candidate data as target data according to the comparison result.

在一个实施例中,根据本申请的实施例的装置应用于翻译领域。在此情况下,候选数据集为翻译语料库,翻译语料库包括至少两种语言的相互对应的语料,所述目标数据集是为特定目的从所述翻译语料库抽取的子集,所述目标数据样本为针对所述特定目的准备的所述至少两种语言中的至少一种语言的文本。该装置的具体工作的细节与第五实施例中所描述的过程相同。在此不再赘述。In one embodiment, the device according to the embodiments of the present application is applied in the field of translation. In this case, the candidate data set is a translation corpus, the translation corpus includes mutually corresponding corpus of at least two languages, the target data set is a subset extracted from the translation corpus for a specific purpose, and the target data sample is A text in at least one of the at least two languages prepared for the specific purpose. The details of the specific operation of the device are the same as those described in the fifth embodiment. I won't repeat them here.

本申请的实施例还提出了一种翻译机器,其具有包括至少两种语言的相互对应的语料的语料数据库,该语料数据库包括根据本申请的实施例的方法获得的目标数据集。The embodiment of the present application also proposes a translation machine, which has a corpus database including corpus corresponding to each other in at least two languages, and the corpus database includes the target data set obtained according to the method of the embodiment of the present application.

此外,本申请的实施例还提出了一种翻译机器,其具有根据本申请的实施例的装置。In addition, the embodiment of the present application also proposes a translation machine, which has the device according to the embodiment of the present application.

用于从候选数据集获取目标数据集的装置的特征提取单元、抽取单元可通过软件、固件、硬件或其组合的方式进行配置。配置可使用的具体手段或方式为本领域技术人员所熟知,在此不再赘述。在通过软件或固件实现的情况下,从存储介质或网络向具有专用硬件结构的计算机(例如图9所示的通用计算机900)安装构成该软件的程序,该计算机在安装有各种程序时,能够执行各种功能等。The feature extraction unit and extraction unit of the device for obtaining the target data set from the candidate data set can be configured by means of software, firmware, hardware or a combination thereof. Specific means or manners that can be used for configuration are well known to those skilled in the art, and will not be repeated here. In the case of realization by software or firmware, the program constituting the software is installed from a storage medium or a network to a computer (such as a general-purpose computer 900 shown in FIG. 9 ) having a dedicated hardware configuration. When the computer is installed with various programs, Capable of performing various functions, etc.

在图9中,中央处理单元(CPU)901根据只读存储器(ROM)902中存储的程序或从存储部分908加载到随机存取存储器(RAM)903的程序执行各种处理。在RAM 903中,也根据需要存储当CPU 901执行各种处理等等时所需的数据。CPU 901、ROM 902和RAM 903经由总线904彼此连接。输入/输出接口905也连接到总线904。In FIG. 9 , a central processing unit (CPU) 901 executes various processes according to programs stored in a read only memory (ROM) 902 or programs loaded from a storage section 908 to a random access memory (RAM) 903 . In the RAM 903, data required when the CPU 901 executes various processing and the like is also stored as necessary. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output interface 905 is also connected to the bus 904 .

下述部件连接到输入/输出接口905:输入部分906(包括键盘、鼠标等等)、输出部分907(包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等,和扬声器等)、存储部分908(包括硬盘等)、通信部分909(包括网络接口卡比如LAN卡、调制解调器等)。通信部分909经由网络比如因特网执行通信处理。根据需要,驱动器910也可连接到输入/输出接口905。可拆卸介质911比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器910上,使得从中读出的计算机程序根据需要被安装到存储部分908中。The following components are connected to the input/output interface 905: an input section 906 (including a keyboard, a mouse, etc.), an output section 907 (including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.), A storage section 908 (including a hard disk, etc.), a communication section 909 (including a network interface card such as a LAN card, a modem, etc.). The communication section 909 performs communication processing via a network such as the Internet. A driver 910 may also be connected to the input/output interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read therefrom is installed into the storage section 908 as necessary.

在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质911安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 911 .

本领域的技术人员应当理解,这种存储介质不局限于图9所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质911。可拆卸介质911的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM 902、存储部分908中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 911 shown in FIG. 9 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable media 911 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be a ROM 902, a hard disk contained in the storage section 908, etc., in which the programs are stored and distributed to users together with devices containing them.

本发明还提出一种存储有机器可读取的指令代码的程序产品。所述指令代码由机器读取并执行时,可执行上述根据本发明实施例的方法。The invention also proposes a program product storing machine-readable instruction codes. When the instruction code is read and executed by a machine, the above-mentioned method according to the embodiment of the present invention can be executed.

相应地,用于承载上述存储有机器可读取的指令代码的程序产品的存储介质也包括在本发明的公开中。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the program product storing the above-mentioned machine-readable instruction codes is also included in the disclosure of the present invention. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.

最后,还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。此外,在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also Other elements not expressly listed, or inherent to the process, method, article, or apparatus are also included. Furthermore, without further limitations, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising said element .

以上虽然结合附图详细描述了本发明的实施例,但是应当明白,上面所描述的实施方式只是用于说明本发明,而并不构成对本发明的限制。对于本领域的技术人员来说,可以对上述实施方式作出各种修改和变更而没有背离本发明的实质和范围。因此,本发明的范围仅由所附的权利要求及其等效含义来限定。Although the embodiments of the present invention have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are only used to illustrate the present invention, rather than to limit the present invention. Various modifications and changes can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Accordingly, the scope of the present invention is limited only by the appended claims and their equivalents.

通过以上的描述不难看出,根据本发明的实施例,提供了如下的方案:It is not difficult to see from the above description that according to the embodiments of the present invention, the following solutions are provided:

附记1.一种从候选数据集获取目标数据集的方法,包括:Additional Note 1. A method for obtaining a target dataset from a candidate dataset, comprising:

从目标数据样本提取特征;Extract features from target data samples;

利用所述特征从所述候选数据集中抽取目标数据,形成目标数据集。Using the features to extract target data from the candidate data set to form a target data set.

附记2.根据附记1所述的方法,其中,Additional Note 2. The method according to Additional Note 1, wherein,

从所述候选数据集抽取目标数据包括:利用所述特征来查询所述候选数据集,根据查询到的候选数据与目标数据样本的相似度而获得所述目标数据。Extracting target data from the candidate data set includes: using the features to query the candidate data set, and obtaining the target data according to the similarity between the queried candidate data and target data samples.

附记3.根据附记1所述的方法,其中,Additional Note 3. The method according to Additional Note 1, wherein,

从所述候选数据集抽取目标数据包括:利用所述特征对候选数据集中的候选数据进行聚类,根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。Extracting target data from the candidate data set includes: using the features to cluster the candidate data in the candidate data set, and selecting an appropriate class as the target data according to the similarity between the classes generated by the clustering and the target data samples .

附记4.根据附记2所述的方法,其中,Additional Note 4. The method according to Additional Note 2, wherein,

从所述候选数据集抽取目标数据包括:利用所述特征对候选数据集中的数据进行聚类,并且将根据聚类所产生的类与目标数据样本的相似度同查询到的候选数据与目标数据样本的相似度比较,根据比较结果来选择合适的候选数据作为所述目标数据。Extracting target data from the candidate data set includes: using the features to cluster the data in the candidate data set, and combining the similarity between the class generated according to the clustering and the target data sample with the candidate data and the target data queried Comparing the similarity of samples, selecting suitable candidate data as the target data according to the comparison result.

附记5.根据附记1至4之一所述的方法,其中,Supplementary Note 5. The method according to one of Supplementary Notes 1 to 4, wherein,

所述数据中的每一项包括数据元素,由所述数据元素中的至少一部分构成所述特征。Each item of said data comprises data elements, at least some of which constitute said characteristic.

附记6.根据附记1至4之一所述的方法,其中,Supplementary Note 6. The method according to one of Supplementary Notes 1 to 4, wherein,

基于所述特征在目标数据样本中出现的频率确定所述特征的权重。The weight of the feature is determined based on the frequency of occurrence of the feature in the target data sample.

附记7.根据附记1至4之一所述的方法,其中,所述数据元素为字符、词语或者图片,相应地,所述数据包括:字符串、句子或者图片集。Supplement 7. The method according to any one of Supplements 1 to 4, wherein the data elements are characters, words or pictures, and correspondingly, the data includes: character strings, sentences or picture sets.

附记8.根据附记1至4之一所述的方法,其中,所述候选数据集为翻译语料库,翻译语料库包括至少两种语言的相互对应的语料,所述目标数据集是为特定目的从所述翻译语料库抽取的子集,所述目标数据样本为针对所述特定目的准备的所述至少两种语言中的至少一种语言的文本。Supplementary Note 8. The method according to any one of Supplementary Notes 1 to 4, wherein the candidate data set is a translation corpus, the translation corpus includes at least two languages corresponding to each other, and the target data set is for a specific purpose A subset extracted from the translation corpus, the target data sample being text in at least one of the at least two languages prepared for the specific purpose.

附记9.根据附记8所述的方法,其中,所述特征为n元词组、间隔n元词组或者词表。Supplement 9. The method according to Supplement 8, wherein the feature is an n-gram, an interval n-gram or a vocabulary.

附记10.一种用于从候选数据集获取目标数据集的装置,其具有:Supplementary note 10. A device for obtaining a target data set from a candidate data set, comprising:

特征提取单元,用于从目标数据样本提取特征;A feature extraction unit is used to extract features from the target data sample;

抽取单元,用于利用所述特征从所述候选数据集中抽取目标数据以形成目标数据集。An extraction unit, configured to use the features to extract target data from the candidate data set to form a target data set.

附记11.根据附记10所述的装置,其中,Supplementary Note 11. The device according to Supplementary Note 10, wherein,

所述抽取单元包括查询单元和生成单元,该查询单元利用所述特征查询所述候选数据集,所述生成单元根据所述查询单元查询到的候选数据与目标数据样本的相似度而获得所述目标数据。The extraction unit includes a query unit and a generation unit, the query unit uses the features to query the candidate data set, and the generation unit obtains the target data.

附记12.根据附记10所述的装置,其中,Supplementary Note 12. The device according to Supplementary Note 10, wherein,

所述抽取单元包括聚类单元和生成单元,该聚类单元利用所述特征对候选数据集中的数据进行聚类,该生成单元根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。The extraction unit includes a clustering unit and a generation unit, the clustering unit uses the features to cluster the data in the candidate data set, and the generation unit selects the appropriate class according to the similarity between the class generated by the cluster and the target data sample class as the target data.

附记13.根据附记11所述的装置,其中,Supplementary Note 13. The device according to Supplementary Note 11, wherein,

所述抽取单元还包括聚类单元和比较单元,该聚类单元利用所述特征对候选数据集中的候选数据进行聚类,该比较单元将根据聚类所产生的类以及查询到的候选数据与目标数据的样本的相似度比较,所述生成单元被配置为根据比较结果来选择合适的候选数据作为所述目标数据。The extraction unit further includes a clustering unit and a comparison unit, the clustering unit utilizes the features to cluster the candidate data in the candidate data set, and the comparison unit combines the classes generated according to the clustering and the queried candidate data with For similarity comparison of samples of target data, the generating unit is configured to select suitable candidate data as the target data according to the comparison result.

附记14.根据附记10至13之一所述的装置,其中,Supplementary Note 14. The device according to any one of Supplementary Notes 10 to 13, wherein,

所述数据中的每一项包括数据元素,由所述数据元素中的至少一部分构成所述特征。Each item of said data comprises data elements, at least some of which constitute said characteristic.

附记15.根据附记10至13之一所述的装置,其中,Supplementary Note 15. The device according to any one of Supplementary Notes 10 to 13, wherein,

所述特征提取单元被配置为基于所述特征在目标数据样本中出现的频率确定所述特征的权重。The feature extraction unit is configured to determine the weight of the feature based on the frequency of occurrence of the feature in the target data sample.

附记16.根据附记10至13之一所述的装置,其中,所述数据元素为字符、词语或者图片,相应地,所述数据包括:字符串、句子或者图片集。Supplement 16. The device according to any one of Supplements 10 to 13, wherein the data element is a character, a word or a picture, and correspondingly, the data includes: a character string, a sentence or a collection of pictures.

附记17.根据附记10至13之一所述的装置,其中,所述候选数据集为翻译语料库,翻译语料库包括至少两种语言的相互对应的语料,所述目标数据集是为特定目的从所述翻译语料库抽取的子集,所述目标数据样本为针对所述特定目的准备的所述至少两种语言中的至少一种语言的文本。Supplementary Note 17. The device according to any one of Supplementary Notes 10 to 13, wherein the candidate data set is a translation corpus, the translation corpus includes at least two languages corresponding to each other, and the target data set is for a specific purpose A subset extracted from the translation corpus, the target data sample being text in at least one of the at least two languages prepared for the specific purpose.

附记18.根据附记17所述的装置,其中,所述特征为n元词组、间隔n元词组或者词表。Supplementary Note 18. The device according to Supplementary Note 17, wherein the feature is an n-gram, an interval n-gram or a vocabulary.

附记19.一种翻译机器,其具有包括至少两种语言的相互对应的语料的语料数据库,该语料数据库包括根据附记1至9之一所述的方法获得的目标数据集。Supplement 19. A translation machine having a corpus database including corpus corresponding to each other in at least two languages, the corpus database including the target data set obtained according to the method described in any one of Supplements 1 to 9.

附记20.一种翻译机器,其具有根据附记10至18之一所述的装置。Supplementary Note 20. A translation machine having the device according to any one of Supplementary Notes 10 to 18.

Claims (10)

1.一种从候选数据集获取目标数据集的方法,包括:1. A method for obtaining a target dataset from a candidate dataset, comprising: 从目标数据样本提取特征;Extract features from target data samples; 利用所述特征从所述候选数据集中抽取目标数据,形成目标数据集。Using the features to extract target data from the candidate data set to form a target data set. 2.根据权利要求1所述的方法,其中,2. The method of claim 1, wherein, 从所述候选数据集抽取目标数据包括:利用所述特征来查询所述候选数据集,根据查询到的候选数据与目标数据样本的相似度而获得所述目标数据。Extracting target data from the candidate data set includes: using the features to query the candidate data set, and obtaining the target data according to the similarity between the queried candidate data and target data samples. 3.根据权利要求1所述的方法,其中,3. The method of claim 1, wherein, 从所述候选数据集抽取目标数据包括:利用所述特征对候选数据集中的候选数据进行聚类,根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。Extracting target data from the candidate data set includes: using the features to cluster the candidate data in the candidate data set, and selecting an appropriate class as the target data according to the similarity between the classes generated by the clustering and the target data samples . 4.一种用于从候选数据集获取目标数据集的装置,其具有:4. An apparatus for obtaining a target data set from a candidate data set, comprising: 特征提取单元,用于从目标数据样本提取特征;A feature extraction unit is used to extract features from the target data sample; 抽取单元,用于利用所述特征从所述候选数据集中抽取目标数据以形成目标数据集。An extraction unit, configured to use the features to extract target data from the candidate data set to form a target data set. 5.根据权利要求4所述的装置,其中,5. The apparatus of claim 4, wherein, 所述抽取单元包括查询单元和生成单元,该查询单元利用所述特征查询所述候选数据集,所述生成单元根据所述查询单元查询到的候选数据与目标数据样本的相似度而获得所述目标数据。The extraction unit includes a query unit and a generation unit, the query unit uses the features to query the candidate data set, and the generation unit obtains the target data. 6.根据权利要求4所述的装置,其中,6. The apparatus of claim 4, wherein, 所述抽取单元包括聚类单元和生成单元,该聚类单元利用所述特征对候选数据集中的数据进行聚类,该生成单元根据聚类所产生的类与目标数据样本的相似度来选择合适的类作为所述目标数据。The extraction unit includes a clustering unit and a generation unit, the clustering unit uses the features to cluster the data in the candidate data set, and the generation unit selects the appropriate class according to the similarity between the class generated by the cluster and the target data sample class as the target data. 7.根据权利要求5所述的装置,其中,7. The apparatus of claim 5, wherein, 所述抽取单元还包括聚类单元和比较单元,该聚类单元利用所述特征对候选数据集中的候选数据进行聚类,该比较单元将根据聚类所产生的类以及查询到的候选数据与目标数据的样本的相似度比较,所述生成单元被配置为根据比较结果来选择合适的候选数据作为所述目标数据。The extraction unit further includes a clustering unit and a comparison unit, the clustering unit utilizes the features to cluster the candidate data in the candidate data set, and the comparison unit combines the classes generated according to the clustering and the queried candidate data with For similarity comparison of samples of target data, the generating unit is configured to select suitable candidate data as the target data according to the comparison result. 8.根据权利要求4至7之一所述的装置,其中,8. Apparatus according to any one of claims 4 to 7, wherein 所述数据中的每一项包括数据元素,由所述数据元素中的至少一部分构成所述特征。Each item of said data comprises data elements, at least some of which constitute said characteristic. 9.一种翻译机器,其具有包括至少两种语言的相互对应的语料的语料数据库,该语料数据库包括根据权利要求1至3之一所述的方法获得的目标数据集。9. A translation machine having a corpus database comprising corpus corresponding to each other in at least two languages, the corpus database comprising the target data set obtained according to the method according to one of claims 1 to 3. 10.一种翻译机器,其具有根据权利要求4至8之一所述的装置。10. A translation machine with a device according to one of claims 4 to 8.
CN201010257678XA 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine Pending CN102375839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010257678XA CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010257678XA CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Publications (1)

Publication Number Publication Date
CN102375839A true CN102375839A (en) 2012-03-14

Family

ID=45794462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010257678XA Pending CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Country Status (1)

Country Link
CN (1) CN102375839A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832305A (en) * 2020-07-03 2020-10-27 广州小鹏车联网科技有限公司 A method, device, server and medium for identifying user intent
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium
CN114970763A (en) * 2022-06-23 2022-08-30 西南交通大学 Radiation source working mode identification method based on transfer learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040068493A1 (en) * 2002-10-04 2004-04-08 International Business Machines Corporation Data retrieval method, system and program product
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040068493A1 (en) * 2002-10-04 2004-04-08 International Business Machines Corporation Data retrieval method, system and program product
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832305A (en) * 2020-07-03 2020-10-27 广州小鹏车联网科技有限公司 A method, device, server and medium for identifying user intent
CN111832305B (en) * 2020-07-03 2023-08-25 北京小鹏汽车有限公司 User intention recognition method, device, server and medium
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium
CN114781409B (en) * 2022-05-12 2023-12-01 北京百度网讯科技有限公司 Text translation method, device, electronic equipment and storage medium
CN114970763A (en) * 2022-06-23 2022-08-30 西南交通大学 Radiation source working mode identification method based on transfer learning

Similar Documents

Publication Publication Date Title
EP1676211B1 (en) Systems and methods for searching using queries written in a different character-set and/or language from the target pages
US20070288450A1 (en) Query language determination using query terms and interface language
US20070288449A1 (en) Augmenting queries with synonyms selected using language statistics
US20100153396A1 (en) Name indexing for name matching systems
US20070288230A1 (en) Simplifying query terms with transliteration
US8442771B2 (en) Methods and apparatus for term normalization
CN102855263A (en) Method and device for aligning sentences in bilingual corpus
JP6705318B2 (en) Bilingual dictionary creating apparatus, bilingual dictionary creating method, and bilingual dictionary creating program
CN104252542A (en) Dynamic-planning Chinese words segmentation method based on lexicons
CN115794995A (en) Target answer obtaining method and related device, electronic equipment and storage medium
Nguyen et al. Sub-character neural language modelling in Japanese
US20130103388A1 (en) Document analyzing apparatus
Piskorski et al. On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
Gugliotta et al. Tarc: Tunisian arabish corpus first complete release
Kuo et al. A phonetic similarity model for automatic extraction of transliteration pairs
CN102375839A (en) Method and device for acquiring target data set from candidate data set, and translation machine
Wang et al. A joint chinese named entity recognition and disambiguation system
CN116415587A (en) Information processing device and information processing method
Chang et al. An approach to cross-lingual sentiment lexicon construction
US20200311059A1 (en) Multi-layer word search option
Li et al. Linguistic resources for entity linking evaluation: from monolingual to cross-lingual
JP5298834B2 (en) Example sentence matching translation apparatus, program, and phrase translation apparatus including the translation apparatus
JP2005011078A (en) Similar word retrieval device and method, its program, recording medium with its program recorded and information retreival system
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
Alpkocak et al. Effects of diacritics on Turkish information retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120314