CN112417896B

CN112417896B - Domain data acquisition method, machine translation method and related equipment

Info

Publication number: CN112417896B
Application number: CN202011210710.9A
Authority: CN
Inventors: 宋锐; 张为泰; 刘丹; 刘俊华; 魏思
Original assignee: iFlytek Co Ltd; University of Science and Technology of China USTC
Current assignee: iFlytek Co Ltd; University of Science and Technology of China USTC
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2024-02-02
Anticipated expiration: 2040-11-03
Also published as: CN112417896A

Abstract

The application provides a domain data acquisition method, a machine translation method and related equipment, wherein the domain data acquisition method can determine a first target value corresponding to a training corpus in a training corpus set of a general domain, and because the first target value can represent the matching degree of the corresponding training corpus and a designated domain, the training corpus in the designated domain can be screened out from the training corpus set of the general domain based on the first target value corresponding to the training corpus in the training corpus set of the general domain. On the basis, the application also provides a machine translation method, which can utilize a pre-established domain translation model to realize translation of the text in the appointed domain, and because the domain translation model is obtained by fine tuning a general translation model by adopting a large number of training corpuses in the appointed domain, the method is a translation model which can be suitable for the appointed domain, and the text in the appointed domain is translated by utilizing the translation model, so that a relatively accurate translation result can be obtained.

Description

A domain data acquisition method, machine translation method and related equipment

技术领域Technical field

本申请涉及自然语言处理技术领域，尤其涉及一种领域数据获取方法、机器翻译方法及相关设备。This application relates to the technical field of natural language processing, and in particular to a field data acquisition method, machine translation method and related equipment.

背景技术Background technique

语言沟通成为不同语言种族群体相互交流面临的一个重要课题，实现任意时间、任意地点、任意语言的无障碍自由沟通是人类追求的一个梦想。传统语言服务行业采用人工陪同口译、交替口译以及同声传译等解决语言沟通障碍问题，但受限于人力不足以及成本限制，无法满足普通人对不同语言沟通交流的需求。Language communication has become an important issue for mutual communication between different language and racial groups. It is a dream pursued by mankind to realize barrier-free and free communication at any time, in any place and in any language. The traditional language service industry uses manual accompanying interpretation, consecutive interpretation, and simultaneous interpretation to solve the problem of language communication barriers. However, due to insufficient manpower and cost constraints, it cannot meet the needs of ordinary people for communication in different languages.

机器翻译是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。机器翻译可以大幅节约翻译时间，提高翻译效率，满足诸如资讯等时效性要求较高或者海量文本的翻译需求，极大地降低了人力成本，而更重要的是，它让跨语言交流变成每个人都可以拥有的能力，语言不通不再是人们获取信息和服务的障碍。Machine translation is the process of using computers to convert one natural language (source language) into another natural language (target language). Machine translation can significantly save translation time, improve translation efficiency, meet the translation needs of information with high timeliness requirements or massive texts, greatly reduce labor costs, and more importantly, it makes cross-language communication accessible to everyone. Everyone can have the ability, and language barriers are no longer obstacles for people to obtain information and services.

在某些时候，会存在一些特定领域的翻译任务，然而，目前的机器翻译方法多为基于通用领域翻译模型的机器翻译方法，用这种机器翻译方法对特定领域的文本进行翻译时，翻译准确度不高，为此，需要构建出针对特定领域的翻译模型。可以理解的是，若要构建出特定领域的翻译模型，往往需要特定领域的训练语料，然而，在某些特定领域，训练语料收集难度大，这导致特定领域的训练语料数量不多，而特定领域的训练语料数量不足，会导致难以构建出性能较佳的特定领域翻译模型，为此，亟需一种获得特定领域的训练语料的方法。At some point, there will be some translation tasks in specific fields. However, the current machine translation methods are mostly machine translation methods based on general domain translation models. When using this machine translation method to translate texts in specific fields, the translation is accurate. The degree is not high. For this reason, it is necessary to build a translation model for specific fields. It is understandable that to build a translation model in a specific field, training corpus in a specific field is often required. However, in some specific fields, it is difficult to collect training corpus, which results in a small amount of training corpus in a specific field, and in certain fields, it is difficult to collect training corpus. Insufficient amount of training corpus in the field will make it difficult to build a domain-specific translation model with better performance. For this reason, a method of obtaining training corpus in a specific field is urgently needed.

发明内容Contents of the invention

有鉴于此，本申请提供了一种领域数据获取方法、机器翻译方法及相关设备，用以从通用领域的训练语料集中筛选出指定领域的训练语料，以利用指定领域的训练语料构建出翻译准确度较高的领域翻译模型，进而对指定领域的文本进行准确翻译，其技术方案如下：In view of this, this application provides a field data acquisition method, a machine translation method and related equipment to filter out training corpus in a specified field from a collection of training corpus in a general field, so as to use the training corpus in the specified field to construct an accurate translation A highly accurate domain translation model can accurately translate texts in specified fields. The technical solution is as follows:

一种领域数据获取方法，包括：A method for obtaining domain data, including:

获取通用领域的训练语料集和指定领域的初始训练语料集；Obtain the training corpus in the general field and the initial training corpus in the specified field;

利用所述通用领域的训练语料集建立通用翻译模型；Using the training corpus in the general field to establish a general translation model;

基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中训练语料对应的第一目标值，其中，一条训练语料对应一第一目标值，所述第一目标值能够表征对应的训练语料与所述指定领域的匹配程度；Based on the general translation model and the initial training corpus of the designated field, a first target value corresponding to the training corpus in the training corpus of the general field is determined, wherein a piece of training corpus corresponds to a first target value, and the first target value is determined. A target value can represent the degree of matching between the corresponding training corpus and the specified field;

基于所述通用领域的训练语料集中训练语料对应的第一目标值，从所述通用领域的训练语料集中筛选所述指定领域的训练语料。Based on the first target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened from the training corpus in the general field.

可选的，所述领域数据获取方法还包括：Optionally, the domain data acquisition method also includes:

利用所述通用领域的训练语料集建立通用语言模型，并利用所述指定领域的初始训练语料集建立领域语言模型；Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model;

基于所述通用语言模型和所述领域语言模型，确定所述通用领域的训练语料集中训练语料对应的第二目标值，其中，一条训练语料对应一第二目标值，所述第二目标值能够表征对应的训练语料与所述指定领域的相关程度；Based on the general language model and the domain language model, a second target value corresponding to the training corpus in the training corpus set in the general domain is determined, wherein a piece of training corpus corresponds to a second target value, and the second target value can Characterize the degree of relevance of the corresponding training corpus to the designated field;

所述基于所述通用领域的训练语料集中训练语料对应的第一目标值，从所述通用领域的训练语料集中筛选所述指定领域的训练语料，包括：Based on the first target value corresponding to the training corpus in the training corpus in the general field, screening the training corpus in the specified field from the training corpus in the general field includes:

以所述通用领域的训练语料集中训练语料对应的第一目标值和第二目标值为依据，从所述通用领域的训练语料集中筛选出所述指定领域的训练语料。Based on the first target value and the second target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened out from the training corpus in the general field.

可选的，所述确定所述通用领域的训练语料集中训练语料对应的第一目标值，包括：Optionally, determining the first target value corresponding to the training corpus in the training corpus of the general field includes:

以所述通用领域的训练语料集中训练语料对应的第二目标值为依据，从所述通用领域的训练语料集中筛选候选训练语料；Based on the second target value corresponding to the training corpus in the training corpus in the general field, select candidate training corpus from the training corpus in the general field;

确定筛选出的每条候选训练语料对应的第一目标值；Determine the first target value corresponding to each candidate training corpus screened out;

以筛选出的每条候选训练语料对应的第一目标值为依据，从筛选出的候选训练语料中筛选出所述指定领域的训练语料。Based on the first target value corresponding to each selected candidate training corpus, the training corpus in the designated field is selected from the selected candidate training corpus.

可选的，基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中一目标训练语料对应的第一目标值，包括：Optionally, based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to a target training corpus in the training corpus in the general field, including:

确定所述指定领域的初始训练语料集中每条训练语料在所述通用翻译模型上的梯度；Determine the gradient of each training corpus in the initial training corpus in the specified field on the general translation model;

计算所述指定领域的初始训练语料集中各条训练语料在所述通用翻译模型上的梯度的平均值，作为领域梯度平均值；Calculate the average value of the gradient of each training corpus in the initial training corpus set in the specified field on the general translation model as the average field gradient;

确定所述目标训练语料在所述通用翻译模型上的梯度；Determine the gradient of the target training corpus on the general translation model;

计算所述目标训练语料在所述通用翻译模型上的梯度与所述领域梯度平均值的距离，作为所述目标训练语料对应的第一目标值。The distance between the gradient of the target training corpus on the general translation model and the average value of the field gradient is calculated as the first target value corresponding to the target training corpus.

可选的，所述通用领域的训练语料集中的每条训练语料和所述指定领域的初始训练语料集中的每条训练语料均包括：源语言文本和对应的目标语言文本；Optionally, each piece of training corpus in the training corpus in the general field and each piece of training corpus in the initial training corpus in the specified field include: source language text and corresponding target language text;

所述利用所述通用领域的训练语料集建立通用语言模型，并利用所述指定领域的初始训练语料集建立领域语言模型，包括：The use of the training corpus in the general field to establish a general language model, and the use of the initial training corpus in the specified field to establish a domain language model include:

利用所述通用领域的训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端通用语言模型；Utilize the source language text in the training corpus set in the general field to train the language model, and the trained language model is used as the source language end universal language model;

利用所述通用领域的训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端通用语言模型；Utilize the target language text in the training corpus in the general field to train the language model, and the language model obtained by training is used as the target language side general language model;

利用所述指定领域的初始训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端领域语言模型；Use the source language text in the initial training corpus of the specified field to train the language model, and the trained language model is used as the source language side domain language model;

利用所述指定领域的初始训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端领域语言模型。The language model is trained using the target language text in the initial training corpus of the specified field, and the trained language model is used as the target language end domain language model.

可选的，基于所述通用语言模型和所述领域语言模型，确定所述通用领域的训练语料集中一目标训练语料对应的第二目标值，包括：Optionally, based on the general language model and the domain language model, determine the second target value corresponding to a target training corpus in the training corpus of the general domain, including:

计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上的后验概率；Calculate the posterior probabilities of the target training corpus on the general language model and the domain language model respectively;

根据确定出的后验概率，确定所述目标训练语料对应的第二目标值。According to the determined posterior probability, the second target value corresponding to the target training corpus is determined.

可选的，所述计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上的后验概率，包括：Optionally, calculating the posterior probabilities of the target training corpus on the general language model and the domain language model respectively includes:

计算所述目标训练语料分别在源语言端通用语言模型、源语言端领域语言模型、目标语言端通用语言模型、目标语言端领域语言模型上的后验概率；Calculate the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model;

所述根据确定出的后验概率，确定所述目标训练语料对应的第二目标值，包括：Determining the second target value corresponding to the target training corpus according to the determined posterior probability includes:

根据所述目标训练语料分别在所述源语言端通用语言模型上的后验概率和所述源语言端领域语言模型上的后验概率，确定所述目标训练语料在源语言端语言模型上的得分；According to the posterior probabilities of the target training corpus on the source language side universal language model and the posterior probabilities on the source language side domain language model, the probability of the target training corpus on the source language side language model is determined. Score;

根据所述目标训练语料分别在所述目标语言端通用语言模型上的后验概率和所述目标语言端领域语言模型上的后验概率，确定所述目标训练语料在目标语言端语言模型上的得分；According to the posterior probabilities of the target training corpus on the target language side general language model and the posterior probabilities of the target language side domain language model, the probability of the target training corpus on the target language side language model is determined. Score;

所述目标训练语料在源语言端语言模型上的得分和所述目标训练语料在目标语言端语言模型上的得分作为所述目标训练语料对应的第二目标值。The score of the target training corpus on the source language side language model and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.

一种针对指定领域的机器翻译方法，包括：A machine translation method for a specified domain, including:

获取指定领域的待翻译源语言文本；Obtain the source language text to be translated in the specified field;

将所述待翻译源语言文本输入预先建立的领域翻译模型，得到所述待翻译源语言文本对应的目标语言文本；Input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;

其中，所述领域翻译模型采用指定领域的初始训练语料集，以及采用上述任一项所述的领域数据获取方法从通用领域的训练语料集中获取的训练语料，对通用翻译模型进行调整得到，所述通用翻译模型采用所述通用领域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain, and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method described in any of the above, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.

可选的，采用所述指定领域的初始训练语料集以及从所述通用领域的训练语料集中筛选出的训练语料，对通用翻译模型进行调整的过程包括：Optionally, using the initial training corpus in the designated field and the training corpus selected from the training corpus in the general field, the process of adjusting the general translation model includes:

将所述指定领域的初始训练语料集中的训练语料与从所述通用领域的训练语料集中筛选出的训练语料混合；Mix the training corpus from the initial training corpus set in the specified field with the training corpus screened out from the training corpus set in the general field;

利用混合后的训练语料调整所述通用翻译模型。The general translation model is adjusted using the mixed training corpus.

可选的，采用所述指定领域的初始训练语料集以及从所述通用领域的训练语料集中筛选出的训练语料，对通用翻译模型进行调整的过程还包括：Optionally, using the initial training corpus in the designated field and the training corpus selected from the training corpus in the general field, the process of adjusting the general translation model also includes:

采用所述指定领域的初始训练语料集中的训练语料，对利用所述混合后的训练语料调整后的翻译模型进一步进行调整。The translation model adjusted using the mixed training corpus is further adjusted using the training corpus in the initial training corpus set in the specified field.

一种领域数据获取装置，包括：数据获取模块、通用翻译模型建立模块、第一目标值确定模块和数据筛选模块；A domain data acquisition device, including: a data acquisition module, a universal translation model establishment module, a first target value determination module and a data screening module;

所述数据获取模块，用于获取通用领域的训练语料集和指定领域的初始训练语料集；The data acquisition module is used to obtain training corpus in general fields and initial training corpus in specified fields;

所述通用翻译模型建立模块，用于利用所述通用领域的训练语料集建立通用翻译模型；The universal translation model establishment module is used to establish a universal translation model using the training corpus in the general field;

所述第一目标值确定模块，用于基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中训练语料对应的第一目标值，其中，一条训练语料对应一第一目标值，所述第一目标值能够表征对应的训练语料与所述指定领域的匹配程度；The first target value determination module is used to determine the first target value corresponding to the training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field, wherein a training The corpus corresponds to a first target value, and the first target value can represent the degree of matching between the corresponding training corpus and the designated field;

所述数据筛选模块，用于基于所述通用领域的训练语料集中训练语料对应的第一目标值，从所述通用领域的训练语料集中筛选出所述指定领域的训练语料。The data screening module is configured to filter out the training corpus in the specified field from the training corpus in the general field based on the first target value corresponding to the training corpus in the general field training corpus.

可选的，所述第一目标值确定模块在基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中一目标训练语料对应的第一目标值时，具体用于确定所述指定领域的初始训练语料集中每条训练语料在所述通用翻译模型上的梯度，计算所述指定领域的初始训练语料集中各条训练语料在所述通用翻译模型上的梯度的平均值，作为领域梯度平均值，确定所述目标训练语料在所述通用翻译模型上的梯度，计算所述目标训练语料在所述通用翻译模型上的梯度与所述领域梯度平均值的距离，作为所述目标训练语料对应的第二目标值。Optionally, the first target value determination module determines the first target value corresponding to a target training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field. , specifically used to determine the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model, and calculate the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model. The average value of the gradient, as the domain gradient average, determines the gradient of the target training corpus on the universal translation model, and calculates the difference between the gradient of the target training corpus on the universal translation model and the domain gradient average The distance is used as the second target value corresponding to the target training corpus.

一种针对指定领域的机器翻译装置，包括：源语言文本获取模块和翻译模块；A machine translation device for a designated field, including: a source language text acquisition module and a translation module;

所述源语言文本获取模块，用于获取指定领域的待翻译源语言文本；The source language text acquisition module is used to obtain the source language text to be translated in a specified field;

所述翻译模块，用于将所述待翻译源语言文本输入预先建立的领域翻译模型，得到所述待翻译源语言文本对应的目标语言文本；The translation module is used to input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;

其中，所述领域翻译模型采用指定领域的初始训练语料集以及上述任一项所述的领域数据获取装置从通用领域的训练语料集中筛选出的训练语料，对通用翻译模型进行调整得到，所述通用翻译模型采用所述通用领域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the designated domain and the training corpus selected from the training corpus of the general domain by the domain data acquisition device described in any of the above, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.

一种领域数据筛选设备，包括：存储器和处理器；A domain data screening device includes: a memory and a processor;

所述存储器，用于存储程序；The memory is used to store programs;

所述处理器，用于执行所述程序，实现上述任一项所述的领域数据获取方法的各个步骤。The processor is used to execute the program and implement each step of the domain data acquisition method described in any one of the above.

一种可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时，实现上述任一项所述的领域数据获取方法的各个步骤。A readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, each step of the domain data acquisition method described in any one of the above is implemented.

经由上述方案可知，本申请提供的领域数据获取方法，首先获取通用领域的训练语料集和指定领域的初始训练语料集，然后利用通用领域的训练语料集建立通用翻译模型，接着基于通用翻译模型和指定领域的初始训练语料集确定通用领域的训练语料集中训练语料对应的第一目标值，最后基于通用领域的训练语料集中训练语料对应的第一目标值，从通用领域的训练语料集中筛选出指定领域的训练语料。通过本申请提供的领域数据获取方法可从通用领域的训练语料集中获得指定领域的训练语料，在此基础上，本申请还提供了一种机器翻译方法，该方法可利用预先建立的领域翻译模型实现指定领域文本的翻译，由于领域翻译模型采用大量指定领域的训练语料对通用翻译模型进行微调得到，因此，其为能够适应于指定领域的翻译模型，利用该翻译模型对指定领域的文本进行翻译，能够获得比较准确的翻译结果。It can be seen from the above solution that the domain data acquisition method provided by this application first obtains a training corpus in a general field and an initial training corpus in a specified field, then uses the training corpus in the general field to establish a universal translation model, and then based on the universal translation model and The initial training corpus in the specified field determines the first target value corresponding to the training corpus in the general field training corpus. Finally, based on the first target value corresponding to the training corpus in the general field training corpus, the designated training corpus is filtered out from the general field training corpus. training corpus in the field. Through the domain data acquisition method provided by this application, training corpus in a specified field can be obtained from a collection of training corpus in general fields. On this basis, this application also provides a machine translation method that can utilize a pre-established domain translation model. To realize the translation of text in a specified field, the domain translation model is obtained by fine-tuning a general translation model using a large amount of training corpus in the specified field. Therefore, it is a translation model that can be adapted to the specified field. This translation model is used to translate texts in the specified field. , can obtain more accurate translation results.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1为本申请实施例提供的一种领域数据获取方法的流程示意图；Figure 1 is a schematic flow chart of a domain data acquisition method provided by an embodiment of the present application;

图2为本申请实施例提供的另一种领域数据获取方法的流程示意图；Figure 2 is a schematic flow chart of another domain data acquisition method provided by an embodiment of the present application;

图3为本申请实施例提供的再一种领域数据获取方法的流程示意图；Figure 3 is a schematic flow chart of yet another domain data acquisition method provided by an embodiment of the present application;

图4为本申请实施例提供的采用指定领域的初始训练语料集以及从通用领域的训练语料集中筛选出的训练语料，对通用翻译模型进行调整的流程示意图；Figure 4 is a schematic flow chart of adjusting a general translation model using an initial training corpus in a specified field and training corpus selected from a training corpus in a general field provided by the embodiment of the present application;

图5为本申请实施例提供的通过KL正则约束防止模型跑偏的示意图；Figure 5 is a schematic diagram of preventing model deviation through KL regular constraints provided by the embodiment of the present application;

图6为本申请实施例提供的领域数据获取装置的结构示意图；Figure 6 is a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application;

图7为本申请实施例提供的领域数据获取设备的结构示意图。Figure 7 is a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the scope of protection of the present invention.

鉴于特定领域的训练语料收集难度大，为了能够获得足够多的训练语料，本案发明人想到，可以从通用领域的训练语料中筛选出特定领域的训练语料，从而将筛选出的训练语料与从特定领域收集的训练语料一并作为构建领域翻译模型的训练语料。In view of the difficulty in collecting training corpus in a specific field, in order to obtain enough training corpus, the inventor of this case thought that the training corpus in a specific field can be screened out from the training corpus in a general field, so that the selected training corpus can be compared with the training corpus from a specific field. The training corpus collected in the domain is used as the training corpus for building the domain translation model.

为了实现从通用领域的训练语料中筛选出特定领域的训练语料，本案发明人进行了深入研究，最终提出了解决方案，该解决方案的大致思路是：确定通用领域的训练语料集中训练语料与特定领域的匹配程度，基于通用领域的训练语料集中训练语料与特定领域的匹配程度，从通用领域的训练语料集中训练语料筛选特定领域的训练语料。In order to filter out the training corpus in a specific field from the training corpus in the general field, the inventor of this case conducted in-depth research and finally proposed a solution. The general idea of the solution is: determine the training corpus in the general field and concentrate the training corpus with the specific field. The degree of matching in the field is based on the matching degree between the training corpus in the general field and the training corpus in the specific field. The training corpus in the specific field is screened from the training corpus in the general field.

在上述领域数据获取方案的基础上，本案发明人还提供了针对特定领域的机器翻译方法，该方法的大致思路是，用通用领域的训练语料集构建通用领域翻译模型，然后利用在特定领域中收集的训练语料以及从通用领域的训练语料集中筛选出的训练语料对通用领域翻译模型进行微调，从而得到特定领域的翻译模型，将特定领域的源语言文本输入特定领域的翻译模型，便可得到对应的目标语言文本。On the basis of the above-mentioned domain data acquisition scheme, the inventor of this case also provides a machine translation method for specific fields. The general idea of this method is to use the training corpus of general fields to build a general-field translation model, and then use it to The collected training corpus and the training corpus selected from the training corpus in the general field are fine-tuned to the general field translation model to obtain a translation model in a specific field. By inputting the source language text in the specific field into the translation model in the specific field, you can get Corresponding target language text.

本申请提供的领域数据获取方法和机器翻译方法可应用于具有数据处理能力的终端，也可应用于单个服务器或多个服务器组成的服务器集群。接下来通过下述实施例对本申请提供的领域数据获取方法和机器翻译方法进行介绍。The domain data acquisition method and machine translation method provided by this application can be applied to terminals with data processing capabilities, and can also be applied to a single server or a server cluster composed of multiple servers. Next, the domain data acquisition method and machine translation method provided by this application will be introduced through the following embodiments.

第一实施例First embodiment

请参阅图1，示出了本申请实施例提供的领域数据获取方法的流程示意图，该方法可以包括：Please refer to Figure 1, which shows a schematic flowchart of a domain data acquisition method provided by an embodiment of the present application. The method may include:

步骤S101：获取通用领域的训练语料集和指定领域的初始训练语料集。Step S101: Obtain a training corpus in a general field and an initial training corpus in a specified field.

其中，通用领域的训练语料集中的每条训练语料以及指定领域的初始训练语料集中的每条训练语料均包括源语言文本和其对应的目标语言文本，即每条训练语料为一文本对。Among them, each training corpus in the general domain training corpus and each training corpus in the initial training corpus in a specified field includes source language text and its corresponding target language text, that is, each training corpus is a text pair.

需要说明的是，通用领域的训练语料集中包括混合在一起的多个领域的训练语料，指定领域的初始训练语料集中的训练语料为在指定领域直接收集而来的训练语料。本实施例中通用领域的训练语料集中包括指定领域的训练语料，本实施例所要实现的即是，从通用领域的训练语料集中筛选出指定领域的训练语料，进而能将筛选出的训练语料与指定领域的初始训练语料集中的训练语料组成指定领域的训练语料集。It should be noted that the training corpus in the general field includes training corpus from multiple fields mixed together, and the training corpus in the initial training corpus in the specified field is the training corpus collected directly in the specified field. In this embodiment, the training corpus in the general field includes training corpus in the specified field. What this embodiment aims to achieve is to filter out the training corpus in the specified field from the training corpus in the general field, and then combine the filtered training corpus with The training corpus in the initial training corpus of the specified field constitutes the training corpus of the specified field.

步骤S102：利用通用领域的训练语料集建立通用翻译模型。Step S102: Establish a general translation model using training corpus in general fields.

具体的，利用通用领域的训练语料集中的训练语料训练翻译模型，训练得到的翻译模型即为通用翻译模型。利用通用领域的训练语料集中的训练语料训练翻译模型的过程为现有技术，本实施例在此不做赘述。Specifically, the translation model is trained using the training corpus in the general domain training corpus, and the trained translation model is the universal translation model. The process of training a translation model using training corpus set in a general domain training corpus is an existing technology, and will not be described in detail in this embodiment.

步骤S103：基于通用翻译模型和指定领域的初始训练语料集，确定通用领域的训练语料集中每条训练语料对应的第一目标值。Step S103: Based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the training corpus in the general field.

其中，第一目标值能够表征对应的训练语料与指定领域的匹配程度。Among them, the first target value can represent the degree of matching between the corresponding training corpus and the specified field.

需要说明的是，一训练语料与指定领域的匹配程度越高，则说明该训练语料为指定领域的训练语料的可能性越大，反之，一训练语料与指定领域的匹配程度越低，则说明该训练语料为指定领域的训练语料的可能性越小。It should be noted that the higher the matching degree between a training corpus and the specified field, the greater the possibility that the training corpus is the training corpus in the specified field. On the contrary, the lower the matching degree between a training corpus and the specified field, it means that the training corpus is training corpus in the specified field. The less likely it is that the training corpus is training corpus in the specified field.

考虑到在深度学习/机器学习领域中，具有相似特征或者特性的训练语料其反向更新的梯度往往具有一致性，而相同领域的训练语料必然存在相似的特征，也就是说，属于同一领域的训练语料在模型上的反向梯度具有一致性，基于此，本案发明想到，可用梯度距离衡量训练语料与指定领域的匹配程度。Considering that in the field of deep learning/machine learning, the reverse update gradients of training corpus with similar characteristics or characteristics are often consistent, and training corpus in the same field must have similar characteristics, that is, those belonging to the same field The reverse gradient of the training corpus on the model is consistent. Based on this, the present invention thinks that the gradient distance can be used to measure the matching degree of the training corpus and the specified field.

有鉴于此，基于通用翻译模型和指定领域的初始训练语料集，确定通用领域的训练语料集中一目标训练语料(本申请将待确定目标值的训练语料称之为“目标训练语料”)对应的第一目标值的过程可以包括：In view of this, based on the general translation model and the initial training corpus in the specified field, a target training corpus (this application refers to the training corpus with a target value to be determined as the "target training corpus") corresponding to the training corpus in the general field is determined. The process for the first target value may include:

步骤a1、确定指定领域的初始训练语料集中每条训练语料在通用翻译模型上的梯度。Step a1: Determine the gradient of each training corpus in the initial training corpus in the specified field on the general translation model.

具体的，针对指定领域的初始训练语料集中的每条训练语料，将其输入通用翻译模型，计算其在通用翻译模型中的梯度梯度/>的计算公式如下：Specifically, for each piece of training corpus in the initial training corpus in the specified field, input it into the universal translation model, and calculate its gradient in the universal translation model. Gradient/> The calculation formula is as follows:

其中，表示求梯度运算符，θ指的是通用翻译模型中的模型参数，需要说明的是，θ可以是整个通用翻译模型的参数，也可以是其中部分参数，可选的，θ可以为通用翻译模型的最后一层模型参数。in, Represents the gradient operator, θ refers to the model parameters in the universal translation model. It should be noted that θ can be a parameter of the entire universal translation model, or some of the parameters. Optionally, θ can be a universal translation model. The last layer of model parameters.

步骤a2、计算指定领域的初始训练语料集中各条训练语料在通用翻译模型上的梯度的平均值，作为领域梯度平均值 Step a2: Calculate the average of the gradients of each training corpus in the initial training corpus in the specified field on the general translation model as the average field gradient

步骤a3、确定目标训练语料在通用翻译模型上的梯度。Step a3: Determine the gradient of the target training corpus on the general translation model.

目标训练语料在通用翻译模型上的梯度的确定方式与指定领域的初始训练语料集中训练语料在通用翻译模型上的梯度的确定方式相同。The gradient of the target training corpus on the general translation model is determined in the same way as the gradient of the initial training corpus in the specified field on the general translation model.

步骤a4、计算目标训练语料在通用翻译模型上的梯度与领域梯度平均值的距离，作为目标训练语料对应的第一目标值。Step a4: Calculate the gradient of the target training corpus on the general translation model and the average domain gradient The distance is used as the first target value corresponding to the target training corpus.

即，目标训练语料对应的第一目标值S_d为：That is, the first target value S _d corresponding to the target training corpus is:

其中，dist(.，.)指的是计算两个张量之间的距离，两个张量之间的距离可以但不限为余弦距离。Among them, dist(.,.) refers to calculating the distance between two tensors. The distance between two tensors can be, but is not limited to, cosine distance.

步骤S104：基于通用领域的训练语料集中每条训练语料对应的第一目标值，从通用领域的训练语料集中筛选出指定领域的训练语料。Step S104: Based on the first target value corresponding to each training corpus in the training corpus in the general field, select the training corpus in the specified field from the training corpus in the general field.

基于通用领域的训练语料集中每条训练语料对应的第一目标值，从通用领域的训练语料集中筛选出指定领域的训练语料的实现方式有多种：在一种可能的实现方式中，可按第一目标值由高到低的顺序对通用领域的训练语料集中的各训练语料进行排序，取前N个训练语料作为指定领域的训练语料，当然，也可按第一目标值由低到高的顺序对通用领域的训练语料集中的各训练语料进行排序，取后N个训练语料作为指定领域的训练语料，需要说明的是，N的取值可根据具体应用情况设定；在另一种可能的实现方式中，可设定一阈值T，从通用领域的训练语料集中筛选第一目标值大于阈值T的训练语料，作为指定领域的训练语料。Based on the first target value corresponding to each training corpus in the training corpus in the general field, there are many ways to filter out the training corpus in the specified field from the training corpus in the general field: In one possible implementation, Sort the training corpus in the general domain training corpus from high to low in order of the first target value, and take the first N training corpus as the training corpus in the specified field. Of course, the first target value can also be sorted from low to high. Sort each training corpus in the training corpus set in the general field in the order of In a possible implementation, a threshold T can be set, and the training corpus with a first target value greater than the threshold T can be selected from the training corpus in the general field as the training corpus in the specified field.

本申请实施例提供的领域数据获取方法，可基于通用翻译模型和指定领域的初始训练语料集确定出通用领域的训练语料集中每条训练语料对应的第一目标值，由于第一目标值能够表征对应训练语料与指定领域的匹配程度，因此，基于通用领域的训练语料集中每条训练语料对应的第一目标值，能够从通用领域的训练语料集中筛选出与指定领域的匹配程度较高训练语料。本申请实施例提供的领域数据获取方法能够较准确地从通用领域的训练语料集中筛选出指定领域的训练语料，另外，本申请实施例是从模型梯度层面进行训练语料的筛选，这使得后续采用筛选出的训练语料构建指定领域的翻译模型时，训练语料能够直接和模型耦合，从而可以有效提升指定领域的翻译模型的领域翻译能力。The domain data acquisition method provided by the embodiment of the present application can determine the first target value corresponding to each training corpus in the general domain training corpus based on the general translation model and the initial training corpus in the specified field, because the first target value can characterize Corresponds to the matching degree of the training corpus with the specified field. Therefore, based on the first target value corresponding to each training corpus in the general field training corpus, it is possible to filter out the training corpus with a higher degree of matching with the specified field from the general field training corpus set. . The domain data acquisition method provided by the embodiments of the present application can more accurately filter out the training corpus in the specified field from the training corpus in the general field. In addition, the embodiments of the present application screen the training corpus from the model gradient level, which makes subsequent use When the selected training corpus is used to build a translation model in a specified field, the training corpus can be directly coupled with the model, which can effectively improve the domain translation capabilities of the translation model in the specified field.

第二实施例Second embodiment

为了能够提高训练语料的筛选准确度，本实施例提供了另一种领域数据获取方法，请参阅图2，示出了该领域数据获取方法的流程示意图，该方法可以包括：In order to improve the screening accuracy of training corpus, this embodiment provides another domain data acquisition method. Please refer to Figure 2, which shows a schematic flow chart of the domain data acquisition method. The method may include:

步骤S201：获取通用领域的训练语料集和指定领域的初始训练语料集。Step S201: Obtain a training corpus in a general field and an initial training corpus in a specified field.

其中，通用领域的训练语料集中包括混合在一起的多个领域的训练语料，指定领域的初始训练语料集为在指定领域直接收集而来的训练语料。Among them, the training corpus in a general field includes training corpus in multiple fields mixed together, and the initial training corpus in a specified field is training corpus collected directly in the specified field.

步骤S202a：利用通用领域的训练语料集建立通用翻译模型。Step S202a: Establish a general translation model using training corpus in general fields.

具体的，利用通用领域的训练语料集中的训练语料训练翻译模型，训练得到的翻译模型即为通用翻译模型。Specifically, the translation model is trained using the training corpus in the general domain training corpus, and the trained translation model is the universal translation model.

步骤S202b：利用通用领域的训练语料集建立通用语言模型，并利用指定领域的初始训练语料集建立领域语言模型。Step S202b: Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model.

其中，利用通用领域的训练语料集建立通用语言模型的过程可以包括：利用通用领域的训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端通用语言模型；利用通用领域的训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端通用语言模型。即，本实施例中的通用语言模型包括源语言端通用语言模型和目标语言端通用语言模型。Among them, the process of establishing a general language model using a training corpus in a general field may include: using the source language text in the training corpus in the general field to train the language model, and the trained language model serves as a source language side general language model; using the source language text in the general field The target language text in the training corpus is used to train the language model, and the trained language model is used as a general language model on the target language side. That is, the universal language model in this embodiment includes a source language side universal language model and a target language side universal language model.

其中，利用指定领域的初始训练语料集建立领域语言模型的过程可以包括：利用指定领域的初始训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端领域语言模型；利用指定领域的初始训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端领域语言模型。即，本实施例中的领域语言模型包括源语言端领域语言模型和目标语言端领域语言模型。Among them, the process of establishing a domain language model using the initial training corpus of the specified field may include: using the source language text in the initial training corpus of the specified field to train the language model, and the trained language model is used as the source language side domain language model; using the specified The target language text in the domain's initial training corpus is used to train the language model, and the trained language model is used as the target language side domain language model. That is, the domain language model in this embodiment includes a source language side domain language model and a target language side domain language model.

步骤S203a：基于通用翻译模型和指定领域的初始训练语料集，确定通用领域的训练语料集中每条训练语料对应的第一目标值。Step S203a: Based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the training corpus in the general field.

确定通用领域的训练语料集中每条训练语料对应的第一目标值的过程可参见第一实施例中的步骤a1～步骤a4，本实施例在此不做赘述。The process of determining the first target value corresponding to each piece of training corpus in the general field training corpus set can be referred to steps a1 to a4 in the first embodiment, which will not be described in detail in this embodiment.

步骤S203b：基于通用语言模型和领域语言模型，确定通用领域的训练语料集中每条训练语料对应的第二目标值。Step S203b: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus.

其中，第二目标值能够表征对应的训练语料与指定领域的相关程度。Among them, the second target value can represent the degree of correlation between the corresponding training corpus and the specified field.

需要说明的是，一训练语料与指定领域的相关程度越高，则说明该训练语料为指定领域的训练语料的可能性越大，反之，一训练语料与指定领域的相关程度越低，则说明该训练语料为指定领域的训练语料的可能性越小。It should be noted that the higher the correlation between a training corpus and the specified field, the greater the possibility that the training corpus is the training corpus in the specified field. On the contrary, the lower the correlation between a training corpus and the specified field, it means that the training corpus is training corpus in the specified field. The less likely it is that the training corpus is training corpus in the specified field.

具体的，基于通用语言模型和领域语言模型，确定通用领域的训练语料集中一目标训练语料对应的第二目标值的过程可以包括：Specifically, based on the general language model and the domain language model, the process of determining the training corpus in the general domain and concentrating the second target value corresponding to one target training corpus may include:

步骤b1、计算目标训练语料分别在通用语言模型和领域语言模型上的后验概率。Step b1: Calculate the posterior probabilities of the target training corpus on the general language model and domain language model respectively.

具体的，计算所述目标训练语料分别在源语言端通用语言模型、源语言端领域语言模型、目标语言端通用语言模型、目标语言端领域语言模型上的后验概率。Specifically, the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model are calculated respectively.

步骤b2、根据确定出的后验概率，确定目标训练语料对应的第二目标值。Step b2: Determine the second target value corresponding to the target training corpus according to the determined posterior probability.

具体的，根据确定出的后验概率，确定目标训练语料对应的第二目标值的过程包括：Specifically, based on the determined posterior probability, the process of determining the second target value corresponding to the target training corpus includes:

根据目标训练语料分别在源语言端通用语言模型上的后验概率和源语言端领域语言模型上的后验概率，确定目标训练语料在源语言端语言模型上的得分；根据目标训练语料分别在目标语言端通用语言模型上的后验概率和目标语言端领域语言模型上的后验概率，确定目标训练语料在目标语言端语言模型上的得分；将目标训练语料在源语言端语言模型上的得分和目标训练语料在目标语言端语言模型上的得分作为目标训练语料对应的第二目标值。According to the posterior probability of the target training corpus on the source language side general language model and the posterior probability on the source language side domain language model, the score of the target training corpus on the source language side language model is determined; according to the target training corpus, respectively The posterior probability on the general language model on the target language side and the posterior probability on the domain language model on the target language side are used to determine the score of the target training corpus on the target language side language model; the score of the target training corpus on the source language side language model is determined. The score and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.

更为具体的，假设目标训练语料在源语言端通用语言模型上的后验概率为目标训练语料在源语言端领域语言模型上的后验概率为/>则目标训练语料在源语言端语言模型上的得分/>为：More specifically, it is assumed that the posterior probability of the target training corpus on the source language side universal language model is The posterior probability of the target training corpus on the source language domain language model is/> Then the score of the target training corpus on the source language side language model/> for:

假设目标训练语料在目标语言端通用语言模型上的后验概率为目标训练语料在目标语言端领域语言模型上的后验概率为/>则目标训练语料在目标语言端语言模型上的得分/>为：Assume that the posterior probability of the target training corpus on the target language side universal language model is The posterior probability of the target training corpus on the target language domain language model is/> Then the score of the target training corpus on the target language side language model/> for:

其中，x为目标训练语料中的源语言文本，y为目标训练语料中的目标语言文本。Among them, x is the source language text in the target training corpus, and y is the target language text in the target training corpus.

步骤S204：以通用领域的训练语料集中每条训练语料对应的第一目标值和第二目标值为依据，从通用领域的训练语料集中筛选出指定领域的训练语料。Step S204: Based on the first target value and the second target value corresponding to each training corpus in the general field training corpus set, select the training corpus in the specified field from the general field training corpus set.

在一种可能的实现方式中，对于通用领域的训练语料集中的每条训练语料：可将该训练语料对应的第一目标值作为第一个维度的得分，将该训练语料在源语言端语言模型上的得分作为第二个维度的得分，将该训练语料在目标语言端语言模型上的得分作为第三个维度的得分，将这三个维度的得分融合，融合后的得分作为该训练语料对应的目标得分，在得到通用领域的训练语料集中每条训练语料对应的目标得分后，以通用领域的训练语料集中的各条训练语料分别对应的目标得分为依据，从通用领域的训练语料集中筛选出指定领域的训练语料，具体的，可从通用领域的训练语料集中筛选目标得分最高的N条训练语料，作为指定领域的训练语料。In a possible implementation, for each training corpus in the general field training corpus: the first target value corresponding to the training corpus can be used as the score of the first dimension, and the training corpus can be used in the source language The score on the model is used as the score of the second dimension, the score of the training corpus on the target language side language model is used as the score of the third dimension, the scores of these three dimensions are fused, and the fused score is used as the training corpus The corresponding target score, after obtaining the target score corresponding to each training corpus in the general field training corpus set, is based on the target score corresponding to each training corpus in the general field training corpus set, from the general field training corpus set Filter out the training corpus in the specified field. Specifically, the N training corpus with the highest target score can be selected from the training corpus in the general field as the training corpus in the specified field.

假设一条训练语料x对应的第一目标值为S_d，该条训练语料x在源语言端语言模型上的得分为该条训练语料x在目标语言端语言模型上的得分为/>在一种可能的实现方式中，可直接将S_d、/>和/>求和，求和得到值作为训练语料x对应的目标得分，在另一种可能的实现方式中，可预先确定S_d、/>和/>分别对应的权重α、β和γ，按权重α、β和γ对S_d、/>和/>加权求和，加权求和得到的值S作为训练语料x对应的目标得分，即：Assume that the first target value corresponding to a piece of training corpus x is S _d , and the score of this piece of training corpus x on the source language side language model is The score of this training corpus x on the target language side language model is/> In a possible implementation, S _d ,/> can be directly and/> Sum, and the value obtained by summing is used as the target score corresponding to the training corpus x. In another possible implementation, S _d ,/> can be predetermined and/> The corresponding weights α, β and γ respectively, according to the weights α, β and γ, S _d , /> and/> Weighted summation, the value S obtained by the weighted summation is used as the target score corresponding to the training corpus x, that is:

本申请实施例提供的领域数据获取方法，可基于通用翻译模型和指定领域的初始训练语料集，确定通用领域的训练语料集中每条训练语料对应的第一目标值，还可基于通用语言模型和领域语言模型，确定通用领域的训练语料集中每条训练语料对应的第二目标值，由于第一目标值能够表征对应训练语料与指定领域的匹配程度，第二目标值能够表征对应训练语料与指定领域的相关程度，因此，以通用领域的训练语料集中每条训练语料对应的第一目标值和第二目标值为依据，能够从通用领域的训练语料集中更准确地筛选出指定领域的训练语料。The domain data acquisition method provided by the embodiments of the present application can determine the first target value corresponding to each training corpus in the general domain training corpus based on the general translation model and the initial training corpus in the specified field. It can also be based on the general language model and The domain language model determines the second target value corresponding to each training corpus in the general domain training corpus. Since the first target value can characterize the matching degree of the corresponding training corpus with the specified domain, the second target value can characterize the matching degree of the corresponding training corpus with the specified domain. The degree of relevance of the field. Therefore, based on the first target value and the second target value corresponding to each training corpus in the general field training corpus, the training corpus in the specified field can be more accurately screened out from the general field training corpus. .

第三实施例Third embodiment

为了能够提高训练语料的筛选效率和筛选准确度，本实施例提供了再一种领域数据获取方法，请参阅图3，示出了该领域数据获取方法的流程示意图，该方法可以包括：In order to improve the screening efficiency and screening accuracy of training corpus, this embodiment provides yet another domain data acquisition method. Please refer to Figure 3, which shows a schematic flow chart of the domain data acquisition method. The method may include:

步骤S301：获取通用领域的训练语料集和指定领域的初始训练语料集。Step S301: Obtain a training corpus in a general field and an initial training corpus in a specified field.

步骤S302：利用通用领域的训练语料集建立通用语言模型，并利用指定领域的初始训练语料集建立领域语言模型。Step S302: Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model.

步骤S303：基于通用语言模型和领域语言模型，确定通用领域的训练语料集中每条训练语料对应的第二目标值。Step S303: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus.

本步骤的具体实现过程可参见第二实施例中“步骤S203b：基于通用语言模型和领域语言模型，确定通用领域的训练语料集中每条训练语料对应的第二目标值”的具体实现过程，本实施例在此不做赘述。For the specific implementation process of this step, please refer to the specific implementation process of "Step S203b: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus set" in the second embodiment. The embodiments will not be described in detail here.

步骤S304：以通用领域的训练语料集中训练语料对应的第二目标值为依据，从通用领域的训练语料集中筛选候选训练语料，组成候选训练语料集。Step S304: Based on the second target value corresponding to the training corpus in the training corpus in the general field, select candidate training corpus from the training corpus in the general field to form a candidate training corpus.

在一种可能的实现方式中，可按第二目标值从大到小的顺序对通用领域的训练语料集中的各训练语料进行排序，取前M个训练语料作为候选训练语料，组成候选训练语料集(当然，也可按第二目标值从小到大的顺序对通用领域的训练语料集中的各训练语料进行排序，取后M个训练语料作为候选训练语料，组成候选训练语料集)；在另一种可能的实现方式中，可设定一阈值T1，将通用领域的训练语料集中第二目标值大于阈值T1的训练语料作为候选训练语料，组成候选训练语料集。In a possible implementation, each training corpus in the training corpus in the general field can be sorted according to the order of the second target value from large to small, and the top M training corpus is taken as the candidate training corpus to form the candidate training corpus. set (of course, you can also sort the training corpus in the general field training corpus in ascending order of the second target value, and take the last M training corpus as candidate training corpus to form a candidate training corpus); in another In one possible implementation, a threshold T1 can be set, and the training corpus with a second target value greater than the threshold T1 in the general field training corpus is used as candidate training corpus to form a candidate training corpus set.

步骤S305：利用通用领域的训练语料集建立通用翻译模型。Step S305: Establish a general translation model using training corpus in general fields.

需要说明的是，本实施例并不限定步骤S305在步骤S304之后执行，只要步骤S305在步骤S301之后，步骤S306之前执行即可。It should be noted that this embodiment does not limit step S305 to be executed after step S304, as long as step S305 is executed after step S301 and before step S306.

步骤S306：基于通用翻译模型和指定领域的初始训练语料集，确定候选训练语料集中每条训练语料对应的第一目标值。Step S306: Based on the universal translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the candidate training corpus.

具体的，首先确定指定领域的初始训练语料集中每条训练语料在通用翻译模型上的梯度，然后计算指定领域的初始训练语料集中各条训练语料在通用翻译模型上的梯度的平均值，作为领域梯度平均值，接着，确定候选训练语料集中的每条训练语料在通用翻译模型上的梯度，最后，针对候选训练语料集中的每条训练语料，计算其在通用翻译模型上的梯度与领域梯度平均值的距离，作为其对应的第一目标值，从而得到候选训练语料集中每条训练语料对应的第一目标值。其中，确定一条训练语料在通用翻译模型上的梯度的过程可参见第一实施例，本实施例在此不做赘述。Specifically, first determine the gradient of each training corpus on the universal translation model in the initial training corpus of the specified field, and then calculate the average of the gradient of each training corpus on the universal translation model in the initial training corpus of the specified field, as the domain Gradient average, then, determine the gradient of each training corpus in the candidate training corpus on the general translation model, and finally, for each training corpus in the candidate training corpus, calculate its gradient on the general translation model and the domain gradient average The distance between the values is used as its corresponding first target value, thereby obtaining the first target value corresponding to each training corpus in the candidate training corpus set. The process of determining the gradient of a piece of training corpus on the universal translation model can be found in the first embodiment, and will not be described again in this embodiment.

步骤S307：以候选训练语料集中每条训练语料对应的第一目标值为依据，从候选训练语料集中筛选出指定领域的训练语料。Step S307: Based on the first target value corresponding to each training corpus in the candidate training corpus set, select the training corpus in the specified field from the candidate training corpus set.

可选的，可按第一目标值由大到小的顺序对候选训练语料集中的各训练语料进行排序，取前N(N的大小可根据具体应用情况设定)个训练语料，作为指定领域的训练语料，当然，也可按第一目标值由小到大的顺序对候选训练语料集中的各训练语料进行排序，取后N个训练语料作为指定领域的训练语料，还可设置一阈值T2，从候选训练语料集中筛选第一目标值大于阈值 T2的训练语料作为指定领域的训练语料。Optionally, you can sort each training corpus in the candidate training corpus set in descending order of the first target value, and take the top N (the size of N can be set according to the specific application) training corpus as the designated field. training corpus. Of course, you can also sort the training corpus in the candidate training corpus in ascending order of the first target value, and take the last N training corpus as the training corpus in the specified field. You can also set a threshold T2 , select the training corpus whose first target value is greater than the threshold T2 from the candidate training corpus set as the training corpus in the specified field.

考虑到梯度的计算相对较费时，为了提高筛选效率，本实施例提供的领域数据获取方法首先基于通用领域的训练语料集中每条训练语料对应的第二目标值(语言模型得分)，从通用领域的训练语料集中筛选候选训练语料，然后再基于候选训练语料集中每条候选训练语料对应的第一目标值(梯度距离)，从候选训练语料集中筛选出指定领域的训练语料。本实施例提供的领域数据获取方法不但可从通用领域的训练语料集中准确地筛选出指定领域的训练语料，而且，由于只需要针对候选训练语料计算梯度，而不需要对所有的训练语料计算梯度，因此，降低了运算量，从而提高了训练语料的筛选效率。Considering that the calculation of the gradient is relatively time-consuming, in order to improve the screening efficiency, the domain data acquisition method provided in this embodiment is first based on the second target value (language model score) corresponding to each training corpus in the general domain training corpus, from the general domain Candidate training corpus is screened from the training corpus set, and then based on the first target value (gradient distance) corresponding to each candidate training corpus in the candidate training corpus set, training corpus in the specified field is screened out from the candidate training corpus set. The domain data acquisition method provided by this embodiment can not only accurately filter out the training corpus in the specified field from the training corpus in the general domain, but also, because it only needs to calculate the gradient for the candidate training corpus, there is no need to calculate the gradient for all the training corpus. , Therefore, the amount of calculation is reduced, thereby improving the efficiency of screening training corpus.

第四实施例Fourth embodiment

在上述实施例的基础上，本实施例提供了一种针对指定领域的机器翻译方法，该方法可以包括：Based on the above embodiments, this embodiment provides a machine translation method for a specified field. The method may include:

获取指定领域的待翻译源语言文本；将待翻译源语言文本输入预先建立的领域翻译模型，得到待翻译源语言文本对应的目标语言文本。Obtain the source language text to be translated in the specified field; input the source language text to be translated into the pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated.

其中，领域翻译模型采用指定领域的初始训练语料集，以及采用上述任一实施例提供的领域数据获取方法从通用领域的训练语料集中获取的训练语料，对通用翻译模型进行调整得到，通用翻译模型采用通用领域的训练语料集训练得到。Among them, the domain translation model adopts the initial training corpus of the specified domain and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method provided by any of the above embodiments. The general translation model is obtained by adjusting the general translation model. It is trained using training corpus in general fields.

请参阅图4，示出了采用指定领域的初始训练语料集以及从通用领域的训练语料集中筛选出的训练语料，对通用翻译模型进行调整的流程示意图，可以包括：Please refer to Figure 4, which shows a schematic flow chart of adjusting the general translation model using the initial training corpus in the specified field and the training corpus filtered out from the training corpus in the general field, which can include:

步骤S401：将指定领域的初始训练语料集中的训练语料与从通用领域的训练语料集中筛选出的训练语料混合。Step S401: Mix the training corpus from the initial training corpus in the specified field with the training corpus filtered out from the training corpus in the general field.

步骤S402：利用混合后的训练语料对通用翻译模型进行微调，微调后的模型作为领域翻译模型T_in1。Step S402: Use the mixed training corpus to fine-tune the general translation model, and the fine-tuned model is used as the domain translation model T _in1 .

利用混合后的训练语料对通用翻译模型进行微调，即利用混合后的训练语料对通用翻译模型进一步进行训练，训练后得到的翻译模型即为领域翻译模型T_in1，该领域翻译模型T_in1能够较准确地对指定领域的待翻译文本进行翻译。Use the mixed training corpus to fine-tune the general translation model, that is, use the mixed training corpus to further train the general translation model. The translation model obtained after training is the domain translation model T _in1 . This domain translation model T _in1 can compare Accurately translate the text to be translated in the specified field.

优选的，为了能够获得性能更优的领域翻译模型，本实施例还可以包括如下步骤：Preferably, in order to obtain a domain translation model with better performance, this embodiment may also include the following steps:

步骤S403：采用指定领域的初始训练语料集中的训练语料，对领域翻译模型T_in1进行微调，微调后的模型作为最终的领域翻译模型T_in。Step S403: Fine-tune the domain translation model T _in1 using the training corpus in the initial training corpus set of the specified domain, and the fine-tuned model is used as the final domain translation model T _in .

经由步骤S402获得领域翻译模型T_in1后，利用指定领域的高质量训练语料对领域翻译模型T_in1进行更为精细的微调。利用指定领域的高质量训练语料对领域翻译模型T_in1进行微调，即利用指定领域的初始训练语料集中的训练语料对领域翻译模型T_in1进一步进行训练。After the domain translation model T _in1 is obtained through step S402, the domain translation model T _in1 is fine-tuned more precisely using high-quality training corpus in the specified domain. Use high-quality training corpus in the specified field to fine-tune the domain translation model T _in1 , that is, use the training corpus in the initial training corpus of the specified field to further train the domain translation model T _in1 .

考虑到高质量训练语料较少，本实施例在对领域翻译模型T_in1进行微调时，增加一个KL正则约束，以防止模型跑偏。Considering that there is a small amount of high-quality training corpus, this embodiment adds a KL regular constraint when fine-tuning the domain translation model T _in1 to prevent the model from going astray.

具体的，通过KL正则约束防止模型跑偏的策略为：在获得T_in1后，增加一个T_in1，如图5所示，其中一个T_in1的参数固定不变(如图5中的T_in1-fixed)，，另一个T_in1进行调整，即调整参数(如图5中的T_in)，高质量训练语料分别输入图5中的T_in和T_in1-fixed，T_in输出概率分布P(y|x)，T_in1-fixed输出概率分布 Q(y|x)，为了防止模型跑偏，使T_in的输出概率分布与T_in1-fixed的输出概率分布尽可能的接近。Specifically, the strategy to prevent model deviation through KL regular constraints is: after obtaining T _in1 , add a T _in1 , as shown in Figure 5, one of the parameters of T _in1 is fixed (T _in1 - in Figure 5 fixed), another T _in1 is adjusted, that is, the parameters are adjusted (T _in in Figure 5). The high-quality training corpus is input into T _in and T _in1 -fixed in Figure 5 respectively, and T _in outputs the probability distribution P(y |x), _T _in1 _-fixed output probability distribution Q (y |

两个概率分布的接近情况可通过相对熵进行度量，相对熵又称为KL散度或信息散度，其是两个概率分布间差异的非对称性度量，相对熵可以衡量两个概率分布之间的距离，当两个概率分布相同时，它们的相对熵为零，当两个概率分布的差别增大时，它们的相对熵也会增大，具体的，概率分布P概率分布Q的相对熵或KL散度可通过下式计算：The proximity of two probability distributions can be measured by relative entropy. Relative entropy is also called KL divergence or information divergence. It is an asymmetry measure of the difference between two probability distributions. Relative entropy can measure the difference between two probability distributions. When the two probability distributions are the same, their relative entropy is zero. When the difference between the two probability distributions increases, their relative entropy will also increase. Specifically, the relative entropy of the probability distribution P and the probability distribution Q Entropy or KL divergence can be calculated by:

训练收敛后，最终得到的模型T_in即为指定领域的翻译模型。After the training converges, the final model T _in is the translation model in the specified field.

本实施例提供的针对指定领域的机器翻译方法中，由于指定领域的领域翻译模型采用大量指定领域的训练语料对通用翻译模型进行微调得到，因此，其为能够适应于指定领域的翻译模型，利用该翻译模型对指定领域的文本进行翻译，可获得准确的翻译结果。In the machine translation method for the designated field provided in this embodiment, since the domain translation model of the designated field is obtained by fine-tuning the general translation model using a large amount of training corpus in the designated field, it is a translation model that can be adapted to the designated field. This translation model translates texts in specified fields and can obtain accurate translation results.

需要说明的是，可采用上述实施例提供的领域数据筛选方法获得多个不同领域的训练语料，进而可采用多个不同领域的训练语料(每个领域可包括筛选出的训练语料和收集而来的语料)分别对通用翻译模型进行微调，以得到多个不同领域的领域翻译模型，从而可实现对多个不同领域待翻译文本的准确翻译。It should be noted that the domain data screening method provided in the above embodiments can be used to obtain training corpus in multiple different fields, and further training corpus in multiple different fields can be used (each domain can include filtered training corpus and collected training corpus. corpus) to fine-tune the general translation model to obtain domain translation models in multiple different fields, thereby achieving accurate translation of texts to be translated in multiple different fields.

第五实施例Fifth embodiment

本申请实施例还提供了一种领域数据获取装置，下面对本申请实施例提供的领域数据获取装置进行描述，下文描述的领域数据获取装置与上文描述的领域数据获取方法可相互对应参照。The embodiment of the present application also provides a domain data acquisition device. The domain data acquisition device provided by the embodiment of the present application is described below. The domain data acquisition device described below and the domain data acquisition method described above can be mutually referenced.

请参阅图6，示出了本申请实施例提供的领域数据获取装置的结构示意图，可以包括：数据获取模块601、通用翻译模型建立模块602、第一目标值确定模块603和数据筛选模块604。Please refer to Figure 6, which shows a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application, which may include: a data acquisition module 601, a general translation model establishment module 602, a first target value determination module 603, and a data filtering module 604.

数据获取模块601，用于获取通用领域的训练语料集和指定领域的初始训练语料集；Data acquisition module 601, used to obtain training corpus in general fields and initial training corpus in specified fields;

通用翻译模型建立模块602，用于利用所述通用领域的训练语料集建立通用翻译模型；Universal translation model building module 602, used to establish a universal translation model using the training corpus in the general field;

第一目标值确定模块603，用于基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中训练语料对应的第一目标值，其中，一条训练语料对应一第一目标值，所述第一目标值能够表征对应的训练语料与所述指定领域的匹配程度；The first target value determination module 603 is used to determine the first target value corresponding to the training corpus in the training corpus set in the general field based on the general translation model and the initial training corpus set in the designated field, wherein a training corpus Corresponding to a first target value, the first target value can represent the degree of matching between the corresponding training corpus and the designated field;

数据筛选模块604，用于基于所述通用领域的训练语料集中训练语料对应的第一目标值，从所述通用领域的训练语料集中筛选出所述指定领域的训练语料。The data screening module 604 is configured to filter out the training corpus in the specified field from the training corpus in the general field based on the first target value corresponding to the training corpus in the training corpus in the general field.

可选的，本申请实施例提供的领域数据获取装置还可以包括：通用语言模型建立模块、领域语言模型建立模块、第二目标值确定模块。Optionally, the domain data acquisition device provided by the embodiment of the present application may also include: a general language model establishment module, a domain language model establishment module, and a second target value determination module.

通用语言模型建立模块，用于利用所述通用领域的训练语料集建立通用语言模型。A general language model building module is used to build a general language model using the training corpus in the general field.

领域语言模型建立模块，用于利用所述指定领域的初始训练语料集建立领域语言模型。A domain language model establishment module is used to establish a domain language model using the initial training corpus of the designated domain.

第二目标值确定模块，用于基于所述通用语言模型和所述领域语言模型，确定所述通用领域的训练语料集中训练语料对应的第二目标值。其中，一条训练语料对应一第二目标值，所述第二目标值能够表征对应的训练语料与所述指定领域的相关程度；The second target value determination module is configured to determine the second target value corresponding to the training corpus in the training corpus of the general domain based on the general language model and the domain language model. Among them, a piece of training corpus corresponds to a second target value, and the second target value can represent the degree of correlation between the corresponding training corpus and the designated field;

数据筛选模块604，具体用于以所述通用领域的训练语料集中训练语料对应的第一目标值和第二目标值为依据，从所述通用领域的训练语料集中筛选出所述指定领域的训练语料。The data screening module 604 is specifically used to filter out the training in the specified field from the training corpus in the general field based on the first target value and the second target value corresponding to the training corpus in the training corpus in the general field. corpus.

可选的，第一目标值确定模块603包括：通用语言模型建立模块、领域语言模型建立模块、第二目标值确定模块、候选训练语料筛选子模块和候选语料的第一目标值确定模块。Optionally, the first target value determination module 603 includes: a general language model establishment module, a domain language model establishment module, a second target value determination module, a candidate training corpus screening submodule, and a candidate corpus first target value determination module.

候选训练语料筛选子模块，用于以所述通用领域的训练语料集中训练语料对应的第二目标值为依据，从所述通用领域的训练语料集中筛选候选训练语料。The candidate training corpus screening sub-module is used to screen candidate training corpus from the training corpus in the general field based on the second target value corresponding to the training corpus in the training corpus in the general field.

候选语料的第一目标值确定模块，用于确定筛选出的每条候选训练语料对应的第一目标值。The first target value determination module of the candidate corpus is used to determine the first target value corresponding to each filtered candidate training corpus.

数据筛选模块604，具体用于以筛选出的每条候选训练语料对应的第一目标值为依据，从筛选出的候选训练语料中筛选出所述指定领域的训练语料。The data screening module 604 is specifically used to screen out the training corpus in the specified field from the screened candidate training corpus based on the first target value corresponding to each of the screened candidate training corpus.

可选的，第一目标值确定模块603在基于所述通用翻译模型和所述指定领域的初始训练语料集，确定所述通用领域的训练语料集中一目标训练语料对应的第一目标值时，具体用于：Optionally, when the first target value determination module 603 determines the first target value corresponding to a target training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field, Specifically used for:

确定所述指定领域的初始训练语料集中每条训练语料在所述通用翻译模型上的梯度；计算所述指定领域的初始训练语料集中各条训练语料在所述通用翻译模型上的梯度的平均值，作为领域梯度平均值；确定所述目标训练语料在所述通用翻译模型上的梯度；计算所述目标训练语料在所述通用翻译模型上的梯度与所述领域梯度平均值的距离，作为所述目标训练语料对应的第一目标值。Determine the gradient of each training corpus in the initial training corpus in the specified field on the universal translation model; calculate the average value of the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model , as the domain gradient average; determine the gradient of the target training corpus on the universal translation model; calculate the distance between the gradient of the target training corpus on the universal translation model and the domain gradient average, as the The first target value corresponding to the target training corpus.

所述通用领域的训练语料集中的每条训练语料和所述指定领域的初始训练语料集中的每条训练语料均包括：源语言文本和对应的目标语言文本。Each piece of training corpus in the training corpus in the general field and each piece of training corpus in the initial training corpus in the specified field include: a source language text and a corresponding target language text.

通用语言模型建立模块在利用所述通用领域的训练语料集建立通用语言模型，并利用所述指定领域的初始训练语料集建立领域语言模型时，具体用于：When the general language model establishment module uses the training corpus in the general field to establish a general language model, and uses the initial training corpus in the specified field to establish a domain language model, it is specifically used to:

利用所述通用领域的训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端通用语言模型；利用所述通用领域的训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端通用语言模型。Use the source language text in the training corpus in the general field to train the language model, and the trained language model is used as the source language end universal language model; use the target language text in the training corpus in the general field to train the language model, and the trained language model The language model serves as a general language model on the target language side.

领域语言模型建立模块在利用所述指定领域的初始训练语料集建立领域语言模型时，具体用于：When the domain language model establishment module uses the initial training corpus of the designated domain to establish a domain language model, it is specifically used to:

利用所述指定领域的初始训练语料集中的源语言文本训练语言模型，训练得到的语言模型作为源语言端领域语言模型；利用所述指定领域的初始训练语料集中的目标语言文本训练语言模型，训练得到的语言模型作为目标语言端领域语言模型。Use the source language text in the initial training corpus of the designated field to train the language model, and the trained language model is used as the source language side domain language model; use the target language text in the initial training corpus of the designated field to train the language model, and train The obtained language model is used as the target language side domain language model.

第二目标值确定模块在基于所述通用语言模型和所述领域语言模型，确定所述通用领域的训练语料集中一目标训练语料对应的第二目标值时，具体用于：When the second target value determination module determines the second target value corresponding to a target training corpus in the training corpus of the general domain based on the general language model and the domain language model, it is specifically used to:

计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上的后验概率；根据确定出的后验概率，确定所述目标训练语料对应的第二目标值。Calculate the posterior probabilities of the target training corpus on the general language model and the domain language model respectively; determine the second target value corresponding to the target training corpus based on the determined posterior probabilities.

第二目标值确定模块在计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上的后验概率时，具体用于：When calculating the posterior probabilities of the target training corpus on the general language model and the domain language model respectively, the second target value determination module is specifically used to:

计算所述目标训练语料分别在源语言端通用语言模型、源语言端领域语言模型、目标语言端通用语言模型、目标语言端领域语言模型上的后验概率。Calculate the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model.

第二目标值确定模块在根据确定出的后验概率，确定所述目标训练语料对应的第二目标值时，具体用于：When the second target value determination module determines the second target value corresponding to the target training corpus according to the determined posterior probability, it is specifically used to:

根据所述目标训练语料分别在所述源语言端通用语言模型上的后验概率和所述源语言端领域语言模型上的后验概率，确定所述目标训练语料在源语言端语言模型上的得分；根据所述目标训练语料分别在所述目标语言端通用语言模型上的后验概率和所述目标语言端领域语言模型上的后验概率，确定所述目标训练语料在目标语言端语言模型上的得分；所述目标训练语料在源语言端语言模型上的得分和所述目标训练语料在目标语言端语言模型上的得分作为所述目标训练语料对应的第二目标值。According to the posterior probabilities of the target training corpus on the source language side universal language model and the posterior probabilities on the source language side domain language model, the probability of the target training corpus on the source language side language model is determined. Score; according to the posterior probability of the target training corpus on the target language side universal language model and the posterior probability on the target language side domain language model, determine the target training corpus on the target language side language model The score of the target training corpus on the source language side language model and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.

本实施例提供的领域数据获取装置，可从通用领域的训练语料集中较准确地筛选出指定领域的训练语料。The domain data acquisition device provided in this embodiment can more accurately filter out the training corpus in the specified field from the training corpus in the general domain.

第六实施例Sixth embodiment

本实施例提供了一种针对指定领域的机器翻译装置，下面对本实施例提供的机器翻译装置进行描述，下文描述的机器翻译装置与上文描述的机器翻译方法可相互对应参照。This embodiment provides a machine translation device for a designated field. The machine translation device provided by this embodiment will be described below. The machine translation device described below and the machine translation method described above may be mutually referenced.

本实施例提供的针对指定领域的机器翻译装置可以包括：源语言文本获取模块和翻译模块。The machine translation device for a specified field provided in this embodiment may include: a source language text acquisition module and a translation module.

源语言文本获取模块，用于获取指定领域的待翻译源语言文本。The source language text acquisition module is used to obtain the source language text to be translated in the specified field.

翻译模块，用于将所述待翻译源语言文本输入预先建立的领域翻译模型，得到所述待翻译源语言文本对应的目标语言文本。A translation module is used to input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated.

其中，所述领域翻译模型采用指定领域的初始训练语料集以及采用上述实施例提供的领域数据获取装置从通用领域的训练语料集中获取的训练语料，对通用翻译模型进行调整得到，所述通用翻译模型采用所述通用领域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain and the training corpus obtained from the training corpus of the general domain using the domain data acquisition device provided in the above embodiment, and is obtained by adjusting the general translation model. The general translation model The model is trained using the training corpus in the general field.

本实施例提供的针对指定领域的机器翻译装置还可以包括：领域翻译模型构建模块。The machine translation device for a specified domain provided in this embodiment may also include: a domain translation model building module.

领域翻译模型构建模块可以包括第一调整模块。The domain translation model building module may include a first adjustment module.

第一调整模块，用于将所述指定领域的初始训练语料集中的训练语料与从所述通用领域的训练语料集中筛选出的训练语料混合，利用混合后的训练语料调整所述通用翻译模型。The first adjustment module is used to mix the training corpus from the initial training corpus set in the specified field with the training corpus selected from the training corpus set in the general field, and use the mixed training corpus to adjust the general translation model.

可选的，领域翻译模型构建模块还可以包括第二调整模块。Optionally, the domain translation model building module may also include a second adjustment module.

第二调整模块，用于采用所述指定领域的初始训练语料集中的训练语料，对利用所述混合后的训练语料调整后的翻译模型进一步进行调整。The second adjustment module is configured to further adjust the translation model adjusted using the mixed training corpus using the training corpus in the initial training corpus set in the specified field.

由于本实施例提供的机器翻译装置所利用的领域翻译模型采用大量指定领域的训练语料对通用翻译模型进行微调得到，因此，其为能够适应于指定领域的翻译模型，利用该翻译模型对指定领域的文本进行翻译，可获得准确的翻译结果。Since the domain translation model used by the machine translation device provided in this embodiment is obtained by fine-tuning a general translation model using a large amount of training corpus in the specified domain, it is a translation model that can be adapted to the specified domain, and the translation model is used to perform translation in the specified domain. Translate the text to get accurate translation results.

第七实施例Seventh embodiment

本申请实施例还提供了一种领域数据获取设备，请参阅图7，示出了该领域数据获取设备的结构示意图，该领域数据获取设备可以包括：至少一个处理器701，至少一个通信接口702，至少一个存储器703和至少一个通信总线 704；The embodiment of the present application also provides a field data acquisition device. Please refer to Figure 7, which shows a schematic structural diagram of the field data acquisition device. The field data acquisition device may include: at least one processor 701, at least one communication interface 702 , at least one memory 703 and at least one communication bus 704;

在本申请实施例中，处理器701、通信接口702、存储器703、通信总线704 的数量为至少一个，且处理器701、通信接口702、存储器703通过通信总线704 完成相互间的通信；In the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703, and the communication bus 704 is at least one, and the processor 701, the communication interface 702, and the memory 703 complete communication with each other through the communication bus 704;

处理器701可能是一个中央处理器CPU，或者是特定集成电路ASIC (ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路等；The processor 701 may be a central processing unit CPU, or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

存储器703可能包含高速RAM存储器，也可能还包括非易失性存储器 (non-volatile memory)等，例如至少一个磁盘存储器；Memory 703 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), etc., such as at least one disk memory;

其中，存储器存储有程序，处理器可调用存储器存储的程序，所述程序用于：Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:

可选的，所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.

第八实施例Eighth embodiment

本申请实施例还提供一种可读存储介质，该可读存储介质可存储有适于处理器执行的程序，所述程序用于：The embodiment of the present application also provides a readable storage medium, which can store a program suitable for execution by the processor, and the program is used for:

第九实施例Ninth embodiment

本申请实施例还提供了一种针对指定领域的机器翻译设备，该机器翻译设备可以包括：至少一个处理器，至少一个通信接口，至少一个存储器和至少一个通信总线；The embodiment of the present application also provides a machine translation device for a specified field. The machine translation device may include: at least one processor, at least one communication interface, at least one memory and at least one communication bus;

在本申请实施例中，处理器、通信接口、存储器、通信总线的数量为至少一个，且处理器、通信接口、存储器通过通信总线完成相互间的通信；In the embodiment of the present application, the number of processors, communication interfaces, memories, and communication buses is at least one, and the processor, communication interface, and memory complete communication with each other through the communication bus;

处理器可能是一个中央处理器CPU，或者是特定集成电路ASIC (ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路等；The processor may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

存储器可能包含高速RAM存储器，也可能还包括非易失性存储器 (non-volatilememory)等，例如至少一个磁盘存储器；The memory may include high-speed RAM memory, or may also include non-volatile memory (non-volatile memory), such as at least one disk memory;

其中，所述领域翻译模型采用指定领域的初始训练语料集，以及采用上述任一实施例提供的领域数据获取方法从通用领域的训练语料集中获取的训练语料，对通用翻译模型进行调整得到，所述通用翻译模型采用所述通用领域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain, and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method provided by any of the above embodiments, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.

第十实施例Tenth embodiment

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element is defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus including the stated element.

对所公开的实施例的上述说明，本说明书中各实施例中记载的特征可以相互替换或者组合，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。With regard to the above description of the disclosed embodiments, the features described in each embodiment in this specification can be replaced or combined with each other, so that those skilled in the art can realize or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A domain data acquisition method, characterized by comprising:

acquiring a training corpus of a general field and an initial training corpus of a designated field;

establishing a general translation model by using the training corpus of the general field;

determining a first target value corresponding to a training corpus in the general field based on the general translation model and the initial training corpus in the specific field, wherein one training corpus corresponds to the first target value, and the first target value can represent the matching degree of the corresponding training corpus and the specific field;

screening the training corpus of the appointed field from the training corpus set of the general field based on a first target value corresponding to the training corpus in the training corpus set of the general field;

wherein determining a first target value corresponding to a target training corpus in the general domain based on the general translation model and the initial training corpus in the specific domain comprises:

determining the gradient of each training corpus in the initial training corpus set in the appointed field on the general translation model;

calculating the average value of gradients of each training corpus on the general translation model in the initial training corpus set in the appointed field, and taking the average value as a field gradient average value;

Determining a gradient of the target training corpus on the general translation model;

and calculating the distance between the gradient of the target training corpus on the general translation model and the domain gradient average value as a first target value corresponding to the target training corpus.

2. The domain data acquisition method according to claim 1, characterized by further comprising:

establishing a general language model by using the training corpus of the general field, and establishing a field language model by using the initial training corpus of the appointed field;

determining a second target value corresponding to a training corpus in a training corpus set of the general field based on the general language model and the field language model, wherein one training corpus corresponds to a second target value, and the second target value can represent the correlation degree between the corresponding training corpus and the specified field;

the screening the training corpus of the specified domain from the training corpus set of the general domain based on a first target value corresponding to the training corpus in the training corpus set of the general domain includes:

and screening the training corpus in the appointed field from the training corpus set in the universal field based on a first target value and a second target value corresponding to the training corpus in the training corpus set in the universal field.

3. The method according to claim 1, wherein determining the first target value corresponding to the training corpus in the training corpus set of the general domain includes:

screening candidate training corpuses from the training corpuses in the general field based on a second target value corresponding to the training corpuses in the general field;

determining a first target value corresponding to each screened candidate training corpus;

And screening the training corpus in the appointed field from the screened candidate training corpuses according to the first target value corresponding to each screened candidate training corpus.

4. A domain data acquisition method according to claim 2 or 3, wherein each of the training corpora in the training corpus set of the general domain and each of the training corpora in the initial training corpus set of the specified domain include: a source language text and a corresponding target language text;

the building a general language model by using the training corpus of the general domain and building a domain language model by using the initial training corpus of the appointed domain comprises the following steps:

training a language model by using a source language text in a training corpus in the general field, wherein the trained language model is used as a source language end general language model;

training a language model by using a target language text in a training corpus in the general field, wherein the trained language model is used as a general language model of a target language end;

training a language model by using the source language text in the initial training corpus of the appointed field, wherein the trained language model is used as a source language end field language model;

Training a language model by using the target language text in the initial training corpus of the appointed domain, wherein the trained language model is used as a target language end domain language model.

5. A method of acquiring domain data according to claim 2 or 3, wherein determining a second target value corresponding to a target training corpus in the training corpus of the general domain based on the general language model and the domain language model comprises:

calculating posterior probabilities of the target training corpus on the general language model and the domain language model respectively;

and determining a second target value corresponding to the target training corpus according to the determined posterior probability.

6. The method of claim 5, wherein the calculating posterior probabilities of the target training corpus on the generic language model and the domain language model, respectively, comprises:

calculating posterior probabilities of the target training corpus on a source language end general language model, a source language end field language model, a target language end general language model and a target language end field language model respectively;

the determining, according to the determined posterior probability, a second target value corresponding to the target training corpus includes:

Determining the score of the target training corpus on the source language end language model according to the posterior probability of the target training corpus on the source language end general language model and the posterior probability of the target training corpus on the source language end field language model;

determining the score of the target training corpus on the target language end language model according to the posterior probability of the target training corpus on the target language end general language model and the posterior probability of the target language end field language model;

and the score of the target training corpus on the source language end language model and the score of the target training corpus on the target language end language model are used as a second target value corresponding to the target training corpus.

7. A machine translation method for a specified domain, comprising:

acquiring a source language text to be translated in a designated field;

inputting the source language text to be translated into a pre-established domain translation model to obtain a target language text corresponding to the source language text to be translated;

the domain translation model is obtained by adjusting a general translation model by adopting an initial training corpus of a designated domain and a training corpus obtained from a training corpus of a general domain by adopting the domain data obtaining method according to any one of claims 1-6, and the general translation model is obtained by training the training corpus of the general domain.

8. The machine translation method according to claim 7, wherein the process of adjusting the universal translation model using the initial corpus of training for the specified domain and the corpus of training selected from the corpus of training for the universal domain comprises:

mixing the training corpus in the initial training corpus set of the appointed field with the training corpus screened from the training corpus set of the general field;

and adjusting the general translation model by using the mixed training corpus.

9. The machine translation method according to claim 8, wherein the process of adjusting the universal translation model using the initial corpus of training for the specified domain and the corpus of training selected from the corpus of training for the universal domain further comprises:

and adopting the training corpus in the initial training corpus set in the appointed field to further adjust the translation model adjusted by the mixed training corpus.

10. A domain data acquisition apparatus, comprising: the system comprises a data acquisition module, a general translation model establishment module, a first target value determination module and a data screening module;

the data acquisition module is used for acquiring a training corpus in the general field and an initial training corpus in the appointed field;

The universal translation model building module is used for building a universal translation model by utilizing the training corpus in the universal field;

the first target value determining module is configured to determine a first target value corresponding to a training corpus in the general domain based on the general translation model and the initial training corpus in the specific domain, where one training corpus corresponds to a first target value, and the first target value can represent a matching degree of the corresponding training corpus and the specific domain;

the data screening module is used for screening the training corpus in the appointed field from the training corpus set in the general field based on a first target value corresponding to the training corpus in the training corpus set in the general field;

the first target value determining module is specifically configured to determine a gradient of each training corpus in the initial training corpus in the designated domain on the general translation model when determining a first target value corresponding to a target training corpus in the designated domain based on the general translation model and the initial training corpus in the designated domain, calculate an average value of gradients of each training corpus in the initial training corpus in the designated domain on the general translation model, as a domain gradient average value, determine a gradient of the target training corpus on the general translation model, and calculate a distance between the gradient of the target training corpus on the general translation model and the domain gradient average value, as a first target value corresponding to the target training corpus.

11. A machine translation apparatus for a specified domain, comprising: the system comprises a source language text acquisition module and a translation module;

the source language text acquisition module is used for acquiring a source language text to be translated in a designated field;

the translation module is used for inputting the source language text to be translated into a pre-established domain translation model to obtain a target language text corresponding to the source language text to be translated;

the domain translation model is obtained by adjusting a general translation model by adopting an initial training corpus of a designated domain and a training corpus obtained from a training corpus of a general domain by adopting the domain data obtaining device according to claim 10, and the general translation model is obtained by training the training corpus of the general domain.

12. A field data acquisition apparatus, characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the domain data acquisition method according to any one of claims 1 to 6.

13. A readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the respective steps of the domain data acquisition method according to any one of claims 1 to 6.