CN112417896B - Domain data acquisition method, machine translation method and related equipment - Google Patents
Domain data acquisition method, machine translation method and related equipment Download PDFInfo
- Publication number
- CN112417896B CN112417896B CN202011210710.9A CN202011210710A CN112417896B CN 112417896 B CN112417896 B CN 112417896B CN 202011210710 A CN202011210710 A CN 202011210710A CN 112417896 B CN112417896 B CN 112417896B
- Authority
- CN
- China
- Prior art keywords
- training corpus
- general
- domain
- field
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical field
本申请涉及自然语言处理技术领域,尤其涉及一种领域数据获取方法、 机器翻译方法及相关设备。This application relates to the technical field of natural language processing, and in particular to a field data acquisition method, machine translation method and related equipment.
背景技术Background technique
语言沟通成为不同语言种族群体相互交流面临的一个重要课题,实现任意 时间、任意地点、任意语言的无障碍自由沟通是人类追求的一个梦想。传统语 言服务行业采用人工陪同口译、交替口译以及同声传译等解决语言沟通障碍问 题,但受限于人力不足以及成本限制,无法满足普通人对不同语言沟通交流的 需求。Language communication has become an important issue for mutual communication between different language and racial groups. It is a dream pursued by mankind to realize barrier-free and free communication at any time, in any place and in any language. The traditional language service industry uses manual accompanying interpretation, consecutive interpretation, and simultaneous interpretation to solve the problem of language communication barriers. However, due to insufficient manpower and cost constraints, it cannot meet the needs of ordinary people for communication in different languages.
机器翻译是利用计算机将一种自然语言(源语言)转换为另一种自然语 言(目标语言)的过程。机器翻译可以大幅节约翻译时间,提高翻译效率, 满足诸如资讯等时效性要求较高或者海量文本的翻译需求,极大地降低了人 力成本,而更重要的是,它让跨语言交流变成每个人都可以拥有的能力,语 言不通不再是人们获取信息和服务的障碍。Machine translation is the process of using computers to convert one natural language (source language) into another natural language (target language). Machine translation can significantly save translation time, improve translation efficiency, meet the translation needs of information with high timeliness requirements or massive texts, greatly reduce labor costs, and more importantly, it makes cross-language communication accessible to everyone. Everyone can have the ability, and language barriers are no longer obstacles for people to obtain information and services.
在某些时候,会存在一些特定领域的翻译任务,然而,目前的机器翻译 方法多为基于通用领域翻译模型的机器翻译方法,用这种机器翻译方法对特 定领域的文本进行翻译时,翻译准确度不高,为此,需要构建出针对特定领 域的翻译模型。可以理解的是,若要构建出特定领域的翻译模型,往往需要 特定领域的训练语料,然而,在某些特定领域,训练语料收集难度大,这导 致特定领域的训练语料数量不多,而特定领域的训练语料数量不足,会导致 难以构建出性能较佳的特定领域翻译模型,为此,亟需一种获得特定领域的 训练语料的方法。At some point, there will be some translation tasks in specific fields. However, the current machine translation methods are mostly machine translation methods based on general domain translation models. When using this machine translation method to translate texts in specific fields, the translation is accurate. The degree is not high. For this reason, it is necessary to build a translation model for specific fields. It is understandable that to build a translation model in a specific field, training corpus in a specific field is often required. However, in some specific fields, it is difficult to collect training corpus, which results in a small amount of training corpus in a specific field, and in certain fields, it is difficult to collect training corpus. Insufficient amount of training corpus in the field will make it difficult to build a domain-specific translation model with better performance. For this reason, a method of obtaining training corpus in a specific field is urgently needed.
发明内容Contents of the invention
有鉴于此,本申请提供了一种领域数据获取方法、机器翻译方法及相关 设备,用以从通用领域的训练语料集中筛选出指定领域的训练语料,以利用 指定领域的训练语料构建出翻译准确度较高的领域翻译模型,进而对指定领 域的文本进行准确翻译,其技术方案如下:In view of this, this application provides a field data acquisition method, a machine translation method and related equipment to filter out training corpus in a specified field from a collection of training corpus in a general field, so as to use the training corpus in the specified field to construct an accurate translation A highly accurate domain translation model can accurately translate texts in specified fields. The technical solution is as follows:
一种领域数据获取方法,包括:A method for obtaining domain data, including:
获取通用领域的训练语料集和指定领域的初始训练语料集;Obtain the training corpus in the general field and the initial training corpus in the specified field;
利用所述通用领域的训练语料集建立通用翻译模型;Using the training corpus in the general field to establish a general translation model;
基于所述通用翻译模型和所述指定领域的初始训练语料集,确定所述通 用领域的训练语料集中训练语料对应的第一目标值,其中,一条训练语料对 应一第一目标值,所述第一目标值能够表征对应的训练语料与所述指定领域 的匹配程度;Based on the general translation model and the initial training corpus of the designated field, a first target value corresponding to the training corpus in the training corpus of the general field is determined, wherein a piece of training corpus corresponds to a first target value, and the first target value is determined. A target value can represent the degree of matching between the corresponding training corpus and the specified field;
基于所述通用领域的训练语料集中训练语料对应的第一目标值,从所述 通用领域的训练语料集中筛选所述指定领域的训练语料。Based on the first target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened from the training corpus in the general field.
可选的,所述领域数据获取方法还包括:Optionally, the domain data acquisition method also includes:
利用所述通用领域的训练语料集建立通用语言模型,并利用所述指定领 域的初始训练语料集建立领域语言模型;Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model;
基于所述通用语言模型和所述领域语言模型,确定所述通用领域的训练 语料集中训练语料对应的第二目标值,其中,一条训练语料对应一第二目标 值,所述第二目标值能够表征对应的训练语料与所述指定领域的相关程度;Based on the general language model and the domain language model, a second target value corresponding to the training corpus in the training corpus set in the general domain is determined, wherein a piece of training corpus corresponds to a second target value, and the second target value can Characterize the degree of relevance of the corresponding training corpus to the designated field;
所述基于所述通用领域的训练语料集中训练语料对应的第一目标值,从 所述通用领域的训练语料集中筛选所述指定领域的训练语料,包括:Based on the first target value corresponding to the training corpus in the training corpus in the general field, screening the training corpus in the specified field from the training corpus in the general field includes:
以所述通用领域的训练语料集中训练语料对应的第一目标值和第二目标 值为依据,从所述通用领域的训练语料集中筛选出所述指定领域的训练语料。Based on the first target value and the second target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened out from the training corpus in the general field.
可选的,所述确定所述通用领域的训练语料集中训练语料对应的第一目 标值,包括:Optionally, determining the first target value corresponding to the training corpus in the training corpus of the general field includes:
利用所述通用领域的训练语料集建立通用语言模型,并利用所述指定领 域的初始训练语料集建立领域语言模型;Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model;
基于所述通用语言模型和所述领域语言模型,确定所述通用领域的训练 语料集中训练语料对应的第二目标值,其中,一条训练语料对应一第二目标 值,所述第二目标值能够表征对应的训练语料与所述指定领域的相关程度;Based on the general language model and the domain language model, a second target value corresponding to the training corpus in the training corpus set in the general domain is determined, wherein a piece of training corpus corresponds to a second target value, and the second target value can Characterize the degree of relevance of the corresponding training corpus to the designated field;
以所述通用领域的训练语料集中训练语料对应的第二目标值为依据,从 所述通用领域的训练语料集中筛选候选训练语料;Based on the second target value corresponding to the training corpus in the training corpus in the general field, select candidate training corpus from the training corpus in the general field;
确定筛选出的每条候选训练语料对应的第一目标值;Determine the first target value corresponding to each candidate training corpus screened out;
所述基于所述通用领域的训练语料集中训练语料对应的第一目标值,从 所述通用领域的训练语料集中筛选所述指定领域的训练语料,包括:Based on the first target value corresponding to the training corpus in the training corpus in the general field, screening the training corpus in the specified field from the training corpus in the general field includes:
以筛选出的每条候选训练语料对应的第一目标值为依据,从筛选出的候 选训练语料中筛选出所述指定领域的训练语料。Based on the first target value corresponding to each selected candidate training corpus, the training corpus in the designated field is selected from the selected candidate training corpus.
可选的,基于所述通用翻译模型和所述指定领域的初始训练语料集,确 定所述通用领域的训练语料集中一目标训练语料对应的第一目标值,包括:Optionally, based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to a target training corpus in the training corpus in the general field, including:
确定所述指定领域的初始训练语料集中每条训练语料在所述通用翻译模 型上的梯度;Determine the gradient of each training corpus in the initial training corpus in the specified field on the general translation model;
计算所述指定领域的初始训练语料集中各条训练语料在所述通用翻译模 型上的梯度的平均值,作为领域梯度平均值;Calculate the average value of the gradient of each training corpus in the initial training corpus set in the specified field on the general translation model as the average field gradient;
确定所述目标训练语料在所述通用翻译模型上的梯度;Determine the gradient of the target training corpus on the general translation model;
计算所述目标训练语料在所述通用翻译模型上的梯度与所述领域梯度平 均值的距离,作为所述目标训练语料对应的第一目标值。The distance between the gradient of the target training corpus on the general translation model and the average value of the field gradient is calculated as the first target value corresponding to the target training corpus.
可选的,所述通用领域的训练语料集中的每条训练语料和所述指定领域 的初始训练语料集中的每条训练语料均包括:源语言文本和对应的目标语言 文本;Optionally, each piece of training corpus in the training corpus in the general field and each piece of training corpus in the initial training corpus in the specified field include: source language text and corresponding target language text;
所述利用所述通用领域的训练语料集建立通用语言模型,并利用所述指 定领域的初始训练语料集建立领域语言模型,包括:The use of the training corpus in the general field to establish a general language model, and the use of the initial training corpus in the specified field to establish a domain language model include:
利用所述通用领域的训练语料集中的源语言文本训练语言模型,训练得 到的语言模型作为源语言端通用语言模型;Utilize the source language text in the training corpus set in the general field to train the language model, and the trained language model is used as the source language end universal language model;
利用所述通用领域的训练语料集中的目标语言文本训练语言模型,训练 得到的语言模型作为目标语言端通用语言模型;Utilize the target language text in the training corpus in the general field to train the language model, and the language model obtained by training is used as the target language side general language model;
利用所述指定领域的初始训练语料集中的源语言文本训练语言模型,训 练得到的语言模型作为源语言端领域语言模型;Use the source language text in the initial training corpus of the specified field to train the language model, and the trained language model is used as the source language side domain language model;
利用所述指定领域的初始训练语料集中的目标语言文本训练语言模型, 训练得到的语言模型作为目标语言端领域语言模型。The language model is trained using the target language text in the initial training corpus of the specified field, and the trained language model is used as the target language end domain language model.
可选的,基于所述通用语言模型和所述领域语言模型,确定所述通用领 域的训练语料集中一目标训练语料对应的第二目标值,包括:Optionally, based on the general language model and the domain language model, determine the second target value corresponding to a target training corpus in the training corpus of the general domain, including:
计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上 的后验概率;Calculate the posterior probabilities of the target training corpus on the general language model and the domain language model respectively;
根据确定出的后验概率,确定所述目标训练语料对应的第二目标值。According to the determined posterior probability, the second target value corresponding to the target training corpus is determined.
可选的,所述计算所述目标训练语料分别在所述通用语言模型和所述领 域语言模型上的后验概率,包括:Optionally, calculating the posterior probabilities of the target training corpus on the general language model and the domain language model respectively includes:
计算所述目标训练语料分别在源语言端通用语言模型、源语言端领域语 言模型、目标语言端通用语言模型、目标语言端领域语言模型上的后验概率;Calculate the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model;
所述根据确定出的后验概率,确定所述目标训练语料对应的第二目标值, 包括:Determining the second target value corresponding to the target training corpus according to the determined posterior probability includes:
根据所述目标训练语料分别在所述源语言端通用语言模型上的后验概率 和所述源语言端领域语言模型上的后验概率,确定所述目标训练语料在源语 言端语言模型上的得分;According to the posterior probabilities of the target training corpus on the source language side universal language model and the posterior probabilities on the source language side domain language model, the probability of the target training corpus on the source language side language model is determined. Score;
根据所述目标训练语料分别在所述目标语言端通用语言模型上的后验概 率和所述目标语言端领域语言模型上的后验概率,确定所述目标训练语料在 目标语言端语言模型上的得分;According to the posterior probabilities of the target training corpus on the target language side general language model and the posterior probabilities of the target language side domain language model, the probability of the target training corpus on the target language side language model is determined. Score;
所述目标训练语料在源语言端语言模型上的得分和所述目标训练语料 在目标语言端语言模型上的得分作为所述目标训练语料对应的第二目标值。The score of the target training corpus on the source language side language model and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.
一种针对指定领域的机器翻译方法,包括:A machine translation method for a specified domain, including:
获取指定领域的待翻译源语言文本;Obtain the source language text to be translated in the specified field;
将所述待翻译源语言文本输入预先建立的领域翻译模型,得到所述待翻 译源语言文本对应的目标语言文本;Input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;
其中,所述领域翻译模型采用指定领域的初始训练语料集,以及采用上 述任一项所述的领域数据获取方法从通用领域的训练语料集中获取的训练语 料,对通用翻译模型进行调整得到,所述通用翻译模型采用所述通用领域的 训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain, and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method described in any of the above, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.
可选的,采用所述指定领域的初始训练语料集以及从所述通用领域的训 练语料集中筛选出的训练语料,对通用翻译模型进行调整的过程包括:Optionally, using the initial training corpus in the designated field and the training corpus selected from the training corpus in the general field, the process of adjusting the general translation model includes:
将所述指定领域的初始训练语料集中的训练语料与从所述通用领域的训 练语料集中筛选出的训练语料混合;Mix the training corpus from the initial training corpus set in the specified field with the training corpus screened out from the training corpus set in the general field;
利用混合后的训练语料调整所述通用翻译模型。The general translation model is adjusted using the mixed training corpus.
可选的,采用所述指定领域的初始训练语料集以及从所述通用领域的训 练语料集中筛选出的训练语料,对通用翻译模型进行调整的过程还包括:Optionally, using the initial training corpus in the designated field and the training corpus selected from the training corpus in the general field, the process of adjusting the general translation model also includes:
采用所述指定领域的初始训练语料集中的训练语料,对利用所述混合后 的训练语料调整后的翻译模型进一步进行调整。The translation model adjusted using the mixed training corpus is further adjusted using the training corpus in the initial training corpus set in the specified field.
一种领域数据获取装置,包括:数据获取模块、通用翻译模型建立模块、 第一目标值确定模块和数据筛选模块;A domain data acquisition device, including: a data acquisition module, a universal translation model establishment module, a first target value determination module and a data screening module;
所述数据获取模块,用于获取通用领域的训练语料集和指定领域的初始 训练语料集;The data acquisition module is used to obtain training corpus in general fields and initial training corpus in specified fields;
所述通用翻译模型建立模块,用于利用所述通用领域的训练语料集建立 通用翻译模型;The universal translation model establishment module is used to establish a universal translation model using the training corpus in the general field;
所述第一目标值确定模块,用于基于所述通用翻译模型和所述指定领域 的初始训练语料集,确定所述通用领域的训练语料集中训练语料对应的第一 目标值,其中,一条训练语料对应一第一目标值,所述第一目标值能够表征 对应的训练语料与所述指定领域的匹配程度;The first target value determination module is used to determine the first target value corresponding to the training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field, wherein a training The corpus corresponds to a first target value, and the first target value can represent the degree of matching between the corresponding training corpus and the designated field;
所述数据筛选模块,用于基于所述通用领域的训练语料集中训练语料对 应的第一目标值,从所述通用领域的训练语料集中筛选出所述指定领域的训 练语料。The data screening module is configured to filter out the training corpus in the specified field from the training corpus in the general field based on the first target value corresponding to the training corpus in the general field training corpus.
可选的,所述第一目标值确定模块在基于所述通用翻译模型和所述指定 领域的初始训练语料集,确定所述通用领域的训练语料集中一目标训练语料 对应的第一目标值时,具体用于确定所述指定领域的初始训练语料集中每条 训练语料在所述通用翻译模型上的梯度,计算所述指定领域的初始训练语料 集中各条训练语料在所述通用翻译模型上的梯度的平均值,作为领域梯度平 均值,确定所述目标训练语料在所述通用翻译模型上的梯度,计算所述目标 训练语料在所述通用翻译模型上的梯度与所述领域梯度平均值的距离,作为 所述目标训练语料对应的第二目标值。Optionally, the first target value determination module determines the first target value corresponding to a target training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field. , specifically used to determine the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model, and calculate the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model. The average value of the gradient, as the domain gradient average, determines the gradient of the target training corpus on the universal translation model, and calculates the difference between the gradient of the target training corpus on the universal translation model and the domain gradient average The distance is used as the second target value corresponding to the target training corpus.
一种针对指定领域的机器翻译装置,包括:源语言文本获取模块和翻译 模块;A machine translation device for a designated field, including: a source language text acquisition module and a translation module;
所述源语言文本获取模块,用于获取指定领域的待翻译源语言文本;The source language text acquisition module is used to obtain the source language text to be translated in a specified field;
所述翻译模块,用于将所述待翻译源语言文本输入预先建立的领域翻译 模型,得到所述待翻译源语言文本对应的目标语言文本;The translation module is used to input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;
其中,所述领域翻译模型采用指定领域的初始训练语料集以及上述任一 项所述的领域数据获取装置从通用领域的训练语料集中筛选出的训练语料, 对通用翻译模型进行调整得到,所述通用翻译模型采用所述通用领域的训练 语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the designated domain and the training corpus selected from the training corpus of the general domain by the domain data acquisition device described in any of the above, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.
一种领域数据筛选设备,包括:存储器和处理器;A domain data screening device includes: a memory and a processor;
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,用于执行所述程序,实现上述任一项所述的领域数据获取 方法的各个步骤。The processor is used to execute the program and implement each step of the domain data acquisition method described in any one of the above.
一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机 程序被处理器执行时,实现上述任一项所述的领域数据获取方法的各个步骤。A readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, each step of the domain data acquisition method described in any one of the above is implemented.
经由上述方案可知,本申请提供的领域数据获取方法,首先获取通用领 域的训练语料集和指定领域的初始训练语料集,然后利用通用领域的训练语 料集建立通用翻译模型,接着基于通用翻译模型和指定领域的初始训练语料 集确定通用领域的训练语料集中训练语料对应的第一目标值,最后基于通用 领域的训练语料集中训练语料对应的第一目标值,从通用领域的训练语料集 中筛选出指定领域的训练语料。通过本申请提供的领域数据获取方法可从通 用领域的训练语料集中获得指定领域的训练语料,在此基础上,本申请还提 供了一种机器翻译方法,该方法可利用预先建立的领域翻译模型实现指定领 域文本的翻译,由于领域翻译模型采用大量指定领域的训练语料对通用翻译 模型进行微调得到,因此,其为能够适应于指定领域的翻译模型,利用该翻 译模型对指定领域的文本进行翻译,能够获得比较准确的翻译结果。It can be seen from the above solution that the domain data acquisition method provided by this application first obtains a training corpus in a general field and an initial training corpus in a specified field, then uses the training corpus in the general field to establish a universal translation model, and then based on the universal translation model and The initial training corpus in the specified field determines the first target value corresponding to the training corpus in the general field training corpus. Finally, based on the first target value corresponding to the training corpus in the general field training corpus, the designated training corpus is filtered out from the general field training corpus. training corpus in the field. Through the domain data acquisition method provided by this application, training corpus in a specified field can be obtained from a collection of training corpus in general fields. On this basis, this application also provides a machine translation method that can utilize a pre-established domain translation model. To realize the translation of text in a specified field, the domain translation model is obtained by fine-tuning a general translation model using a large amount of training corpus in the specified field. Therefore, it is a translation model that can be adapted to the specified field. This translation model is used to translate texts in the specified field. , can obtain more accurate translation results.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面 描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不 付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.
图1为本申请实施例提供的一种领域数据获取方法的流程示意图;Figure 1 is a schematic flow chart of a domain data acquisition method provided by an embodiment of the present application;
图2为本申请实施例提供的另一种领域数据获取方法的流程示意图;Figure 2 is a schematic flow chart of another domain data acquisition method provided by an embodiment of the present application;
图3为本申请实施例提供的再一种领域数据获取方法的流程示意图;Figure 3 is a schematic flow chart of yet another domain data acquisition method provided by an embodiment of the present application;
图4为本申请实施例提供的采用指定领域的初始训练语料集以及从通用 领域的训练语料集中筛选出的训练语料,对通用翻译模型进行调整的流程示 意图;Figure 4 is a schematic flow chart of adjusting a general translation model using an initial training corpus in a specified field and training corpus selected from a training corpus in a general field provided by the embodiment of the present application;
图5为本申请实施例提供的通过KL正则约束防止模型跑偏的示意图;Figure 5 is a schematic diagram of preventing model deviation through KL regular constraints provided by the embodiment of the present application;
图6为本申请实施例提供的领域数据获取装置的结构示意图;Figure 6 is a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application;
图7为本申请实施例提供的领域数据获取设备的结构示意图。Figure 7 is a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行 清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而 不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做 出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the scope of protection of the present invention.
鉴于特定领域的训练语料收集难度大,为了能够获得足够多的训练语 料,本案发明人想到,可以从通用领域的训练语料中筛选出特定领域的训练 语料,从而将筛选出的训练语料与从特定领域收集的训练语料一并作为构建 领域翻译模型的训练语料。In view of the difficulty in collecting training corpus in a specific field, in order to obtain enough training corpus, the inventor of this case thought that the training corpus in a specific field can be screened out from the training corpus in a general field, so that the selected training corpus can be compared with the training corpus from a specific field. The training corpus collected in the domain is used as the training corpus for building the domain translation model.
为了实现从通用领域的训练语料中筛选出特定领域的训练语料,本案发 明人进行了深入研究,最终提出了解决方案,该解决方案的大致思路是:确 定通用领域的训练语料集中训练语料与特定领域的匹配程度,基于通用领域 的训练语料集中训练语料与特定领域的匹配程度,从通用领域的训练语料集 中训练语料筛选特定领域的训练语料。In order to filter out the training corpus in a specific field from the training corpus in the general field, the inventor of this case conducted in-depth research and finally proposed a solution. The general idea of the solution is: determine the training corpus in the general field and concentrate the training corpus with the specific field. The degree of matching in the field is based on the matching degree between the training corpus in the general field and the training corpus in the specific field. The training corpus in the specific field is screened from the training corpus in the general field.
在上述领域数据获取方案的基础上,本案发明人还提供了针对特定领域 的机器翻译方法,该方法的大致思路是,用通用领域的训练语料集构建通用 领域翻译模型,然后利用在特定领域中收集的训练语料以及从通用领域的训 练语料集中筛选出的训练语料对通用领域翻译模型进行微调,从而得到特定 领域的翻译模型,将特定领域的源语言文本输入特定领域的翻译模型,便可 得到对应的目标语言文本。On the basis of the above-mentioned domain data acquisition scheme, the inventor of this case also provides a machine translation method for specific fields. The general idea of this method is to use the training corpus of general fields to build a general-field translation model, and then use it to The collected training corpus and the training corpus selected from the training corpus in the general field are fine-tuned to the general field translation model to obtain a translation model in a specific field. By inputting the source language text in the specific field into the translation model in the specific field, you can get Corresponding target language text.
本申请提供的领域数据获取方法和机器翻译方法可应用于具有数据处理 能力的终端,也可应用于单个服务器或多个服务器组成的服务器集群。接下 来通过下述实施例对本申请提供的领域数据获取方法和机器翻译方法进行介 绍。The domain data acquisition method and machine translation method provided by this application can be applied to terminals with data processing capabilities, and can also be applied to a single server or a server cluster composed of multiple servers. Next, the domain data acquisition method and machine translation method provided by this application will be introduced through the following embodiments.
第一实施例First embodiment
请参阅图1,示出了本申请实施例提供的领域数据获取方法的流程示意 图,该方法可以包括:Please refer to Figure 1, which shows a schematic flowchart of a domain data acquisition method provided by an embodiment of the present application. The method may include:
步骤S101:获取通用领域的训练语料集和指定领域的初始训练语料集。Step S101: Obtain a training corpus in a general field and an initial training corpus in a specified field.
其中,通用领域的训练语料集中的每条训练语料以及指定领域的初始训 练语料集中的每条训练语料均包括源语言文本和其对应的目标语言文本,即 每条训练语料为一文本对。Among them, each training corpus in the general domain training corpus and each training corpus in the initial training corpus in a specified field includes source language text and its corresponding target language text, that is, each training corpus is a text pair.
需要说明的是,通用领域的训练语料集中包括混合在一起的多个领域的 训练语料,指定领域的初始训练语料集中的训练语料为在指定领域直接收集 而来的训练语料。本实施例中通用领域的训练语料集中包括指定领域的训练 语料,本实施例所要实现的即是,从通用领域的训练语料集中筛选出指定领 域的训练语料,进而能将筛选出的训练语料与指定领域的初始训练语料集中 的训练语料组成指定领域的训练语料集。It should be noted that the training corpus in the general field includes training corpus from multiple fields mixed together, and the training corpus in the initial training corpus in the specified field is the training corpus collected directly in the specified field. In this embodiment, the training corpus in the general field includes training corpus in the specified field. What this embodiment aims to achieve is to filter out the training corpus in the specified field from the training corpus in the general field, and then combine the filtered training corpus with The training corpus in the initial training corpus of the specified field constitutes the training corpus of the specified field.
步骤S102:利用通用领域的训练语料集建立通用翻译模型。Step S102: Establish a general translation model using training corpus in general fields.
具体的,利用通用领域的训练语料集中的训练语料训练翻译模型,训练 得到的翻译模型即为通用翻译模型。利用通用领域的训练语料集中的训练语 料训练翻译模型的过程为现有技术,本实施例在此不做赘述。Specifically, the translation model is trained using the training corpus in the general domain training corpus, and the trained translation model is the universal translation model. The process of training a translation model using training corpus set in a general domain training corpus is an existing technology, and will not be described in detail in this embodiment.
步骤S103:基于通用翻译模型和指定领域的初始训练语料集,确定通用 领域的训练语料集中每条训练语料对应的第一目标值。Step S103: Based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the training corpus in the general field.
其中,第一目标值能够表征对应的训练语料与指定领域的匹配程度。Among them, the first target value can represent the degree of matching between the corresponding training corpus and the specified field.
需要说明的是,一训练语料与指定领域的匹配程度越高,则说明该训练 语料为指定领域的训练语料的可能性越大,反之,一训练语料与指定领域的 匹配程度越低,则说明该训练语料为指定领域的训练语料的可能性越小。It should be noted that the higher the matching degree between a training corpus and the specified field, the greater the possibility that the training corpus is the training corpus in the specified field. On the contrary, the lower the matching degree between a training corpus and the specified field, it means that the training corpus is training corpus in the specified field. The less likely it is that the training corpus is training corpus in the specified field.
考虑到在深度学习/机器学习领域中,具有相似特征或者特性的训练语料 其反向更新的梯度往往具有一致性,而相同领域的训练语料必然存在相似的 特征,也就是说,属于同一领域的训练语料在模型上的反向梯度具有一致性, 基于此,本案发明想到,可用梯度距离衡量训练语料与指定领域的匹配程度。Considering that in the field of deep learning/machine learning, the reverse update gradients of training corpus with similar characteristics or characteristics are often consistent, and training corpus in the same field must have similar characteristics, that is, those belonging to the same field The reverse gradient of the training corpus on the model is consistent. Based on this, the present invention thinks that the gradient distance can be used to measure the matching degree of the training corpus and the specified field.
有鉴于此,基于通用翻译模型和指定领域的初始训练语料集,确定通用 领域的训练语料集中一目标训练语料(本申请将待确定目标值的训练语料称 之为“目标训练语料”)对应的第一目标值的过程可以包括:In view of this, based on the general translation model and the initial training corpus in the specified field, a target training corpus (this application refers to the training corpus with a target value to be determined as the "target training corpus") corresponding to the training corpus in the general field is determined. The process for the first target value may include:
步骤a1、确定指定领域的初始训练语料集中每条训练语料在通用翻译模 型上的梯度。Step a1: Determine the gradient of each training corpus in the initial training corpus in the specified field on the general translation model.
具体的,针对指定领域的初始训练语料集中的每条训练语料,将其输入 通用翻译模型,计算其在通用翻译模型中的梯度梯度/>的计算 公式如下:Specifically, for each piece of training corpus in the initial training corpus in the specified field, input it into the universal translation model, and calculate its gradient in the universal translation model. Gradient/> The calculation formula is as follows:
其中,表示求梯度运算符,θ指的是通用翻译模型中的模型参数,需要 说明的是,θ可以是整个通用翻译模型的参数,也可以是其中部分参数,可选 的,θ可以为通用翻译模型的最后一层模型参数。in, Represents the gradient operator, θ refers to the model parameters in the universal translation model. It should be noted that θ can be a parameter of the entire universal translation model, or some of the parameters. Optionally, θ can be a universal translation model. The last layer of model parameters.
步骤a2、计算指定领域的初始训练语料集中各条训练语料在通用翻译模 型上的梯度的平均值,作为领域梯度平均值 Step a2: Calculate the average of the gradients of each training corpus in the initial training corpus in the specified field on the general translation model as the average field gradient
步骤a3、确定目标训练语料在通用翻译模型上的梯度。Step a3: Determine the gradient of the target training corpus on the general translation model.
目标训练语料在通用翻译模型上的梯度的确定方式与指定领域的初始训 练语料集中训练语料在通用翻译模型上的梯度的确定方式相同。The gradient of the target training corpus on the general translation model is determined in the same way as the gradient of the initial training corpus in the specified field on the general translation model.
步骤a4、计算目标训练语料在通用翻译模型上的梯度与领域梯度平均值 的距离,作为目标训练语料对应的第一目标值。Step a4: Calculate the gradient of the target training corpus on the general translation model and the average domain gradient The distance is used as the first target value corresponding to the target training corpus.
即,目标训练语料对应的第一目标值Sd为:That is, the first target value S d corresponding to the target training corpus is:
其中,dist(.,.)指的是计算两个张量之间的距离,两个张量之间的距离可 以但不限为余弦距离。Among them, dist(.,.) refers to calculating the distance between two tensors. The distance between two tensors can be, but is not limited to, cosine distance.
步骤S104:基于通用领域的训练语料集中每条训练语料对应的第一目标 值,从通用领域的训练语料集中筛选出指定领域的训练语料。Step S104: Based on the first target value corresponding to each training corpus in the training corpus in the general field, select the training corpus in the specified field from the training corpus in the general field.
基于通用领域的训练语料集中每条训练语料对应的第一目标值,从通用 领域的训练语料集中筛选出指定领域的训练语料的实现方式有多种:在一种 可能的实现方式中,可按第一目标值由高到低的顺序对通用领域的训练语料 集中的各训练语料进行排序,取前N个训练语料作为指定领域的训练语料, 当然,也可按第一目标值由低到高的顺序对通用领域的训练语料集中的各训 练语料进行排序,取后N个训练语料作为指定领域的训练语料,需要说明的 是,N的取值可根据具体应用情况设定;在另一种可能的实现方式中,可设定一阈值T,从通用领域的训练语料集中筛选第一目标值大于阈值T的训练 语料,作为指定领域的训练语料。Based on the first target value corresponding to each training corpus in the training corpus in the general field, there are many ways to filter out the training corpus in the specified field from the training corpus in the general field: In one possible implementation, Sort the training corpus in the general domain training corpus from high to low in order of the first target value, and take the first N training corpus as the training corpus in the specified field. Of course, the first target value can also be sorted from low to high. Sort each training corpus in the training corpus set in the general field in the order of In a possible implementation, a threshold T can be set, and the training corpus with a first target value greater than the threshold T can be selected from the training corpus in the general field as the training corpus in the specified field.
本申请实施例提供的领域数据获取方法,可基于通用翻译模型和指定领 域的初始训练语料集确定出通用领域的训练语料集中每条训练语料对应的第 一目标值,由于第一目标值能够表征对应训练语料与指定领域的匹配程度, 因此,基于通用领域的训练语料集中每条训练语料对应的第一目标值,能够 从通用领域的训练语料集中筛选出与指定领域的匹配程度较高训练语料。本 申请实施例提供的领域数据获取方法能够较准确地从通用领域的训练语料集 中筛选出指定领域的训练语料,另外,本申请实施例是从模型梯度层面进行 训练语料的筛选,这使得后续采用筛选出的训练语料构建指定领域的翻译模 型时,训练语料能够直接和模型耦合,从而可以有效提升指定领域的翻译模 型的领域翻译能力。The domain data acquisition method provided by the embodiment of the present application can determine the first target value corresponding to each training corpus in the general domain training corpus based on the general translation model and the initial training corpus in the specified field, because the first target value can characterize Corresponds to the matching degree of the training corpus with the specified field. Therefore, based on the first target value corresponding to each training corpus in the general field training corpus, it is possible to filter out the training corpus with a higher degree of matching with the specified field from the general field training corpus set. . The domain data acquisition method provided by the embodiments of the present application can more accurately filter out the training corpus in the specified field from the training corpus in the general field. In addition, the embodiments of the present application screen the training corpus from the model gradient level, which makes subsequent use When the selected training corpus is used to build a translation model in a specified field, the training corpus can be directly coupled with the model, which can effectively improve the domain translation capabilities of the translation model in the specified field.
第二实施例Second embodiment
为了能够提高训练语料的筛选准确度,本实施例提供了另一种领域数据 获取方法,请参阅图2,示出了该领域数据获取方法的流程示意图,该方法可 以包括:In order to improve the screening accuracy of training corpus, this embodiment provides another domain data acquisition method. Please refer to Figure 2, which shows a schematic flow chart of the domain data acquisition method. The method may include:
步骤S201:获取通用领域的训练语料集和指定领域的初始训练语料集。Step S201: Obtain a training corpus in a general field and an initial training corpus in a specified field.
其中,通用领域的训练语料集中包括混合在一起的多个领域的训练语料, 指定领域的初始训练语料集为在指定领域直接收集而来的训练语料。Among them, the training corpus in a general field includes training corpus in multiple fields mixed together, and the initial training corpus in a specified field is training corpus collected directly in the specified field.
步骤S202a:利用通用领域的训练语料集建立通用翻译模型。Step S202a: Establish a general translation model using training corpus in general fields.
具体的,利用通用领域的训练语料集中的训练语料训练翻译模型,训练 得到的翻译模型即为通用翻译模型。Specifically, the translation model is trained using the training corpus in the general domain training corpus, and the trained translation model is the universal translation model.
步骤S202b:利用通用领域的训练语料集建立通用语言模型,并利用指定 领域的初始训练语料集建立领域语言模型。Step S202b: Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model.
其中,利用通用领域的训练语料集建立通用语言模型的过程可以包括: 利用通用领域的训练语料集中的源语言文本训练语言模型,训练得到的语言 模型作为源语言端通用语言模型;利用通用领域的训练语料集中的目标语言 文本训练语言模型,训练得到的语言模型作为目标语言端通用语言模型。即, 本实施例中的通用语言模型包括源语言端通用语言模型和目标语言端通用语 言模型。Among them, the process of establishing a general language model using a training corpus in a general field may include: using the source language text in the training corpus in the general field to train the language model, and the trained language model serves as a source language side general language model; using the source language text in the general field The target language text in the training corpus is used to train the language model, and the trained language model is used as a general language model on the target language side. That is, the universal language model in this embodiment includes a source language side universal language model and a target language side universal language model.
其中,利用指定领域的初始训练语料集建立领域语言模型的过程可以包 括:利用指定领域的初始训练语料集中的源语言文本训练语言模型,训练得 到的语言模型作为源语言端领域语言模型;利用指定领域的初始训练语料集 中的目标语言文本训练语言模型,训练得到的语言模型作为目标语言端领域 语言模型。即,本实施例中的领域语言模型包括源语言端领域语言模型和目 标语言端领域语言模型。Among them, the process of establishing a domain language model using the initial training corpus of the specified field may include: using the source language text in the initial training corpus of the specified field to train the language model, and the trained language model is used as the source language side domain language model; using the specified The target language text in the domain's initial training corpus is used to train the language model, and the trained language model is used as the target language side domain language model. That is, the domain language model in this embodiment includes a source language side domain language model and a target language side domain language model.
步骤S203a:基于通用翻译模型和指定领域的初始训练语料集,确定通用 领域的训练语料集中每条训练语料对应的第一目标值。Step S203a: Based on the general translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the training corpus in the general field.
其中,第一目标值能够表征对应的训练语料与指定领域的匹配程度。Among them, the first target value can represent the degree of matching between the corresponding training corpus and the specified field.
确定通用领域的训练语料集中每条训练语料对应的第一目标值的过程可 参见第一实施例中的步骤a1~步骤a4,本实施例在此不做赘述。The process of determining the first target value corresponding to each piece of training corpus in the general field training corpus set can be referred to steps a1 to a4 in the first embodiment, which will not be described in detail in this embodiment.
步骤S203b:基于通用语言模型和领域语言模型,确定通用领域的训练语 料集中每条训练语料对应的第二目标值。Step S203b: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus.
其中,第二目标值能够表征对应的训练语料与指定领域的相关程度。Among them, the second target value can represent the degree of correlation between the corresponding training corpus and the specified field.
需要说明的是,一训练语料与指定领域的相关程度越高,则说明该训练 语料为指定领域的训练语料的可能性越大,反之,一训练语料与指定领域的 相关程度越低,则说明该训练语料为指定领域的训练语料的可能性越小。It should be noted that the higher the correlation between a training corpus and the specified field, the greater the possibility that the training corpus is the training corpus in the specified field. On the contrary, the lower the correlation between a training corpus and the specified field, it means that the training corpus is training corpus in the specified field. The less likely it is that the training corpus is training corpus in the specified field.
具体的,基于通用语言模型和领域语言模型,确定通用领域的训练语料 集中一目标训练语料对应的第二目标值的过程可以包括:Specifically, based on the general language model and the domain language model, the process of determining the training corpus in the general domain and concentrating the second target value corresponding to one target training corpus may include:
步骤b1、计算目标训练语料分别在通用语言模型和领域语言模型上的后 验概率。Step b1: Calculate the posterior probabilities of the target training corpus on the general language model and domain language model respectively.
具体的,计算所述目标训练语料分别在源语言端通用语言模型、源语言 端领域语言模型、目标语言端通用语言模型、目标语言端领域语言模型上的 后验概率。Specifically, the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model are calculated respectively.
步骤b2、根据确定出的后验概率,确定目标训练语料对应的第二目标值。Step b2: Determine the second target value corresponding to the target training corpus according to the determined posterior probability.
具体的,根据确定出的后验概率,确定目标训练语料对应的第二目标值 的过程包括:Specifically, based on the determined posterior probability, the process of determining the second target value corresponding to the target training corpus includes:
根据目标训练语料分别在源语言端通用语言模型上的后验概率和源语言 端领域语言模型上的后验概率,确定目标训练语料在源语言端语言模型上的 得分;根据目标训练语料分别在目标语言端通用语言模型上的后验概率和目 标语言端领域语言模型上的后验概率,确定目标训练语料在目标语言端语言 模型上的得分;将目标训练语料在源语言端语言模型上的得分和目标训练语 料在目标语言端语言模型上的得分作为目标训练语料对应的第二目标值。According to the posterior probability of the target training corpus on the source language side general language model and the posterior probability on the source language side domain language model, the score of the target training corpus on the source language side language model is determined; according to the target training corpus, respectively The posterior probability on the general language model on the target language side and the posterior probability on the domain language model on the target language side are used to determine the score of the target training corpus on the target language side language model; the score of the target training corpus on the source language side language model is determined. The score and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.
更为具体的,假设目标训练语料在源语言端通用语言模型上的后验概率 为目标训练语料在源语言端领域语言模型上的后验概率为/>则目标训练语料在源语言端语言模型上的得分/>为:More specifically, it is assumed that the posterior probability of the target training corpus on the source language side universal language model is The posterior probability of the target training corpus on the source language domain language model is/> Then the score of the target training corpus on the source language side language model/> for:
假设目标训练语料在目标语言端通用语言模型上的后验概率为目标训练语料在目标语言端领域语言模型上的后验概率为/>则目标训练 语料在目标语言端语言模型上的得分/>为:Assume that the posterior probability of the target training corpus on the target language side universal language model is The posterior probability of the target training corpus on the target language domain language model is/> Then the score of the target training corpus on the target language side language model/> for:
其中,x为目标训练语料中的源语言文本,y为目标训练语料中的目标语 言文本。Among them, x is the source language text in the target training corpus, and y is the target language text in the target training corpus.
步骤S204:以通用领域的训练语料集中每条训练语料对应的第一目标值 和第二目标值为依据,从通用领域的训练语料集中筛选出指定领域的训练语 料。Step S204: Based on the first target value and the second target value corresponding to each training corpus in the general field training corpus set, select the training corpus in the specified field from the general field training corpus set.
在一种可能的实现方式中,对于通用领域的训练语料集中的每条训练语 料:可将该训练语料对应的第一目标值作为第一个维度的得分,将该训练语 料在源语言端语言模型上的得分作为第二个维度的得分,将该训练语料在目 标语言端语言模型上的得分作为第三个维度的得分,将这三个维度的得分融 合,融合后的得分作为该训练语料对应的目标得分,在得到通用领域的训练 语料集中每条训练语料对应的目标得分后,以通用领域的训练语料集中的各 条训练语料分别对应的目标得分为依据,从通用领域的训练语料集中筛选出 指定领域的训练语料,具体的,可从通用领域的训练语料集中筛选目标得分 最高的N条训练语料,作为指定领域的训练语料。In a possible implementation, for each training corpus in the general field training corpus: the first target value corresponding to the training corpus can be used as the score of the first dimension, and the training corpus can be used in the source language The score on the model is used as the score of the second dimension, the score of the training corpus on the target language side language model is used as the score of the third dimension, the scores of these three dimensions are fused, and the fused score is used as the training corpus The corresponding target score, after obtaining the target score corresponding to each training corpus in the general field training corpus set, is based on the target score corresponding to each training corpus in the general field training corpus set, from the general field training corpus set Filter out the training corpus in the specified field. Specifically, the N training corpus with the highest target score can be selected from the training corpus in the general field as the training corpus in the specified field.
假设一条训练语料x对应的第一目标值为Sd,该条训练语料x在源语言 端语言模型上的得分为该条训练语料x在目标语言端语言模型上的得分 为/>在一种可能的实现方式中,可直接将Sd、/>和/>求和,求和得到 值作为训练语料x对应的目标得分,在另一种可能的实现方式中,可预先确 定Sd、/>和/>分别对应的权重α、β和γ,按权重α、β和γ对Sd、/>和/>加权求和,加权求和得到的值S作为训练语料x对应的目标得分,即:Assume that the first target value corresponding to a piece of training corpus x is S d , and the score of this piece of training corpus x on the source language side language model is The score of this training corpus x on the target language side language model is/> In a possible implementation, S d ,/> can be directly and/> Sum, and the value obtained by summing is used as the target score corresponding to the training corpus x. In another possible implementation, S d ,/> can be predetermined and/> The corresponding weights α, β and γ respectively, according to the weights α, β and γ, S d , /> and/> Weighted summation, the value S obtained by the weighted summation is used as the target score corresponding to the training corpus x, that is:
本申请实施例提供的领域数据获取方法,可基于通用翻译模型和指定领 域的初始训练语料集,确定通用领域的训练语料集中每条训练语料对应的第 一目标值,还可基于通用语言模型和领域语言模型,确定通用领域的训练语 料集中每条训练语料对应的第二目标值,由于第一目标值能够表征对应训练 语料与指定领域的匹配程度,第二目标值能够表征对应训练语料与指定领域 的相关程度,因此,以通用领域的训练语料集中每条训练语料对应的第一目 标值和第二目标值为依据,能够从通用领域的训练语料集中更准确地筛选出 指定领域的训练语料。The domain data acquisition method provided by the embodiments of the present application can determine the first target value corresponding to each training corpus in the general domain training corpus based on the general translation model and the initial training corpus in the specified field. It can also be based on the general language model and The domain language model determines the second target value corresponding to each training corpus in the general domain training corpus. Since the first target value can characterize the matching degree of the corresponding training corpus with the specified domain, the second target value can characterize the matching degree of the corresponding training corpus with the specified domain. The degree of relevance of the field. Therefore, based on the first target value and the second target value corresponding to each training corpus in the general field training corpus, the training corpus in the specified field can be more accurately screened out from the general field training corpus. .
第三实施例Third embodiment
为了能够提高训练语料的筛选效率和筛选准确度,本实施例提供了再一 种领域数据获取方法,请参阅图3,示出了该领域数据获取方法的流程示意图, 该方法可以包括:In order to improve the screening efficiency and screening accuracy of training corpus, this embodiment provides yet another domain data acquisition method. Please refer to Figure 3, which shows a schematic flow chart of the domain data acquisition method. The method may include:
步骤S301:获取通用领域的训练语料集和指定领域的初始训练语料集。Step S301: Obtain a training corpus in a general field and an initial training corpus in a specified field.
其中,通用领域的训练语料集中包括混合在一起的多个领域的训练语料, 指定领域的初始训练语料集为在指定领域直接收集而来的训练语料。Among them, the training corpus in a general field includes training corpus in multiple fields mixed together, and the initial training corpus in a specified field is training corpus collected directly in the specified field.
步骤S302:利用通用领域的训练语料集建立通用语言模型,并利用指定 领域的初始训练语料集建立领域语言模型。Step S302: Use the training corpus in the general field to establish a general language model, and use the initial training corpus in the specified field to establish a domain language model.
步骤S303:基于通用语言模型和领域语言模型,确定通用领域的训练语 料集中每条训练语料对应的第二目标值。Step S303: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus.
其中,第二目标值能够表征对应的训练语料与指定领域的相关程度。Among them, the second target value can represent the degree of correlation between the corresponding training corpus and the specified field.
本步骤的具体实现过程可参见第二实施例中“步骤S203b:基于通用语言 模型和领域语言模型,确定通用领域的训练语料集中每条训练语料对应的第 二目标值”的具体实现过程,本实施例在此不做赘述。For the specific implementation process of this step, please refer to the specific implementation process of "Step S203b: Based on the general language model and the domain language model, determine the second target value corresponding to each training corpus in the general domain training corpus set" in the second embodiment. The embodiments will not be described in detail here.
步骤S304:以通用领域的训练语料集中训练语料对应的第二目标值为依 据,从通用领域的训练语料集中筛选候选训练语料,组成候选训练语料集。Step S304: Based on the second target value corresponding to the training corpus in the training corpus in the general field, select candidate training corpus from the training corpus in the general field to form a candidate training corpus.
在一种可能的实现方式中,可按第二目标值从大到小的顺序对通用领域 的训练语料集中的各训练语料进行排序,取前M个训练语料作为候选训练语 料,组成候选训练语料集(当然,也可按第二目标值从小到大的顺序对通用 领域的训练语料集中的各训练语料进行排序,取后M个训练语料作为候选训 练语料,组成候选训练语料集);在另一种可能的实现方式中,可设定一阈 值T1,将通用领域的训练语料集中第二目标值大于阈值T1的训练语料作为 候选训练语料,组成候选训练语料集。In a possible implementation, each training corpus in the training corpus in the general field can be sorted according to the order of the second target value from large to small, and the top M training corpus is taken as the candidate training corpus to form the candidate training corpus. set (of course, you can also sort the training corpus in the general field training corpus in ascending order of the second target value, and take the last M training corpus as candidate training corpus to form a candidate training corpus); in another In one possible implementation, a threshold T1 can be set, and the training corpus with a second target value greater than the threshold T1 in the general field training corpus is used as candidate training corpus to form a candidate training corpus set.
步骤S305:利用通用领域的训练语料集建立通用翻译模型。Step S305: Establish a general translation model using training corpus in general fields.
具体的,利用通用领域的训练语料集中的训练语料训练翻译模型,训练 得到的翻译模型即为通用翻译模型。Specifically, the translation model is trained using the training corpus in the general domain training corpus, and the trained translation model is the universal translation model.
需要说明的是,本实施例并不限定步骤S305在步骤S304之后执行,只 要步骤S305在步骤S301之后,步骤S306之前执行即可。It should be noted that this embodiment does not limit step S305 to be executed after step S304, as long as step S305 is executed after step S301 and before step S306.
步骤S306:基于通用翻译模型和指定领域的初始训练语料集,确定候选 训练语料集中每条训练语料对应的第一目标值。Step S306: Based on the universal translation model and the initial training corpus in the specified field, determine the first target value corresponding to each training corpus in the candidate training corpus.
具体的,首先确定指定领域的初始训练语料集中每条训练语料在通用翻 译模型上的梯度,然后计算指定领域的初始训练语料集中各条训练语料在通 用翻译模型上的梯度的平均值,作为领域梯度平均值,接着,确定候选训练 语料集中的每条训练语料在通用翻译模型上的梯度,最后,针对候选训练语 料集中的每条训练语料,计算其在通用翻译模型上的梯度与领域梯度平均值 的距离,作为其对应的第一目标值,从而得到候选训练语料集中每条训练语 料对应的第一目标值。其中,确定一条训练语料在通用翻译模型上的梯度的 过程可参见第一实施例,本实施例在此不做赘述。Specifically, first determine the gradient of each training corpus on the universal translation model in the initial training corpus of the specified field, and then calculate the average of the gradient of each training corpus on the universal translation model in the initial training corpus of the specified field, as the domain Gradient average, then, determine the gradient of each training corpus in the candidate training corpus on the general translation model, and finally, for each training corpus in the candidate training corpus, calculate its gradient on the general translation model and the domain gradient average The distance between the values is used as its corresponding first target value, thereby obtaining the first target value corresponding to each training corpus in the candidate training corpus set. The process of determining the gradient of a piece of training corpus on the universal translation model can be found in the first embodiment, and will not be described again in this embodiment.
步骤S307:以候选训练语料集中每条训练语料对应的第一目标值为依据, 从候选训练语料集中筛选出指定领域的训练语料。Step S307: Based on the first target value corresponding to each training corpus in the candidate training corpus set, select the training corpus in the specified field from the candidate training corpus set.
可选的,可按第一目标值由大到小的顺序对候选训练语料集中的各训练 语料进行排序,取前N(N的大小可根据具体应用情况设定)个训练语料, 作为指定领域的训练语料,当然,也可按第一目标值由小到大的顺序对候选 训练语料集中的各训练语料进行排序,取后N个训练语料作为指定领域的训 练语料,还可设置一阈值T2,从候选训练语料集中筛选第一目标值大于阈值 T2的训练语料作为指定领域的训练语料。Optionally, you can sort each training corpus in the candidate training corpus set in descending order of the first target value, and take the top N (the size of N can be set according to the specific application) training corpus as the designated field. training corpus. Of course, you can also sort the training corpus in the candidate training corpus in ascending order of the first target value, and take the last N training corpus as the training corpus in the specified field. You can also set a threshold T2 , select the training corpus whose first target value is greater than the threshold T2 from the candidate training corpus set as the training corpus in the specified field.
考虑到梯度的计算相对较费时,为了提高筛选效率,本实施例提供的领 域数据获取方法首先基于通用领域的训练语料集中每条训练语料对应的第二 目标值(语言模型得分),从通用领域的训练语料集中筛选候选训练语料, 然后再基于候选训练语料集中每条候选训练语料对应的第一目标值(梯度距 离),从候选训练语料集中筛选出指定领域的训练语料。本实施例提供的领 域数据获取方法不但可从通用领域的训练语料集中准确地筛选出指定领域的 训练语料,而且,由于只需要针对候选训练语料计算梯度,而不需要对所有 的训练语料计算梯度,因此,降低了运算量,从而提高了训练语料的筛选效 率。Considering that the calculation of the gradient is relatively time-consuming, in order to improve the screening efficiency, the domain data acquisition method provided in this embodiment is first based on the second target value (language model score) corresponding to each training corpus in the general domain training corpus, from the general domain Candidate training corpus is screened from the training corpus set, and then based on the first target value (gradient distance) corresponding to each candidate training corpus in the candidate training corpus set, training corpus in the specified field is screened out from the candidate training corpus set. The domain data acquisition method provided by this embodiment can not only accurately filter out the training corpus in the specified field from the training corpus in the general domain, but also, because it only needs to calculate the gradient for the candidate training corpus, there is no need to calculate the gradient for all the training corpus. , Therefore, the amount of calculation is reduced, thereby improving the efficiency of screening training corpus.
第四实施例Fourth embodiment
在上述实施例的基础上,本实施例提供了一种针对指定领域的机器翻译 方法,该方法可以包括:Based on the above embodiments, this embodiment provides a machine translation method for a specified field. The method may include:
获取指定领域的待翻译源语言文本;将待翻译源语言文本输入预先建立 的领域翻译模型,得到待翻译源语言文本对应的目标语言文本。Obtain the source language text to be translated in the specified field; input the source language text to be translated into the pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated.
其中,领域翻译模型采用指定领域的初始训练语料集,以及采用上述任 一实施例提供的领域数据获取方法从通用领域的训练语料集中获取的训练语 料,对通用翻译模型进行调整得到,通用翻译模型采用通用领域的训练语料 集训练得到。Among them, the domain translation model adopts the initial training corpus of the specified domain and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method provided by any of the above embodiments. The general translation model is obtained by adjusting the general translation model. It is trained using training corpus in general fields.
请参阅图4,示出了采用指定领域的初始训练语料集以及从通用领域的训 练语料集中筛选出的训练语料,对通用翻译模型进行调整的流程示意图,可 以包括:Please refer to Figure 4, which shows a schematic flow chart of adjusting the general translation model using the initial training corpus in the specified field and the training corpus filtered out from the training corpus in the general field, which can include:
步骤S401:将指定领域的初始训练语料集中的训练语料与从通用领域的 训练语料集中筛选出的训练语料混合。Step S401: Mix the training corpus from the initial training corpus in the specified field with the training corpus filtered out from the training corpus in the general field.
步骤S402:利用混合后的训练语料对通用翻译模型进行微调,微调后的 模型作为领域翻译模型Tin1。Step S402: Use the mixed training corpus to fine-tune the general translation model, and the fine-tuned model is used as the domain translation model T in1 .
利用混合后的训练语料对通用翻译模型进行微调,即利用混合后的训练 语料对通用翻译模型进一步进行训练,训练后得到的翻译模型即为领域翻译 模型Tin1,该领域翻译模型Tin1能够较准确地对指定领域的待翻译文本进行翻 译。Use the mixed training corpus to fine-tune the general translation model, that is, use the mixed training corpus to further train the general translation model. The translation model obtained after training is the domain translation model T in1 . This domain translation model T in1 can compare Accurately translate the text to be translated in the specified field.
优选的,为了能够获得性能更优的领域翻译模型,本实施例还可以包括 如下步骤:Preferably, in order to obtain a domain translation model with better performance, this embodiment may also include the following steps:
步骤S403:采用指定领域的初始训练语料集中的训练语料,对领域翻译 模型Tin1进行微调,微调后的模型作为最终的领域翻译模型Tin。Step S403: Fine-tune the domain translation model T in1 using the training corpus in the initial training corpus set of the specified domain, and the fine-tuned model is used as the final domain translation model T in .
经由步骤S402获得领域翻译模型Tin1后,利用指定领域的高质量训练语 料对领域翻译模型Tin1进行更为精细的微调。利用指定领域的高质量训练语料 对领域翻译模型Tin1进行微调,即利用指定领域的初始训练语料集中的训练语 料对领域翻译模型Tin1进一步进行训练。After the domain translation model T in1 is obtained through step S402, the domain translation model T in1 is fine-tuned more precisely using high-quality training corpus in the specified domain. Use high-quality training corpus in the specified field to fine-tune the domain translation model T in1 , that is, use the training corpus in the initial training corpus of the specified field to further train the domain translation model T in1 .
考虑到高质量训练语料较少,本实施例在对领域翻译模型Tin1进行微调 时,增加一个KL正则约束,以防止模型跑偏。Considering that there is a small amount of high-quality training corpus, this embodiment adds a KL regular constraint when fine-tuning the domain translation model T in1 to prevent the model from going astray.
具体的,通过KL正则约束防止模型跑偏的策略为:在获得Tin1后,增加 一个Tin1,如图5所示,其中一个Tin1的参数固定不变(如图5中的Tin1-fixed),, 另一个Tin1进行调整,即调整参数(如图5中的Tin),高质量训练语料分别输 入图5中的Tin和Tin1-fixed,Tin输出概率分布P(y|x),Tin1-fixed输出概率分布 Q(y|x),为了防止模型跑偏,使Tin的输出概率分布与Tin1-fixed的输出概率分 布尽可能的接近。Specifically, the strategy to prevent model deviation through KL regular constraints is: after obtaining T in1 , add a T in1 , as shown in Figure 5, one of the parameters of T in1 is fixed (T in1 - in Figure 5 fixed), another T in1 is adjusted, that is, the parameters are adjusted (T in in Figure 5). The high-quality training corpus is input into T in and T in1 -fixed in Figure 5 respectively, and T in outputs the probability distribution P(y |x), T in1 -fixed output probability distribution Q (y |
两个概率分布的接近情况可通过相对熵进行度量,相对熵又称为KL散度 或信息散度,其是两个概率分布间差异的非对称性度量,相对熵可以衡量两 个概率分布之间的距离,当两个概率分布相同时,它们的相对熵为零,当两 个概率分布的差别增大时,它们的相对熵也会增大,具体的,概率分布P概 率分布Q的相对熵或KL散度可通过下式计算:The proximity of two probability distributions can be measured by relative entropy. Relative entropy is also called KL divergence or information divergence. It is an asymmetry measure of the difference between two probability distributions. Relative entropy can measure the difference between two probability distributions. When the two probability distributions are the same, their relative entropy is zero. When the difference between the two probability distributions increases, their relative entropy will also increase. Specifically, the relative entropy of the probability distribution P and the probability distribution Q Entropy or KL divergence can be calculated by:
训练收敛后,最终得到的模型Tin即为指定领域的翻译模型。After the training converges, the final model T in is the translation model in the specified field.
本实施例提供的针对指定领域的机器翻译方法中,由于指定领域的领域 翻译模型采用大量指定领域的训练语料对通用翻译模型进行微调得到,因此, 其为能够适应于指定领域的翻译模型,利用该翻译模型对指定领域的文本进 行翻译,可获得准确的翻译结果。In the machine translation method for the designated field provided in this embodiment, since the domain translation model of the designated field is obtained by fine-tuning the general translation model using a large amount of training corpus in the designated field, it is a translation model that can be adapted to the designated field. This translation model translates texts in specified fields and can obtain accurate translation results.
需要说明的是,可采用上述实施例提供的领域数据筛选方法获得多个不 同领域的训练语料,进而可采用多个不同领域的训练语料(每个领域可包括 筛选出的训练语料和收集而来的语料)分别对通用翻译模型进行微调,以得 到多个不同领域的领域翻译模型,从而可实现对多个不同领域待翻译文本的 准确翻译。It should be noted that the domain data screening method provided in the above embodiments can be used to obtain training corpus in multiple different fields, and further training corpus in multiple different fields can be used (each domain can include filtered training corpus and collected training corpus. corpus) to fine-tune the general translation model to obtain domain translation models in multiple different fields, thereby achieving accurate translation of texts to be translated in multiple different fields.
第五实施例Fifth embodiment
本申请实施例还提供了一种领域数据获取装置,下面对本申请实施例提 供的领域数据获取装置进行描述,下文描述的领域数据获取装置与上文描述 的领域数据获取方法可相互对应参照。The embodiment of the present application also provides a domain data acquisition device. The domain data acquisition device provided by the embodiment of the present application is described below. The domain data acquisition device described below and the domain data acquisition method described above can be mutually referenced.
请参阅图6,示出了本申请实施例提供的领域数据获取装置的结构示意 图,可以包括:数据获取模块601、通用翻译模型建立模块602、第一目标值 确定模块603和数据筛选模块604。Please refer to Figure 6, which shows a schematic structural diagram of a domain data acquisition device provided by an embodiment of the present application, which may include: a data acquisition module 601, a general translation model establishment module 602, a first target value determination module 603, and a data filtering module 604.
数据获取模块601,用于获取通用领域的训练语料集和指定领域的初始训 练语料集;Data acquisition module 601, used to obtain training corpus in general fields and initial training corpus in specified fields;
通用翻译模型建立模块602,用于利用所述通用领域的训练语料集建立通 用翻译模型;Universal translation model building module 602, used to establish a universal translation model using the training corpus in the general field;
第一目标值确定模块603,用于基于所述通用翻译模型和所述指定领域的 初始训练语料集,确定所述通用领域的训练语料集中训练语料对应的第一目 标值,其中,一条训练语料对应一第一目标值,所述第一目标值能够表征对 应的训练语料与所述指定领域的匹配程度;The first target value determination module 603 is used to determine the first target value corresponding to the training corpus in the training corpus set in the general field based on the general translation model and the initial training corpus set in the designated field, wherein a training corpus Corresponding to a first target value, the first target value can represent the degree of matching between the corresponding training corpus and the designated field;
数据筛选模块604,用于基于所述通用领域的训练语料集中训练语料对应 的第一目标值,从所述通用领域的训练语料集中筛选出所述指定领域的训练 语料。The data screening module 604 is configured to filter out the training corpus in the specified field from the training corpus in the general field based on the first target value corresponding to the training corpus in the training corpus in the general field.
可选的,本申请实施例提供的领域数据获取装置还可以包括:通用语言 模型建立模块、领域语言模型建立模块、第二目标值确定模块。Optionally, the domain data acquisition device provided by the embodiment of the present application may also include: a general language model establishment module, a domain language model establishment module, and a second target value determination module.
通用语言模型建立模块,用于利用所述通用领域的训练语料集建立通用 语言模型。A general language model building module is used to build a general language model using the training corpus in the general field.
领域语言模型建立模块,用于利用所述指定领域的初始训练语料集建立 领域语言模型。A domain language model establishment module is used to establish a domain language model using the initial training corpus of the designated domain.
第二目标值确定模块,用于基于所述通用语言模型和所述领域语言模型, 确定所述通用领域的训练语料集中训练语料对应的第二目标值。其中,一条 训练语料对应一第二目标值,所述第二目标值能够表征对应的训练语料与所 述指定领域的相关程度;The second target value determination module is configured to determine the second target value corresponding to the training corpus in the training corpus of the general domain based on the general language model and the domain language model. Among them, a piece of training corpus corresponds to a second target value, and the second target value can represent the degree of correlation between the corresponding training corpus and the designated field;
数据筛选模块604,具体用于以所述通用领域的训练语料集中训练语料对 应的第一目标值和第二目标值为依据,从所述通用领域的训练语料集中筛选 出所述指定领域的训练语料。The data screening module 604 is specifically used to filter out the training in the specified field from the training corpus in the general field based on the first target value and the second target value corresponding to the training corpus in the training corpus in the general field. corpus.
可选的,第一目标值确定模块603包括:通用语言模型建立模块、领域 语言模型建立模块、第二目标值确定模块、候选训练语料筛选子模块和候选 语料的第一目标值确定模块。Optionally, the first target value determination module 603 includes: a general language model establishment module, a domain language model establishment module, a second target value determination module, a candidate training corpus screening submodule, and a candidate corpus first target value determination module.
通用语言模型建立模块,用于利用所述通用领域的训练语料集建立通用 语言模型。A general language model building module is used to build a general language model using the training corpus in the general field.
领域语言模型建立模块,用于利用所述指定领域的初始训练语料集建立 领域语言模型。A domain language model establishment module is used to establish a domain language model using the initial training corpus of the designated domain.
第二目标值确定模块,用于基于所述通用语言模型和所述领域语言模型, 确定所述通用领域的训练语料集中训练语料对应的第二目标值。其中,一条 训练语料对应一第二目标值,所述第二目标值能够表征对应的训练语料与所 述指定领域的相关程度;The second target value determination module is configured to determine the second target value corresponding to the training corpus in the training corpus of the general domain based on the general language model and the domain language model. Among them, a piece of training corpus corresponds to a second target value, and the second target value can represent the degree of correlation between the corresponding training corpus and the designated field;
候选训练语料筛选子模块,用于以所述通用领域的训练语料集中训练语 料对应的第二目标值为依据,从所述通用领域的训练语料集中筛选候选训练 语料。The candidate training corpus screening sub-module is used to screen candidate training corpus from the training corpus in the general field based on the second target value corresponding to the training corpus in the training corpus in the general field.
候选语料的第一目标值确定模块,用于确定筛选出的每条候选训练语料 对应的第一目标值。The first target value determination module of the candidate corpus is used to determine the first target value corresponding to each filtered candidate training corpus.
数据筛选模块604,具体用于以筛选出的每条候选训练语料对应的第一目 标值为依据,从筛选出的候选训练语料中筛选出所述指定领域的训练语料。The data screening module 604 is specifically used to screen out the training corpus in the specified field from the screened candidate training corpus based on the first target value corresponding to each of the screened candidate training corpus.
可选的,第一目标值确定模块603在基于所述通用翻译模型和所述指定 领域的初始训练语料集,确定所述通用领域的训练语料集中一目标训练语料 对应的第一目标值时,具体用于:Optionally, when the first target value determination module 603 determines the first target value corresponding to a target training corpus in the training corpus in the general field based on the general translation model and the initial training corpus in the specified field, Specifically used for:
确定所述指定领域的初始训练语料集中每条训练语料在所述通用翻译模 型上的梯度;计算所述指定领域的初始训练语料集中各条训练语料在所述通 用翻译模型上的梯度的平均值,作为领域梯度平均值;确定所述目标训练语 料在所述通用翻译模型上的梯度;计算所述目标训练语料在所述通用翻译模 型上的梯度与所述领域梯度平均值的距离,作为所述目标训练语料对应的第 一目标值。Determine the gradient of each training corpus in the initial training corpus in the specified field on the universal translation model; calculate the average value of the gradient of each training corpus in the initial training corpus in the designated field on the universal translation model , as the domain gradient average; determine the gradient of the target training corpus on the universal translation model; calculate the distance between the gradient of the target training corpus on the universal translation model and the domain gradient average, as the The first target value corresponding to the target training corpus.
所述通用领域的训练语料集中的每条训练语料和所述指定领域的初始训 练语料集中的每条训练语料均包括:源语言文本和对应的目标语言文本。Each piece of training corpus in the training corpus in the general field and each piece of training corpus in the initial training corpus in the specified field include: a source language text and a corresponding target language text.
通用语言模型建立模块在利用所述通用领域的训练语料集建立通用语言 模型,并利用所述指定领域的初始训练语料集建立领域语言模型时,具体用 于:When the general language model establishment module uses the training corpus in the general field to establish a general language model, and uses the initial training corpus in the specified field to establish a domain language model, it is specifically used to:
利用所述通用领域的训练语料集中的源语言文本训练语言模型,训练得 到的语言模型作为源语言端通用语言模型;利用所述通用领域的训练语料集 中的目标语言文本训练语言模型,训练得到的语言模型作为目标语言端通用 语言模型。Use the source language text in the training corpus in the general field to train the language model, and the trained language model is used as the source language end universal language model; use the target language text in the training corpus in the general field to train the language model, and the trained language model The language model serves as a general language model on the target language side.
领域语言模型建立模块在利用所述指定领域的初始训练语料集建立领域 语言模型时,具体用于:When the domain language model establishment module uses the initial training corpus of the designated domain to establish a domain language model, it is specifically used to:
利用所述指定领域的初始训练语料集中的源语言文本训练语言模型,训 练得到的语言模型作为源语言端领域语言模型;利用所述指定领域的初始训 练语料集中的目标语言文本训练语言模型,训练得到的语言模型作为目标语 言端领域语言模型。Use the source language text in the initial training corpus of the designated field to train the language model, and the trained language model is used as the source language side domain language model; use the target language text in the initial training corpus of the designated field to train the language model, and train The obtained language model is used as the target language side domain language model.
第二目标值确定模块在基于所述通用语言模型和所述领域语言模型,确 定所述通用领域的训练语料集中一目标训练语料对应的第二目标值时,具体 用于:When the second target value determination module determines the second target value corresponding to a target training corpus in the training corpus of the general domain based on the general language model and the domain language model, it is specifically used to:
计算所述目标训练语料分别在所述通用语言模型和所述领域语言模型上 的后验概率;根据确定出的后验概率,确定所述目标训练语料对应的第二目 标值。Calculate the posterior probabilities of the target training corpus on the general language model and the domain language model respectively; determine the second target value corresponding to the target training corpus based on the determined posterior probabilities.
第二目标值确定模块在计算所述目标训练语料分别在所述通用语言模型 和所述领域语言模型上的后验概率时,具体用于:When calculating the posterior probabilities of the target training corpus on the general language model and the domain language model respectively, the second target value determination module is specifically used to:
计算所述目标训练语料分别在源语言端通用语言模型、源语言端领域语 言模型、目标语言端通用语言模型、目标语言端领域语言模型上的后验概率。Calculate the posterior probabilities of the target training corpus on the source language side general language model, the source language side domain language model, the target language side general language model, and the target language side domain language model.
第二目标值确定模块在根据确定出的后验概率,确定所述目标训练语料 对应的第二目标值时,具体用于:When the second target value determination module determines the second target value corresponding to the target training corpus according to the determined posterior probability, it is specifically used to:
根据所述目标训练语料分别在所述源语言端通用语言模型上的后验概率 和所述源语言端领域语言模型上的后验概率,确定所述目标训练语料在源语 言端语言模型上的得分;根据所述目标训练语料分别在所述目标语言端通用 语言模型上的后验概率和所述目标语言端领域语言模型上的后验概率,确定 所述目标训练语料在目标语言端语言模型上的得分;所述目标训练语料在源 语言端语言模型上的得分和所述目标训练语料在目标语言端语言模型上的得 分作为所述目标训练语料对应的第二目标值。According to the posterior probabilities of the target training corpus on the source language side universal language model and the posterior probabilities on the source language side domain language model, the probability of the target training corpus on the source language side language model is determined. Score; according to the posterior probability of the target training corpus on the target language side universal language model and the posterior probability on the target language side domain language model, determine the target training corpus on the target language side language model The score of the target training corpus on the source language side language model and the score of the target training corpus on the target language side language model are used as the second target value corresponding to the target training corpus.
本实施例提供的领域数据获取装置,可从通用领域的训练语料集中较准 确地筛选出指定领域的训练语料。The domain data acquisition device provided in this embodiment can more accurately filter out the training corpus in the specified field from the training corpus in the general domain.
第六实施例Sixth embodiment
本实施例提供了一种针对指定领域的机器翻译装置,下面对本实施例提 供的机器翻译装置进行描述,下文描述的机器翻译装置与上文描述的机器翻 译方法可相互对应参照。This embodiment provides a machine translation device for a designated field. The machine translation device provided by this embodiment will be described below. The machine translation device described below and the machine translation method described above may be mutually referenced.
本实施例提供的针对指定领域的机器翻译装置可以包括:源语言文本获 取模块和翻译模块。The machine translation device for a specified field provided in this embodiment may include: a source language text acquisition module and a translation module.
源语言文本获取模块,用于获取指定领域的待翻译源语言文本。The source language text acquisition module is used to obtain the source language text to be translated in the specified field.
翻译模块,用于将所述待翻译源语言文本输入预先建立的领域翻译模型, 得到所述待翻译源语言文本对应的目标语言文本。A translation module is used to input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated.
其中,所述领域翻译模型采用指定领域的初始训练语料集以及采用上述 实施例提供的领域数据获取装置从通用领域的训练语料集中获取的训练语 料,对通用翻译模型进行调整得到,所述通用翻译模型采用所述通用领域的 训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain and the training corpus obtained from the training corpus of the general domain using the domain data acquisition device provided in the above embodiment, and is obtained by adjusting the general translation model. The general translation model The model is trained using the training corpus in the general field.
本实施例提供的针对指定领域的机器翻译装置还可以包括:领域翻译模 型构建模块。The machine translation device for a specified domain provided in this embodiment may also include: a domain translation model building module.
领域翻译模型构建模块可以包括第一调整模块。The domain translation model building module may include a first adjustment module.
第一调整模块,用于将所述指定领域的初始训练语料集中的训练语料与 从所述通用领域的训练语料集中筛选出的训练语料混合,利用混合后的训练 语料调整所述通用翻译模型。The first adjustment module is used to mix the training corpus from the initial training corpus set in the specified field with the training corpus selected from the training corpus set in the general field, and use the mixed training corpus to adjust the general translation model.
可选的,领域翻译模型构建模块还可以包括第二调整模块。Optionally, the domain translation model building module may also include a second adjustment module.
第二调整模块,用于采用所述指定领域的初始训练语料集中的训练语料, 对利用所述混合后的训练语料调整后的翻译模型进一步进行调整。The second adjustment module is configured to further adjust the translation model adjusted using the mixed training corpus using the training corpus in the initial training corpus set in the specified field.
由于本实施例提供的机器翻译装置所利用的领域翻译模型采用大量指定 领域的训练语料对通用翻译模型进行微调得到,因此,其为能够适应于指定 领域的翻译模型,利用该翻译模型对指定领域的文本进行翻译,可获得准确 的翻译结果。Since the domain translation model used by the machine translation device provided in this embodiment is obtained by fine-tuning a general translation model using a large amount of training corpus in the specified domain, it is a translation model that can be adapted to the specified domain, and the translation model is used to perform translation in the specified domain. Translate the text to get accurate translation results.
第七实施例Seventh embodiment
本申请实施例还提供了一种领域数据获取设备,请参阅图7,示出了该领 域数据获取设备的结构示意图,该领域数据获取设备可以包括:至少一个处 理器701,至少一个通信接口702,至少一个存储器703和至少一个通信总线 704;The embodiment of the present application also provides a field data acquisition device. Please refer to Figure 7, which shows a schematic structural diagram of the field data acquisition device. The field data acquisition device may include: at least one processor 701, at least one communication interface 702 , at least one memory 703 and at least one communication bus 704;
在本申请实施例中,处理器701、通信接口702、存储器703、通信总线704 的数量为至少一个,且处理器701、通信接口702、存储器703通过通信总线704 完成相互间的通信;In the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703, and the communication bus 704 is at least one, and the processor 701, the communication interface 702, and the memory 703 complete communication with each other through the communication bus 704;
处理器701可能是一个中央处理器CPU,或者是特定集成电路ASIC (ApplicationSpecific Integrated Circuit),或者是被配置成实施本发明实施例 的一个或多个集成电路等;The processor 701 may be a central processing unit CPU, or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
存储器703可能包含高速RAM存储器,也可能还包括非易失性存储器 (non-volatile memory)等,例如至少一个磁盘存储器;Memory 703 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), etc., such as at least one disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序 用于:Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获取通用领域的训练语料集和指定领域的初始训练语料集;Obtain the training corpus in the general field and the initial training corpus in the specified field;
利用所述通用领域的训练语料集建立通用翻译模型;Using the training corpus in the general field to establish a general translation model;
基于所述通用翻译模型和所述指定领域的初始训练语料集,确定所述通 用领域的训练语料集中训练语料对应的第一目标值,其中,一条训练语料对 应一第一目标值,所述第一目标值能够表征对应的训练语料与所述指定领域 的匹配程度;Based on the general translation model and the initial training corpus of the designated field, a first target value corresponding to the training corpus in the training corpus of the general field is determined, wherein a piece of training corpus corresponds to a first target value, and the first target value is determined. A target value can represent the degree of matching between the corresponding training corpus and the specified field;
基于所述通用领域的训练语料集中训练语料对应的第一目标值,从所述 通用领域的训练语料集中筛选所述指定领域的训练语料。Based on the first target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened from the training corpus in the general field.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
第八实施例Eighth embodiment
本申请实施例还提供一种可读存储介质,该可读存储介质可存储有适于 处理器执行的程序,所述程序用于:The embodiment of the present application also provides a readable storage medium, which can store a program suitable for execution by the processor, and the program is used for:
获取通用领域的训练语料集和指定领域的初始训练语料集;Obtain the training corpus in the general field and the initial training corpus in the specified field;
利用所述通用领域的训练语料集建立通用翻译模型;Using the training corpus in the general field to establish a general translation model;
基于所述通用翻译模型和所述指定领域的初始训练语料集,确定所述通 用领域的训练语料集中训练语料对应的第一目标值,其中,一条训练语料对 应一第一目标值,所述第一目标值能够表征对应的训练语料与所述指定领域 的匹配程度;Based on the general translation model and the initial training corpus of the designated field, a first target value corresponding to the training corpus in the training corpus of the general field is determined, wherein a piece of training corpus corresponds to a first target value, and the first target value is determined. A target value can represent the degree of matching between the corresponding training corpus and the specified field;
基于所述通用领域的训练语料集中训练语料对应的第一目标值,从所述 通用领域的训练语料集中筛选所述指定领域的训练语料。Based on the first target value corresponding to the training corpus in the training corpus in the general field, the training corpus in the specified field is screened from the training corpus in the general field.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
第九实施例Ninth embodiment
本申请实施例还提供了一种针对指定领域的机器翻译设备,该机器翻译 设备可以包括:至少一个处理器,至少一个通信接口,至少一个存储器和至 少一个通信总线;The embodiment of the present application also provides a machine translation device for a specified field. The machine translation device may include: at least one processor, at least one communication interface, at least one memory and at least one communication bus;
在本申请实施例中,处理器、通信接口、存储器、通信总线的数量为至 少一个,且处理器、通信接口、存储器通过通信总线完成相互间的通信;In the embodiment of the present application, the number of processors, communication interfaces, memories, and communication buses is at least one, and the processor, communication interface, and memory complete communication with each other through the communication bus;
处理器可能是一个中央处理器CPU,或者是特定集成电路ASIC (ApplicationSpecific Integrated Circuit),或者是被配置成实施本发明实施例 的一个或多个集成电路等;The processor may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
存储器可能包含高速RAM存储器,也可能还包括非易失性存储器 (non-volatilememory)等,例如至少一个磁盘存储器;The memory may include high-speed RAM memory, or may also include non-volatile memory (non-volatile memory), such as at least one disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序 用于:Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获取指定领域的待翻译源语言文本;Obtain the source language text to be translated in the specified field;
将所述待翻译源语言文本输入预先建立的领域翻译模型,得到所述待翻 译源语言文本对应的目标语言文本;Input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;
其中,所述领域翻译模型采用指定领域的初始训练语料集,以及采用上 述任一实施例提供的领域数据获取方法从通用领域的训练语料集中获取的训 练语料,对通用翻译模型进行调整得到,所述通用翻译模型采用所述通用领 域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain, and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method provided by any of the above embodiments, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
第十实施例Tenth embodiment
本申请实施例还提供一种可读存储介质,该可读存储介质可存储有适于 处理器执行的程序,所述程序用于:The embodiment of the present application also provides a readable storage medium, which can store a program suitable for execution by the processor, and the program is used for:
获取指定领域的待翻译源语言文本;Obtain the source language text to be translated in the specified field;
将所述待翻译源语言文本输入预先建立的领域翻译模型,得到所述待翻 译源语言文本对应的目标语言文本;Input the source language text to be translated into a pre-established domain translation model to obtain the target language text corresponding to the source language text to be translated;
其中,所述领域翻译模型采用指定领域的初始训练语料集,以及采用上 述任一实施例提供的领域数据获取方法从通用领域的训练语料集中获取的训 练语料,对通用翻译模型进行调整得到,所述通用翻译模型采用所述通用领 域的训练语料集训练得到。Wherein, the domain translation model adopts the initial training corpus of the specified domain, and the training corpus obtained from the training corpus of the general domain using the domain data acquisition method provided by any of the above embodiments, and is obtained by adjusting the general translation model. The general translation model is trained using the training corpus in the general field.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语 仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求 或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术 语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包 括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括 没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备 所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素, 并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同 要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element is defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus including the stated element.
对所公开的实施例的上述说明,本说明书中各实施例中记载的特征可以 相互替换或者组合,使本领域专业技术人员能够实现或使用本发明。对这些 实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所 定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例 中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合 与本文所公开的原理和新颖特点相一致的最宽的范围。With regard to the above description of the disclosed embodiments, the features described in each embodiment in this specification can be replaced or combined with each other, so that those skilled in the art can realize or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011210710.9A CN112417896B (en) | 2020-11-03 | 2020-11-03 | Domain data acquisition method, machine translation method and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011210710.9A CN112417896B (en) | 2020-11-03 | 2020-11-03 | Domain data acquisition method, machine translation method and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112417896A CN112417896A (en) | 2021-02-26 |
CN112417896B true CN112417896B (en) | 2024-02-02 |
Family
ID=74827333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011210710.9A Active CN112417896B (en) | 2020-11-03 | 2020-11-03 | Domain data acquisition method, machine translation method and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112417896B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0665507B1 (en) * | 1994-01-14 | 2000-05-10 | Raytheon Company | Position and orientation estimation neural network system and method |
CN108920468A (en) * | 2018-05-07 | 2018-11-30 | 内蒙古工业大学 | A kind of bilingual kind of inter-translation method of illiteracy Chinese based on intensified learning |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | A Keyword Extraction Method by Fusing Topic Information and Bidirectional LSTM |
CN110543643A (en) * | 2019-08-21 | 2019-12-06 | 语联网(武汉)信息技术有限公司 | Training method and device of text translation model |
CN111460838A (en) * | 2020-04-23 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Pre-training method and device of intelligent translation model and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9235567B2 (en) * | 2013-01-14 | 2016-01-12 | Xerox Corporation | Multi-domain machine translation model adaptation |
-
2020
- 2020-11-03 CN CN202011210710.9A patent/CN112417896B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0665507B1 (en) * | 1994-01-14 | 2000-05-10 | Raytheon Company | Position and orientation estimation neural network system and method |
CN108920468A (en) * | 2018-05-07 | 2018-11-30 | 内蒙古工业大学 | A kind of bilingual kind of inter-translation method of illiteracy Chinese based on intensified learning |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | A Keyword Extraction Method by Fusing Topic Information and Bidirectional LSTM |
CN110543643A (en) * | 2019-08-21 | 2019-12-06 | 语联网(武汉)信息技术有限公司 | Training method and device of text translation model |
CN111460838A (en) * | 2020-04-23 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Pre-training method and device of intelligent translation model and storage medium |
Non-Patent Citations (2)
Title |
---|
Bilingual recursive neural network based data selection for statistical machine translation;Derek F. Wong et al.;《Knowledge-Based Systems》;第108卷;第15-24页 * |
基于语义分析的机器翻译领域适应性优化方法研究;姚亮;《中国优秀硕士学位论文全文数据库信息科技辑》(第4期);第I138-3646页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112417896A (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104793224B (en) | A kind of GPS location method for correcting error and device | |
CN107341220B (en) | Multi-source data fusion method and device | |
CN111582394B (en) | A group assessment method, device, equipment and medium | |
WO2022042297A1 (en) | Text clustering method, apparatus, electronic device, and storage medium | |
CN106855851A (en) | Knowledge extraction method and device | |
CN109686402A (en) | Based on key protein matter recognition methods in dynamic weighting interactive network | |
WO2023168812A1 (en) | Optimization method and apparatus for search system, and storage medium and computer device | |
CN112463974B (en) | Method and device for establishing knowledge graph | |
CN109712174A (en) | A kind of point cloud of Complex Different Shape curved surface robot three-dimensional measurement mismatches quasi- filtering method and system | |
CN103823753B (en) | Webpage sampling method oriented at barrier-free webpage content detection | |
CN116450827A (en) | Event template induction method and system based on large-scale language model | |
CN108562867A (en) | A kind of fingerprint positioning method and device based on cluster | |
CN112417896B (en) | Domain data acquisition method, machine translation method and related equipment | |
CN104794209A (en) | Chinese microblog sentiment classification method and system based on Markov logic network | |
CN110703038B (en) | Harmonic impedance estimation method suitable for fan access power distribution network | |
CN105162648B (en) | Corporations' detection method based on backbone network extension | |
CN108021985A (en) | A kind of model parameter training method and device | |
CN103593427A (en) | New word searching method and system | |
CN109241146A (en) | Student's intelligence aid method and system under cluster environment | |
CN109800384B (en) | A Basic Probability Assignment Calculation Method Based on Rough Set Information Decision Table | |
CN114491699A (en) | Three-dimensional CAD software usability quantification method and device based on expansion interval number | |
CN117390292B (en) | Application program information recommendation method, system and equipment based on machine learning | |
CN104965869A (en) | Mobile application sorting and clustering method based on heterogeneous information network | |
CN111275564A (en) | A kind of community number detection method and detection system for microblog network | |
CN112990348B (en) | A Self-Adjusting Feature Fusion Method for Small Object Detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230515 Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96 Applicant after: University of Science and Technology of China Applicant after: IFLYTEK Co.,Ltd. Address before: NO.666, Wangjiang West Road, hi tech Zone, Hefei City, Anhui Province Applicant before: IFLYTEK Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |