CN115206296A

CN115206296A - Method and device for speech recognition

Info

Publication number: CN115206296A
Application number: CN202110380567.6A
Authority: CN
Inventors: 王润宇; 资礼波; 付立
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2022-10-18
Anticipated expiration: 2041-04-09
Also published as: CN115206296B

Abstract

Embodiments of the present disclosure disclose methods and apparatuses for speech recognition. The specific implementation of the method includes: obtaining the keywords to be identified; searching for the audio containing the keywords from the original scene training set to form the first training set, and obtaining the audio containing the keywords but not the original scene training set to form the first test set Carry out the first round of incremental training to the first speech recognition model of the original scene based on the first training set, obtain the second speech recognition model to identify the first test set, and add the audio that the keywords in the first test set are correctly identified. Go to the first training set, obtain a second training set and a second test set, and perform a second round of incremental training to obtain a third speech recognition model; input the second test set into the third speech recognition model to obtain an initial recognition result. This implementation can not only greatly reduce the training time and the required amount of data, but also solve the overfitting problem of incremental learning and the limitation of hot word decoding.

Description

Method and device for speech recognition

技术领域technical field

本公开的实施例涉及计算机技术领域，具体涉及语音识别的方法和装置。The embodiments of the present disclosure relate to the field of computer technology, and in particular, to methods and apparatuses for speech recognition.

背景技术Background technique

端到端深度神经网络已成为语音识别领域中的一种流行框架，与传统语音识别框架相比，它可以简化模型的构建和训练流程。在实际应用中，许多场合需要现有的语音识别模型既可以识别新场景下的语音输入，又能够保持原有场景的识别准确率。例如，在原有数据集上训练的语音识别模型，需要加强对新的关键词的识别能力。或者，对于新的语音识别场景的冷启动，需要在继承旧语音识别模型的识别能力的同时，基于一个小的新数据集将该模型适用于新域。由于新的关键词或新场景通常不在过去训练数据集中，因此直接在新任务上使用旧模型时，识别性能会非常不理想。为了解决这个问题，一种可行的方法是混合新旧场景数据集重新训练语音识别模型。但是，该方法可能会遇到训练数据不平衡的问题，因为新数据集通常比旧数据集小得多。同时，出于对数据安全性和隐私性的考虑，过去的数据集可能无法用于训练。另一种方法是使用新场景数据进行迁移学习，这种方法虽然可以减少时间成本，但会导致语音识别模型过拟合的问题。使用热词解码也是一种可行的方式，但是使用热词解码只能在关键词出现在解码路径中时，对该路径进行操作，从而实现关键词召回，当关键词不存在于解码路径中或关键词概率较低时，热词解码方法则无法实现关键词召回。End-to-end deep neural networks have become a popular framework in the field of speech recognition, which can simplify the model building and training process compared to traditional speech recognition frameworks. In practical applications, many occasions require that the existing speech recognition model can not only recognize the speech input in the new scene, but also maintain the recognition accuracy of the original scene. For example, the speech recognition model trained on the original data set needs to strengthen the ability to recognize new keywords. Alternatively, for a cold start of a new speech recognition scenario, it is necessary to adapt the model to a new domain based on a small new dataset while inheriting the recognition capabilities of the old speech recognition model. Since new keywords or new scenes are usually not in past training datasets, the recognition performance will be very suboptimal when using old models directly on new tasks. To solve this problem, a feasible approach is to retrain the speech recognition model by mixing old and new scene datasets. However, this method may suffer from imbalanced training data, as new datasets are usually much smaller than older datasets. At the same time, due to data security and privacy concerns, past datasets may not be available for training. Another method is to use new scene data for transfer learning, although this method can reduce the time cost, but it will lead to the problem of overfitting the speech recognition model. Using hot word decoding is also a feasible way, but using hot word decoding can only operate the path when the keyword appears in the decoding path, so as to achieve keyword recall. When the keyword does not exist in the decoding path or When the keyword probability is low, the hot word decoding method cannot achieve keyword recall.

发明内容SUMMARY OF THE INVENTION

本公开的实施例提出了语音识别的方法和装置。Embodiments of the present disclosure propose methods and apparatuses for speech recognition.

第一方面，本公开的实施例提供了一种语音识别的方法，包括：获取待识别的关键词；从原场景训练集中查找包含所述关键词的音频组成第一训练集，并获取包含所述关键词且非原场景训练集的音频组成第一测试集；基于所述第一训练集对原场景的第一语音识别模型进行第一轮增量训练，得到第二语音识别模型；使用所述第二语音识别模型对所述第一测试集进行识别，将所述第一测试集中关键词被正确识别的音频加入到所述第一训练集，得到第二训练集和第二测试集；基于所述第二训练集对所述第二语音识别模型进行第二轮增量训练，得到第三语音识别模型；将所述第二测试集输入所述第三语音识别模型，得到初始识别结果。In a first aspect, an embodiment of the present disclosure provides a method for speech recognition, including: obtaining a keyword to be recognized; searching for audio containing the keyword from an original scene training set to form a first training set, and obtaining a first training set containing the keyword. The audio of the keyword and the non-original scene training set constitutes the first test set; the first round of incremental training is performed on the first speech recognition model of the original scene based on the first training set to obtain the second speech recognition model; The second speech recognition model recognizes the first test set, and adds the audio of the correctly identified keywords in the first test set to the first training set to obtain a second training set and a second test set; Perform a second round of incremental training on the second speech recognition model based on the second training set to obtain a third speech recognition model; input the second test set into the third speech recognition model to obtain an initial recognition result .

在一些实施例中，所述方法还包括：通过热词解码器对所述初始识别结果进行调整，计算所述关键词的召回率。In some embodiments, the method further includes: adjusting the initial recognition result through a hot word decoder to calculate the recall rate of the keyword.

在一些实施例中，所述将所述第一测试集中关键词被正确识别的音频加入到所述第一训练集，得到第二训练集和第二测试集，包括：通过热词解码器将第二语音识别模型对所述第一测试集进行识别得到的识别结果进行调整；根据调整后的识别结果确定出关键词被正确识别的音频；将确定出的音频加入到所述第一训练集，得到第二训练集；从所述第一训练集中删除确定出的音频，得到第二测试集。In some embodiments, adding the audio with correctly identified keywords in the first test set to the first training set to obtain a second training set and a second test set includes: using a hot word decoder to The second speech recognition model adjusts the recognition result obtained by recognizing the first test set; determines the audio for which the keyword is correctly recognized according to the adjusted recognition result; adds the determined audio to the first training set , to obtain a second training set; delete the determined audio from the first training set to obtain a second test set.

在一些实施例中，所述第一训练集中音频的数量大于第一阈值。In some embodiments, the number of audios in the first training set is greater than a first threshold.

在一些实施例中，所述方法还包括：若所述原场景训练集中包含所述关键词的音频的数量不大于第一阈值，则录制包含所述关键词的音频加入到所述第一训练集。In some embodiments, the method further includes: if the number of audios containing the keyword in the original scene training set is not greater than a first threshold, recording the audio containing the keyword and adding it to the first training set.

在一些实施例中，所述第一测试集中音频的数量大于第二阈值。In some embodiments, the number of audios in the first test set is greater than a second threshold.

在一些实施例中，所述方法还包括：若所述第一测试集中关键词被全部正确识别或所述第二测试集中音频的数量小于第三阈值，则录制包含所述关键词的音频加入到所述第二测试集。In some embodiments, the method further includes: if the keywords in the first test set are all correctly identified or the number of audios in the second test set is less than a third threshold, recording audios containing the keywords and adding to the second test set.

第二方面，本公开的实施例提供了一种语音识别的装置，包括：获取单元，被配置成获取待识别的关键词；组成单元，被配置成从原场景训练集中查找包含所述关键词的音频组成第一训练集，并获取包含所述关键词且非原场景训练集的音频组成第一测试集；第一训练单元，被配置成基于所述第一训练集对原场景的第一语音识别模型进行第一轮增量训练，得到第二语音识别模型；识别单元，被配置成使用所述第二语音识别模型对所述第一测试集进行识别，将所述第一测试集中关键词被正确识别的音频加入到所述第一训练集，得到第二训练集和第二测试集；第二训练单元，被配置成基于所述第二训练集对所述第二语音识别模型进行第二轮增量训练，得到第三语音识别模型；输出单元，被配置成将所述第二测试集输入所述第三语音识别模型，得到初始识别结果。In a second aspect, an embodiment of the present disclosure provides an apparatus for speech recognition, including: an acquisition unit configured to acquire a keyword to be recognized; a composition unit configured to search for a keyword containing the keyword from an original scene training set The audio of the original scene constitutes the first training set, and obtains the audio containing the keywords and the non-original scene training set constitutes the first test set; the first training unit is configured to be based on the first training set. The speech recognition model performs a first round of incremental training to obtain a second speech recognition model; the recognition unit is configured to use the second speech recognition model to recognize the first test set, and the first test set is the key The audio whose words are correctly recognized is added to the first training set to obtain a second training set and a second test set; a second training unit is configured to perform a second training set on the second speech recognition model based on the second training set. The second round of incremental training obtains a third speech recognition model; the output unit is configured to input the second test set into the third speech recognition model to obtain an initial recognition result.

在一些实施例中，所述装置还包括计算单元，被配置成：通过热词解码器对所述初始识别结果进行调整，计算所述关键词的召回率。In some embodiments, the apparatus further includes a computing unit configured to: adjust the initial recognition result through a hot word decoder to calculate the recall rate of the keyword.

在一些实施例中，所述识别单元进一步被配置成：通过热词解码器将第二语音识别模型对所述第一测试集进行识别得到的识别结果进行调整；根据调整后的识别结果确定出关键词被正确识别的音频；将确定出的音频加入到所述第一训练集，得到第二训练集；从所述第一训练集中删除确定出的音频，得到第二测试集。In some embodiments, the recognizing unit is further configured to: adjust the recognition result obtained by recognizing the first test set by the second speech recognition model by using the hot word decoder; determine according to the adjusted recognition result The audios whose keywords are correctly identified; the determined audios are added to the first training set to obtain the second training set; the determined audios are deleted from the first training set to obtain the second test set.

在一些实施例中，所述装置还包括第一录音单元，被配置成：若所述原场景训练集中包含所述关键词的音频的数量不大于第一阈值，则录制包含所述关键词的音频加入到所述第一训练集。In some embodiments, the apparatus further includes a first recording unit configured to: if the number of audios including the keyword in the original scene training set is not greater than a first threshold, record audio including the keyword Audio is added to the first training set.

在一些实施例中，所述装置还包括第二录音单元，被配置成：若所述第一测试集中关键词被全部正确识别或所述第二测试集中音频的数量小于第三阈值，则录制包含所述关键词的音频加入到所述第二测试集。In some embodiments, the apparatus further includes a second recording unit, configured to: if the keywords in the first test set are all correctly identified or the number of audios in the second test set is less than a third threshold, record Audio containing the keyword is added to the second test set.

第三方面，本公开的实施例提供了一种语音识别的电子设备，包括：一个或多个处理器；存储装置，其上存储有一个或多个程序，当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如第一方面中任一项所述的方法。In a third aspect, embodiments of the present disclosure provide an electronic device for speech recognition, including: one or more processors; and a storage device on which one or more programs are stored, when the one or more programs are The one or more processors execute such that the one or more processors implement the method of any one of the first aspects.

第四方面，本公开的实施例提供了一种计算机可读介质，其上存储有计算机程序，其中，所述程序被处理器执行时实现如第一方面中任一项所述的方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein the program implements the method according to any one of the first aspects when the program is executed by a processor.

本公开的实施例提供的语音识别的方法和装置，通过使用带有关键词的少量训练数据进行第一轮增量训练。完成第一轮训练后，使用第一轮训练所得模型，将测试集中正确召回的数据加入训练集进行第二轮增量训练。既可以大幅减少训练时间和所需数据量，又可以解决增量学习的过拟合问题以及热词解码的局限性。The method and apparatus for speech recognition provided by the embodiments of the present disclosure perform the first round of incremental training by using a small amount of training data with keywords. After the first round of training is completed, the model obtained from the first round of training is used to add the correctly recalled data in the test set to the training set for the second round of incremental training. It can not only greatly reduce the training time and the amount of data required, but also solve the overfitting problem of incremental learning and the limitations of hot word decoding.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本公开的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present disclosure will become more apparent upon reading the detailed description of non-limiting embodiments taken with reference to the following drawings:

图1是本公开的一个实施例可以应用于其中的示例性系统架构图；FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure may be applied;

图2是根据本公开的语音识别的方法的一个实施例的流程图；2 is a flowchart of one embodiment of a method for speech recognition according to the present disclosure;

图3是根据本公开的语音识别的方法的一个应用场景的示意图；3 is a schematic diagram of an application scenario of the method for speech recognition according to the present disclosure;

图4是根据本公开的语音识别的方法的又一个实施例的流程图；4 is a flow chart of yet another embodiment of a method for speech recognition according to the present disclosure;

图5是根据本公开的语音识别的方法的热词解码流程图；Fig. 5 is the hot word decoding flow chart of the method for speech recognition according to the present disclosure;

图6是根据本公开的语音识别的装置的一个实施例的结构示意图；6 is a schematic structural diagram of an embodiment of an apparatus for speech recognition according to the present disclosure;

图7是适于用来实现本公开的实施例的电子设备的计算机系统的结构示意图。7 is a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.

具体实施方式Detailed ways

下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。It should be noted that the embodiments of the present disclosure and the features of the embodiments may be combined with each other under the condition of no conflict. The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments.

图1示出了可以应用本申请实施例的语音识别的方法、语音识别的装置的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which the speech recognition method and speech recognition apparatus according to the embodiments of the present application can be applied.

如图1所示，系统架构100可以包括终端101、102，网络103、数据库服务器104和服务器105。网络103用以在终端101、102，数据库服务器104与服务器105之间提供通信链路的介质。网络103可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include terminals 101 and 102 , a network 103 , a database server 104 and a server 105 . The network 103 is used as a medium for providing a communication link between the terminals 101 , 102 , the database server 104 and the server 105 . The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户110可以使用终端101、102通过网络103与服务器105进行交互，以接收或发送消息等。终端101、102上可以安装有各种客户端应用，例如模型训练类应用、语音识别类应用、购物类应用、支付类应用、网页浏览器和即时通讯工具等。The user 110 can use the terminals 101, 102 to interact with the server 105 through the network 103 to receive or send messages and the like. Various client applications may be installed on the terminals 101 and 102, such as model training applications, speech recognition applications, shopping applications, payment applications, web browsers, and instant messaging tools.

这里的终端101、102可以是硬件，也可以是软件。当终端101、102为硬件时，可以是具有麦克风的各种电子设备，包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III，动态影像专家压缩标准音频层面3)、膝上型便携计算机和台式计算机等等。当终端101、102为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。The terminals 101 and 102 here may be hardware or software. When the terminals 101 and 102 are hardware, they can be various electronic devices with microphones, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compressed Standard Audio Layer 3), Laptops and Desktops, etc. When the terminals 101 and 102 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.

当终端101、102为硬件时，其上还可以安装有音频采集设备。音频采集设备可以是各种能实现采集音频功能的设备，如麦克风等等。用户110可以利用终端101、102上的音频采集设备，来采集自身或他人的语音。When the terminals 101 and 102 are hardware, an audio collection device may also be installed thereon. The audio collection device may be various devices that can realize the function of collecting audio, such as a microphone and so on. The user 110 can use the audio collection device on the terminals 101 and 102 to collect the voice of himself or others.

数据库服务器104可以是提供各种服务的数据库服务器。例如数据库服务器中可以存储有样本集。样本集中包含有大量的样本。其中，样本可以包括音频以及与音频对应的标注信息。这样，用户110也可以通过终端101、102，从数据库服务器104所存储的样本集中选取样本。Database server 104 may be a database server that provides various services. For example, a sample set may be stored in a database server. The sample set contains a large number of samples. The samples may include audio and annotation information corresponding to the audio. In this way, the user 110 can also select samples from the sample set stored in the database server 104 through the terminals 101 and 102 .

服务器105也可以是提供各种服务的服务器，例如对终端101、102上显示的各种应用提供支持的后台服务器。后台服务器可以利用终端101、102发送的样本集中的样本，对初始模型进行训练，并可以将训练结果(如生成的语音识别模型)发送给终端101、102。这样，用户可以应用生成的语音识别模型进行语音识别。The server 105 may also be a server that provides various services, such as a background server that provides support for various applications displayed on the terminals 101 and 102 . The background server can use the samples in the sample set sent by the terminals 101 and 102 to train the initial model, and can send the training results (such as the generated speech recognition model) to the terminals 101 and 102 . In this way, the user can apply the generated speech recognition model for speech recognition.

这里的数据库服务器104和服务器105同样可以是硬件，也可以是软件。当它们为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当它们为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。The database server 104 and the server 105 here can also be hardware or software. When they are hardware, they can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When they are software, they can be implemented as multiple software or software modules (eg, to provide distributed services), or as a single software or software module. There is no specific limitation here.

需要说明的是，本申请实施例所提供的语音识别的方法一般由服务器105执行。相应地，语音识别的装置一般也设置于服务器105中。It should be noted that the speech recognition method provided by the embodiment of the present application is generally executed by the server 105 . Correspondingly, the apparatus for speech recognition is generally also provided in the server 105 .

需要指出的是，在服务器105可以实现数据库服务器104的相关功能的情况下，系统架构100中可以不设置数据库服务器104。It should be noted that, in the case where the server 105 can implement the relevant functions of the database server 104, the database server 104 may not be provided in the system architecture 100.

应该理解，图1中的终端、网络、数据库服务器和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端、网络、数据库服务器和服务器。It should be understood that the numbers of terminals, networks, database servers and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks, database servers, and servers according to implementation needs.

继续参见图2，其示出了根据本申请的语音识别的方法的一个实施例的流程200。该语音识别的方法可以包括以下步骤：Continuing to refer to FIG. 2 , a flow 200 of one embodiment of the method for speech recognition according to the present application is shown. The method for speech recognition may include the following steps:

步骤201，获取待识别的关键词。Step 201, acquiring the keyword to be identified.

在本实施例中，语音识别的方法的执行主体(例如图1所示的服务器105)可以通过多种方式来获取待识别的关键词。例如，执行主体可以通过有线连接方式或无线连接方式，从数据库服务器(例如图1所示的数据库服务器104)中获取存储于其中的现有的待识别的关键词。再例如，用户可以通过终端(例如图1所示的终端101、102)来收集待识别的关键词。这样，执行主体可以接收终端所收集的待识别的关键词，并将这些待识别的关键词存储在本地。In this embodiment, the execution body of the speech recognition method (for example, the server 105 shown in FIG. 1 ) can acquire the keywords to be recognized in various ways. For example, the execution subject may acquire the existing keywords to be identified stored therein from a database server (eg, the database server 104 shown in FIG. 1 ) through a wired connection or a wireless connection. For another example, the user may collect keywords to be identified through a terminal (eg, the terminals 101 and 102 shown in FIG. 1 ). In this way, the execution body can receive the keywords to be identified collected by the terminal, and store the keywords to be identified locally.

待识别的关键词应用于新的场景，原场景训练出的语音识别模型对待识别的关键词的识别效果不好。关键词一般为专业术语或产品名称，如“ETF”、“XY贷”等。The keyword to be recognized is applied to a new scene, and the speech recognition model trained on the original scene has a poor recognition effect on the keyword to be recognized. Keywords are generally professional terms or product names, such as "ETF", "XY Loan", etc.

步骤202，从原场景训练集中查找包含关键词的音频组成第一训练集，并获取包含关键词且非原场景训练集的音频组成第一测试集。Step 202: Search for audios containing keywords from the original scene training set to form a first training set, and acquire audios containing keywords but not in the original scene training set to form a first test set.

在本实施例中，本申请是在原场景的第一语音识别模型的基础上进行的增量学习。使用原有语音识别模型，在小批量新场景训练集上进行训练，使其最大程度原场景上的识别性能的同时，又可以拟合新的场景。增量学习是指模型在不访问原始数据的基础上使用新数据对新的任务或场景知识进行学习，同时又不遗忘已经学习到的任务或场景知识。它既可以优化模型在新场景下的性能，又可以最大程度地保留模型在原场景下的精度。In this embodiment, the present application performs incremental learning based on the first speech recognition model of the original scene. Use the original speech recognition model to train on a small batch of new scene training sets to maximize the recognition performance on the original scene, and at the same time fit the new scene. Incremental learning means that the model uses new data to learn new task or scene knowledge without accessing the original data without forgetting the task or scene knowledge that has been learned. It can not only optimize the performance of the model in the new scene, but also preserve the accuracy of the model in the original scene to the greatest extent.

原场景的第一语音识别模型是通过原场景训练集训练的。原场景训练集包括大量音频，每个音频与标注信息相对应，标注信息用于标注音频的内容。原场景训练集中有部分音频是包括关键词的，可以把这些包括关键词的音频挑出来组成第一训练集。再收集包含关键词且不在原场景训练集的音频组成第一测试集，用于对训练完成的模型进行性能验证。The first speech recognition model of the original scene is trained through the original scene training set. The original scene training set includes a large number of audios, each audio corresponds to the annotation information, and the annotation information is used to annotate the content of the audio. Some audios in the original scene training set include keywords, and these audios including keywords can be picked out to form the first training set. Then collect the audio that contains keywords and is not in the original scene training set to form the first test set, which is used to verify the performance of the trained model.

在本实施例的一些可选地实现方式中，第一训练集中音频的数量大于第一阈值(例如，30条)，如果训练数据不足，则训练效果不理想，需要录制包含所述关键词的音频加入到所述第一训练集，使得包含所述关键词的音频的总数大于第一阈值。In some optional implementations of this embodiment, the number of audios in the first training set is greater than the first threshold (for example, 30 pieces), and if the training data is insufficient, the training effect is not ideal, and it is necessary to record the audio files containing the keyword. Audios are added to the first training set such that the total number of audios containing the keyword is greater than a first threshold.

在本实施例的一些可选地实现方式中，所述第一测试集中音频的数量大于第二阈值(例如，15条)。第二阈值可以小于第一阈值。需要有足够的测试集对模型进行性能验证，也为了能让部分测试集中的数据加入训练集后，测试集仍有足够数量的音频对模型进行性能验证。In some optional implementations of this embodiment, the number of audios in the first test set is greater than a second threshold (for example, 15). The second threshold may be smaller than the first threshold. There needs to be enough test sets to verify the performance of the model, and in order to allow some of the data in the test set to be added to the training set, the test set still has a sufficient number of audios to verify the performance of the model.

步骤203，基于第一训练集对原场景的第一语音识别模型进行第一轮增量训练，得到第二语音识别模型。Step 203 , perform a first round of incremental training on the first speech recognition model of the original scene based on the first training set to obtain a second speech recognition model.

在本实施例中，增量训练不需要在训练过程开始之前就提供大量的训练数据，而是随着时间推移不断使用新的训练数据进行训练。将第一训练集中的音频和标注信息分别作为第一语音识别模型的输入和期望输出，对第一语音识别模型进行有监督地训练，得到第二语音识别模型。具体训练过程为现有技术，因此不再赘述。In this embodiment, incremental training does not require a large amount of training data to be provided before the training process starts, but continuously uses new training data for training over time. The audio and label information in the first training set are respectively used as the input and expected output of the first speech recognition model, and the first speech recognition model is trained in a supervised manner to obtain the second speech recognition model. The specific training process is in the prior art, so it is not repeated here.

步骤204，使用第二语音识别模型对第一测试集进行识别，将第一测试集中关键词被正确识别的音频加入到第一训练集，得到第二训练集和第二测试集。Step 204 , use the second speech recognition model to recognize the first test set, and add the audios whose keywords are correctly recognized in the first test set to the first training set to obtain the second training set and the second test set.

在本实施例中，使用训练完成的第二语音识别模型对第一测试集进行识别。对于关键词被正确识别的音频，将其加入第一训练集，得到第二训练集，并从第一测试集中删除该音频，得到第二测试集。也就是说经过第一次增量训练，增加了训练集的音频的数量，减小了测试集的音频的数量。In this embodiment, the trained second speech recognition model is used to recognize the first test set. For the audio whose keywords are correctly identified, add it to the first training set to obtain the second training set, and delete the audio from the first test set to obtain the second test set. That is to say, after the first incremental training, the number of audios in the training set is increased, and the number of audios in the test set is reduced.

在本实施例的一些可选地实现方式中，若所述第一测试集中关键词被全部正确识别或所述第二测试集中音频的数量小于第三阈值，则录制包含所述关键词的音频加入到所述第二测试集。第三阈值(例如5条)可小于第二阈值(例如15条)，也小于第一阈值(例如30条)。若测试集中所有音频的关键词均被模型正确识别，或新测试集中音频少于5条，则录制含有关键词的新音频，使新测试集中的音频不少于5条。In some optional implementations of this embodiment, if all the keywords in the first test set are correctly identified or the number of audios in the second test set is less than a third threshold, the audio containing the keywords is recorded added to the second test set. The third threshold (eg, 5 bars) may be less than the second threshold (eg, 15 bars) and also less than the first threshold (eg, 30 bars). If the keywords of all audios in the test set are correctly recognized by the model, or there are fewer than 5 audios in the new test set, record new audios containing keywords so that there are no less than 5 audios in the new test set.

在本实施例的一些可选地实现方式中，通过热词解码器将第二语音识别模型对所述第一测试集进行识别得到的识别结果进行调整；根据调整后的识别结果确定出关键词被正确识别的音频；将确定出的音频加入到所述第一训练集，得到第二训练集；从所述第一训练集中删除确定出的音频，得到第二测试集。在模型识别过程中，本方法使用了如图5所示的热词解码器。热词解码器在语音识别解码过程中，在各个时间步上对解码路径的末尾进行匹配，若解码路径末尾为所指定的热词，则对该路径给予相应的加分。如图5中所示，指定的解码热词为“ETF”，在某时间步中，某解码路径的末尾出现“ETF”，经热词解码器匹配后，得分获得提升。在完成解码搜索时，解码器输出得分最高的解码路径作为识别结果，为获得热词奖励的带有关键词的路径。这样，通过声学模型的调整和热词的调整，本方法有效提升了关键词的召回率。In some optional implementations of this embodiment, the recognition result obtained by recognizing the first test set by the second speech recognition model is adjusted by the hot word decoder; the keyword is determined according to the adjusted recognition result. Correctly recognized audio; adding the determined audio to the first training set to obtain a second training set; deleting the determined audio from the first training set to obtain a second testing set. In the model recognition process, this method uses the hot word decoder shown in Figure 5. During the speech recognition decoding process, the hot word decoder matches the end of the decoding path at each time step. If the end of the decoding path is the specified hot word, the path will be given corresponding bonus points. As shown in Figure 5, the specified decoding hot word is "ETF". In a certain time step, "ETF" appears at the end of a decoding path. After matching by the hot word decoder, the score is improved. When the decoding search is completed, the decoder outputs the decoding path with the highest score as the recognition result, which is the path with the keyword that obtains the hot word reward. In this way, through the adjustment of the acoustic model and the adjustment of hot words, the method effectively improves the recall rate of keywords.

步骤205，基于第二训练集对第二语音识别模型进行第二轮增量训练，得到第三语音识别模型。Step 205 , perform a second round of incremental training on the second speech recognition model based on the second training set to obtain a third speech recognition model.

在本实施例中，将第二训练集中的音频和标注信息分别作为第二语音识别模型的输入和期望输出，对第二语音识别模型进行有监督地训练，得到第三语音识别模型。第三语音识别模型对关键词的识别效果有了明显提升。In this embodiment, the audio and label information in the second training set are used as the input and expected output of the second speech recognition model, respectively, and the second speech recognition model is trained in a supervised manner to obtain the third speech recognition model. The third speech recognition model has significantly improved the recognition effect of keywords.

本实施例中语音识别的方法，在训练样本较少的情况下，使用增量学习调整语音识别模型，结合热词的方式提高关键词的召回率的方法；同时，相较于其他方式，本方法可以有效的保留新模型在旧场景上的识别性能，同时解决了热词解码受限于声学概率分布的问题。The speech recognition method in this embodiment uses incremental learning to adjust the speech recognition model and improves the recall rate of keywords by combining with hot words when there are few training samples. The method can effectively retain the recognition performance of the new model on the old scene, while solving the problem that hot word decoding is limited by the acoustic probability distribution.

步骤206，将第二测试集输入所述第三语音识别模型，得到初始识别结果。Step 206: Input the second test set into the third speech recognition model to obtain an initial recognition result.

在本实施例中，使用第二测试集对第三语音识别模型的性能进行验证，得到初始识别结果，即对关键词识别的准确性。进一步参见图3，图3是根据本实施例的语音识别的方法的一个应用场景的示意图。具体过程如下：In this embodiment, the performance of the third speech recognition model is verified by using the second test set to obtain an initial recognition result, that is, the accuracy of keyword recognition. Referring further to FIG. 3 , FIG. 3 is a schematic diagram of an application scenario of the method for speech recognition according to this embodiment. The specific process is as follows:

1)用户可通过终端设备向服务器输入待识别的关键词“ETF”。1) The user can input the keyword "ETF" to be identified to the server through the terminal device.

2)对于指定的关键词，在原场景训练集中找出含有该关键词的训练数据，组成第一次增量训练的训练集，若训练集中含有的带关键词的音频不足30条，则需录制含有关键词的音频，使训练集中音频不少于30条。同时，收集含有关键词且不在原场景训练集中的数据，不少于15条，作为测试集。2) For the specified keyword, find the training data containing the keyword in the original scene training set to form the training set for the first incremental training. If the training set contains less than 30 audios with keywords, it needs to be recorded. Audio containing keywords should be no less than 30 in the training set. At the same time, collect no less than 15 data containing keywords and not in the original scene training set as the test set.

3)使用原场景的语音识别模型进行增量训练，使用1)中训练集，进行第一轮增量训练。3) Use the speech recognition model of the original scene for incremental training, and use the training set in 1) to perform the first round of incremental training.

4)训练完成后，使用3)中所得的新模型对测试集进行识别，对于关键词被正确识别的音频，将其加入2)中训练集；对于未被识别正确的音频，将其加入新的测试集中。若测试集中所有音频的关键词均被模型正确识别，或新测试集中音频少于5条，则录制含有关键词的新音频，使新测试集中的音频不少于5条。4) After the training is completed, use the new model obtained in 3) to identify the test set, and for the audio whose keywords are correctly identified, add it to the training set in 2); for the audio that has not been correctly identified, add it to the new in the test set. If the keywords of all audios in the test set are correctly recognized by the model, or there are fewer than 5 audios in the new test set, record new audios containing keywords so that there are no less than 5 audios in the new test set.

在模型识别过程中，本方法使用了如图4所示的热词解码器。热词解码器在语音识别解码过程中，在各个时间步上对解码路径的末尾进行匹配，若解码路径末尾为所指定的热词，则对该路径给予相应的加分。如图4中所示，指定的解码热词为“ETF”，在某时间步中，某解码路径的末尾出现“ETF”，经热词解码器匹配后，得分获得提升。在完成解码搜索时，解码器输出得分最高的解码路径作为识别结果，为获得热词奖励的带有关键词的路径。这样，通过语音识别模型(特别是声学模型)的调整和热词的调整，本方法有效提升了关键词的召回率。In the model recognition process, this method uses the hot word decoder shown in Figure 4. During the speech recognition decoding process, the hot word decoder matches the end of the decoding path at each time step. If the end of the decoding path is the specified hot word, the path will be given corresponding bonus points. As shown in Figure 4, the specified decoding hot word is "ETF", and in a certain time step, "ETF" appears at the end of a certain decoding path, and after matching by the hot word decoder, the score is improved. When the decoding search is completed, the decoder outputs the decoding path with the highest score as the recognition result, which is the path with the keyword that obtains the hot word reward. In this way, through the adjustment of the speech recognition model (especially the acoustic model) and the adjustment of hot words, the method effectively improves the recall rate of keywords.

5)使用4)中训练集，对3)中所得模型进行第二轮增量训练。5) Using the training set in 4), perform a second round of incremental training on the model obtained in 3).

6)训练完成后，得到5)中模型，即为语音识别模型。6) After the training is completed, the model in 5) is obtained, which is the speech recognition model.

继续参见图4，作为对上述各图所示方法的实现，本申请提供了一种语音识别的方法的又一个实施例的流程图。Continuing to refer to FIG. 4 , as an implementation of the methods shown in the above figures, the present application provides a flowchart of another embodiment of a method for speech recognition.

如图4所示，本实施例的语音识别的方法400可以包括：As shown in FIG. 4 , the method 400 for speech recognition in this embodiment may include:

步骤401，获取待识别的关键词。Step 401, acquiring the keyword to be identified.

步骤402，从原场景训练集中查找包含关键词的音频组成第一训练集，并获取包含关键词且非原场景训练集的音频组成第一测试集。Step 402 , searching for audios containing keywords from the original scene training set to form a first training set, and acquiring audios containing keywords but not in the original scene training set to form a first test set.

步骤403，基于第一训练集对原场景的第一语音识别模型进行第一轮增量训练，得到第二语音识别模型。Step 403 , perform a first round of incremental training on the first speech recognition model of the original scene based on the first training set to obtain a second speech recognition model.

步骤404，使用第二语音识别模型对第一测试集进行识别，将第一测试集中关键词被正确识别的音频加入到第一训练集，得到第二训练集和第二测试集。Step 404 , use the second speech recognition model to recognize the first test set, and add the audios whose keywords are correctly recognized in the first test set to the first training set to obtain the second training set and the second test set.

步骤405，基于第二训练集对第二语音识别模型进行第二轮增量训练，得到第三语音识别模型。Step 405: Perform a second round of incremental training on the second speech recognition model based on the second training set to obtain a third speech recognition model.

步骤401-405与步骤201-205基本相同，因此不再赘述。Steps 401-405 are basically the same as steps 201-205, and thus are not repeated here.

步骤406，将第二测试集输入第三语音识别模型，得到初始识别结果。Step 406: Input the second test set into the third speech recognition model to obtain an initial recognition result.

在本实施例中，使用第二测试集对第三语音识别模型的性能进行验证，得到初始识别结果，即对关键词识别的准确性。In this embodiment, the performance of the third speech recognition model is verified by using the second test set to obtain an initial recognition result, that is, the accuracy of keyword recognition.

步骤407，通过热词解码器对初始识别结果进行调整，计算关键词的召回率。In step 407, the initial recognition result is adjusted by the hot word decoder, and the recall rate of the keyword is calculated.

在本实施例中，使用了如图5所示的热词解码器。热词解码器在语音识别解码过程中，在各个时间步上对解码路径的末尾进行匹配，若解码路径末尾为所指定的热词，则对该路径给予相应的加分。如图5中所示，指定的解码热词为“ETF”，在某时间步中，某解码路径的末尾出现“ETF”，经热词解码器匹配后，得分获得提升。在完成解码搜索时，解码器输出得分最高的解码路径作为识别结果，为获得热词奖励的带有关键词的路径。这样，通过声学模型的调整和热词的调整，本方法有效提升了关键词的召回率。In this embodiment, the hot word decoder shown in FIG. 5 is used. During the speech recognition decoding process, the hot word decoder matches the end of the decoding path at each time step. If the end of the decoding path is the specified hot word, the path will be given corresponding bonus points. As shown in Figure 5, the specified decoding hot word is "ETF". In a certain time step, "ETF" appears at the end of a decoding path. After matching by the hot word decoder, the score is improved. When the decoding search is completed, the decoder outputs the decoding path with the highest score as the recognition result, which is the path with the keyword that obtains the hot word reward. In this way, through the adjustment of the acoustic model and the adjustment of hot words, the method effectively improves the recall rate of keywords.

可选地，如果召回率未达到期望阈值，则还可重复执行步骤404-407，继续从测试集中选出准确识别出关键词的音频加入到训练集中，用更新的训练集重复训练模型，直到训练完成的模型的召回率达到期望阈值。Optionally, if the recall rate does not reach the desired threshold, steps 404 to 407 can be repeatedly executed, and the audio that accurately identifies the keyword is continued to be selected from the test set and added to the training set, and the training model is repeated with the updated training set until The recall of the trained model reaches the desired threshold.

从图4中可以看出，与图2对应的实施例相比，本实施例中的语音识别的方法的流程400体现了对语音识别模型的识别结果通过热词解码器进行调整的步骤。由此，本实施例描述的方案可以提高关键词的召回率。As can be seen from FIG. 4 , compared with the embodiment corresponding to FIG. 2 , the process 400 of the speech recognition method in this embodiment embodies the step of adjusting the recognition result of the speech recognition model through the hot word decoder. Therefore, the solution described in this embodiment can improve the recall rate of keywords.

继续参见图6，作为对上述各图所示方法的实现，本申请提供了一种语音识别的装置的一个实施例。该装置实施例与图2所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Continuing to refer to FIG. 6 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a device for speech recognition. This apparatus embodiment corresponds to the method embodiment shown in FIG. 2 , and the apparatus can be specifically applied to various electronic devices.

如图6所示，本实施例的语音识别的装置600可以包括：获取单元601、组成单元602、第一训练单元603、识别单元604、第二训练单元605和输出单元606。其中，获取单元601，被配置成获取待识别的关键词；组成单元602，被配置成从原场景训练集中查找包含所述关键词的音频组成第一训练集，并获取包含所述关键词且非原场景训练集的音频组成第一测试集；第一训练单元603，被配置成基于所述第一训练集对原场景的第一语音识别模型进行第一轮增量训练，得到第二语音识别模型；识别单元604，被配置成使用所述第二语音识别模型对所述第一测试集进行识别，将所述第一测试集中关键词被正确识别的音频加入到所述第一训练集，得到第二训练集和第二测试集；第二训练单元605，被配置成基于所述第二训练集对所述第二语音识别模型进行第二轮增量训练，得到第三语音识别模型；输出单元606，被配置成将所述第二测试集输入所述第三语音识别模型，得到初始识别结果。As shown in FIG. 6 , the apparatus 600 for speech recognition in this embodiment may include: an acquisition unit 601 , a composition unit 602 , a first training unit 603 , a recognition unit 604 , a second training unit 605 and an output unit 606 . Wherein, the obtaining unit 601 is configured to obtain the keyword to be identified; the composition unit 602 is configured to search for the audio containing the keyword from the original scene training set to form the first training set, and obtain the first training set containing the keyword and The audio of the non-original scene training set constitutes the first test set; the first training unit 603 is configured to perform the first round of incremental training on the first speech recognition model of the original scene based on the first training set, to obtain the second voice A recognition model; the recognition unit 604 is configured to use the second speech recognition model to recognize the first test set, and add the audio of the correctly recognized keywords in the first test set to the first training set , to obtain a second training set and a second test set; the second training unit 605 is configured to perform a second round of incremental training on the second speech recognition model based on the second training set to obtain a third speech recognition model ; The output unit 606 is configured to input the second test set into the third speech recognition model to obtain an initial recognition result.

在本实施例中，语音识别的装置600的获取单元601、组成单元602、第一训练单元603、识别单元604、第二训练单元605和输出单元606的具体处理可以参考图2对应实施例中的步骤201、步骤202、步骤203、步骤204、步骤205和步骤206。In this embodiment, the specific processing of the acquiring unit 601 , the composition unit 602 , the first training unit 603 , the identification unit 604 , the second training unit 605 and the output unit 606 of the apparatus 600 for speech recognition may refer to the corresponding embodiment in FIG. 2 Step 201, Step 202, Step 203, Step 204, Step 205 and Step 206.

在本实施例的一些可选的实现方式中，装置600还包括计算单元(附图中未示出)，被配置成：通过热词解码器对所述初始识别结果进行调整，计算所述关键词的召回率。In some optional implementations of this embodiment, the apparatus 600 further includes a computing unit (not shown in the drawings), configured to: adjust the initial recognition result through a hot word decoder, calculate the key word recall.

在本实施例的一些可选的实现方式中，所述识别单元604进一步被配置成：通过热词解码器将第二语音识别模型对所述第一测试集进行识别得到的识别结果进行调整；根据调整后的识别结果确定出关键词被正确识别的音频；将确定出的音频加入到所述第一训练集，得到第二训练集；从所述第一训练集中删除确定出的音频，得到第二测试集。In some optional implementations of this embodiment, the recognizing unit 604 is further configured to: adjust the recognition result obtained by recognizing the first test set by the second speech recognition model through a hot word decoder; According to the adjusted recognition result, determine the audio for which the keyword is correctly recognized; add the determined audio to the first training set to obtain a second training set; delete the determined audio from the first training set to obtain Second test set.

在本实施例的一些可选的实现方式中，所述第一训练集中音频的数量大于第一阈值。In some optional implementations of this embodiment, the number of audios in the first training set is greater than a first threshold.

在本实施例的一些可选的实现方式中，所述装置600还包括第一录音单元(附图中未示出)，被配置成：若所述原场景训练集中包含所述关键词的音频的数量不大于第一阈值，则录制包含所述关键词的音频加入到所述第一训练集。In some optional implementations of this embodiment, the apparatus 600 further includes a first recording unit (not shown in the drawings), configured to: if the original scene training set contains the audio of the keyword The number of is not greater than the first threshold, then the audio containing the keyword is recorded and added to the first training set.

在本实施例的一些可选的实现方式中，所述第一测试集中音频的数量大于第二阈值。In some optional implementations of this embodiment, the number of audios in the first test set is greater than a second threshold.

在本实施例的一些可选的实现方式中，所述装置600还包括第二录音单元(附图中未示出)，被配置成：若所述第一测试集中关键词被全部正确识别或所述第二测试集中音频的数量小于第三阈值，则录制包含所述关键词的音频加入到所述第二测试集。In some optional implementations of this embodiment, the apparatus 600 further includes a second recording unit (not shown in the drawings), configured to: if all the keywords in the first test set are correctly identified or If the number of audios in the second test set is less than the third threshold, the audios containing the keyword are recorded and added to the second test set.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质。According to an embodiment of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the device 700 includes a computing unit 701 that can be executed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 Various appropriate actions and handling. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The computing unit 701 , the ROM 702 , and the RAM 703 are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .

设备700中的多个部件连接至I/O接口705，包括：输入单元706，例如键盘、鼠标等；输出单元707，例如各种类型的显示器、扬声器等；存储单元708，例如磁盘、光盘等；以及通信单元709，例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理，例如方法语音识别。例如，在一些实施例中，方法语音识别可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时，可以执行上文描述的方法语音识别的一个或多个步骤。备选地，在其他实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行方法语音识别。Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method speech recognition. For example, in some embodiments, the method speech recognition may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709 . When a computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the method speech recognition described above may be performed. Alternatively, in other embodiments, computing unit 701 may be configured to perform method speech recognition by any other suitable means (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以为分布式系统的服务器，或者是结合了区块链的服务器。服务器也可以是云服务器，或者是带人工智能技术的智能云计算服务器或智能云主机。服务器可以为分布式系统的服务器，或者是结合了区块链的服务器。服务器也可以是云服务器，或者是带人工智能技术的智能云计算服务器或智能云主机。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a distributed system server, or a server combined with a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The server can be a distributed system server, or a server combined with a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims

1. A method of speech recognition, comprising:

acquiring a keyword to be identified;

searching audio containing the keywords from an original scene training set to form a first training set, and acquiring audio containing the keywords and not forming a first test set from the original scene training set;

performing a first round of incremental training on a first voice recognition model of an original scene based on the first training set to obtain a second voice recognition model;

using the second speech recognition model to recognize the first test set, and adding the audio with the correctly recognized keywords in the first test set into the first training set to obtain a second training set and a second test set;

performing a second round of incremental training on the second voice recognition model based on the second training set to obtain a third voice recognition model;

and inputting the second test set into the third speech recognition model to obtain an initial recognition result.

2. The method of claim 1, wherein the method further comprises:

and adjusting the initial recognition result through a hotword decoder, and calculating the recall rate of the keyword.

3. The method of claim 1, wherein the adding the audio of the first test set with correctly recognized keywords to the first training set to obtain a second training set and a second test set comprises:

adjusting a recognition result obtained by recognizing the first test set by a second voice recognition model through a hot word decoder;

determining the audio frequency with the correctly recognized keywords according to the adjusted recognition result;

adding the determined audio into the first training set to obtain a second training set;

and deleting the determined audio from the first training set to obtain a second test set.

4. The method of claim 1, wherein the amount of audio in the first training set is greater than a first threshold.

5. The method of claim 4, wherein the method further comprises:

and if the number of the audios containing the keywords in the original scene training set is not larger than a first threshold value, recording the audios containing the keywords and adding the audios into the first training set.

6. The method of claim 1, wherein the number of tones in the first test set is greater than a second threshold.

7. The method of claim 1, wherein the method further comprises:

and if all the keywords in the first test set are correctly identified or the number of the audios in the second test set is smaller than a third threshold value, recording the audios containing the keywords and adding the audios into the second test set.

8. An apparatus for speech recognition, comprising:

an acquisition unit configured to acquire a keyword to be recognized;

the component unit is configured to search an audio component first training set containing the keywords from an original scene training set, and acquire an audio component first test set containing the keywords and not the original scene training set;

a first training unit configured to perform a first round of incremental training on a first speech recognition model of an original scene based on the first training set to obtain a second speech recognition model;

the recognition unit is configured to recognize the first test set by using the second speech recognition model, and add the audio with correctly recognized keywords in the first test set into the first training set to obtain a second training set and a second test set;

a second training unit configured to perform a second round of incremental training on the second speech recognition model based on the second training set, resulting in a third speech recognition model;

and the output unit is configured to input the second test set into the third speech recognition model to obtain an initial recognition result.

9. An electronic device for speech recognition, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.