CN109461436B

CN109461436B - A method and system for correcting pronunciation errors in speech recognition

Info

Publication number: CN109461436B
Application number: CN201811239934.5A
Authority: CN
Inventors: 魏誉荧
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2020-12-15
Anticipated expiration: 2038-10-23
Also published as: CN109461436A

Abstract

The invention provides a method and a system for correcting pronunciation errors in speech recognition, wherein the method comprises the following steps: establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to pronounce error-prone characters; acquiring user voice information; recognizing the user voice information, and when the pronunciation-prone wrong character is contained in the voice information, extracting an audio segment corresponding to a word containing the pronunciation-prone wrong character in the user voice information; and when the audio frequency segment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss and the pronunciation is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table. The invention prompts and outputs corresponding correct audio when recognizing that the pronunciation of the pronouncing error-prone character of the user is wrong by establishing a mapping table between the standard acoustic model and the error acoustic model.

Description

A method and system for correcting pronunciation errors in speech recognition

技术领域technical field

本发明涉及语音识别技术领域，尤指一种语音识别发音错误的纠正方法及系统。The present invention relates to the technical field of speech recognition, in particular to a method and system for correcting pronunciation errors in speech recognition.

背景技术Background technique

随着互联网的快速发展，人们的生活变得越来越智能化。语音交互作为智能终端中人机交互主流的交流应用之一，也是越来越受到用户的青睐。智能终端基于用户输入的语音采取相应的措施，因此用户通过终端终端所输入的语音的准确性严重影响着智能终端所作出的反馈。With the rapid development of the Internet, people's lives are becoming more and more intelligent. As one of the mainstream communication applications of human-computer interaction in intelligent terminals, voice interaction is increasingly favored by users. The intelligent terminal takes corresponding measures based on the voice input by the user, so the accuracy of the voice input by the user through the terminal terminal seriously affects the feedback made by the intelligent terminal.

汉字中存在大量的多音字、形近字等，对于部分用户而言，很难分清楚不常用较生僻的多音字、形近字，更有甚者，对于部分多音字、形近字有的用户的常用发音本身就是错误的。There are a large number of polyphonic characters and close-shaped characters in Chinese characters. For some users, it is difficult to distinguish uncommon polyphonic characters and close-shaped characters. The common pronunciation of users is inherently wrong.

另外对于小学生而言，他们还在学习的过程中，特别是识字量不多的情况下，经常出现含多音字的词语读错或形近字读错的情况，这种情况在智能终端识别读音时，会导致识别错误，无法给出需要查询的正确结果或者相应的准确的反馈。因此，需要一种语音识别发音错误的纠正方法及系统解决上述问题。In addition, for primary school students, they are still in the process of learning, especially in the case of a small amount of literacy, often mispronounce words containing polyphonic characters or mispronounce words with similar shapes. In this case, the intelligent terminal recognizes the pronunciation. , it will lead to recognition errors, unable to give the correct results to be queried or the corresponding accurate feedback. Therefore, there is a need for a method and system for correcting pronunciation errors in speech recognition to solve the above problems.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种语音识别发音错误的纠正方法及系统，实现通过建立标准声学模型和错误声学模型之间的映射表，在识别出用户发音易错字发音错误时进行提示并输出相应的正确音频。The purpose of the present invention is to provide a method and system for correcting pronunciation errors in speech recognition, so as to realize that by establishing a mapping table between the standard acoustic model and the wrong acoustic model, prompting and outputting the corresponding typos when recognizing that the user's pronunciation is prone to mispronunciation correct audio.

本发明提供的技术方案如下：The technical scheme provided by the present invention is as follows:

本发明提供一种语音识别发音错误的纠正方法，其特征在于，包括：The present invention provides a method for correcting pronunciation errors of speech recognition, which is characterized in that, comprising:

建立发音易错字对应的标准声学模型和错误声学模型之间的映射表；Establish a mapping table between the standard acoustic model and the wrong acoustic model corresponding to the pronunciation-prone words;

获取用户语音信息；Obtain user voice information;

识别所述用户语音信息，当所述语音信息中包含所述发音易错字时，提取所述用户语音信息中包含所述发音易错字的词语对应的音频片段；Identifying the user's voice information, when the voice information contains the pronunciation-prone words, extracting the audio segments corresponding to the pronunciation-prone words in the user's voice information;

当所述音频片段与所述错误声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音错误，并根据所述映射表输出对应的所述标准声学模型中的语音音频。When the audio segment matches the speech audio matching result in the wrong acoustic model, the user is prompted to pronounce the typo-prone word incorrectly, and the corresponding speech audio in the standard acoustic model is output according to the mapping table.

进一步的，所述的建立发音易错字对应的标准声学模型和错误声学模型之间的映射表之前还包括：Further, before the described mapping table between the standard acoustic model and the erroneous acoustic model corresponding to the easily typo-pronounced word is established, it also includes:

获取所述发音易错字，根据所述发音易错字生成目标词语；Obtain the easy-to-pronounce words, and generate target words according to the easy-to-pronounce words;

获取所述目标词语的语音音频，根据所述目标词语的语音音频生成所述标准声学模型；Obtain the speech audio of the target word, and generate the standard acoustic model according to the speech audio of the target word;

获取所述发音易错字的发音混淆字，将所述目标词语中的所述发音易错字替换成所述发音混淆字生成混淆词语；Obtain the pronunciation confusing word of described pronunciation typo, replace the pronunciation confusing word in described target word with described pronunciation confusing word to generate confusing word;

获取所述混淆词语的语音音频，根据所述混淆词语的语音音频生成所述错误声学模型。Acquire the speech audio of the confused word, and generate the error acoustic model according to the speech audio of the confused word.

进一步的，还包括：Further, it also includes:

当所述音频片段与所述标准声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音正确。When the audio segment matches the speech audio matching result in the standard acoustic model, the user is prompted to pronounce the typo-prone word correctly.

进一步的，还包括：Further, it also includes:

当所述音频片段与声学模型中的语音音频匹配结果不相符时，将所述音频片段转化为识别文本，所述声学模型包括所述标准声学模型和所述错误声学模型；When the audio segment does not match the speech audio matching result in the acoustic model, converting the audio segment into a recognized text, and the acoustic model includes the standard acoustic model and the wrong acoustic model;

若所述目标词语包含所述识别文本，判断所述音频片段的发音是否正确，若是则根据所述音频片段更新所述标准声学模型；否则根据所述音频片段更新所述错误声学模型，并提示用户所述发音易错字发音错误，根据所述映射表输出对应的所述标准声学模型中的语音音频；If the target word contains the recognized text, determine whether the pronunciation of the audio segment is correct, if so, update the standard acoustic model according to the audio segment; otherwise, update the incorrect acoustic model according to the audio segment, and prompt The user's described pronunciation is prone to typos and mispronunciation, and according to the mapping table, the corresponding speech audio in the standard acoustic model is output;

若所述目标词语不包含所述识别文本，则根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型。If the target word does not contain the recognized text, the target word is updated according to the recognized text, and the acoustic model is updated according to the audio segment.

进一步的，所述的当所述目标词语不包含所述识别文本时，根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型具体包括：Further, when the target word does not contain the recognition text, updating the target word according to the recognition text, and updating the acoustic model according to the audio clip specifically includes:

当所述目标词语不包含所述识别文本时，根据所述识别文本更新所述目标词语；When the target word does not contain the recognized text, updating the target word according to the recognized text;

若所述音频片段发音正确，则根据所述音频片段更新所述标准声学模型，根据所述更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型；If the audio segment is pronounced correctly, update the standard acoustic model according to the audio segment, update the confused word according to the updated target word, and then update the error acoustic model according to the updated speech audio of the confused word Model;

若所述音频片段发音错误，则获取所述识别文本的正确语音音频，根据所述正确语音音频更新所述标准声学模型，根据所述更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。If the audio segment is pronounced incorrectly, obtain the correct voice audio of the recognized text, update the standard acoustic model according to the correct voice audio, update the confused word according to the updated target word, and then update the confusing word according to the updated target word. The speech audio of the confused words updates the error acoustic model.

本发明还提供一种语音识别发音错误的纠正系统，其特征在于，包括：The present invention also provides a correction system for speech recognition pronunciation errors, characterized in that it includes:

映射表建立模块，建立发音易错字对应的标准声学模型和错误声学模型之间的映射表；A mapping table building module is used to establish a mapping table between the standard acoustic model and the wrong acoustic model corresponding to the easily typo-pronounced words;

获取模块，获取用户语音信息；Obtaining module to obtain user voice information;

提取模块，识别所述获取模块获取的所述用户语音信息，当所述语音信息中包含所述发音易错字时，提取所述用户语音信息中包含所述发音易错字的词语对应的音频片段；An extraction module, identifying the user voice information obtained by the obtaining module, and extracting the audio clips corresponding to the words containing the easily typo-pronounced words in the user voice information when the voice information contains the words that are prone to pronunciation errors;

处理模块，当所述提取模块提取的所述音频片段与所述映射表建立模块中的所述错误声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音错误，并根据所述映射表建立模块建立的所述映射表输出对应的所述标准声学模型中的语音音频。The processing module, when the audio segment extracted by the extraction module is consistent with the voice audio matching result in the wrong acoustic model in the mapping table establishment module, the user is prompted to pronounce the typo-prone words incorrectly, and according to The mapping table established by the mapping table establishment module outputs the corresponding speech audio in the standard acoustic model.

进一步的，还包括：Further, it also includes:

易错字获取模块，获取所述发音易错字；A typo-prone acquisition module to acquire the easily typo-pronounced characters;

目标词语生成模块，根据所述易错字获取模块获取的所述发音易错字生成目标词语；A target word generation module, generating target words according to the pronunciation-prone words obtained by the typo-prone acquisition module;

音频获取模块，获取所述目标词语生成模块生成的所述目标词语的语音音频；an audio acquisition module, which acquires the voice audio of the target word generated by the target word generation module;

声学模型生成模块，根据所述音频获取模块获取的所述目标词语的语音音频生成所述标准声学模型；an acoustic model generation module, which generates the standard acoustic model according to the speech audio of the target word acquired by the audio acquisition module;

混淆字获取模块，获取所述易错字获取模块获取的所述发音易错字的发音混淆字；A confusing word acquisition module, for acquiring the pronunciation confusing words of the easily typo-pronounced words acquired by the typo-prone word acquisition module;

混淆词语生成模块，将所述目标词语生成模块生成的所述目标词语中的所述发音易错字替换成所述混淆字获取模块获取的所述发音混淆字生成混淆词语；A confusing word generation module, which replaces the pronunciation-prone words in the target words generated by the target word generation module with the pronunciation confusing words obtained by the confusing word acquisition module to generate confusing words;

所述音频获取模块，获取所述混淆词语生成模块生成的所述混淆词语的语音音频；The audio acquisition module acquires the speech audio of the confused words generated by the confused word generation module;

所述声学模型生成模块，根据所述音频获取模块获取的所述混淆词语的语音音频生成所述错误声学模型。The acoustic model generation module generates the wrong acoustic model according to the speech audio of the confused word acquired by the audio acquisition module.

进一步的，还包括：Further, it also includes:

所述处理模块，当所述提取模块提取的所述音频片段与所述映射表建立模块中的所述标准声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音正确。The processing module, when the audio segment extracted by the extraction module matches the voice and audio matching results in the standard acoustic model in the mapping table establishment module, prompting the user to pronounce the typo-prone word correctly.

进一步的，还包括：Further, it also includes:

所述处理模块，当所述提取模块提取的所述音频片段与所述映射表建立模块中的声学模型中的语音音频匹配结果不相符时，将所述音频片段转化为识别文本，所述声学模型包括所述标准声学模型和所述错误声学模型；The processing module, when the audio segment extracted by the extraction module does not match the voice audio matching result in the acoustic model in the mapping table establishment module, converts the audio segment into recognized text, and the acoustic models include the standard acoustic model and the error acoustic model;

控制模块，若所述目标词语包含所述处理模块转化的所述识别文本，判断所述音频片段的发音是否正确，若是则根据所述音频片段更新所述标准声学模型；否则根据所述音频片段更新所述错误声学模型，并提示用户所述发音易错字发音错误，根据所述映射表输出对应的所述标准声学模型中的语音音频；The control module, if the target word includes the recognition text converted by the processing module, judges whether the pronunciation of the audio clip is correct, if so, updates the standard acoustic model according to the audio clip; otherwise, according to the audio clip Updating the wrong acoustic model, and prompting the user that the pronunciation is prone to typos and mispronunciation, and output the corresponding voice audio in the standard acoustic model according to the mapping table;

所述控制模块，若所述目标词语不包含所述处理模块转化的所述识别文本，则根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型。The control module, if the target word does not contain the recognition text converted by the processing module, updates the target word according to the recognition text, and updates the acoustic model according to the audio segment.

进一步的，所述控制模块具体包括：Further, the control module specifically includes:

目标词语更新单元，当所述目标词语不包含所述处理模块转化的所述识别文本时，根据所述识别文本更新所述目标词语；A target word updating unit, when the target word does not include the recognition text converted by the processing module, update the target word according to the recognition text;

控制单元，若所述提取模块提取的所述音频片段发音正确，则根据所述音频片段更新所述标准声学模型，根据所述目标词语更新单元更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型；The control unit, if the pronunciation of the audio segment extracted by the extraction module is correct, then update the standard acoustic model according to the audio segment, update the confusing word according to the target word updated by the target word update unit, and then update the confusing word according to the target word update unit. The updated speech audio of the confused word updates the wrong acoustic model;

所述控制单元，若所述提取模块提取的所述音频片段发音错误，则获取所述识别文本的正确语音音频，根据所述正确语音音频更新所述标准声学模型，根据所述目标词语更新单元更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。The control unit, if the audio segment extracted by the extraction module is mispronounced, obtains the correct voice audio of the recognized text, updates the standard acoustic model according to the correct voice audio, and updates the unit according to the target word The updated target word updates the confused word, and then the error acoustic model is updated according to the updated speech audio of the confused word.

通过本发明提供的一种语音识别发音错误的纠正方法及系统，能够带来以下至少一种有益效果：A kind of correction method and system of speech recognition pronunciation error provided by the present invention can bring at least one of the following beneficial effects:

1、本发明中，通过建立发音易错字对应的标准声学模型和错误声学模型之间的映射表以便于后续用于匹配识别用户语音信息，并对错误的发音进行纠正。1. In the present invention, a mapping table between the standard acoustic model corresponding to the typo-prone words and the erroneous acoustic model is established to facilitate subsequent matching and recognition of user voice information, and erroneous pronunciation is corrected.

2、本发明中，匹配用户语音信息时只提取其中包含发音易错字的词语对应的音频片段进行匹配，减少匹配过程的难度以及工作量，加快匹配的速度，提高匹配的准确率。2. In the present invention, when matching user voice information, only audio clips corresponding to words that contain mispronounced words are extracted for matching, which reduces the difficulty and workload of the matching process, speeds up the matching, and improves the matching accuracy.

附图说明Description of drawings

下面将以明确易懂的方式，结合附图说明优选实施方式，对一种语音识别发音错误的纠正方法及系统的上述特性、技术特征、优点及其实现方式予以进一步说明。The preferred embodiments will be described below in a clear and easy-to-understand manner with reference to the accompanying drawings, and further description will be given of the above-mentioned characteristics, technical features, advantages and implementations of a method and system for correcting pronunciation errors in speech recognition.

图1是本发明一种语音识别发音错误的纠正方法的第一个实施例的流程图；Fig. 1 is the flow chart of the first embodiment of the correction method of a kind of speech recognition pronunciation error of the present invention;

图2是本发明一种语音识别发音错误的纠正方法的第二个实施例的流程图；Fig. 2 is the flow chart of the second embodiment of the correction method of a kind of speech recognition pronunciation error of the present invention;

图3是本发明一种语音识别发音错误的纠正方法的第三个实施例的流程图；Fig. 3 is the flow chart of the 3rd embodiment of the correction method of a kind of speech recognition pronunciation error of the present invention;

图4是本发明一种语音识别发音错误的纠正系统的第四个实施例的结构示意图；Fig. 4 is the structural representation of the 4th embodiment of the correction system of a kind of speech recognition pronunciation error of the present invention;

图5是本发明一种语音识别发音错误的纠正系统的第五个实施例的结构示意图；Fig. 5 is the structural representation of the fifth embodiment of a kind of speech recognition pronunciation error correction system of the present invention;

图6是本发明一种语音识别发音错误的纠正系统的第六个实施例的结构示意图。FIG. 6 is a schematic structural diagram of a sixth embodiment of a system for correcting pronunciation errors in speech recognition according to the present invention.

附图标号说明：Description of reference numbers:

1000语音识别发音错误的纠正系统1000 Speech Recognition Pronunciation Error Correction System

1100映射表建立模块 1200获取模块 1300提取模块1100 mapping table establishment module 1200 acquisition module 1300 extraction module

1400处理模块 1500易错字获取模块 1600目标词语生成模块1400 processing module 1500 typo-prone acquisition module 1600 target word generation module

1700音频获取模块 1800声学模型生成模块 1850混淆字获取模块1700 Audio Acquisition Module 1800 Acoustic Model Generation Module 1850 Confused Word Acquisition Module

1900混淆词语生成模块1900 Confused Word Generation Module

1950控制模块 1951目标词语更新单元 1952控制单元1950 Control Module 1951 Target Word Update Unit 1952 Control Unit

具体实施方式Detailed ways

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对照附图说明本发明的具体实施方式。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图，并获得其他的实施方式。In order to more clearly describe the embodiments of the present invention or the technical solutions in the prior art, the specific embodiments of the present invention will be described below with reference to the accompanying drawings. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts, and obtain other implementations.

为使图面简洁，各图中只示意性地表示出了与本发明相关的部分，它们并不代表其作为产品的实际结构。另外，以使图面简洁便于理解，在有些图中具有相同结构或功能的部件，仅示意性地绘示了其中的一个，或仅标出了其中的一个。在本文中，“一个”不仅表示“仅此一个”，也可以表示“多于一个”的情形。In order to keep the drawings concise, the drawings only schematically show the parts related to the present invention, and they do not represent its actual structure as a product. In addition, in order to make the drawings concise and easy to understand, in some drawings, only one of the components having the same structure or function is schematically shown, or only one of them is marked. As used herein, "one" not only means "only one", but also "more than one".

本发明的第一实施例，如图1所示，一种语音识别发音错误的纠正方法，包括：The first embodiment of the present invention, as shown in Figure 1, is a method for correcting pronunciation errors in speech recognition, including:

S100建立发音易错字对应的标准声学模型和错误声学模型之间的映射表。S100 establishes a mapping table between the standard acoustic model corresponding to the typo-prone words and the wrong acoustic model.

具体的，将多音字以及具有多个形近字的汉字作为发音易错字，将正确的发音作为标准声学模型，将多音字以及形近字作为错误声学模型，然后建立标准声学模型和错误声学模型之间相互对应的映射表，以便于后续为纠正错误发音寻找对应正确的发音。Specifically, the polyphonic characters and Chinese characters with multiple approximations are used as pronunciation-prone words, the correct pronunciation is used as the standard acoustic model, the polyphonic characters and the approximations are used as the error acoustic model, and then the standard acoustic model and the error acoustic model are established. The mapping table corresponding to each other is convenient to find the corresponding correct pronunciation for correcting the wrong pronunciation later.

S200获取用户语音信息。S200 acquires user voice information.

具体的，获取用户语音信息，该用户语音信息可能是用户实时输入的，例如用户使用智能终端进行朗读需要进行语音纠正，或者小学生通过智能终端学习新的词汇需要对小学生的学习成果进行检验。也可能是实现录制好的音频，例如校核学生录制的音频中发音是否准确等。当系统检测到错误的发音时会进行提示纠正，因此用户可以自主选择系统或者单个应用是否开启提示纠正错误发音的功能。Specifically, the user's voice information is obtained, which may be input by the user in real time. For example, when a user uses an intelligent terminal to read aloud, voice correction needs to be performed, or a primary school student needs to check the learning achievement of the primary school student when learning new vocabulary through the intelligent terminal. It may also be to realize the recorded audio, such as checking whether the pronunciation in the audio recorded by the students is accurate, etc. When the system detects a wrong pronunciation, it will prompt for correction, so the user can choose whether the system or a single application will enable the function of prompting and correcting the wrong pronunciation.

S300识别所述用户语音信息，当所述语音信息中包含所述发音易错字时，提取所述用户语音信息中包含所述发音易错字的词语对应的音频片段。S300 identifies the user's voice information, and when the voice information contains the easily typo-pronounced words, extracting audio segments corresponding to the words containing the easily-pronounced typo in the user voice information.

具体的，将获取到的用户语音信息首先转化为文本信息，然后识别该文本信息中是否包含上述任意一个或多个发音易错字，如果包含则通过分词技术分析文本信息中句子的成分以及词语的词性，对其中包含发音易错字的词语进行标记，最后根据标记的结果从用户语音信息中提取标记的词语对应的一个或多个音频片段。Specifically, the obtained user voice information is first converted into text information, and then it is identified whether the text information contains any one or more of the above-mentioned words that are prone to mispronunciation. Part-of-speech, marking words containing easily typo-pronounced words, and finally extracting one or more audio clips corresponding to the marked words from the user's speech information according to the marking result.

S400当所述音频片段与所述错误声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音错误，并根据所述映射表输出对应的所述标准声学模型中的语音音频。S400, when the audio segment matches the voice and audio matching results in the wrong acoustic model, prompt the user that the pronunciation of the typo-prone word is wrong, and output the corresponding voice and audio in the standard acoustic model according to the mapping table .

具体的，根据提取出来的音频片段对应的发音易错字找到标准声学模型和错误声学模型中对应的该发音易错字的语音音频，然后将提取出来的音频片段和找到的语音音频逐一进行匹配。Specifically, according to the typo-prone words corresponding to the extracted audio clips, the corresponding speech audios of the typo-prone words in the standard acoustic model and the error acoustic model are found, and then the extracted audio clips and the found speech audios are matched one by one.

如果匹配结果相符的语音音频是属于错误声学模型中的语音音频，那么说明用户语音信息中关于该发音易错字的部分发音错误，因此提示用户发音易错字发音错误。If the voice audio that matches the matching result belongs to the voice audio in the wrong acoustic model, it means that the part of the typo-prone word in the user's voice information is mispronounced, so the user is prompted to pronounce the typo-prone word incorrectly.

由于用户语音信息中同一发音易错字可能出现多次，但是并不是每一处都发音错误，因此在提示发音错误时需要结合该发音易错字在用户语音信息中的前后文清楚说明该发音易错字的具体位置。Since the same typo-prone word in the user's voice information may appear multiple times, but not every place is mispronounced, it is necessary to clearly explain the typo-prone word in combination with the context of the typo-prone word in the user's voice information when prompting a mispronunciation specific location.

另外通过标准声学模型和错误声学模型之间的映射表找到该发音易错字在该位置形成的词语对应的正确的语音音频，并输出给用户，以便于用户纠正发音。In addition, through the mapping table between the standard acoustic model and the wrong acoustic model, the correct speech audio corresponding to the word formed by the typo-prone word at the position is found, and output to the user, so that the user can correct the pronunciation.

本实施例中，在获取到用户语音信息之后，首先识别用户语音信息中包含的发音易错字，然后提取用户语音信息中包含该发音易错字的词语对应的音频片段和标准声学模型以及错误声学模型中的语音音频进行匹配。只提取有效的音频片段进行匹配，减小了需要匹配的语音长度，降低了对系统匹配能力的要求，加快了匹配速度，并且增加了匹配的准确率。In this embodiment, after acquiring the user's voice information, firstly identify the typo-prone words contained in the user's voice information, and then extract the audio segment, standard acoustic model and error acoustic model corresponding to the words containing the typo-prone words in the user's voice information to match the speech audio in . Only valid audio clips are extracted for matching, which reduces the length of speech that needs to be matched, reduces the requirements for the system matching capability, speeds up the matching speed, and increases the matching accuracy.

本发明的第二实施例，是上述第一实施例的优化实施例，如图2所示，包括：The second embodiment of the present invention is an optimized embodiment of the above-mentioned first embodiment, as shown in FIG. 2 , including:

S010获取所述发音易错字，根据所述发音易错字生成目标词语。S010 obtains the pronunciation-prone words, and generates a target word according to the pronunciation-prone words.

具体的，将多音字以及具有多个形近字的汉字作为发音易错字，根据发音易错字生成目标词语，目标词语为包含发音易错字的词语，由于发音易错字为多音字或者具有多个形近字，并且目标词语中发音易错字的发音比较容易混淆，例如莘莘学子、下载、压轴等。或者部分词语的发音虽然不影响用户日常生活的交流，但是实际上发音是错误的，例如倔强、框架等，也应当给与纠正，特别是小学生刚开始学习新的字词的时候，应该从最开始就形成正确的发音避免后期纠正。Specifically, the polyphonic words and Chinese characters with multiple shapes are regarded as easy-to-pronounce characters, and the target words are generated according to the easy-to-pronounce characters. Near words, and the pronunciation of the typo in the target word is easy to confuse, such as Xin Xin Xuezi, download, finale, etc. Or the pronunciation of some words does not affect the daily communication of users, but the pronunciation is actually wrong, such as stubbornness, frame, etc., should also be corrected, especially when primary school students just start learning new words, they should start from the most Form correct pronunciation from the beginning to avoid later correction.

S020获取所述目标词语的语音音频，根据所述目标词语的语音音频生成所述标准声学模型。S020 acquires the speech audio of the target word, and generates the standard acoustic model according to the speech audio of the target word.

具体的，获取每个目标词语的语音音频，其中每个发音易错字可能对应一个或多个语音音频，其中每个目标词语都获取标准普通话发音的音频，后续如果需要对用户的发音进行纠正时，系统输出的优先选择标准普通话发音的音频。Specifically, the voice audio of each target word is obtained, wherein each typo may correspond to one or more voice audios, and each target word is obtained from the audio of standard Mandarin pronunciation, and if the user's pronunciation needs to be corrected later , the system outputs the audio of the standard Mandarin pronunciation first.

但是由于用户性别、年龄、地域、方言口音以及腔调等的影响，可能存在用户发音正确但是无法和标准普通话发音音频匹配符合，因此每个目标词语出获取标准普通话发音的音频外，还应当尽量获取不同年龄不同地域的人的正确发音的音频，对于发音正确但是不标准的情况用户自主选择后续是否进行提示纠正。However, due to the influence of the user's gender, age, region, dialect accent and accent, etc., there may be cases where the user's pronunciation is correct but cannot match the standard Mandarin pronunciation audio. Therefore, in addition to obtaining the standard Mandarin pronunciation audio for each target word, you should also try to obtain the audio. The audio of the correct pronunciation of people of different ages and different regions, for the correct pronunciation but not standard, the user can choose whether to prompt and correct it later.

根据目标词语的语音音频生成标准声学模型，标准声学模型按照发音易错字进行分类，当某一个发音易错字生成的目标词语较多，或者是某一个目标词语对应的语音音频较多，可以选择在发音易错字的分类下根据目标词语再次进行分类。The standard acoustic model is generated according to the speech and audio of the target word. The standard acoustic model is classified according to the words that are prone to typos. When there are many target words generated by a word that is prone to pronunciation errors, or there are many speech and audio corresponding to a certain target word, you can choose to Under the classification of easy-to-pronounce words, it is classified again according to the target words.

S030获取所述发音易错字的发音混淆字，将所述目标词语中的所述发音易错字替换成所述发音混淆字生成混淆词语。S030 obtains the pronunciation confusing words of the pronunciation-prone words, and replaces the pronunciation-prone words in the target words with the pronunciation confusing words to generate confusing words.

具体的，获取发音易错字的发音混淆字，发音混淆字为其他读音的发音易错字或者该发音易错字的形近字，将目标词语中的发音易错字替换成发音混淆字生成混淆词语，也就是将其他读音的发音易错字和目标词语进行组合或者将发音易错字的形近字和目标词语进行组合，生成混淆词语。Specifically, obtain a pronunciation confusing word that is prone to mispronunciation, and the pronunciation confusing word is a mispronunciation-prone word with other pronunciations or a similar word to the pronunciation-prone word, and replace the easily mispronounced word in the target word with a pronunciation confusing word to generate a confusing word, and also It is to combine the typo-prone words of other pronunciations with the target words, or combine the typos-prone words and the target words to generate confusing words.

混淆词语为用户在读目标词语时可能读错的语音对应的词语，从组成上看混淆词语是不符合词语组成规则的，实际上并没有该类词语，此处仅仅是为了后续获取得以匹配的语音音频。Confused words are words corresponding to the voices that the user may mispronounce when reading the target words. From the composition point of view, the confused words do not conform to the rules of word composition. audio.

S040获取所述混淆词语的语音音频，根据所述混淆词语的语音音频生成所述错误声学模型。S040 acquires the speech audio of the confused word, and generates the error acoustic model according to the speech audio of the confused word.

具体的，获取每个混淆词语的语音音频，同样地，由于用户性别、年龄、地域、方言口音以及腔调等的影响，可能存在同一个混淆词语对应多个语音音频。根据混淆词语的语音音频生成错误声学模型，错误声学模型按照对应的发音易错字进行分类，当某一个发音易错字对应的混淆词语较多，或者是某一个混淆词语对应的语音音频较多，可以选择在发音易错字的分类下根据混淆词语再次进行分类。Specifically, the voice audio of each confused word is obtained. Similarly, due to the influence of the user's gender, age, region, dialect accent, and tone, etc., there may be multiple voice audios corresponding to the same confused word. Generate an error acoustic model based on the speech audio of the confused word, and the error acoustic model is classified according to the corresponding pronunciation prone words. Select Classify again by confusing words under Mispronounceable words.

S200获取用户语音信息。S200 acquires user voice information.

S500当所述音频片段与所述标准声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音正确。S500, when the audio segment matches the speech audio matching result in the standard acoustic model, prompt the user to pronounce the typo-prone word correctly.

如果匹配结果相符的语音音频是属于标准声学模型中的语音音频，那么说明用户语音信息中关于该发音易错字的部分发音正确，但是需要进一步判断匹配相符的语音音频是否是标准普通话发音音频，若是说明用户发音不仅正确而且标准，若不是则提示用户该发音易错字发音正确但不标准，询问用户是否需要输出标准普通话发音音频。If the matching voice audio belongs to the standard acoustic model, it means that the part of the typo-prone words in the user's voice information is pronounced correctly, but it is necessary to further judge whether the matching voice audio is standard Mandarin voice audio, and if so It indicates that the user's pronunciation is not only correct but also standard. If not, the user is prompted that the pronunciation of the typo-prone word is correct but not standard, and the user is asked whether to output standard Mandarin pronunciation audio.

本实施例中，分别生成发音易错字的正确发音的标准声学模型，以及对应的针对发音易错字的易混淆发音的多音字和形近字组合的错误发音的错误声学模型，并且建立两个声学模型之间对应关系的映射表，用于后续识别用户语音信息然后进行相应的匹配，能够准确识别出用户语音信息中发音错误的地方并给与纠正。In this embodiment, a standard acoustic model for the correct pronunciation of the typo-prone characters and a corresponding erroneous acoustic model for the mispronunciation of a combination of a polyphonic word that is prone to mispronunciation and a form-approximation word are respectively generated, and two acoustic models are established. The mapping table of the corresponding relationship between the models is used for subsequent identification of the user's voice information and then corresponding matching, which can accurately identify and correct the mispronunciation in the user's voice information.

本发明的第三实施例，是上述第一实施例的优化实施例，如图3所示，包括：The third embodiment of the present invention is an optimized embodiment of the above-mentioned first embodiment, as shown in FIG. 3 , including:

S200获取用户语音信息。S200 acquires user voice information.

S600当所述音频片段与声学模型中的语音音频匹配结果不相符时，将所述音频片段转化为识别文本，所述声学模型包括所述标准声学模型和所述错误声学模型。S600, when the audio segment does not match the speech audio matching result in the acoustic model, convert the audio segment into recognized text, and the acoustic model includes the standard acoustic model and the wrong acoustic model.

具体的，根据提取出来的音频片段对应的发音易错字找到声学模型即标准声学模型和错误声学模型中对应的该发音易错字的语音音频，然后将提取出来的音频片段和找到的语音音频逐一进行匹配。Specifically, the acoustic model is found according to the typo-prone words corresponding to the extracted audio clips, that is, the standard acoustic model and the wrong acoustic model. match.

如果该音频片段和声学模型中的语音音频的匹配结果都不相符，说明该音频片段对应的发音并没有收录在声学模型中，因此应当根据该音频片段对目标词语、混淆词语以及声学模型进行更新。识别该音频片段将该音频片段转换为识别文本。If the matching results of the audio segment and the speech audio in the acoustic model do not match, it means that the pronunciation corresponding to the audio segment is not included in the acoustic model. Therefore, the target word, confusion word and acoustic model should be updated according to the audio segment. . Recognize the audio segment converts the audio segment to recognized text.

S610若所述目标词语包含所述识别文本，判断所述音频片段的发音是否正确，若是则根据所述音频片段更新所述标准声学模型；否则根据所述音频片段更新所述错误声学模型，并提示用户所述发音易错字发音错误，根据所述映射表输出对应的所述标准声学模型中的语音音频。S610, if the target word contains the recognized text, determine whether the pronunciation of the audio segment is correct, and if so, update the standard acoustic model according to the audio segment; otherwise, update the incorrect acoustic model according to the audio segment, and The user is prompted that the pronunciation of the typo-prone word is incorrect, and the corresponding speech audio in the standard acoustic model is output according to the mapping table.

具体的，识别该识别文本，将识别文本和目标词语进行匹配，如果匹配结果显示该识别文本和某一目标词语相符，则判断该音频片段的发音是否正确，如果正确，则该音频片段可能由于口音较少见等原因没有被收录在标准声学模型中，因此根据该音频片段更新标准声学模型，然后提示用户该发音易错字发音正确但不标准，是否需要输出标准普通话发音音频。Specifically, the recognition text is recognized, and the recognized text is matched with the target word. If the matching result shows that the recognized text matches a target word, it is determined whether the pronunciation of the audio clip is correct. If it is correct, the audio clip may be due to Accents are not included in the standard acoustic model for reasons such as rare accents. Therefore, the standard acoustic model is updated according to the audio clip, and then the user is prompted that the pronunciation is prone to typos, but the pronunciation is correct but not standard, whether it is necessary to output standard Mandarin pronunciation audio.

如果该音频片段的发音不正确，则说明该音频片段为错误声学模型没有收录的一种目标词语的错误发音，因此根据该音频片段更新错误声学模型，然后提示用户该发音易错字发音错误，输出对应的标准普通话发音音频对用户进行纠正。If the pronunciation of the audio clip is incorrect, it means that the audio clip is a wrong pronunciation of a target word that is not included in the wrong acoustic model. Therefore, the wrong acoustic model is updated according to the audio clip, and then the user is prompted that the pronunciation is prone to typos and the wrong pronunciation is output. The corresponding standard Mandarin pronunciation audio is corrected for the user.

S620若所述目标词语不包含所述识别文本，则根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型。S620 If the target word does not contain the recognized text, update the target word according to the recognized text, and update the acoustic model according to the audio segment.

具体的，识别该识别文本，将识别文本和目标词语进行匹配，如果匹配结果显示该识别文本和所有的目标词语都不相符，说明该识别文本并没有被收录为目标词语，因此根据该识别文本对目标词语进行更新，然后根据音频片段更新声学模型。Specifically, the recognized text is recognized, and the recognized text is matched with the target word. If the matching result shows that the recognized text does not match all the target words, it means that the recognized text has not been included as the target word. Therefore, according to the recognized text Update the target words and then update the acoustic model based on the audio clip.

步骤S620若所述目标词语不包含所述识别文本，则根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型具体包括：In step S620, if the target word does not contain the recognition text, updating the target word according to the recognition text, and updating the acoustic model according to the audio segment specifically includes:

S621当所述目标词语不包含所述识别文本时，根据所述识别文本更新所述目标词语。S621 When the target word does not contain the recognized text, update the target word according to the recognized text.

S622若所述音频片段发音正确，则根据所述音频片段更新所述标准声学模型，根据所述更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。S622 If the pronunciation of the audio clip is correct, update the standard acoustic model according to the audio clip, update the confused word according to the updated target word, and then update the error according to the voice audio of the updated confused word acoustic model.

具体的，根据识别文本更新目标词语之后，对该音频片段的发音进行判断，如果发音正确，则根据音频片段更新标准声学模型，后续还可以获取该识别文本的其它发音音频更新标准声学模型。Specifically, after the target word is updated according to the recognized text, the pronunciation of the audio segment is judged, and if the pronunciation is correct, the standard acoustic model is updated according to the audio segment, and other pronunciation audios of the recognized text can be obtained to update the standard acoustic model subsequently.

然后将更新后的目标词语中的发音易错字更换为发音混淆字对应地更新混淆词语，重新获取更新后的混淆词语的语音音频，根据重新获取的语音音频更新错误声学模型。Then, the pronunciation-prone words in the updated target words are replaced with pronunciation-confused words, and the confused words are correspondingly updated, the speech audio of the updated confused words is re-acquired, and the error acoustic model is updated according to the re-acquired speech audio.

S623若所述音频片段发音错误，则获取所述识别文本的正确语音音频，根据所述正确语音音频更新所述标准声学模型，根据所述更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。S623 If the audio segment is pronounced incorrectly, obtain the correct voice audio of the recognized text, update the standard acoustic model according to the correct voice audio, update the confused word according to the updated target word, and then update the After the speech audio of the confused word updates the error acoustic model.

具体的，根据识别文本更新目标词语之后，对该音频片段的发音进行判断，如果发音不正确，则获取该识别文本对应的正确的语音音频，包括标准普通话发音音频和其它方言、腔调等的发音音频，根据获取的正确的语音音频更新标准声学模型。Specifically, after the target word is updated according to the recognized text, the pronunciation of the audio segment is judged, and if the pronunciation is incorrect, the correct voice audio corresponding to the recognized text is obtained, including the standard Mandarin pronunciation audio and the pronunciation of other dialects, accents, etc. Audio, update the standard acoustic model according to the acquired correct speech audio.

本实施例中，当提取出的音频片段与声学模型中的语音音频匹配结果不相符时，将音频片段转化为识别文本，然后对识别文本和音频片段进行识别，对不同情况分类处理，快速准确地对用户的发音进行判断然后进行相应处理。In this embodiment, when the extracted audio segment does not match the voice-audio matching result in the acoustic model, the audio segment is converted into recognized text, and then the recognized text and audio segment are recognized, and different situations are classified and processed, which is fast and accurate. The user's pronunciation is judged and processed accordingly.

本发明的第四实施例，如图4所示，一种语音识别发音错误的纠正系统1000，包括：The fourth embodiment of the present invention, as shown in FIG. 4 , is a system 1000 for correcting pronunciation errors in speech recognition, including:

映射表建立模块1100，建立发音易错字对应的标准声学模型和错误声学模型之间的映射表。The mapping table establishment module 1100 establishes a mapping table between the standard acoustic model and the erroneous acoustic model corresponding to the easily typo-pronounced words.

具体的，映射表建立模块1100将多音字以及具有多个形近字的汉字作为发音易错字，将正确的发音作为标准声学模型，将多音字以及形近字作为错误声学模型，然后建立标准声学模型和错误声学模型之间相互对应的映射表，以便于后续为纠正错误发音寻找对应正确的发音。Specifically, the mapping table building module 1100 uses polyphonic words and Chinese characters with multiple approximations as pronunciation-prone words, takes correct pronunciation as a standard acoustic model, and uses polyphonic characters and approximations as an error acoustic model, and then establishes a standard acoustic model The corresponding mapping table between the model and the wrong acoustic model, so as to find the corresponding correct pronunciation for correcting the wrong pronunciation later.

获取模块1200，获取用户语音信息。The obtaining module 1200 obtains user voice information.

具体的，获取模块1200获取用户语音信息，该用户语音信息可能是用户实时输入的，例如用户使用智能终端进行朗读需要进行语音纠正，或者小学生通过智能终端学习新的词汇需要对小学生的学习成果进行检验。也可能是实现录制好的音频，例如校核学生录制的音频中发音是否准确等。当系统检测到错误的发音时会进行提示纠正，因此用户可以自主选择系统或者单个应用是否开启提示纠正错误发音的功能。Specifically, the acquisition module 1200 acquires user voice information, which may be input by the user in real time. For example, when a user uses an intelligent terminal to read aloud, he needs to perform voice correction, or when a primary school student learns new vocabulary through a smart terminal, it is necessary to perform a correction on the primary school students' learning results. test. It may also be to realize the recorded audio, such as checking whether the pronunciation in the audio recorded by the students is accurate, etc. When the system detects a wrong pronunciation, it will prompt for correction, so the user can choose whether the system or a single application will enable the function of prompting and correcting the wrong pronunciation.

提取模块1300，识别所述获取模块1200获取的所述用户语音信息，当所述语音信息中包含所述发音易错字时，提取所述用户语音信息中包含所述发音易错字的词语对应的音频片段。The extracting module 1300 identifies the user's voice information obtained by the obtaining module 1200, and when the voice information contains the typo-prone words, extracts the audio corresponding to the words that contain the typo-prone words in the user's voice information Fragment.

具体的，提取模块1300将获取到的用户语音信息首先转化为文本信息，然后识别该文本信息中是否包含上述任意一个或多个发音易错字，如果包含则通过分词技术分析文本信息中句子的成分以及词语的词性，对其中包含发音易错字的词语进行标记，最后根据标记的结果从用户语音信息中提取标记的词语对应的一个或多个音频片段。Specifically, the extraction module 1300 first converts the acquired user voice information into text information, and then identifies whether the text information contains any one or more of the above-mentioned words that are prone to mispronunciation, and if so, analyzes the components of sentences in the text information through word segmentation technology and part-of-speech of words, mark words that contain easily mispronounced words, and finally extract one or more audio clips corresponding to the marked words from the user's voice information according to the marking result.

处理模块1400，当所述提取模块1300提取的所述音频片段与所述映射表建立模块1100中的所述错误声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音错误，并根据所述映射表建立模块1100建立的所述映射表输出对应的所述标准声学模型中的语音音频。The processing module 1400, when the audio segment extracted by the extracting module 1300 is consistent with the voice and audio matching results in the wrong acoustic model in the mapping table building module 1100, prompting the user that the pronunciation is prone to typos and the pronunciation is wrong. , and output the corresponding speech audio in the standard acoustic model according to the mapping table established by the mapping table establishment module 1100 .

具体的，处理模块1400根据提取出来的音频片段对应的发音易错字找到标准声学模型和错误声学模型中对应的该发音易错字的语音音频，然后将提取出来的音频片段和找到的语音音频逐一进行匹配。如果匹配结果相符的语音音频是属于错误声学模型中的语音音频，那么说明用户语音信息中关于该发音易错字的部分发音错误，因此提示用户发音易错字发音错误。Specifically, the processing module 1400 finds the corresponding voice audio of the typo-prone word in the standard acoustic model and the error acoustic model according to the typo-prone words corresponding to the extracted audio clips, and then performs the extracted audio clips and the found voice audios one by one. match. If the voice audio that matches the matching result belongs to the voice audio in the wrong acoustic model, it means that the part of the typo-prone word in the user's voice information is mispronounced, so the user is prompted to pronounce the typo-prone word incorrectly.

由于用户语音信息中同一发音易错字可能出现多次，但是并不是每一处都发音错误，因此在处理模块1400提示发音错误时需要结合该发音易错字在用户语音信息中的前后文清楚说明该发音易错字的具体位置。Since the same typo-prone word in the user's voice information may appear multiple times, but not every place is pronounced incorrectly, when the processing module 1400 prompts a mispronunciation, it is necessary to clearly explain the typo-prone word in the user's voice information in conjunction with the context of the pronunciation-prone word in the user's voice information. Pronounce the exact location of the typo.

本发明的第五实施例，是上述第四实施例的优化实施例，如图5所示，包括：The fifth embodiment of the present invention is an optimized embodiment of the above-mentioned fourth embodiment, as shown in FIG. 5 , including:

易错字获取模块1500，获取所述发音易错字。The typo-prone character acquisition module 1500 acquires the pronunciation-prone characters.

目标词语生成模块1600，根据所述易错字获取模块1500获取的所述发音易错字生成目标词语。The target word generating module 1600 generates a target word according to the pronunciation-prone words obtained by the typo-prone obtaining module 1500 .

具体的，易错字获取模块1500将多音字以及具有多个形近字的汉字作为发音易错字，目标词语生成模块1600根据发音易错字生成目标词语，目标词语为包含发音易错字的词语，由于发音易错字为多音字或者具有多个形近字，并且目标词语中发音易错字的发音比较容易混淆，例如莘莘学子、下载、压轴等。或者部分词语的发音虽然不影响用户日常生活的交流，但是实际上发音是错误的，例如倔强、框架等，也应当给与纠正，特别是小学生刚开始学习新的字词的时候，应该从最开始就形成正确的发音避免后期纠正。Specifically, the typo-prone acquisition module 1500 regards polyphonic words and Chinese characters with a plurality of approximations as pronunciation-prone characters, and the target word generation module 1600 generates target words according to the pronunciation-prone characters, and the target words are words that include pronunciation-prone characters. The typo-prone words are polyphonic words or have multiple approximations, and the pronunciation of the typo-prone words in the target words is easy to confuse, such as Xinxinzizi, download, finale, etc. Or the pronunciation of some words does not affect the daily communication of users, but the pronunciation is actually wrong, such as stubbornness, frame, etc., should also be corrected, especially when primary school students just start learning new words, they should start from the most Form the correct pronunciation at the beginning to avoid later correction.

音频获取模块1700，获取所述目标词语生成模块1600生成的所述目标词语的语音音频。The audio acquisition module 1700 acquires the speech audio of the target word generated by the target word generation module 1600 .

具体的，音频获取模块1700获取每个目标词语的语音音频，其中每个发音易错字可能对应一个或多个语音音频，其中每个目标词语都获取标准普通话发音的音频，后续如果需要对用户的发音进行纠正时，系统输出的优先选择标准普通话发音的音频。Specifically, the audio acquisition module 1700 acquires the audio of each target word, wherein each typo may correspond to one or more audios, and each target word acquires the audio of the standard Mandarin pronunciation. When the pronunciation is corrected, the audio output by the system is preferably the standard Mandarin pronunciation.

声学模型生成模块1800，根据所述音频获取模块1700获取的所述目标词语的语音音频生成所述标准声学模型。The acoustic model generation module 1800 generates the standard acoustic model according to the speech audio of the target word acquired by the audio acquisition module 1700 .

具体的，声学模型生成模块1800根据目标词语的语音音频生成标准声学模型，标准声学模型按照发音易错字进行分类，当某一个发音易错字生成的目标词语较多，或者是某一个目标词语对应的语音音频较多，可以选择在发音易错字的分类下根据目标词语再次进行分类。Specifically, the acoustic model generation module 1800 generates a standard acoustic model according to the speech audio of the target word, and the standard acoustic model is classified according to the words that are prone to mispronunciation. There are many voices and audios, and you can choose to classify them again according to the target words under the classification of easy-to-pronounce words.

混淆字获取模块1850，获取所述易错字获取模块1500获取的所述发音易错字的发音混淆字。The confusing word obtaining module 1850 obtains the pronunciation confusing word of the easily typo-pronounced character obtained by the typo-prone character obtaining module 1500 .

混淆词语生成模块1900，将所述目标词语生成模块1600生成的所述目标词语中的所述发音易错字替换成所述混淆字获取模块1850获取的所述发音混淆字生成混淆词语。The confusing word generating module 1900 replaces the pronunciation-prone words in the target words generated by the target word generating module 1600 with the pronunciation confusing words obtained by the confusing word obtaining module 1850 to generate a confusing word.

具体的，混淆字获取模块1850获取发音易错字的发音混淆字，发音混淆字为其他读音的发音易错字或者该发音易错字的形近字，混淆词语生成模块1900将目标词语中的发音易错字替换成发音混淆字生成混淆词语，也就是将其他读音的发音易错字和目标词语进行组合或者将发音易错字的形近字和目标词语进行组合，生成混淆词语。Specifically, the confusing word obtaining module 1850 obtains a pronunciation confusing word that is prone to mispronunciation, and the pronunciation confusing word is a mispronunciation-prone word with other pronunciations or a shape-approximation word of the mispronounced word. The confusing words are generated by replacing the pronunciation confusing words, that is, combining the mispronounced words of other pronunciations with the target words, or combining the typos-prone words and the target words to generate the confusing words.

所述音频获取模块1700，获取所述混淆词语生成模块1900生成的所述混淆词语的语音音频。The audio acquisition module 1700 acquires the speech audio of the confused words generated by the confused word generation module 1900 .

所述声学模型生成模块1800，根据所述音频获取模块1700获取的所述混淆词语的语音音频生成所述错误声学模型。The acoustic model generation module 1800 generates the wrong acoustic model according to the speech audio of the confused word acquired by the audio acquisition module 1700 .

具体的，音频获取模块1700获取每个混淆词语的语音音频，同样地，由于用户性别、年龄、地域、方言口音以及腔调等的影响，可能存在同一个混淆词语对应多个语音音频。声学模型生成模块1800根据混淆词语的语音音频生成错误声学模型，错误声学模型按照对应的发音易错字进行分类，当某一个发音易错字对应的混淆词语较多，或者是某一个混淆词语对应的语音音频较多，可以选择在发音易错字的分类下根据混淆词语再次进行分类。Specifically, the audio acquisition module 1700 acquires the voice audio of each confused word. Similarly, due to the influence of the user's gender, age, region, dialect accent, and tone, there may be multiple voice audios corresponding to the same confused word. The acoustic model generation module 1800 generates an incorrect acoustic model according to the speech audio of the confusing word, and the incorrect acoustic model is classified according to the corresponding pronunciation prone words. There are many audios, and you can choose to classify them again according to confusing words under the classification of easy-to-pronounce words.

映射表建立模块1100，根据所述声学模型生成模块1800生成的标准声学模型和错误声学模型建立发音易错字对应的标准声学模型和错误声学模型之间的映射表。The mapping table establishment module 1100 establishes a mapping table between the standard acoustic model and the wrong acoustic model corresponding to the pronunciation-prone words according to the standard acoustic model and the wrong acoustic model generated by the acoustic model generation module 1800 .

所述处理模块1400，当所述提取模块1300提取的所述音频片段与所述映射表建立模块1100中的所述标准声学模型中的语音音频匹配结果为相符时，提示用户所述发音易错字发音正确。The processing module 1400, when the audio segment extracted by the extraction module 1300 matches the voice and audio matching results in the standard acoustic model in the mapping table establishment module 1100, prompts the user that the pronunciation is prone to typos. Pronounced correctly.

具体的，处理模块1400根据提取出来的音频片段对应的发音易错字找到标准声学模型和错误声学模型中对应的该发音易错字的语音音频，然后将提取出来的音频片段和找到的语音音频逐一进行匹配。Specifically, the processing module 1400 finds the corresponding voice audio of the typo-prone word in the standard acoustic model and the error acoustic model according to the typo-prone words corresponding to the extracted audio clips, and then performs the extracted audio clips and the found voice audios one by one. match.

如果处理模块1400匹配结果相符的语音音频是属于标准声学模型中的语音音频，那么说明用户语音信息中关于该发音易错字的部分发音正确，但是需要进一步判断匹配相符的语音音频是否是标准普通话发音音频，若是说明用户发音不仅正确而且标准，若不是则提示用户该发音易错字发音正确但不标准，询问用户是否需要输出标准普通话发音音频。If the matching speech audio of the processing module 1400 is the speech audio that belongs to the standard acoustic model, it means that the pronunciation of the part of the typo-prone words in the user's speech information is correct, but it needs to be further judged whether the matching speech audio is the standard Mandarin pronunciation. Audio, if it indicates that the user's pronunciation is not only correct but also standard, if not, it prompts the user that the pronunciation of the typo is correct but not standard, and asks the user whether to output standard Mandarin pronunciation audio.

本发明的第六实施例，是上述第四实施例的优化实施例，如图6所示，包括：The sixth embodiment of the present invention is an optimized embodiment of the above-mentioned fourth embodiment, as shown in FIG. 6 , including:

所述处理模块1400，当所述提取模块1300提取的所述音频片段与所述映射表建立模块1100中的声学模型中的语音音频匹配结果不相符时，将所述音频片段转化为识别文本，所述声学模型包括所述标准声学模型和所述错误声学模型。The processing module 1400, when the audio segment extracted by the extraction module 1300 does not match the voice audio matching result in the acoustic model in the mapping table establishment module 1100, converts the audio segment into a recognized text, The acoustic models include the standard acoustic model and the false acoustic model.

具体的，处理模块1400根据提取出来的音频片段对应的发音易错字找到声学模型即标准声学模型和错误声学模型中对应的该发音易错字的语音音频，然后将提取出来的音频片段和找到的语音音频逐一进行匹配。Specifically, the processing module 1400 finds the acoustic model according to the typo-prone words corresponding to the extracted audio clips, that is, the voice audio of the typo-prone words corresponding to the standard acoustic model and the error acoustic model, and then compares the extracted audio clips and the found speech sounds. The audio is matched one by one.

控制模块1950，若所述目标词语包含所述处理模块1400转化的所述识别文本，判断所述音频片段的发音是否正确，若是则根据所述音频片段更新所述标准声学模型；否则根据所述音频片段更新所述错误声学模型，并提示用户所述发音易错字发音错误，根据所述映射表输出对应的所述标准声学模型中的语音音频。The control module 1950, if the target word includes the recognition text converted by the processing module 1400, judges whether the pronunciation of the audio clip is correct, and if so, updates the standard acoustic model according to the audio clip; otherwise, according to the The audio segment updates the erroneous acoustic model, and prompts the user that the typo-prone word is mispronounced, and outputs the corresponding speech audio in the standard acoustic model according to the mapping table.

所述控制模块1950，若所述目标词语不包含所述处理模块1400转化的所述识别文本，则根据所述识别文本更新所述目标词语，并根据所述音频片段更新所述声学模型。The control module 1950, if the target word does not contain the recognized text transformed by the processing module 1400, updates the target word according to the recognized text, and updates the acoustic model according to the audio clip.

所述控制模块1950具体包括：The control module 1950 specifically includes:

目标词语更新单元1951，当所述目标词语不包含所述处理模块1400转化的所述识别文本时，根据所述识别文本更新所述目标词语。The target word updating unit 1951, when the target word does not contain the recognized text transformed by the processing module 1400, updates the target word according to the recognized text.

控制单元1952，若所述提取模块1300提取的所述音频片段发音正确，则根据所述音频片段更新所述标准声学模型，根据所述目标词语更新单元1951更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。The control unit 1952, if the pronunciation of the audio segment extracted by the extraction module 1300 is correct, updates the standard acoustic model according to the audio segment, and updates the confusing word according to the target word updated by the target word update unit 1951 , and then update the false acoustic model according to the updated speech audio of the confused word.

具体的，根据识别文本更新目标词语之后，控制单元1952对该音频片段的发音进行判断，如果发音正确，则根据音频片段更新标准声学模型，后续还可以获取该识别文本的其它发音音频更新标准声学模型。Specifically, after updating the target word according to the recognized text, the control unit 1952 judges the pronunciation of the audio segment, and if the pronunciation is correct, then updates the standard acoustic model according to the audio segment, and subsequently obtains other pronunciations of the recognized text to update the standard acoustic audio. Model.

所述控制单元1952，若所述提取模块1300提取的所述音频片段发音错误，则获取所述识别文本的正确语音音频，根据所述正确语音音频更新所述标准声学模型，根据所述目标词语更新单元1951更新后的目标词语更新所述混淆词语，然后根据更新后的混淆词语的语音音频更新所述错误声学模型。The control unit 1952, if the audio segment extracted by the extraction module 1300 is mispronounced, obtains the correct voice audio of the recognized text, updates the standard acoustic model according to the correct voice audio, and updates the standard acoustic model according to the target word The updating unit 1951 updates the confused word with the updated target word, and then updates the erroneous acoustic model according to the speech audio of the updated confused word.

具体的，根据识别文本更新目标词语之后，控制单元1952对该音频片段的发音进行判断，如果发音不正确，则获取该识别文本对应的正确的语音音频，包括标准普通话发音音频和其它方言、腔调等的发音音频，根据获取的正确的语音音频更新标准声学模型。Specifically, after updating the target word according to the recognized text, the control unit 1952 judges the pronunciation of the audio segment, and if the pronunciation is incorrect, then obtains the correct speech audio corresponding to the recognized text, including standard Mandarin pronunciation audio and other dialects, accents and other pronunciation audio, and update the standard acoustic model according to the acquired correct speech audio.

应当说明的是，上述实施例均可根据需要自由组合。以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。It should be noted that the above embodiments can be freely combined as required. The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. A method for correcting pronunciation errors in speech recognition, comprising:

establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to pronounce error-prone characters;

acquiring user voice information;

recognizing the user voice information, and when the pronunciation-prone wrong character is contained in the voice information, extracting an audio segment corresponding to a word containing the pronunciation-prone wrong character in the user voice information;

when the audio frequency fragment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table;

wherein, before establishing the mapping table between the standard acoustic model and the error acoustic model corresponding to the pronunciation error-prone word, the method further comprises: acquiring the pronouncing error-prone characters, and generating target words according to the pronouncing error-prone characters;

when the audio segment does not accord with the voice audio matching result in the acoustic model, converting the audio segment into recognition text, wherein the acoustic model comprises the standard acoustic model and the error acoustic model;

if the target words contain the recognition texts, judging whether the pronunciation of the audio clips is correct, and if so, updating the standard acoustic model according to the audio clips; otherwise, updating the error acoustic model according to the audio fragment;

and if the target word does not contain the identification text, updating the target word according to the identification text, and updating the acoustic model according to the audio fragment.

2. The method for correcting pronunciation errors in speech recognition according to claim 1, wherein the step of establishing the mapping table between the standard acoustic model and the incorrect acoustic model corresponding to the pronunciation-prone word further comprises:

acquiring the voice audio of the target word, and generating the standard acoustic model according to the voice audio of the target word;

acquiring pronunciation confusion words of the pronunciation easily-wrong words, and replacing the pronunciation easily-wrong words in the target words with the pronunciation confusion words to generate confusion words;

and acquiring the voice audio of the confusing words, and generating the error acoustic model according to the voice audio of the confusing words.

3. The method for correcting pronunciation errors in speech recognition according to claim 1 or 2, further comprising:

and when the audio frequency segment is matched with the voice audio frequency matching result in the standard acoustic model, prompting the user that the pronunciation of the error-prone character is correct.

4. The method according to claim 2, wherein the updating the target word according to the recognized text and the acoustic model according to the audio piece when the target word does not include the recognized text specifically includes:

when the target word does not contain the recognition text, updating the target word according to the recognition text;

if the pronunciation of the audio clip is correct, updating the standard acoustic model according to the audio clip, updating the confusion terms according to the updated target terms, and then updating the error acoustic model according to the voice audio of the updated confusion terms;

and if the audio fragment is mispronounced, acquiring correct voice audio of the recognized text, updating the standard acoustic model according to the correct voice audio, updating the confusion terms according to the updated target terms, and updating the wrong acoustic model according to the updated voice audio of the confusion terms.

5. A system for correcting pronunciation errors in speech recognition, comprising:

the mapping table establishing module is used for establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronounce error-prone character;

the acquisition module acquires user voice information;

the extraction module is used for identifying the user voice information acquired by the acquisition module, and extracting an audio clip corresponding to a word containing the pronouncing error-prone character in the user voice information when the pronunciation error-prone character is contained in the voice information;

the processing module prompts a user that the pronunciation is easy to miss and the pronunciation is wrong when the audio frequency segment extracted by the extraction module is matched with the voice audio frequency matching result in the wrong acoustic model in the mapping table establishing module, and outputs the corresponding voice audio frequency in the standard acoustic model according to the mapping table established by the mapping table establishing module;

the error-prone character acquisition module is used for acquiring the pronunciation error-prone characters before the mapping table establishment module establishes a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronunciation error-prone characters;

the target word generation module is used for generating a target word according to the pronunciation error-prone characters acquired by the error-prone character acquisition module;

the processing module is used for converting the audio clip into an identification text when the audio clip extracted by the extraction module is not consistent with a voice audio matching result in an acoustic model in the mapping table establishing module, wherein the acoustic model comprises the standard acoustic model and the error acoustic model;

the control module is used for judging whether the pronunciation of the audio clip is correct or not if the target word contains the recognition text converted by the processing module, and updating the standard acoustic model according to the audio clip if the pronunciation of the audio clip is correct; otherwise, updating the error acoustic model according to the audio fragment;

and if the target word does not contain the identification text converted by the processing module, the control module updates the target word according to the identification text and updates the acoustic model according to the audio segment.

6. The system for correcting pronunciation errors in speech recognition according to claim 5, further comprising:

the audio acquisition module is used for acquiring the voice audio of the target words generated by the target word generation module;

the acoustic model generating module is used for generating the standard acoustic model according to the voice audio of the target word acquired by the audio acquiring module;

the confusing character acquisition module is used for acquiring the pronunciation confusing characters of the pronunciation confusing characters acquired by the confusing character acquisition module;

a confusion word generation module, which replaces the pronunciation-prone wrong character in the target word generated by the target word generation module with the pronunciation-prone wrong character acquired by the confusion word acquisition module to generate a confusion word;

the audio acquisition module is used for acquiring the voice audio of the confusion word generated by the confusion word generation module;

the acoustic model generating module is used for generating the error acoustic model according to the voice audio of the confusing words acquired by the audio acquiring module.

7. The system for correcting pronunciation errors in speech recognition according to claim 5 or 6, further comprising:

and the processing module prompts a user that the pronunciation of the error-prone character is correct when the audio frequency segment extracted by the extraction module is matched with the voice audio frequency matching result in the standard acoustic model in the mapping table establishing module.

8. The system for correcting pronunciation errors in speech recognition according to claim 6, wherein the control module specifically comprises:

a target word updating unit, which updates the target word according to the recognition text when the target word does not contain the recognition text converted by the processing module;

the control unit is used for updating the standard acoustic model according to the audio segment if the pronunciation of the audio segment extracted by the extraction module is correct, updating the confusion word according to the target word updated by the target word updating unit, and then updating the wrong acoustic model according to the voice audio of the updated confusion word;

the control unit is used for acquiring correct voice audio of the recognized text if the audio segment extracted by the extraction module is in wrong pronunciation, updating the standard acoustic model according to the correct voice audio, updating the confusion word according to the target word updated by the target word updating unit, and then updating the wrong acoustic model according to the updated voice audio of the confusion word.