CN107481718B

CN107481718B - Voice recognition method, voice recognition device, storage medium and electronic equipment

Info

Publication number: CN107481718B
Application number: CN201710854125.4A
Authority: CN
Inventors: 梁昆
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2019-07-05
Anticipated expiration: 2037-09-20
Also published as: CN110310623B; CN110310623A; CN107481718A

Abstract

The embodiment of the present application discloses a speech recognition method, device, storage medium and electronic device. The method comprises obtaining first speech data; inputting the first speech data into a pre-constructed screening model for screening, obtaining a speech segment output by the screening model with set speech features filtered out; and recognizing the speech segment to obtain corresponding text. The above technical solution can effectively reduce the amount of calculation in the speech recognition process and improve the recognition speed.

Description

Speech recognition method, device, storage medium and electronic device

技术领域technical field

本申请实施例涉及语音识别技术，尤其涉及一种语音识别方法、装置、存储介质及电子设备。The embodiments of the present application relate to speech recognition technology, and in particular, to a speech recognition method, apparatus, storage medium, and electronic device.

背景技术Background technique

随着应用于电子设备的科技技术的迅猛发展，电子设备已经拥有了强大的处理能力，并逐渐成为人们生活、文娱及工作必不可少的重要工具。With the rapid development of science and technology applied to electronic equipment, electronic equipment has possessed powerful processing capabilities and has gradually become an indispensable and important tool for people's life, entertainment and work.

以智能手机为例，为了在驾驶车辆、手提物品或者其它不方便通过触摸屏操作智能手机的场景下，用户也能方便地操作智能手机，目前的智能手机大都配置了语音助手功能。通过语音助手可以将用户输入的语音数据转换为文字。然而，目前的语音识别方案在进行语音识别时，存在计算量大，识别速度慢的缺陷。Taking a smartphone as an example, most of the current smartphones are equipped with a voice assistant function in order to allow the user to conveniently operate the smartphone even when driving a vehicle, carrying items or other scenarios where it is inconvenient to operate the smartphone through a touch screen. The voice data input by the user can be converted into text through the voice assistant. However, the current speech recognition solution has the defects of a large amount of calculation and a slow recognition speed when performing speech recognition.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种语音识别方法、装置、存储介质及电子设备，可以减少语音识别过程中的计算量，提高识别速度。Embodiments of the present application provide a speech recognition method, apparatus, storage medium and electronic device, which can reduce the amount of calculation in the speech recognition process and improve the recognition speed.

第一方面，本申请实施例提供了一种语音识别方法，包括：In a first aspect, an embodiment of the present application provides a speech recognition method, including:

获取第一语音数据；obtain the first voice data;

将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段，其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到；The first voice data is input into a pre-built screening model for screening, and a voice segment output by the screening model that filters out the set voice features is obtained, wherein the screening model is composed of voice data with voice features that have no actual meaning added. sample training;

识别所述语音片段得到对应的文字。Recognize the speech segment to obtain the corresponding text.

第二方面，本申请实施例还提供了一种语音识别装置，该装置包括：In a second aspect, an embodiment of the present application further provides a speech recognition device, the device comprising:

语音获取模块，用于获取第一语音数据；a voice acquisition module for acquiring the first voice data;

语音筛选模块，用于将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段，其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到；A voice screening module, configured to input the first voice data into a pre-built screening model for screening, and obtain a voice segment output by the screening model to filter out the set voice features, wherein the screening model is not meaningful by adding The speech data samples of the speech features are obtained by training;

语音识别模块，用于识别所述语音片段得到对应的文字。The speech recognition module is used for recognizing the speech segment to obtain the corresponding text.

第三方面，本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如本申请实施例所述的语音识别方法。In a third aspect, the embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the speech recognition method described in the embodiments of the present application.

第四方面，本申请实施例还提供了一种电子设备，包括用于采集第一语音数据的语音采集器、存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如本申请实施例所述的语音识别方法。In a fourth aspect, the embodiments of the present application also provide an electronic device, including a voice collector for collecting first voice data, a memory, a processor, and a computer program stored in the memory and running on the processor, so When the processor executes the computer program, the speech recognition method according to the embodiment of the present application is implemented.

本申请提供一种语音识别方案，通过获取第一语音数据；将第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段；识别语音片段得到对应的文字。上述技术方案在语音识别前，将所获取的第一语音数据输入了筛选模型。由于筛选模型的训练样本是添加有无实际含义的语音特征的语音数据样本，将第一语音数据输入筛选模型进行计算，可以滤除第一语音数据包含的无实际含义的音素，得到不包含无实际含义的音素的语音片段。从而，由筛选模型输出的语音片段的数据量小于第一语音数据的数据量。再对数据量减小后的语音片段进行识别，可以有效地减少语音识别过程中的计算量，提高了识别速度。The present application provides a voice recognition solution, by acquiring first voice data; inputting the first voice data into a pre-built screening model for screening, and obtaining voice fragments output by the screening model that filter out the set voice features; and recognizing voice fragments get the corresponding text. In the above technical solution, before speech recognition, the acquired first speech data is input into the screening model. Since the training samples of the screening model are speech data samples with speech features with or without actual meaning added, the first speech data is input into the screening model for calculation, and the phonemes without actual meaning contained in the first speech data can be filtered out, and the result is obtained without any actual meaning. The actual meaning of the phoneme's speech fragment. Therefore, the data volume of the speech segment output by the screening model is smaller than the data volume of the first speech data. Recognizing the speech segment after the data amount is reduced can effectively reduce the calculation amount in the speech recognition process and improve the recognition speed.

附图说明Description of drawings

图1是本申请实施例提供的一种语音识别方法的流程图；1 is a flowchart of a speech recognition method provided by an embodiment of the present application;

图2是本申请实施例提供的单个神经元的基本结构示意图；2 is a schematic diagram of the basic structure of a single neuron provided by an embodiment of the present application;

图3是本申请实施例提供的另一种语音识别方法的流程图；3 is a flowchart of another speech recognition method provided by an embodiment of the present application;

图4是本申请实施例提供的又一种语音识别方法的流程图；4 is a flowchart of another speech recognition method provided by an embodiment of the present application;

图5是本申请实施例提供的一种语音识别装置的结构框图；5 is a structural block diagram of a speech recognition device provided by an embodiment of the present application;

图6是本申请实施例提供的一种电子设备的结构框图。FIG. 6 is a structural block diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释本申请，而非对本申请的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本申请相关的部分而非全部结构。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.

在更加详细地讨论示例性实施例之前应当提到的是，一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各步骤描述成顺序的处理，但是其中的许多步骤可以被并行地、并发地或者同时实施。此外，各步骤的顺序可以被重新安排。当其操作完成时所述处理可以被终止，但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。Before discussing the exemplary embodiments in greater detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowchart depicts the steps as a sequential process, many of the steps may be performed in parallel, concurrently, or concurrently. Furthermore, the order of the steps can be rearranged. The process may be terminated when its operation is complete, but may also have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, subroutines, and the like.

相关技术中，语音识别的方式通常包括端点检测、特征提取和匹配运算。其中，为了精准地找到语音开始和结束的时刻，通常采用双门限检测算法。同时使用短时过零率和短时平均能量来分别检测语音数据，综合采用上述方式确定语音信号的端点(开始时刻和结束时刻)。语音数据的特征提取的实质是把语音数据从模拟信号转换为数字信号，用反映语音数据特点的一系列特征参数代表语音数据。由于梅尔频谱倒谱系数(Mel FrequencyCepstral Coefficents，简称为MFCC)是根据人耳的听觉模型提出的，因其接近于人的听觉特征，能够很好的提高识别性能。因此，以MFCC参数的提取方式为例说明特征提取流程。MFCC参数的提取方式包括以下几个步骤：采用预设的窗函数，根据固定的帧长和帧移对音频信号进行分帧，例如，帧长可以是25ms，帧移可以是10ms；经过快速傅里叶变换(fastFourier transform，简称FFT)将时域信号变为信号的功率谱；再采用一组梅尔滤波器对上述频率谱进行处理后得到梅尔频谱；在梅尔频谱上进行倒谱分析(包括取对数及离散余弦变换)，得到MFCC参数。将各个声音帧的MFCC参数作为该声音帧的语音特征矢量序列。将各个声音帧的语音特征矢量序列均输入隐式马尔可夫模型，并获取隐式马尔可夫模型输出的与至少一帧声音帧匹配的状态(即比较声音帧与状态匹配的概率，将最大概率对应的状态作为与声音帧匹配的状态)。顺序获取三个状态组成音素，根据所述音素确定单词的发音，从而实现语音识别。然而，上述语音识别的方案无法区别具有实际含义的音素和无实际含义的音素(例如用户表述习惯中的“这个”、“那个”、“怎么说呢”及“就是说”等)，从而导致语音识别过程中计算量比较大，语音识别速度慢。In the related art, speech recognition methods generally include endpoint detection, feature extraction and matching operations. Among them, in order to accurately find the moment when speech starts and ends, a double-threshold detection algorithm is usually used. At the same time, the short-term zero-crossing rate and the short-term average energy are used to detect the voice data respectively, and the above-mentioned methods are comprehensively used to determine the endpoints (start time and end time) of the voice signal. The essence of feature extraction of voice data is to convert voice data from analog signal to digital signal, and represent voice data with a series of characteristic parameters reflecting the characteristics of voice data. Since the Mel Frequency Cepstral Coefficents (MFCC for short) are proposed based on the auditory model of the human ear, it is close to the auditory features of the human and can improve the recognition performance. Therefore, the feature extraction process is described by taking the extraction method of MFCC parameters as an example. The extraction method of MFCC parameters includes the following steps: using a preset window function, the audio signal is divided into frames according to a fixed frame length and frame shift, for example, the frame length can be 25ms, and the frame shift can be 10ms; The Liye transform (fastFourier transform, referred to as FFT) converts the time domain signal into the power spectrum of the signal; then a set of Mel filters are used to process the above frequency spectrum to obtain the Mel spectrum; the cepstrum analysis is performed on the Mel spectrum (including taking logarithm and discrete cosine transform) to obtain MFCC parameters. The MFCC parameters of each sound frame are taken as the speech feature vector sequence of the sound frame. Input the speech feature vector sequence of each sound frame into the hidden Markov model, and obtain the state output by the hidden Markov model that matches at least one sound frame (that is, compare the probability that the sound frame matches the state, and set the maximum The state corresponding to the probability as the state matching the sound frame). The three states are sequentially acquired to form phonemes, and the pronunciation of the word is determined according to the phonemes, thereby realizing speech recognition. However, the above-mentioned speech recognition solution cannot distinguish between phonemes with actual meaning and phonemes without actual meaning (such as "this", "that", "how do you say" and "that is to say", etc. in the user's expression habits), resulting in In the process of speech recognition, the amount of calculation is relatively large, and the speed of speech recognition is slow.

图1为本申请实施例提供的一种语音识别方法的流程图，该方法可以由语音识别装置来执行，其中，该装置可由软件和/或硬件实现，一般可集成在电子设备中。如图1所示，该方法包括：FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application. The method may be executed by a speech recognition apparatus, wherein the apparatus may be implemented by software and/or hardware, and may generally be integrated in an electronic device. As shown in Figure 1, the method includes:

步骤110、获取第一语音数据。Step 110: Acquire first voice data.

其中，第一语音数据包括用户输入的语音信号。例如，用户在使用短信应用中的语音输入功能时输入的语音信号。又如，用户在使用备忘录应用中的语音输入功能时输入的语音信号。又如，用户在使用邮件应用中的语音输入功能时输入的语音信号。再如，用户在使用即时通信应用的语音输入功能时输入的语音信号等。Wherein, the first voice data includes a voice signal input by a user. For example, the voice signal input by the user when using the voice input function in the SMS application. Another example is the voice signal input by the user when using the voice input function in the memo application. Another example is the voice signal input by the user when using the voice input function in the mail application. Another example is the voice signal input by the user when using the voice input function of the instant messaging application.

电子设备上集成有语音采集器，通过语音采集器可以获取第一语音数据。其中，语音采集器包括送话器，以及蓝牙耳机、红外耳机等无线耳机。示例性地，以智能手机为例，当用户开启短信应用的语音输入功能时，短信输入方式可以采用语音输入代替手动输入，其实现过程可以是，用户通过向智能手机输入语音指示，由智能手机将该语音指示对应的语音信号转为文字并显示在短信应用界面。对用户输入的语音指示对应的语音信号进行预处理，可以得到第一语音数据。其中，上述预处理包括滤波和模数转换等。需要说明的是，由于用户在说话时往往不自觉的带入口语化的表达，可能导致第一语音数据中包括“这个”、“那个”、“怎么说呢”及“就是说”等无实际意义的词汇。A voice collector is integrated on the electronic device, and the first voice data can be acquired through the voice collector. Among them, the voice collector includes a microphone, and wireless earphones such as Bluetooth earphones and infrared earphones. Exemplarily, taking a smartphone as an example, when the user enables the voice input function of the short message application, the short message input method can use voice input instead of manual input. The voice signal corresponding to the voice instruction is converted into text and displayed on the short message application interface. The first voice data can be obtained by preprocessing the voice signal corresponding to the voice instruction input by the user. The above preprocessing includes filtering, analog-to-digital conversion, and the like. It should be noted that because users often unconsciously bring oral expressions when speaking, it may lead to the fact that the first voice data includes "this", "that", "how do you say" and "that is to say", etc. vocabulary of meaning.

步骤120、将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段。Step 120: Input the first voice data into a pre-built screening model for screening, and obtain a voice segment output by the screening model for filtering out the set voice feature.

其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到。示例性地，以筛选模型为神经网络模型为例，筛选模型的训练过程包括：Wherein, the screening model is obtained by training speech data samples with speech features without actual meaning added. Exemplarily, taking the screening model as a neural network model as an example, the training process of the screening model includes:

模型初始化，包括设置隐藏层的数目以及输入层、隐藏层和输出层各层的节点数，各层之间的连接权重，以及初始化隐藏层和输出层的阈值等，初步得到神经网络模型的框架。Model initialization, including setting the number of hidden layers, the number of nodes in the input layer, the hidden layer and the output layer, the connection weight between the layers, and the thresholds for initializing the hidden layer and the output layer, etc., to initially obtain the framework of the neural network model .

语音识别，根据神经网络模型包括的公式计算隐藏层的输出参数和输出层的输出参数，根据上一层的计算结果、两层之间的连接权重和自身节点的外部偏置值，计算神经网络模型的输出。Speech recognition, calculate the output parameters of the hidden layer and the output layer of the output layer according to the formula included in the neural network model, and calculate the neural network according to the calculation result of the previous layer, the connection weight between the two layers and the external bias value of the own node. the output of the model.

误差计算，采用监督式学习方式对神经网络模型中的参数进行调整。获取用户的历史发送短消息中采用语音输入方式输入的语音数据及对应的文字，由于用户确认发出的短消息是经调整后不具有无实际含义的词汇且符合用户表述习惯的数据，可以将其作为标准的语音数据样本。相应地，语音数据样本对应的期望输出是上述语音数据对应的文字的语音(或发音)。通过向该语音数据样本中添加无实际含义的语音特征的方式获取训练样本。其中，获取无实际含义的语音特征的方式可以是通过统计设定数量的样本群体的表述习惯，分析得到出现概率较高的无实际意义的词汇作为语音特征。还可以是由用户自行选择其常用的无实际意义的词汇，或者自动统计该用户常用的无实际意义的词汇作为语音特征等等。Error calculation, using supervised learning to adjust the parameters in the neural network model. Obtain the voice data and the corresponding text entered by voice input in the user's historical short messages sent. Since the user confirms that the short message sent is data that does not have meaningless words after adjustment and conforms to the user's expression habits, it can be as a standard speech data sample. Correspondingly, the expected output corresponding to the speech data sample is the speech (or pronunciation) of the text corresponding to the above speech data. A training sample is obtained by adding voice features without actual meaning to the voice data sample. Among them, the way to obtain the speech features without actual meaning may be by counting the expression habits of a set number of sample groups, and analyzing and obtaining words with a high occurrence probability that have no actual meaning as the speech features. It may also be that the user selects the commonly used meaningless words by himself, or automatically counts the meaningless words commonly used by the user as speech features, and so on.

对神经网络模型的实际输出和期望输出进行计算，得到实际输出和期望输出之间的误差信号。然后，根据该误差信号对神经网络模型中各个神经元的连接权重和外部偏置值进行更新。图2示出本申请实施例提供的单个神经元的基本结构示意图，图2中ω_i1为神经元i与其所在层的上一层中神经元之间的连接权重，也可以理解为输入x₁的权重；θ_i为该神经元的外部偏置。根据网络预测误差，神经网络中误差反向传递修改各个神经元的连接权重和外部偏置值。判断算法迭代是否结束，若是，则完成筛选模型的构建。Calculate the actual output and the expected output of the neural network model, and obtain the error signal between the actual output and the expected output. Then, the connection weight and external bias value of each neuron in the neural network model are updated according to the error signal. Fig. 2 shows a schematic diagram of the basic structure of a single neuron provided by an embodiment of the present application. In Fig. 2, ω _i1 is the connection weight between neuron i and the neuron in the upper layer of the layer where it is located, which can also be understood as the input x ₁ The weight of ; θ _i is the external bias of this neuron. According to the prediction error of the network, the error back propagation in the neural network modifies the connection weight and external bias value of each neuron. Determine whether the algorithm iteration is over, and if so, complete the construction of the screening model.

将第一语音数据输入构建好的筛选模型，对于第一语音数据中无实际含义的发音对应的路径，其连接权重较小，输入参数在神经网络模型的隐层之间，或隐层与输出层传递的过程中，由于乘以该连接权重得到逐渐缩小的输入参数，经过多次计算后，第一语音数据中无实际含义的语音特征(例如音素)被滤除。筛选模型的输出结果为滤除无实际含义的语音特征的语音片段。Input the first voice data into the constructed screening model. For the path corresponding to the pronunciation without actual meaning in the first voice data, the connection weight is small, and the input parameters are between the hidden layers of the neural network model, or between the hidden layer and the output. In the process of layer transfer, since the input parameters are gradually reduced by multiplying the connection weight, the voice features (such as phonemes) that have no actual meaning in the first voice data are filtered out after multiple calculations. The output result of the screening model is to filter out speech segments that have no actual meaning.

步骤130、识别所述语音片段得到对应的文字。Step 130: Recognize the voice segment to obtain the corresponding text.

计算语音片段与预设的参考模板进行距离比对，将语音片段中各声音帧与参考模板中距离最短的发音作为该声音帧的发音，各个声音帧的发音的组合即为该语音片段的语音。在获知该语音片段的语音后，可以查询预设的字典，确定所述语音对应的文字。Calculate the speech fragment and carry out the distance comparison with the preset reference template, take the pronunciation with the shortest distance in each sound frame in the speech fragment and the reference template as the pronunciation of this sound frame, and the combination of the pronunciation of each sound frame is the pronunciation of this speech fragment . After learning the voice of the voice segment, a preset dictionary can be queried to determine the text corresponding to the voice.

本实施例的技术方案，通过在语音识别前，将所获取的第一语音数据输入了筛选模型。由于筛选模型的训练样本是添加有无实际含义的语音特征的语音数据样本，将第一语音数据输入筛选模型进行计算，可以滤除第一语音数据包含的无实际含义的音素，得到不包含无实际含义的音素的语音片段。从而，由筛选模型输出的语音片段的数据量小于第一语音数据的数据量。再对数据量减小后的语音片段进行识别，可以有效地减少语音识别过程中的计算量，提高了识别速度。In the technical solution of this embodiment, the acquired first speech data is input into the screening model before speech recognition. Since the training samples of the screening model are speech data samples with speech features with or without actual meaning added, the first speech data is input into the screening model for calculation, and the phonemes without actual meaning contained in the first speech data can be filtered out, and the result is obtained without any actual meaning. The actual meaning of the phoneme's speech fragment. Therefore, the data volume of the speech segment output by the screening model is smaller than the data volume of the first speech data. Recognizing the speech segment after the data amount is reduced can effectively reduce the calculation amount in the speech recognition process and improve the recognition speed.

图3是本申请实施例提供的另一种语音识别方法的流程图。如图3所示，该方法包括：FIG. 3 is a flowchart of another speech recognition method provided by an embodiment of the present application. As shown in Figure 3, the method includes:

步骤301、获取第一语音数据。Step 301: Acquire first voice data.

步骤302、判断所述第一语音数据对应的用户是否为注册用户，若是，则执行步骤303，否则执行步骤306。Step 302: Determine whether the user corresponding to the first voice data is a registered user, if yes, go to Step 303, otherwise go to Step 306.

在检测到第一语音数据时，控制电子设备的摄像头开启，拍摄至少一帧用户图像。通过对用户图像进行图像处理，图像识别，确定输入第一语音数据的用户是否为注册用户。其中，可以通过图像匹配的方式确定输入第一语音数据的用户是否为注册用户。示例性地，在用户注册时，获取用户图像，作为匹配模板。在检测到第一语音数据时，获取用户图像，并将用户图像与匹配模板进行匹配，从而，可以确定第一语音数据对应的用户是否为注册用户。When the first voice data is detected, the camera of the electronic device is controlled to be turned on, and at least one frame of the user image is captured. By performing image processing and image recognition on the user image, it is determined whether the user who inputs the first voice data is a registered user. Wherein, whether the user who inputs the first voice data is a registered user may be determined by image matching. Exemplarily, when a user registers, a user image is obtained as a matching template. When the first voice data is detected, a user image is acquired, and the user image is matched with the matching template, so that it can be determined whether the user corresponding to the first voice data is a registered user.

步骤303、获取至少一个注册用户的历史语音数据，根据所述历史语音数据确定各个注册用户的语速及停顿间隔。Step 303: Acquire historical voice data of at least one registered user, and determine the speech rate and pause interval of each registered user according to the historical voice data.

在第一语音数据对应的用户是注册用户时，获取该注册用户的历史语音数据。其中，历史语音数据包括用户的历史通话数据、历史语音控制数据以及历史语音消息等。通过分析历史语音数据可以确定各个注册用户的平均语速及平均停顿间隔。其中，平均语速及平均停顿间隔均为加权计算得到。还可以进一步确定各个注册用户分别在不同场景下的语速及停顿间隔。When the user corresponding to the first voice data is a registered user, the historical voice data of the registered user is acquired. The historical voice data includes the user's historical call data, historical voice control data, and historical voice messages. The average speech rate and average pause interval of each registered user can be determined by analyzing historical speech data. Among them, the average speech rate and the average pause interval are calculated by weighting. The speech rate and pause interval of each registered user in different scenarios can also be further determined.

步骤304、根据所述语速及停顿习惯查询预设的分帧策略集合，确定与所述注册用户对应的分帧策略。Step 304: Query a preset framing strategy set according to the speaking rate and pause habit, and determine a framing strategy corresponding to the registered user.

其中，分帧策略包括窗函数的选择、帧长的取值和帧移的取值，且所述分帧策略与不同用户的语言习惯相关联。分帧策略集合是分帧策略的集合，其中存储语速区间及停顿间隔区间与窗函数、帧长及帧移的对应关系。The framing strategy includes the selection of a window function, the value of frame length and the value of frame shift, and the framing strategy is associated with the language habits of different users. The framing strategy set is a set of framing strategies, in which the correspondence between the speech rate interval and the pause interval interval and the window function, frame length and frame shift is stored.

根据上述步骤确定的语速及停顿间隔查询分帧策略集合中存储的语速区间及停顿间隔区间，定位语速及停顿间隔对应的区间，将该区间对应的窗函数、帧长及帧移作为注册用户输入的当前语音数据的分帧策略。According to the speech rate and pause interval determined in the above steps, query the speech rate interval and pause interval interval stored in the framing strategy set, locate the interval corresponding to the speech rate and pause interval, and use the window function, frame length and frame shift corresponding to the interval as Register the framing policy of the current voice data input by the user.

步骤305、根据注册用户对应的分帧策略，对所述第一语音数据进行分帧，得到至少两个第二语音数据，然后，执行步骤307。Step 305 : Framing the first voice data according to the framing policy corresponding to the registered user to obtain at least two second voice data, and then go to step 307 .

由于语音数据只在较短的时间内呈现平稳性，因此需要将语音数据划分为一个一个的短时段，即声音帧。Since the voice data only exhibits stationarity in a short period of time, it is necessary to divide the voice data into one short period of time, that is, sound frames.

示例性地，采用上述步骤中确定的分帧策略包括的窗函数，按照分帧策略包括的帧移处理第一语音数据得到至少两个第二语音数据。其中，窗函数的窗长等于该分帧策略的帧长。在得到至少两个第二语音数据后，转至执行步骤307。第一语音数据的划分与注册用户的语速和停顿间隔相关，因此，分帧后得到的第二语音数据的帧长随注册用户的语速和停顿间隔而变化，帧长并非固定不变，可以减少将具有实际意义的语音与不具有实际意义的语音划分在同一声音帧内，有利于提高语音识别的效率。Exemplarily, using the window function included in the framing strategy determined in the above steps, and processing the first voice data according to the frame shift included in the framing strategy to obtain at least two second voice data. The window length of the window function is equal to the frame length of the framing strategy. After obtaining at least two second voice data, go to step 307 . The division of the first voice data is related to the speech rate and pause interval of the registered user. Therefore, the frame length of the second voice data obtained after framing varies with the speech rate and pause interval of the registered user, and the frame length is not fixed. It can reduce the division of speech with practical significance and speech without practical significance into the same sound frame, which is beneficial to improve the efficiency of speech recognition.

步骤306、根据默认的分帧策略，对所述第一语音数据进行分帧，得到至少两个第二语音数据。Step 306: Framing the first voice data according to a default framing strategy to obtain at least two second voice data.

在第一语音数据对应的用户不是注册用户时，采用默认的窗函数，按照默认的帧移处理第一语音数据得到至少两个第二语音数据。其中，窗函数的窗长为默认帧长。分帧后得到的第二语音数据的帧长是固定不变的，将具有实际意义的语音和不具有实际意义的语音划分为一个声音帧的情况较多。When the user corresponding to the first voice data is not a registered user, a default window function is used, and at least two second voice data are obtained by processing the first voice data according to the default frame shift. The window length of the window function is the default frame length. The frame length of the second voice data obtained after frame division is fixed, and it is often the case that voices with practical significance and voices with no practical significance are divided into one voice frame.

步骤307、提取所述第二语音数据对应的第一语音特征矢量序列。Step 307: Extract the first speech feature vector sequence corresponding to the second speech data.

其中，第一语音特征矢量序列包括MFCC特征。从第二语音数据中提取MFCC特征的方式包括：通过一系列梅尔滤波器对第二语音数据的频谱图进行滤波处理，得到梅尔频谱；对所述梅尔频谱进行倒谱分析，得到梅尔频率倒谱系数，将所述梅尔频率倒谱系数作为输入筛选模型的动态特征向量，即第一语音特征矢量序列。Wherein, the first speech feature vector sequence includes MFCC features. The method of extracting MFCC features from the second speech data includes: filtering the spectrogram of the second speech data through a series of mel filters to obtain a mel spectrum; performing cepstral analysis on the mel spectrum to obtain a mel spectrum The Mel frequency cepstral coefficient is used as the dynamic feature vector of the input screening model, that is, the first speech feature vector sequence.

步骤308、对所述第一语音特征矢量序列进行归一化处理后，输入预先构建的循环神经网络模型进行筛选。Step 308: After normalizing the first speech feature vector sequence, input a pre-built recurrent neural network model for screening.

可选的，在将第一语音特征矢量序列输入预先构建的循环神经网络模型之前，还可以对第一语音特征矢量序列进行归一化处理，可以理解的是，归一化处理的步骤并不是必须执行的步骤。其中，归一化处理是把所有第一语音特征矢量序列分别映射为[0，1]或[-1，1]之间的数，可以消除输入数据的单位不同和范围差距对语音识别的影响，降低语音识别误差。Optionally, before the first speech feature vector sequence is input into the pre-built recurrent neural network model, the first speech feature vector sequence may also be normalized. Required steps. Among them, the normalization process is to map all the first speech feature vector sequences into numbers between [0, 1] or [-1, 1], which can eliminate the influence of different units and ranges of input data on speech recognition. , to reduce speech recognition errors.

在对第一语音特征矢量序列进行归一化处理后，输入预先构建的神经网络模型进行筛选，其中，该神经网络模型为循环神经网络模型。After the first speech feature vector sequence is normalized, a pre-built neural network model is input for screening, wherein the neural network model is a recurrent neural network model.

步骤309、获取所述循环神经网络模型的输出结果，其中，所述输出结果为滤除无实际含义的音素的第二语音特征矢量序列。Step 309: Obtain an output result of the cyclic neural network model, wherein the output result is a second speech feature vector sequence for filtering out phonemes that have no actual meaning.

其中，音素是语音中的最小单位，依据音节里的发音动作分析，一个动作构成一个音素，音素包括元音和辅音。Among them, a phoneme is the smallest unit in speech. According to the analysis of the pronunciation action in the syllable, an action constitutes a phoneme, and the phoneme includes vowels and consonants.

由于循环神经网络模型是通过对添加无实际含义的音素的训练样本进行学习及训练构建而成，其输出为滤除无实际含义的音素的语音片段，因此，在将第一语音特征矢量序列输入循环神经网络模型后，输出的语音片段为滤除无实际含义的音素的第二语音特征矢量序列。Since the recurrent neural network model is constructed by learning and training training samples with phonemes that have no actual meaning added, the output is a speech segment that filters out the phonemes that have no actual meaning. Therefore, when the first speech feature vector sequence is input After the recurrent neural network model, the output speech segment is the second speech feature vector sequence that filters out the phonemes that have no actual meaning.

步骤310、判断第二语音特征矢量序列与预设的参考模板的长度是否相等，若是，则执行步骤313，否则执行步骤311。Step 310: Determine whether the lengths of the second speech feature vector sequence and the preset reference template are equal, if so, go to Step 313, otherwise go to Step 311.

获取第二语音特征矢量序列的长度，将其与预设的参考模板的长度进行比较。若长度不相同，则执行步骤311。若长度相同，则执行步骤313。Obtain the length of the second speech feature vector sequence, and compare it with the length of the preset reference template. If the lengths are different, step 311 is executed. If the lengths are the same, step 313 is executed.

步骤311、采用动态时间规整算法计算所述第二语音特征矢量序列与参考模板的帧匹配距离。Step 311 , using the dynamic time warping algorithm to calculate the frame matching distance between the second speech feature vector sequence and the reference template.

其中，动态时间规整算法(dynamic time warping，简称DTW)是一种衡量两个时间序列之间的相似度的方法，主要应用在语音识别领域来识别两段语音是否表示同一个单词。Among them, dynamic time warping (DTW for short) is a method for measuring the similarity between two time series, and is mainly used in the field of speech recognition to identify whether two pieces of speech represent the same word.

示例性地，若第二语音特征矢量序列与预设的参考模板的长度不同，则可以通过DTW算法计算第二语音特征矢量序列与参考模板的帧匹配距离矩阵，在帧匹配距离矩阵中找出一条最佳路径，该最佳路径为最小匹配距离对应的路径。Exemplarily, if the length of the second voice feature vector sequence is different from the preset reference template, the frame matching distance matrix between the second voice feature vector sequence and the reference template can be calculated by the DTW algorithm, and find out in the frame matching distance matrix. An optimal path, which is the path corresponding to the minimum matching distance.

步骤312、确定最小帧匹配距离对应的发音，然后，执行步骤314。Step 312 , determine the pronunciation corresponding to the minimum frame matching distance, and then perform step 314 .

确定作为最小帧匹配距离对应的端点的参考模板内的语音及第二语音特征矢量序列，将该参考模板内的语音作为该第二语音特征矢量序列的发音。Determine the speech in the reference template as the endpoint corresponding to the minimum frame matching distance and the second speech feature vector sequence, and use the speech in the reference template as the pronunciation of the second speech feature vector sequence.

步骤313、直接匹配第二语音特征矢量序列与参考模板，确定所述语音片段对应的发音。Step 313: Directly match the second speech feature vector sequence with the reference template to determine the pronunciation corresponding to the speech segment.

若第二语音特征矢量序列与预设的参考模板的长度相同，则直接匹配第二语音特征矢量序列与参考模板，确定语音片段对应的发音。If the length of the second speech feature vector sequence is the same as that of the preset reference template, the second speech feature vector sequence and the reference template are directly matched to determine the pronunciation corresponding to the speech segment.

步骤314、根据所述发音匹配对应的文字，作为语音识别结果。Step 314: Match the corresponding text according to the pronunciation as a speech recognition result.

本实施例的技术方案，通过在语音识别前，根据用户的语速及停顿间隔确定分帧策略，采用个性化的分帧策略对第一语音数据进行分帧处理，实现个性化的分帧，有效的减少将具有实际含义的语音特征与不具有实际含义的语音特征划分在一帧中的声音帧的数量。将分帧处理后的第二语音数据对应的第一语音特征矢量序列输入筛选模型，可以进一步提高语音识别效率。In the technical solution of this embodiment, before speech recognition, a framing strategy is determined according to the user's speech rate and pause interval, and a personalized framing strategy is used to perform framing processing on the first voice data, so as to realize personalized framing, It effectively reduces the number of sound frames that divide speech features with actual meaning from those without actual meaning in one frame. Inputting the first speech feature vector sequence corresponding to the framed second speech data into the screening model can further improve speech recognition efficiency.

图4是本申请实施例提供的又一种语音识别方法的流程图。如图4所示，该方法包括：FIG. 4 is a flowchart of another speech recognition method provided by an embodiment of the present application. As shown in Figure 4, the method includes:

步骤401、判断是否满足模型更新条件，若是，则执行步骤402，否则，执行步骤408。Step 401: Determine whether the model update condition is satisfied, if yes, go to Step 402, otherwise, go to Step 408.

其中，模型更新条件可以是系统时间达到预设时间，还可以是满足预设更新周期。例如，设置模型更新条件是每周五晚12点进行筛选模型更新，则当检测到系统时间为周五晚12点时，确定当前满足模型更新条件。又如，设置模型更新条件是每7天更新一次，则检测到距离上次模型更新的时间满足更新周期时，确定当前满足模型更新条件。The model update condition may be that the system time reaches a preset time, or a preset update period is met. For example, if the model update condition is set to update the screening model at 12:00 pm every Friday, when it is detected that the system time is 12:00 pm on Friday, it is determined that the model update condition is currently satisfied. For another example, if the model update condition is set to be updated every 7 days, when it is detected that the time from the last model update satisfies the update cycle, it is determined that the model update condition is currently met.

步骤402、获取已发送的通过语音方式输入的短信，和/或已存储的通过语音方式输入的备忘录。Step 402: Acquire the sent short message input by voice, and/or the stored memo input by voice.

获取采用语音输入方式的已发送短信，及已存储的备忘录。由于用户确认发出的短信可以认为是经调整后不具有无实际含义的词汇且符合用户表述习惯的数据，可以将其作为标准的语音数据样本。对于已保存的备忘录，也可以认为其是经过调整后不具有无实际含义的词汇且符合用户表述习惯的数据，也可以将其作为标准的语音数据样本。Get sent text messages with voice input, and stored memos. Since the short message sent by the user's confirmation can be regarded as the adjusted data which has no meaningless words and conforms to the user's expression habits, it can be regarded as a standard voice data sample. For the saved memo, it can also be regarded as data that has no meaningless words after adjustment and conforms to the user's expression habits, and can also be regarded as a standard voice data sample.

预先保存通过语音输入方式的已发送短信的正文内容对应的语音数据的语音特征矢量序列，并对应保存用户口述输入的语音数据，将用户口述输入的语音数据作为历史语音数据。例如，以通过语音输入方式发送短信为例，用户口述输入的语音数据是“关于这个问题，怎么说呢，确实不好解决”，而经处理后实际发出的短消息是“关于这个问题，确实不好解决”。对应存储用户口述输入的语音数据的语音特征矢量序列，以及实际发出的短消息对应的语音数据。Pre-save the voice feature vector sequence of the voice data corresponding to the text content of the sent short message by voice input, and correspondingly save the voice data dictated by the user, and use the voice data dictated by the user as historical voice data. For example, taking the example of sending a short message by voice input, the voice data that the user dictates is "how to say about this problem, it is really difficult to solve", and the actual short message after processing is "about this problem, it is true Not easy to solve." Correspondingly store the voice feature vector sequence of the voice data that the user orally input, and the voice data corresponding to the short message actually sent out.

步骤403、获取所述短信和/或备忘录的正文内容对应的语音数据的语音特征矢量序列。Step 403: Acquire a sequence of speech feature vectors of speech data corresponding to the text content of the short message and/or memo.

获取已发送的短信的正文内容中的语音数据的语音特征矢量序列。可选的，还可以是获取已存储的备忘录的正文内容中的语音数据的语音特征矢量序列。Obtain the speech feature vector sequence of the speech data in the body content of the sent SMS. Optionally, the voice feature vector sequence of the voice data in the text content of the stored memo may also be obtained.

步骤404、获取所述短信和/或备忘录的历史语音数据。Step 404: Acquire historical voice data of the short message and/or memo.

获取已发送的短信对应的用户口述输入的内容，作为历史语音数据。可选的，还可以是获取已存储的备忘录对应的用户口述输入的内容，作为历史语音数据。Obtain the content of the user's oral input corresponding to the sent short message as historical voice data. Optionally, the content of the user's oral input corresponding to the stored memo may also be obtained as historical voice data.

步骤405、根据所述历史语音数据确定个性化的无实际含义的音素及所述音素的出现位置。Step 405: Determine the personalized phoneme without actual meaning and the appearance position of the phoneme according to the historical speech data.

分析历史语音数据，可以得出某一用户的语言习惯，即无实际含义的因素及其出现位置。例如，该用户在语音输入时，喜欢在句中间说“怎么说呢”这种无实际含义的词汇。By analyzing historical voice data, it is possible to obtain the language habits of a user, that is, factors that have no actual meaning and where they appear. For example, during voice input, the user likes to say "how do you say" in the middle of a sentence without actual meaning.

步骤406、根据所述出现位置向所述语音特征矢量序列中添加所述音素作为训练样本，并以所述语音特征矢量序列作为期望输出，采用监督式学习方式对所述筛选模型进行训练。Step 406: Add the phoneme to the speech feature vector sequence as a training sample according to the appearance position, and use the speech feature vector sequence as an expected output to train the screening model by using a supervised learning method.

对训练样本进行归一化处理，可以消除输入数据的单位不同和范围差距对语音识别的影响，同时，还有利于将输入数据映射到激活函数的有效阈值，降低了网络训练误差和网络训练时间。Normalizing the training samples can eliminate the influence of different units and ranges of input data on speech recognition. At the same time, it is also beneficial to map the input data to the effective threshold of the activation function, reducing the network training error and network training time. .

步骤407、根据训练结果调整所述筛选模型的参数，所述参数包括连接权重和外部偏置值。Step 407: Adjust the parameters of the screening model according to the training results, where the parameters include connection weights and external bias values.

通过分析训练样本与期望输出可以确定网络预测误差。根据神经网络模型中误差由后(输出层)向前(输入层)传递的方式，分别修改各个神经元的连接权重和外部偏置值。The network prediction error can be determined by analyzing the training samples and the expected output. According to the way the error is transmitted from the back (output layer) to the front (input layer) in the neural network model, the connection weight and external bias value of each neuron are modified respectively.

步骤408、获取第一语音数据。Step 408: Acquire the first voice data.

若获取第一语音数据时，上述模型更新过程尚未结束，则不识别第一语音数据，提示用户当前正在进行筛选模型的更新操作。If the above-mentioned model update process has not ended when the first voice data is acquired, the first voice data is not recognized, and the user is prompted to update the screening model currently.

步骤409、将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段。Step 409 : Input the first voice data into a pre-built screening model for screening, and obtain a voice segment output by the screening model for filtering out the set voice feature.

若获取第一语音数据时，未在执行模型更新操作，则将第一语音数据输入筛选模型，通过模型对该第一语音数据进行筛选，得到滤除无实际含义的语音特征的语音片段。If the model update operation is not being performed when acquiring the first voice data, the first voice data is input into the screening model, and the first voice data is screened by the model to obtain voice fragments with voice features that have no actual meaning filtered out.

步骤410、识别所述语音片段得到对应的文字。Step 410: Recognize the speech segment to obtain the corresponding text.

步骤411、判断所述文字是否为命令信息，若是，则执行步骤412，否则，执行步骤413。Step 411 , determine whether the text is command information, if yes, go to step 412 , otherwise go to step 413 .

预先通过白名单存储文字组合与命令信息的关联关系。在识别出语音片段对应的文字时，根据该文字的文字组合查询该白名单。若在该白名单中查询到对应的文字组合，则确定语音片段对应的文字代表命令信息，执行步骤412。若未在该白名单中查询到对应的文字组合，则提示用户选择是否为命令信息。若用户选择该语音片段对应的文字代表命令信息，则将该用户确定为命令信息的文字组合添加至该白名单，并执行步骤412。若用户选择该语音片段对应的文字不代表命令信息，则执行步骤413。The association between character combinations and command information is stored in advance through a whitelist. When the text corresponding to the speech segment is recognized, the whitelist is queried according to the text combination of the text. If the corresponding text combination is found in the whitelist, it is determined that the text corresponding to the voice segment represents command information, and step 412 is executed. If the corresponding text combination is not found in the whitelist, the user is prompted to select whether it is command information. If the user selects the text corresponding to the voice segment to represent the command information, the text combination determined by the user as the command information is added to the white list, and step 412 is executed. If the text corresponding to the voice segment selected by the user does not represent the command information, step 413 is executed.

步骤412、执行所述命令信息对应的操作。Step 412: Execute the operation corresponding to the command information.

步骤413、在用户界面中显示所述文字。Step 413: Display the text in the user interface.

本实施例的技术方案，通过在满足筛选模型的更新条件时，采用已发送的通过语音方式输入的短信和/或已存储的通过语音方式输入的备忘录作为训练样本，对筛选模型进行训练，可以使筛选模型的输出适应用户变化的表述习惯，有效地减小误识别率及漏检率。In the technical solution of this embodiment, when the update condition of the screening model is satisfied, the sent short message inputted by voice and/or the stored memo inputted by voice are used as training samples to train the screening model. The output of the screening model is adapted to the changing expression habits of users, and the false recognition rate and missed detection rate are effectively reduced.

图5是本申请实施例提供的一种语音识别装置的结构框图。该装置可有软件和/或硬件实现，一般集成在电子设备中。如图5所示，该装置可以包括：FIG. 5 is a structural block diagram of a speech recognition apparatus provided by an embodiment of the present application. The apparatus may be implemented in software and/or hardware, and is typically integrated in electronic equipment. As shown in Figure 5, the apparatus may include:

语音获取模块510，用于获取第一语音数据。The voice acquisition module 510 is configured to acquire first voice data.

语音筛选模块520，用于将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段，其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到。The voice screening module 520 is configured to input the first voice data into a pre-built screening model for screening, and obtain a voice segment output by the screening model to filter out the set voice features, wherein the screening model is determined by adding The speech data samples of the meaning of the speech feature are obtained by training.

语音识别模块530，用于识别所述语音片段得到对应的文字。The speech recognition module 530 is used for recognizing the speech segment to obtain the corresponding text.

本申请实施例提供一种语音识别装置，在语音识别前，将所获取的第一语音数据输入了筛选模型。由于筛选模型的训练样本是添加有无实际含义的语音特征的语音数据样本，将第一语音数据输入筛选模型进行计算，可以滤除第一语音数据包含的无实际含义的音素，得到不包含无实际含义的音素的语音片段。从而，由筛选模型输出的语音片段的数据量小于第一语音数据的数据量。再对数据量减小后的语音片段进行识别，可以有效地减少语音识别过程中的计算量，提高了识别速度。An embodiment of the present application provides a speech recognition apparatus, which inputs the acquired first speech data into a screening model before speech recognition. Since the training samples of the screening model are speech data samples with speech features with or without actual meaning added, the first speech data is input into the screening model for calculation, and the phonemes without actual meaning contained in the first speech data can be filtered out, and the result is The actual meaning of the phoneme's speech fragment. Therefore, the data volume of the speech segment output by the screening model is smaller than the data volume of the first speech data. Recognizing the speech segment after the data amount is reduced can effectively reduce the calculation amount in the speech recognition process and improve the recognition speed.

可选的，还包括：Optionally, also include:

用户判断模块，用于在检测到第一语音数据时，判断所述第一语音数据对应的用户是否为注册用户；a user judgment module, configured to judge whether the user corresponding to the first voice data is a registered user when the first voice data is detected;

以及，还包括：and, also includes:

分帧模块，用于在将所述第一语音数据输入预先构建的筛选模型进行筛选之前，根据判断结果确定对应的分帧策略，根据所述分帧策略对所述第一语音数据进行分帧，得到至少两个第二语音数据；The framing module is used to determine a corresponding framing strategy according to the judgment result before inputting the first voice data into a pre-built screening model for screening, and perform framing on the first voice data according to the framing strategy , obtain at least two second voice data;

其中，所述分帧策略包括窗函数的选择、帧长的取值和帧移的取值，且所述分帧策略与不同用户的语言习惯相关联。Wherein, the framing strategy includes the selection of a window function, the value of frame length and the value of frame shift, and the framing strategy is associated with the language habits of different users.

可选的，分帧模块具体用于：Optionally, the framing module is specifically used for:

获取至少一个注册用户的历史语音数据，根据所述历史语音数据确定各个注册用户的语速及停顿间隔；Obtain historical voice data of at least one registered user, and determine the speech rate and pause interval of each registered user according to the historical voice data;

根据所述语速及停顿习惯查询预设的分帧策略集合，确定与所述注册用户对应的分帧策略。According to the speech rate and the pause habit, a preset frame segmentation strategy set is inquired, and a frame segmentation strategy corresponding to the registered user is determined.

可选的，语音筛选模块520具体用于：Optionally, the voice screening module 520 is specifically used for:

提取所述第二语音数据对应的第一语音特征矢量序列；extracting the first speech feature vector sequence corresponding to the second speech data;

对所述第一语音特征矢量序列进行归一化处理后，输入预先构建的循环神经网络模型进行筛选；After normalizing the first speech feature vector sequence, input a pre-built cyclic neural network model for screening;

获取所述循环神经网络模型的输出结果，其中，所述输出结果为滤除无实际含义的音素的第二语音特征矢量序列。An output result of the cyclic neural network model is acquired, wherein the output result is a second speech feature vector sequence from which phonemes without actual meaning are filtered out.

可选的，语音识别模块530具体用于：Optionally, the speech recognition module 530 is specifically used for:

判断所述第二语音特征矢量序列与预设的参考模板的长度是否相等；Determine whether the length of the second speech feature vector sequence is equal to the preset reference template;

在不相等时，采用动态时间规整算法计算所述第二语音特征矢量序列与参考模板的帧匹配距离；When not equal, adopt the dynamic time warping algorithm to calculate the frame matching distance between the second speech feature vector sequence and the reference template;

确定最小帧匹配距离对应的发音，将所述发音匹配的文字作为语音识别结果。The pronunciation corresponding to the minimum frame matching distance is determined, and the text whose pronunciation is matched is used as the speech recognition result.

可选的，还包括：Optionally, also include:

文字处理模块，用于在识别所述语音片段得到对应的文字之后，判断所述文字是否为命令信息；a word processing module, configured to determine whether the text is command information after recognizing the voice fragment to obtain the corresponding text;

若是，则执行所述命令信息对应的操作；If so, execute the operation corresponding to the command information;

若否，则在用户界面中显示所述文字。If not, the text is displayed in the user interface.

可选的，还包括：Optionally, also include:

模型更新模块，用于在满足模型更新条件时，获取已发送的通过语音方式输入的短信，和/或已存储的通过语音方式输入的备忘录；The model update module is used to obtain the sent short messages input by voice and/or stored memos input by voice when the model update conditions are met;

获取所述短信和/或备忘录的正文内容对应的语音数据的语音特征矢量序列；Obtain the speech feature vector sequence of the speech data corresponding to the text content of the short message and/or the memo;

获取所述短信和/或备忘录的历史语音数据；Obtain the historical voice data of the text messages and/or memos;

根据所述历史语音数据确定个性化的无实际含义的音素及所述音素的出现位置；Determine the personalized phoneme without actual meaning and the appearance position of the phoneme according to the historical voice data;

根据所述出现位置向所述语音特征矢量序列中添加所述音素作为训练样本，并以所述语音特征矢量序列作为期望输出，采用监督式学习方式对所述筛选模型进行训练；Add the phoneme to the speech feature vector sequence as a training sample according to the appearance position, and use the speech feature vector sequence as an expected output to train the screening model by using a supervised learning method;

根据训练结果调整所述筛选模型的参数，所述参数包括连接权重和外部偏置值。The parameters of the screening model are adjusted according to the training results, and the parameters include connection weights and external bias values.

本申请实施例还提供一种包含计算机可执行指令的存储介质，所述计算机可执行指令在由计算机处理器执行时用于执行一种语音识别方法，该方法包括：Embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions are used to execute a speech recognition method when executed by a computer processor, and the method includes:

获取第一语音数据；obtain the first voice data;

将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段，其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到；Inputting the first voice data into a pre-built screening model for screening, and obtaining a voice segment output by the screening model to filter out the set voice features, wherein the screening model is composed of voice data with voice features that have no actual meaning added. sample training;

存储介质——任何的各种类型的存储器设备或存储设备。术语“存储介质”旨在包括：安装介质，例如CD-ROM、软盘或磁带装置；计算机系统存储器或随机存取存储器，诸如DRAM、DDR RAM、SRAM、EDO RAM，兰巴斯(Rambus)RAM等；非易失性存储器，诸如闪存、磁介质(例如硬盘或光存储)；寄存器或其它相似类型的存储器元件等。存储介质可以还包括其它类型的存储器或其组合。另外，存储介质可以位于程序在其中被执行的第一计算机系统中，或者可以位于不同的第二计算机系统中，第二计算机系统通过网络(诸如因特网)连接到第一计算机系统。第二计算机系统可以提供程序指令给第一计算机用于执行。术语“存储介质”可以包括可以驻留在不同位置中(例如在通过网络连接的不同计算机系统中)的两个或更多存储介质。存储介质可以存储可由一个或多个处理器执行的程序指令(例如具体实现为计算机程序)。storage medium - any of various types of memory devices or storage devices. The term "storage medium" is intended to include: installation media, such as CD-ROMs, floppy disks, or tape devices; computer system memory or random access memory, such as DRAM, DDR RAM, SRAM, EDO RAM, Rambus RAM, etc. ; non-volatile memory, such as flash memory, magnetic media (eg hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the first computer system in which the program is executed, or may be located in a second, different computer system connected to the first computer system through a network such as the Internet. The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations (eg, in different computer systems connected by a network). The storage medium may store program instructions (eg, embodied as a computer program) executable by one or more processors.

当然，本申请实施例所提供的一种包含计算机可执行指令的存储介质，其计算机可执行指令不限于如上所述的语音识别操作，还可以执行本申请任意实施例所提供的语音识别方法中的相关操作。Of course, a storage medium containing computer-executable instructions provided by the embodiments of the present application, the computer-executable instructions of which are not limited to the above-mentioned voice recognition operations, and can also execute any of the voice recognition methods provided in any of the embodiments of the present application. related operations.

本申请实施例提供了一种电子设备，该电子设备中可集成本申请实施例提供的语音识别装置。其中，电子设备包括智能手机、平板电脑、掌上游戏机、笔记本电脑及智能手表等。图6是本申请实施例提供的一种电子设备的结构框图。如图6所示，该电子设备可以包括：存储器601、中央处理器(Central Processing Unit，CPU)602(又称处理器，以下简称CPU)、语音采集器606和触摸屏611。所述触摸屏611，用于将用户操作转换成电信号输入至所述处理器，并显示可视输出信号；所述语音采集器606，用于采集第一语音数据；所述存储器601，用于存储计算机程序；所述CPU602读取并执行所述存储器601中存储的计算机程序。所述CPU602在执行所述计算机程序时实现以下步骤：获取第一语音数据；将所述第一语音数据输入预先构建的筛选模型进行筛选，获取所述筛选模型输出的滤除设定语音特征的语音片段，其中，所述筛选模型由添加无实际含义的语音特征的语音数据样本训练得到；识别所述语音片段得到对应的文字。The embodiment of the present application provides an electronic device, in which the voice recognition apparatus provided by the embodiment of the present application can be integrated. Among them, electronic equipment includes smart phones, tablet computers, handheld game consoles, notebook computers and smart watches. FIG. 6 is a structural block diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 6 , the electronic device may include: a memory 601 , a central processing unit (Central Processing Unit, CPU) 602 (also called a processor, hereinafter referred to as CPU), a voice collector 606 and a touch screen 611 . The touch screen 611 is used to convert user operations into electrical signals and input them to the processor, and display a visual output signal; the voice collector 606 is used to collect the first voice data; the memory 601 is used to Store computer programs; the CPU 602 reads and executes the computer programs stored in the memory 601 . The CPU 602 implements the following steps when executing the computer program: obtaining first voice data; inputting the first voice data into a pre-built screening model for screening, and obtaining the output of the screening model to filter out the set voice features. A speech segment, wherein the screening model is obtained by training speech data samples with speech features without actual meaning added; the speech segment is recognized to obtain the corresponding text.

所述电子设备还包括：外设接口603、RF(Radio Frequency，射频)电路605、电源管理芯片608、输入/输出(I/O)子系统609、其他输入/控制设备610以及外部端口604，这些部件通过一个或多个通信总线或信号线607来通信。The electronic device further includes: a peripheral interface 603, an RF (Radio Frequency, radio frequency) circuit 605, a power management chip 608, an input/output (I/O) subsystem 609, other input/control devices 610, and an external port 604, These components communicate through one or more communication buses or signal lines 607 .

应该理解的是，图示电子设备600仅仅是电子设备的一个范例，并且电子设备600可以具有比图中所示出的更多的或者更少的部件，可以组合两个或更多的部件，或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。It should be understood that the illustrated electronic device 600 is only an example of an electronic device, and that the electronic device 600 may have more or fewer components than those shown in the figures, two or more components may be combined, Or can have different component configurations. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

下面就本实施例提供的集成有语音识别装置的电子设备进行详细的描述，该电子设备以手机为例。The following describes the electronic device integrated with the voice recognition device provided in this embodiment in detail, and the electronic device takes a mobile phone as an example.

存储器601，所述存储器601可以被CPU602、外设接口603等访问，所述存储器601可以包括高速随机存取存储器，还可以包括非易失性存储器，例如一个或多个磁盘存储器件、闪存器件、或其他易失性固态存储器件。Memory 601, which can be accessed by CPU 602, peripheral interface 603, etc., said memory 601 can include high-speed random access memory, and can also include non-volatile memory, such as one or more disk storage devices, flash memory devices , or other volatile solid-state storage devices.

外设接口603，所述外设接口603可以将设备的输入和输出外设连接到CPU602和存储器601。A peripheral interface 603 that can connect the input and output peripherals of the device to the CPU 602 and the memory 601 .

I/O子系统609，所述I/O子系统609可以将设备上的输入输出外设，例如触摸屏611和其他输入/控制设备610，连接到外设接口603。I/O子系统609可以包括显示控制器6091和用于控制其他输入/控制设备610的一个或多个输入控制器6092。其中，一个或多个输入控制器6092从其他输入/控制设备610接收电信号或者向其他输入/控制设备610发送电信号，其他输入/控制设备610可以包括物理按钮(按压按钮、摇臂按钮等)、拨号盘、滑动开关、操纵杆、点击滚轮。值得说明的是，输入控制器6092可以与以下任一个连接：键盘、红外端口、USB接口以及诸如鼠标的指示设备。I/O subsystem 609 , which may connect input and output peripherals on the device, such as touch screen 611 and other input/control devices 610 , to peripherals interface 603 . The I/O subsystem 609 may include a display controller 6091 and one or more input controllers 6092 for controlling other input/control devices 610 . Wherein, one or more input controllers 6092 receive electrical signals from or send electrical signals to other input/control devices 610, which may include physical buttons (push buttons, rocker buttons, etc. ), dial pad, slide switch, joystick, click wheel. Notably, the input controller 6092 can be connected to any of the following: a keyboard, an infrared port, a USB interface, and a pointing device such as a mouse.

I/O子系统609中的显示控制器6091从触摸屏611接收电信号或者向触摸屏611发送电信号。触摸屏611检测触摸屏上的接触，显示控制器6091将检测到的接触转换为与显示在触摸屏611上的用户界面对象的交互，即实现人机交互，显示在触摸屏611上的用户界面对象可以是运行游戏的图标、联网到相应网络的图标等。值得说明的是，设备还可以包括光鼠，光鼠是不显示可视输出的触摸敏感表面，或者是由触摸屏模组形成的触摸敏感表面的延伸。Display controller 6091 in I/O subsystem 609 receives electrical signals from touch screen 611 or sends electrical signals to touch screen 611 . The touch screen 611 detects the contact on the touch screen, and the display controller 6091 converts the detected contact into interaction with the user interface objects displayed on the touch screen 611, that is, to realize human-computer interaction, and the user interface objects displayed on the touch screen 611 can be run. Icons for games, icons for connecting to the corresponding network, etc. It is worth noting that the device may also include a light mouse, which is a touch-sensitive surface that does not display visual output, or an extension of the touch-sensitive surface formed by a touch screen module.

RF电路605，主要用于建立手机与无线网络(即网络侧)的通信，实现手机与无线网络的数据接收和发送。例如收发短信息、电子邮件等。具体地，RF电路605接收并发送RF信号，RF信号也称为电磁信号，RF电路605将电信号转换为电磁信号或将电磁信号转换为电信号，并且通过该电磁信号与通信网络以及其他设备进行通信。RF电路605可以包括用于执行这些功能的已知电路，其包括但不限于天线系统、RF收发机、一个或多个放大器、调谐器、一个或多个振荡器、数字信号处理器、CODEC(COder-DECoder，编译码器)芯片组、用户标识模块(Subscriber Identity Module，SIM)等等。The RF circuit 605 is mainly used to establish the communication between the mobile phone and the wireless network (ie, the network side), and realize the data reception and transmission between the mobile phone and the wireless network. Such as sending and receiving text messages, e-mails, etc. Specifically, the RF circuit 605 receives and transmits RF signals, also known as electromagnetic signals, the RF circuit 605 converts electrical signals into electromagnetic signals or converts electromagnetic signals into electrical signals, and communicates with communication networks and other devices through the electromagnetic signals to communicate. RF circuitry 605 may include known circuitry for performing these functions including, but not limited to, antenna systems, RF transceivers, one or more amplifiers, tuners, one or more oscillators, digital signal processors, CODECs ( COder-DECoder, codec) chip set, subscriber identity module (Subscriber Identity Module, SIM) and so on.

语音采集器606，包括送话器，以及蓝牙耳机、红外耳机等无线耳机，主要用于接收音频数据，将该音频数据转换为电信号。The voice collector 606 includes a microphone, and wireless earphones such as Bluetooth earphones and infrared earphones, and is mainly used for receiving audio data and converting the audio data into electrical signals.

电源管理芯片608，用于为CPU602、I/O子系统及外设接口所连接的硬件进行供电及电源管理。The power management chip 608 is used for power supply and power management for the hardware connected to the CPU 602, the I/O subsystem and the peripheral interface.

本申请实施例提供的电子设备，通过在语音识别前，将所获取的第一语音数据输入了筛选模型。由于筛选模型的训练样本是添加有无实际含义的语音特征的语音数据样本，将第一语音数据输入筛选模型进行计算，可以滤除第一语音数据包含的无实际含义的音素，得到不包含无实际含义的音素的语音片段。从而，由筛选模型输出的语音片段的数据量小于第一语音数据的数据量。再对数据量减小后的语音片段进行识别，可以有效地减少语音识别过程中的计算量，提高了识别速度。The electronic device provided by the embodiment of the present application inputs the acquired first voice data into the screening model before voice recognition. Since the training samples of the screening model are speech data samples with speech features with or without actual meaning added, the first speech data is input into the screening model for calculation, and the phonemes without actual meaning contained in the first speech data can be filtered out, and the result is obtained without any actual meaning. The actual meaning of the phoneme's speech fragment. Therefore, the data volume of the speech segment output by the screening model is smaller than the data volume of the first speech data. Recognizing the speech segment after the data amount is reduced can effectively reduce the calculation amount in the speech recognition process and improve the recognition speed.

上述实施例中提供的语音识别装置、存储介质及电子设备可执行本申请任意实施例所提供的语音识别方法，具备执行该方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节，可参见本申请任意实施例所提供的语音识别方法。The speech recognition apparatus, storage medium, and electronic device provided in the above embodiments can execute the speech recognition method provided by any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in the foregoing embodiments, reference may be made to the speech recognition method provided by any embodiment of the present application.

注意，上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解，本申请不限于这里所述的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此，虽然通过以上实施例对本申请进行了较为详细的说明，但是本申请不仅仅限于以上实施例，在不脱离本申请构思的情况下，还可以包括更多其他等效实施例，而本申请的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present application and applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present application. The scope is determined by the scope of the appended claims.

Claims

1. a speech recognition method, is characterized in that, comprises:

obtain the first voice data;

When detecting the first voice data, determine whether the user corresponding to the first voice data is a registered user;

A corresponding framing strategy is determined according to the judgment result, and the first voice data is divided into frames according to the framing strategy to obtain at least two second voice data, wherein the framing strategy includes selection of a window function, frame The long value and the frame shift value, and the framing strategy is associated with the language habits of different users;

Extracting the first voice feature vector sequence corresponding to the second voice data, inputting the first voice feature vector sequence into a pre-built screening model for screening, and obtaining a voice segment output by the screening model to filter out the set voice features , wherein the screening model is obtained by adding voice data samples without actual meaning of voice features;

Recognize the speech segment to obtain the corresponding text.

2. method according to claim 1, is characterized in that, according to judgment result to determine corresponding framing strategy, comprising:

Obtain historical voice data of at least one registered user, and determine the speech rate and pause interval of each registered user according to the historical voice data;

According to the speech rate and the pause interval, a preset framing strategy set is inquired, and a framing strategy corresponding to the registered user is determined.

3. The method according to claim 1, wherein, extracting the first voice feature vector sequence corresponding to the second voice data, and inputting the first voice feature vector sequence into a pre-built screening model for screening, comprising: :

extracting the first speech feature vector sequence corresponding to the second speech data;

After normalizing the first speech feature vector sequence, input a pre-built cyclic neural network model for screening;

An output result of the cyclic neural network model is acquired, wherein the output result is a second speech feature vector sequence from which phonemes without actual meaning are filtered out.

4. The method according to claim 3, characterized in that, comprising: recognizing the speech segment to obtain corresponding text, comprising:

Determine whether the length of the second speech feature vector sequence is equal to the preset reference template;

When not equal, adopt the dynamic time warping algorithm to calculate the frame matching distance between the second speech feature vector sequence and the reference template;

The pronunciation corresponding to the minimum frame matching distance is determined, and the text whose pronunciation is matched is used as the speech recognition result.

5. The method according to claim 1, characterized in that, after recognizing the speech segment to obtain the corresponding text, the method further comprises:

Determine whether the text is command information;

If so, execute the operation corresponding to the command information;

If not, the text is displayed in the user interface.

6. The method according to any one of claims 1 to 5, characterized in that, further comprising:

When the model update conditions are met, obtain the sent text messages input by voice, and/or stored memos input by voice;

Obtain the speech feature vector sequence of the speech data corresponding to the text content of the short message and/or the memo;

Obtain the historical voice data of the text messages and/or memos;

Determine the personalized phoneme without actual meaning and the appearance position of the phoneme according to the historical voice data;

Add the phoneme to the speech feature vector sequence as a training sample according to the appearance position, and use the speech feature vector sequence as an expected output to train the screening model by using a supervised learning method;

The parameters of the screening model are adjusted according to the training results, and the parameters include connection weights and external bias values.

7. A voice recognition device, comprising:

a voice acquisition module for acquiring the first voice data;

a user judgment module, configured to judge whether the user corresponding to the first voice data is a registered user when the first voice data is detected;

A framing module, configured to determine a corresponding framing strategy according to the judgment result, and perform framing on the first voice data according to the framing strategy to obtain at least two second voice data, wherein the framing strategy includes The selection of window function, the value of frame length and the value of frame shift, and the framing strategy is associated with the language habits of different users;

The voice screening module is used for extracting the first voice feature vector sequence corresponding to the second voice data, inputting the first voice feature vector sequence into a pre-built screening model for screening, and obtaining the filtering settings output by the screening model. A voice segment with a defined voice feature, wherein the screening model is obtained by adding voice data samples with no actual meaning of voice feature training;

The speech recognition module is used for recognizing the speech segment to obtain the corresponding text.

8. A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the speech recognition method according to any one of claims 1 to 6 is implemented.

9. An electronic device, comprising a voice collector, a memory, a processor and a computer program that is stored on the memory and can be run on the processor for collecting the first voice data, wherein the processor executes all The voice recognition method according to any one of claims 1 to 6 is realized when the computer program is implemented.