CN108198547B

CN108198547B - Voice endpoint detection method, apparatus, computer equipment and storage medium

Info

Publication number: CN108198547B
Application number: CN201810048223.3A
Authority: CN
Inventors: 黄石磊; 刘轶; 王昕�
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2020-10-23
Anticipated expiration: 2038-01-18
Also published as: CN108198547A

Abstract

The application relates to a voice endpoint detection method, a voice endpoint detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal. The method can effectively improve the accuracy of voice endpoint detection.

Description

Voice endpoint detection method, apparatus, computer equipment and storage medium

技术领域technical field

本申请涉及信号处理技术领域，特别是涉及一种语言端点检测方法、装置、计算机设备和存储介质。The present application relates to the technical field of signal processing, and in particular, to a language endpoint detection method, apparatus, computer equipment and storage medium.

背景技术Background technique

随着语音技术的不断发展，语音端点检测技术在语音识别技术中占有十分重要的地位。语音端点检测是从一段连续的噪声语音中检测出语音部分的起始点和终止点，从而能够有效地识别出语音。With the continuous development of speech technology, speech endpoint detection technology occupies a very important position in speech recognition technology. Voice endpoint detection is to detect the starting point and ending point of the voice part from a continuous noise voice, so that the voice can be recognized effectively.

传统的语音端点检测方式有两种，一种是根据语音和噪声信号的时域和频域的特征不同，提取每一段信号的特征，将每一段信号的特征与设定的阈值进行比较，从而进行语音端点检测。但这种方式仅适用于平稳噪声条件下检测，噪声鲁棒性差，很难区分纯净语音和噪声，导致语音端点检测的准确性较低。。另一种则是基于神经网络的方式，通过利用训练模型对语音信号进行端点检测。然而大多模型的输入向量只含有带噪语音的特征，使得噪声鲁棒性差，从而导致语音端点检测的准确性较低。因此，如何有效提高语音端点检测的准确性成为目前需要解决的技术问题。There are two traditional voice endpoint detection methods. One is to extract the characteristics of each segment of the signal according to the different characteristics of the voice and noise signals in the time domain and frequency domain, and compare the characteristics of each segment of the signal with the set threshold. Perform voice endpoint detection. However, this method is only suitable for detection under stationary noise conditions, and the noise robustness is poor, and it is difficult to distinguish between pure speech and noise, resulting in low accuracy of speech endpoint detection. . The other is a neural network-based approach, which uses a trained model to perform endpoint detection on speech signals. However, the input vectors of most models only contain the features of noisy speech, which makes the noise robustness poor, resulting in lower accuracy of speech endpoint detection. Therefore, how to effectively improve the accuracy of voice endpoint detection has become a technical problem that needs to be solved at present.

发明内容SUMMARY OF THE INVENTION

基于此，有必要针对上述技术问题，提供一种能够有效提高语音端点检测的准确性的语音端点检测方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a voice endpoint detection method, apparatus, computer equipment and storage medium that can effectively improve the accuracy of voice endpoint detection in view of the above technical problems.

一种语音端点检测方法，包括：A voice endpoint detection method, comprising:

获取带噪语音信号，提取所述带噪语音信号对应的声学特征和频谱特征；Acquiring a noisy speech signal, and extracting the acoustic features and spectral features corresponding to the noisy speech signal;

对所述声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量；Converting the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector;

获取分类器，将所述声学特征向量和频谱特征向量输入至所述分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量；Obtaining a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier, obtaining the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag;

对所述添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；Analyzing the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain a corresponding voice signal;

根据所述语音信号的时序确定所述语音信号对应的起始点和终止点。The starting point and the ending point corresponding to the voice signal are determined according to the time sequence of the voice signal.

在其中一个实施例中，在所述提取所述带噪语音信号对应的声学特征和频谱特征之前，还包括：In one of the embodiments, before the extracting the acoustic feature and the spectral feature corresponding to the noisy speech signal, the method further includes:

将所述带噪语音信号转换为带噪语音频谱；converting the noisy speech signal into a noisy speech spectrum;

对所述带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，得到所述带噪语音信号对应的声学特征。Time domain analysis and/or frequency domain analysis and/or transform domain analysis are performed on the noisy speech spectrum to obtain acoustic features corresponding to the noisy speech signal.

将所述带噪语音信号转换为带噪语音频谱，根据所述带噪语音频谱计算带噪语音幅度谱；Converting the noisy speech signal into a noisy speech spectrum, and calculating a noisy speech amplitude spectrum according to the noisy speech spectrum;

根据所述带噪语音幅度谱对所述带噪语音频谱进行动态噪声估计，得到噪声幅度谱；Perform dynamic noise estimation on the noisy speech spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum;

根据所述带噪语音幅度谱和所述噪声幅度谱估计纯净语音信号的语音幅度谱；Estimate the speech amplitude spectrum of the pure speech signal according to the noisy speech amplitude spectrum and the noise amplitude spectrum;

利用所述带噪语音幅度谱、所述噪声幅度谱和所述语音幅度谱生成所述带噪语音信号对应的频谱特征。The spectral feature corresponding to the noisy speech signal is generated by using the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum.

在其中一个实施例中，所述对所述声学特征和频谱特征进行转换包括：In one of the embodiments, the converting the acoustic feature and the spectral feature includes:

提取所述声学特征和所述频谱特征中当前帧的前后预设数量帧；Extracting a preset number of frames before and after the current frame in the acoustic feature and the spectral feature;

通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；Calculate the mean value vector and/or the variance vector corresponding to the current frame by using a preset number of frames before and after the current frame;

对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。Perform logarithmic domain transformation on the acoustic feature and the spectral feature after calculating the mean vector and/or the variance vector corresponding to the current frame, to obtain the transformed acoustic feature vector and spectral feature vector.

在其中一个实施例中，所述获取分类器的步骤之前还包括：In one of the embodiments, before the step of obtaining the classifier, the step further includes:

获取添加语音类别标签的带噪语音数据，通过对所述带噪语音数据进行训练，得到初始分类器；Obtaining the noisy speech data to which the speech category label is added, and obtaining an initial classifier by training the noisy speech data;

获取第一验证集，所述第一验证集中包括多个第一语音数据；acquiring a first verification set, where the first verification set includes a plurality of first voice data;

将多个第一语音数据输入至分类器，得到所述多个第一语音数据对应的类别概率；Inputting a plurality of first voice data into a classifier to obtain class probabilities corresponding to the plurality of first voice data;

对多个第一语音数据对应的类别概率进行筛选，对选出的第一语音数据添加类别标签，得到添加类别标签的验证集；Screening the class probabilities corresponding to the plurality of first voice data, adding a class label to the selected first speech data, and obtaining a verification set with the class label added;

利用所述添加类别标签的验证集和所述训练集进行训练，得到验证分类器；Use the verification set with the added category label and the training set for training to obtain a verification classifier;

获取第二验证集，所述第二验证集中包括多个第二语音数据；acquiring a second verification set, the second verification set includes a plurality of second voice data;

将多个第二语音数据输入至验证分类器，得到所述多个第二语音数据对应的类别概率；Inputting a plurality of second voice data into the verification classifier to obtain class probabilities corresponding to the plurality of second voice data;

当多个第二语音数据对应的类别概率达到预设概率值时，得到所需的分类器。When the class probabilities corresponding to the plurality of second speech data reach the preset probability value, the required classifier is obtained.

在其中一个实施例中，所述利用所述分类器对所述声学特征向量和频谱特征向量进行分类的步骤包括：In one embodiment, the step of classifying the acoustic feature vector and the spectral feature vector by using the classifier includes:

将所述声学特征向量和频谱特征向量作为分类器的输入，得到所述声学特征向量和频谱特征向量对应的决策值；Using the acoustic feature vector and the spectral feature vector as the input of the classifier, obtain the decision value corresponding to the acoustic feature vector and the spectral feature vector;

当所述决策值为第一阈值时，对所述声学特征向量或频谱特征向量添加语音标签；When the decision value is the first threshold, adding a voice tag to the acoustic feature vector or the spectral feature vector;

当所述决策值为第二阈值时，对所述声学特征向量或频谱特征向量添加非语音标签。When the decision value is the second threshold, a non-speech tag is added to the acoustic feature vector or the spectral feature vector.

一种语音端点检测装置，包括：A voice endpoint detection device, comprising:

提取模块，用于获取带噪语音信号，提取所述带噪语音信号对应的声学特征和频谱特征；an extraction module for acquiring a noisy speech signal, and extracting the acoustic features and spectral features corresponding to the noisy speech signal;

转换模块，用于对所述声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量；a conversion module, configured to convert the acoustic features and spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;

分类模块，用于获取分类器，将所述声学特征向量和频谱特征向量输入至所述分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量；A classification module, for obtaining a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag;

解析模块，用于对所述添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；根据所述语音信号的时序确定所述语音信号对应的起始点和终止点。The parsing module is used to analyze the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain a corresponding voice signal; determine the corresponding starting point and termination of the voice signal according to the time sequence of the voice signal point.

在其中一个实施例中，所述转换模块还用于提取所述声学特征和所述频谱特征中当前帧的前后预设数量帧；通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。In one embodiment, the conversion module is further configured to extract a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculate the mean value corresponding to the current frame by using the preset number of frames before and after the current frame vector and/or variance vector; perform logarithmic domain transformation on the acoustic feature and spectral feature after calculating the mean vector and/or variance vector corresponding to the current frame, to obtain the converted acoustic feature vector and spectral feature vector.

一种计算机设备，包括存储器、处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：A computer device includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the following steps when executing the computer program:

一种计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现以下步骤：A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the following steps are implemented:

上述语音端点检测方法、装置、计算机设备和存储介质，获取带噪语音信号，提取带噪语音信号对应的声学特征和频谱特征；通过对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量。获取分类器，通过将声学特征向量和频谱特征向量输入至分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量，由此能够有效地对声学特征向量和频谱特征向量进行分类，从而能够有效识别出语音和非语音。对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；语音信号的时序确定语音信号对应的起始点和终止点，由此能够准确地识别带噪语音信号的起始点和终止点，从而能够有效提高语音端点检测的准确性。The above-mentioned voice endpoint detection method, device, computer equipment and storage medium obtains a noisy voice signal, and extracts the corresponding acoustic feature and spectral feature of the noisy voice signal; by converting the acoustic feature and the spectral feature, the corresponding acoustic feature vector and Spectral eigenvectors. Obtain a classifier, and by inputting the acoustic feature vector and the spectral feature vector into the classifier, obtain the acoustic feature vector with the added voice tag and the spectral feature vector with the added voice tag, so that the acoustic feature vector and the spectral feature vector can be effectively classified. , so that speech and non-speech can be effectively recognized. The acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag are analyzed to obtain the corresponding voice signal; the time sequence of the voice signal determines the corresponding start point and end point of the voice signal, so that the noisy voice signal can be accurately identified It can effectively improve the accuracy of voice endpoint detection.

附图说明Description of drawings

图1为一个实施例中语音端点检测方法的流程图；1 is a flowchart of a method for detecting a voice endpoint in one embodiment;

图2为一个实施例中语音端点检测装置的的内部结构图；Fig. 2 is the internal structure diagram of the voice endpoint detection device in one embodiment;

图3为一个实施例中计算机设备的内部结构图。FIG. 3 is a diagram of the internal structure of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定申请。可以理解，本发明所使用的术语“第一”、“第二”等可在本文中用于描述各种元件，但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。In order to make the objectives, technical solutions and advantages of the present application clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, but not to limit the application. It will be understood that the terms "first", "second", etc., as used herein, may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish a first element from another element.

在一个实施例中，如图1所示，提供了一种语音端点检测方法，以该方法应用于终端为例进行说明，包括以下步骤：In one embodiment, as shown in FIG. 1 , a method for detecting a voice endpoint is provided, which is described by taking the method applied to a terminal as an example, including the following steps:

步骤102，获取带噪语音信号，提取带噪语音信号对应的声学特征和频谱特征。Step 102: Acquire the noisy speech signal, and extract the acoustic features and spectral features corresponding to the noisy speech signal.

通常而言，实际采集到的语音信号通常含有一定强度的噪音，当这些噪音强度较大时，会对语音应用的效果产生明显的影响，比如语音识别效率低，端点检测准确性下降等。Generally speaking, the actually collected speech signal usually contains a certain intensity of noise. When the intensity of these noises is high, it will have a significant impact on the effect of speech applications, such as low speech recognition efficiency and reduced endpoint detection accuracy.

终端可以获取用户通过语音输入装置输入的语音。其中，终端可以是智能手机、平板电脑、笔记本电脑、台式电脑等终端，终端还包括语音输入装置，例如，可以是麦克风等具有录入语音功能的装置。终端获取到的用户输入的语音通常为含有噪声的带噪语音信号，带噪语音信号可以是用户输入的通话语音、录音音频、语音指令等带噪语音信号。终端获取带噪语音信号后，提取出带噪语音信号对应的声学特征和频谱特征。其中，声学特征可以包括带噪语音信号的清音、浊音，元音、辅音等特征信息。频谱特征可以包括带噪语音信号的振动频率、震动幅度以及带噪语音信号的响度、音色等特征信息。The terminal can acquire the voice input by the user through the voice input device. The terminal may be a terminal such as a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., and the terminal may further include a voice input device, for example, a device with a voice input function such as a microphone. The voice input by the user obtained by the terminal is usually a noisy voice signal containing noise, and the noisy voice signal may be a call voice, recorded audio, voice commands, etc. input by the user. After acquiring the noisy speech signal, the terminal extracts the acoustic features and spectral features corresponding to the noisy speech signal. The acoustic features may include feature information such as unvoiced, voiced, vowels, and consonants of the noisy speech signal. The spectral features may include the vibration frequency and vibration amplitude of the noisy speech signal, and characteristic information such as the loudness and timbre of the noisy speech signal.

具体地，终端获取带噪语音信号后，对带噪语音信号进行加窗分帧。例如，可以采用汉宁窗将带噪语音信号分为多个帧长为10-30ms(毫秒)的帧，帧移可以取10ms，从而可以将带噪语音信号分为多帧带噪语音信号。终端对对带噪语音信号进行加窗分帧后，对加窗分帧后的带噪语音信号进行快速傅里叶转换，由此得到带噪语音信号的频谱。终端则可以根据带噪语音的频谱提取出带噪语音信号对应的声学特征和频谱特征。Specifically, after acquiring the noisy speech signal, the terminal performs windowing and framing on the noisy speech signal. For example, a Hanning window can be used to divide the noisy speech signal into multiple frames with a frame length of 10-30ms (milliseconds), and the frame shift can be 10ms, so that the noisy speech signal can be divided into multi-frame noisy speech signals. After the terminal performs windowing and framing on the noisy speech signal, the terminal performs fast Fourier transform on the noisy speech signal after windowing and framing, thereby obtaining the frequency spectrum of the noisy speech signal. The terminal can extract the acoustic features and spectral features corresponding to the noisy speech signal according to the frequency spectrum of the noisy speech.

步骤104，对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量。Step 104: Convert the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector.

终端提取出带噪语音信号对应的声学特征和频谱特征后，对提取出的带噪语音信号对应的声学特征和频谱特征进行转换，将声学特征转换为对应的声学特征向量，将频谱特征转换为对应的频谱特征向量。After the terminal extracts the acoustic features and spectral features corresponding to the noisy speech signal, it converts the extracted acoustic features and spectral features corresponding to the noisy speech signal, converts the acoustic features into corresponding acoustic feature vectors, and converts the spectral features into The corresponding spectral eigenvectors.

步骤106，获取分类器，将声学特征向量和频谱特征向量输入至分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量。Step 106: Obtain a classifier, input the acoustic feature vector and the spectral feature vector into the classifier, and obtain the acoustic feature vector with the added voice tag and the spectral feature vector with the added voice tag.

终端获取分类器，分类器为在进行语音端点检测之前训练好的分类器，分类器可以通过向声学特征向量和频谱特征向量添加语音标签和非语音标签，将输入的声学特征向量和频谱特征向量分为语音类的声学特征向量和频谱特征向量和非语音类的声学特征向量和频谱特征向量。终端通过将带噪语音对应的声学特征向量和频谱特征向量输入至分类器，利用分类器对输入的声学特征向量和频谱特征向量进行分类。当输入的声学特征向量或频谱特征向量为语音类别时，为声学特征向量或频谱特征向量添加语音标签；当输入的声学特征向量或频谱特征向量为非语音类别时，为声学特征向量或频谱特征向量添加非语音标签，由此能够准确识别出语音和非语音。终端利用分类器对声学特征向量和频谱特征向量后，就可以得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量。The terminal obtains the classifier. The classifier is a classifier trained before voice endpoint detection. The classifier can add voice labels and non-voice labels to the acoustic feature vector and the spectral feature vector to convert the input acoustic feature vector and spectral feature vector. It is divided into acoustic feature vector and spectral feature vector of speech class and acoustic feature vector and spectral feature vector of non-speech class. The terminal inputs the acoustic feature vector and spectral feature vector corresponding to the noisy speech to the classifier, and uses the classifier to classify the input acoustic feature vector and spectral feature vector. When the input acoustic feature vector or spectral feature vector is a speech category, add a voice label to the acoustic feature vector or spectral feature vector; when the input acoustic feature vector or spectral feature vector is a non-speech category, add the acoustic feature vector or spectral feature vector The vectors add non-speech tags, so that speech and non-speech can be accurately identified. After the terminal uses the classifier to compare the acoustic feature vector and the spectral feature vector, the acoustic feature vector with the added voice tag and the spectral feature vector with the added voice tag can be obtained.

进一步地，终端将声学特征向量和频谱特征向量作为分类器的输入，还可以得到声学特征向量和频谱特征向量对应的决策值。终端可以根据得到的决策值对声学特征向量和频谱特征向量添加语音标签或非语音标签。从而实现对声学特征向量和频谱特征向量进行准确分类。Further, the terminal uses the acoustic feature vector and the spectral feature vector as the input of the classifier, and can also obtain decision values corresponding to the acoustic feature vector and the spectral feature vector. The terminal can add a voice tag or a non-voice tag to the acoustic feature vector and the spectral feature vector according to the obtained decision value. Thus, accurate classification of acoustic feature vectors and spectral feature vectors is achieved.

步骤108，对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到添加语音标签后的语音信号。Step 108 , analyze the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag to obtain the voice signal after adding the voice tag.

步骤110，根据语音信号的语音标签和时序确定语音信号对应的起始点和终止点。Step 110: Determine the corresponding start point and end point of the speech signal according to the speech label and the time sequence of the speech signal.

终端对声学特征向量和频谱特征向量进行分类后，需要对添加了语音标签的声学特征向量和添加了语音标签的频谱特征向量进行解析。具体地，终端将添加了语音标签的声学特征向量和添加了语音标签的频谱特征向量进行解析，得到添加了语音标签的声学特征和频谱特征对应的频谱。终端根据带噪语音信号的时序将添加了语音标签的声学特征和频谱特征对应的频谱转换为对应的语音信号，由此能够解析得到对应的语音信号。After the terminal classifies the acoustic feature vector and the spectral feature vector, it needs to parse the acoustic feature vector to which the voice label is added and the spectral feature vector to which the voice label is added. Specifically, the terminal parses the acoustic feature vector to which the voice tag is added and the spectral feature vector to which the voice tag is added, to obtain the spectrum corresponding to the acoustic feature and the spectral feature to which the voice tag is added. According to the time sequence of the noisy speech signal, the terminal converts the frequency spectrum corresponding to the acoustic characteristic and the frequency spectrum characteristic to which the speech tag is added, into a corresponding speech signal, so that the corresponding speech signal can be obtained by analysis.

带噪语音信号具有时序，添加了语音标签后的语音信号的时序仍然与带噪语音信号的时序相对应。终端将添加了语音标签的声学特征向量和添加了语音标签的频谱特征向量解析为对应的添加了语音标签的语音信号，终端由此能够根据语音信号的语音标签和时序确定带噪语音信号对应的起始点和终止点。The noisy speech signal has a time sequence, and the time sequence of the speech signal after adding the speech tag still corresponds to the time sequence of the noisy speech signal. The terminal parses the acoustic feature vector to which the voice tag is added and the spectral feature vector to which the voice tag is added into the corresponding voice signal to which the voice tag is added. start and end points.

例如，终端通过分类器对输入的声学特征向量和频谱特征向量进行分类后，得到的决策值可以是一个0到1之间的值。当得到的决策值为1时，终端对声学特征向量或频谱特征向量添加语音标签。当得到的决策值为0时，终端对声学特征向量或频谱特征向量添加非语音标签。由此能够准确地对声学特征向量和频谱特征向量进行准确分类。终端将添加了语音标签的声学特征向量和添加了语音标签的频谱特征向量进行解析后，就可以得到添加了语音标签后的语音信号。根据添加了语音标签后的语音信号的时序，当第一次出现添加了语音标签的语音帧则为带噪语音信号的起始点，当最后一次出现语音标签对应的语音帧则为带噪语音信号的终止点。进一步地，还可以根据决策值0到1的跳转来确定语音信号的起始点，根据决策值1到0的跳转来确定语音信号的终止点。由此可以准确地确定带噪语音信号对应的起始点和终止点。For example, after the terminal classifies the input acoustic feature vector and spectral feature vector through the classifier, the obtained decision value may be a value between 0 and 1. When the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the spectral feature vector. When the obtained decision value is 0, the terminal adds a non-speech tag to the acoustic feature vector or the spectral feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified. After the terminal analyzes the acoustic feature vector to which the voice tag is added and the spectral feature vector to which the voice tag is added, the voice signal to which the voice tag is added can be obtained. According to the time sequence of the voice signal after adding the voice tag, when the voice frame added with the voice tag appears for the first time, it is the starting point of the noisy voice signal, and when the voice frame corresponding to the voice tag appears for the last time, it is the noisy voice signal. the termination point. Further, the starting point of the speech signal may be determined according to the jump of the decision value 0 to 1, and the end point of the speech signal may be determined according to the jump of the decision value 1 to 0. In this way, the corresponding starting point and ending point of the noisy speech signal can be accurately determined.

本实施例中，终端获取带噪语音信号后，提取带噪语音信号对应的声学特征和频谱特征，通过对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量。通过将声学特征向量和频谱特征向量输入至分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量，由此能够有效地对声学特征向量和频谱特征向量进行分类，从而能够有效识别出语音和非语音。终端通过对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号。终端根据语音信号的时序确定语音信号对应的起始点和终止点，由此能够准确地识别带噪语音信号的起始点和终止点，从而能够有效提高语音端点检测的准确性。In this embodiment, after acquiring the noisy speech signal, the terminal extracts the acoustic feature and spectral feature corresponding to the noisy speech signal, and obtains the corresponding acoustic feature vector and spectral feature vector by converting the acoustic feature and the spectral feature. By inputting the acoustic feature vector and the spectral feature vector into the classifier, the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag are obtained, so that the acoustic feature vector and the spectral feature vector can be effectively classified, so that the Voice and non-voice are recognized. The terminal obtains the corresponding voice signal by analyzing the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag. The terminal determines the corresponding starting point and ending point of the voice signal according to the time sequence of the voice signal, thereby accurately identifying the starting point and ending point of the noisy voice signal, thereby effectively improving the accuracy of voice endpoint detection.

在一个实施例中，在提取带噪语音信号对应的声学特征和频谱特征之前，还包括：将带噪语音信号转换为带噪语音频谱；对带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，得到带噪语音信号对应的声学特征。In one embodiment, before extracting the acoustic features and spectral features corresponding to the noisy speech signal, the method further includes: converting the noisy speech signal into a noisy speech spectrum; performing time domain analysis and/or frequency domain analysis on the noisy speech spectrum Analysis and/or transform domain analysis to obtain acoustic features corresponding to the noisy speech signal.

在语音学中，语音特征可以分为元音、辅音、清音、浊音以及静音等声学特征。终端获取带噪语音信号后，对带噪语音信号进行加窗分帧。例如，可以采用汉宁窗将带噪语音信号分为多个帧长为10-30ms(毫秒)的帧，帧移可以取10ms。从而可以将带噪语音信号分为多帧带噪语音信号。终端对对带噪语音信号进行加窗分帧后，对加窗分帧后的带噪语音信号进行快速傅里叶转换，由此得到带噪语音信号的频谱。In phonetics, speech features can be divided into acoustic features such as vowels, consonants, unvoiced, voiced, and silence. After acquiring the noisy speech signal, the terminal performs windowing and framing on the noisy speech signal. For example, a Hanning window can be used to divide the noisy speech signal into multiple frames with a frame length of 10-30ms (milliseconds), and the frame shift can be 10ms. Thereby, the noisy speech signal can be divided into multi-frame noisy speech signals. After the terminal performs windowing and framing on the noisy speech signal, the terminal performs fast Fourier transform on the noisy speech signal after windowing and framing, thereby obtaining the frequency spectrum of the noisy speech signal.

进一步地，终端可以对带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，从而能够得到带噪语音信号对应的声学特征。Further, the terminal may perform time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech spectrum, so as to obtain acoustic features corresponding to the noisy speech signal.

例如，终端可以采用MFCC(Mel-Frequency Cepstrum Coefficients，梅尔频率倒谱系数)方式提取出带噪语音信号对应的声学特征。终端对带噪语音信号进行加窗分帧后，将带噪语音信号转换为带噪语音信号的频谱。终端将带噪语音信号的频谱变换为带噪语音倒谱，终端根据带噪语音倒谱进行倒谱分析，将带噪语音倒谱进行离散余弦变换，得到每一帧的声学特征，从而能够得到带噪语音有效的声学特征。For example, the terminal may extract the acoustic features corresponding to the noisy speech signal by adopting the MFCC (Mel-Frequency Cepstrum Coefficients, Mel frequency cepstrum coefficients) manner. The terminal converts the noisy speech signal into a frequency spectrum of the noisy speech signal after windowing and framing the noisy speech signal. The terminal transforms the spectrum of the noisy speech signal into a noisy speech cepstrum, the terminal performs cepstrum analysis according to the noisy speech cepstrum, and performs discrete cosine transform on the noisy speech cepstrum to obtain the acoustic characteristics of each frame, so as to obtain Effective acoustic features of noisy speech.

在一个实施例中，在提取带噪语音信号对应的声学特征和频谱特征之前，还包括：将带噪语音信号转换为带噪语音频谱，根据带噪语音频谱计算带噪语音幅度谱；根据带噪语音幅度谱对带噪语音频谱进行动态噪声估计，得到噪声幅度谱；根据带噪语音幅度谱和噪声幅度谱估计纯净语音信号的语音幅度谱；利用带噪语音幅度谱、噪声幅度谱和语音幅度谱生成带噪语音信号对应的频谱特征。In one embodiment, before extracting the acoustic features and spectral features corresponding to the noisy speech signal, the method further includes: converting the noisy speech signal into a noisy speech spectrum, and calculating the noisy speech amplitude spectrum according to the noisy speech spectrum; The noisy speech amplitude spectrum performs dynamic noise estimation on the noisy speech spectrum to obtain the noise amplitude spectrum; estimates the speech amplitude spectrum of the pure speech signal according to the noisy speech amplitude spectrum and the noise amplitude spectrum; uses the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum The magnitude spectrum generates spectral features corresponding to noisy speech signals.

终端获取带噪语音信号后，对带噪语音信号进行加窗分帧。例如，可以采用汉宁窗将带噪语音信号分为多个帧长为10-30ms(毫秒)的帧，帧移可以取10ms。从而可以将带噪语音信号分为多帧带噪语音信号。终端对对带噪语音信号进行加窗分帧后，对加窗分帧后的带噪语音信号进行快速傅里叶转换，由此得到带噪语音信号的频谱。其中，带噪语音信号的频谱可以为经过快速傅里叶转换之后的带噪语音的能量幅度谱。After acquiring the noisy speech signal, the terminal performs windowing and framing on the noisy speech signal. For example, a Hanning window can be used to divide the noisy speech signal into multiple frames with a frame length of 10-30ms (milliseconds), and the frame shift can be 10ms. Thereby, the noisy speech signal can be divided into multi-frame noisy speech signals. After the terminal performs windowing and framing on the noisy speech signal, the terminal performs fast Fourier transform on the noisy speech signal after windowing and framing, thereby obtaining the frequency spectrum of the noisy speech signal. The spectrum of the noisy speech signal may be the energy amplitude spectrum of the noisy speech after fast Fourier transform.

进一步地，终端利用带噪语音频谱可以计算出带噪语音幅度谱和带噪语音相位谱。终端根据带噪语音幅度谱和带噪语音相位谱对带噪语音频谱进行动态噪声估计。具体地，终端可以利用改进最小受控递归平均算法对带噪语音频谱进行动态噪声估计，从而可以得到噪声幅度谱。终端根据带噪语音幅度谱、带噪语音相位谱和噪声幅度谱估计出语音信号的语音幅度谱。例如，终端可以利用对数幅度谱最小均方差估计法，估计出语音信号的语音幅度谱。Further, the terminal can calculate the noisy speech amplitude spectrum and the noisy speech phase spectrum by using the noisy speech spectrum. The terminal performs dynamic noise estimation on the noisy speech spectrum according to the noisy speech amplitude spectrum and the noisy speech phase spectrum. Specifically, the terminal can use the improved minimum controlled recursive averaging algorithm to perform dynamic noise estimation on the noisy speech spectrum, so as to obtain the noise amplitude spectrum. The terminal estimates the speech amplitude spectrum of the speech signal according to the noisy speech amplitude spectrum, the noisy speech phase spectrum and the noise amplitude spectrum. For example, the terminal may estimate the speech amplitude spectrum of the speech signal by using the logarithmic amplitude spectrum minimum mean square error estimation method.

终端利用估计出的带噪语音幅度谱、噪声幅度谱和纯净语音信号的语音幅度谱生成带噪语音信号对应的频谱特征，由此终端就可以有效地提取出带噪语音信号对应的频谱特征。The terminal uses the estimated noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum of the pure speech signal to generate the spectral features corresponding to the noisy speech signal, so that the terminal can effectively extract the spectral characteristics corresponding to the noisy speech signal.

在一个实施例中，对声学特征和频谱特征进行转换包括：提取声学特征和频谱特征中当前帧的前后预设数量帧；通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。In one embodiment, converting the acoustic feature and the spectral feature includes: extracting a preset number of frames before and after the current frame in the acoustic feature and the spectrum feature; calculating the mean vector sum corresponding to the current frame by using the preset number of frames before and after the current frame /or variance vector; perform logarithmic domain transformation on the acoustic feature and spectral feature after calculating the mean vector and/or variance vector corresponding to the current frame, to obtain the converted acoustic feature vector and spectral feature vector.

终端获取带噪语音信号后，对带噪语音信号进行加窗分帧。从而可以将带噪语音信号分为多帧带噪语音信号。终端对对带噪语音信号进行加窗分帧后，对加窗分帧后的带噪语音信号进行快速傅里叶转换，由此得到带噪语音信号的频谱。终端就可以根据带噪语音的频谱提取出带噪语音信号对应的声学特征和频谱特征。After acquiring the noisy speech signal, the terminal performs windowing and framing on the noisy speech signal. Thereby, the noisy speech signal can be divided into multi-frame noisy speech signals. After the terminal performs windowing and framing on the noisy speech signal, the terminal performs fast Fourier transform on the noisy speech signal after windowing and framing, thereby obtaining the frequency spectrum of the noisy speech signal. The terminal can then extract the acoustic features and spectral features corresponding to the noisy speech signal according to the frequency spectrum of the noisy speech.

终端提取出带噪语音信号对应的声学特征和频谱特征后，将声学特征和频谱特征转换为声学特征向量和频谱特征向量。终端提取声学特征向量和频谱特征向量中当前帧的前后预设数量帧。终端通过利用当前帧的前后预设数量帧计算出当前帧对应的均值矢量或方差矢量，从而可以对声学特征和频谱特征进行平滑处理，得到平滑后的声学特征向量和频谱特征向量。After extracting the acoustic features and spectral features corresponding to the noisy speech signal, the terminal converts the acoustic features and spectral features into acoustic feature vectors and spectral feature vectors. The terminal extracts a preset number of frames before and after the current frame in the acoustic feature vector and the spectral feature vector. The terminal calculates the mean vector or variance vector corresponding to the current frame by using a preset number of frames before and after the current frame, so that the acoustic feature and the spectral feature can be smoothed to obtain the smoothed acoustic feature vector and the spectral feature vector.

例如，终端可以获取声学特征或频谱特征当前帧的前推和后续各五帧，总共11帧带噪语音频谱。通过计算这11帧的平均值，可以得到当前帧的均值矢量。具体地，终端可以获取滤波器组，其中，滤波器的形状为三角形，三角窗口表示滤波窗口。每个滤波器具有三角形滤波器的特性，在带噪语音频谱范围内，这些滤波器可以是等带宽的。终端可以利用滤波器组计算当前帧的均值矢量，由此可以对带噪语音频谱进行平滑处理，得到平滑后的声学特征向量和频谱特征向量。For example, the terminal may acquire five frames ahead and five subsequent frames of the current frame of acoustic features or spectral features, for a total of 11 frames of noisy speech spectrum. By calculating the average of these 11 frames, the average vector of the current frame can be obtained. Specifically, the terminal may acquire a filter bank, wherein the shape of the filter is a triangle, and the triangle window represents the filter window. Each filter has the characteristics of a triangular filter, which can be of equal bandwidth in the noisy speech spectrum. The terminal can use the filter bank to calculate the mean vector of the current frame, thereby smoothing the noisy speech spectrum to obtain the smoothed acoustic feature vector and spectral feature vector.

终端对带噪语音频谱进行平滑处理后，对平滑后的声学特征向量和平滑后的频谱特征向量计算对数域，得到转换后的声学特征向量和频谱特征向量。具体地，终端可以计算每个滤波器输出的声学特征和频谱特征的对数能量，由此可以得到声学特征向量的对数域和频谱特征向量的对数域，从而能够有效地得到转换后的声学特征向量和频谱特征向量。After smoothing the noisy speech spectrum, the terminal calculates the logarithmic domain for the smoothed acoustic eigenvectors and the smoothed spectral eigenvectors to obtain the converted acoustic eigenvectors and spectral eigenvectors. Specifically, the terminal can calculate the logarithmic energy of the acoustic feature and the spectral feature output by each filter, thereby obtaining the logarithmic domain of the acoustic feature vector and the logarithmic domain of the spectral feature vector, thereby effectively obtaining the converted Acoustic eigenvectors and spectral eigenvectors.

在一个实施例中，获取分类器的步骤之前还包括：获取添加语音类别标签的带噪语音数据，通过对带噪语音数据进行训练，得到初始分类器；获取第一验证集，第一验证集中包括多个第一语音数据；将多个第一语音数据输入至分类器，得到多个第一语音数据对应的类别概率；对多个第一语音数据对应的类别概率进行筛选，对选出的第一语音数据添加类别标签，得到添加类别标签的验证集；利用添加类别标签的验证集和训练集进行训练，得到验证分类器；获取第二验证集，第二验证集中包括多个第二语音数据；将多个第二语音数据输入至验证分类器，得到多个第二语音数据对应的类别概率；当多个第二语音数据对应的类别概率达到预设概率值时，得到所需的分类器。In one embodiment, before the step of obtaining the classifier, the step further includes: obtaining noisy speech data to which the speech category label is added, and training the noisy speech data to obtain an initial classifier; obtaining a first verification set, which is in the first verification set including a plurality of first voice data; inputting the plurality of first voice data into the classifier to obtain class probabilities corresponding to the plurality of first voice data; screening the class probabilities corresponding to the plurality of first voice data, adding a category label to the first voice data to obtain a verification set with the category label added; using the verification set and training set to which the category label is added for training to obtain a verification classifier; obtaining a second verification set, the second verification set includes a plurality of second voices inputting multiple second voice data into the verification classifier to obtain the class probabilities corresponding to the multiple second voice data; when the class probabilities corresponding to the multiple second voice data reach the preset probability value, obtain the required classification device.

在获取分类器之前，需要利用大量带噪语音数据训练出分类器，这些大量带噪语音数据可以是终端从数据库中获取的带噪语音数据，也可以是终端从互联网中获取的带噪语音数据。在训练分类器时，首先通过人工对带噪语音数据进行标注，利用人工标注后的带噪语音数据进行训练得到分类器。Before obtaining the classifier, it is necessary to train the classifier with a large amount of noisy speech data. These large amounts of noisy speech data can be the noisy speech data obtained by the terminal from the database, or the noisy speech data obtained by the terminal from the Internet. . When training the classifier, firstly, the noisy speech data is manually labeled, and the classifier is obtained by training the artificially labeled noisy speech data.

具体地，终端提取带噪语音数据对应的声学特征和频谱特征后，对声学特征和频谱特征进行转换，转换为对应的声学特征向量和频谱特征向量。工作人员可以根据类别对照表对声学特征向量和频谱特征向量进行标注，对每一帧带噪语音信号添加语音标签或非语音标签。终端获取工作人员根据类别对照表对带噪语音数据进行标注后的带噪语音数据。Specifically, after extracting the acoustic features and spectral features corresponding to the noisy speech data, the terminal converts the acoustic features and spectral features into corresponding acoustic feature vectors and spectral feature vectors. The staff can annotate the acoustic feature vector and the spectral feature vector according to the category comparison table, and add a voice tag or a non-voice tag to each frame of the noisy voice signal. The terminal obtains the noisy speech data after the staff has marked the noisy speech data according to the category comparison table.

终端将添加标签后的声学特征向量和频谱特征向量组合起来输入至LSTM(Bidirectional Long Short-term Memory，双向长短期记忆神经网络)的输入层，LSTM神经网络中的非线性隐藏层可以从输入的向量中学习到新的特征，通过激活函数计算出输入向量的所属类别。具体地，每个LSTM单元中有三个门，分别为遗忘门、候选门和输出门。具体的计算公式可以为：The terminal combines the labeled acoustic feature vector and spectral feature vector to input to the input layer of LSTM (Bidirectional Long Short-term Memory, bidirectional long short-term memory neural network). The nonlinear hidden layer in the LSTM neural network can be input from the input layer. New features are learned in the vector, and the category of the input vector is calculated through the activation function. Specifically, there are three gates in each LSTM unit, namely forget gate, candidate gate and output gate. The specific calculation formula can be:

其中，σ表示激活函数，

表示遗忘门权重矩阵，

是遗忘门输入层与隐层之间的权重矩阵，b_f表示遗忘门的偏置，遗忘门是通过将前一隐层的输出h_t-1与当前的输入x_t进行了线性组合，然后利用激活函数将其输出值压缩到0到1之间。当输出值越靠近1，则表明记忆体保留的信息越多；反之，越靠近0，则表明记忆体保留的信息越少。where σ represents the activation function,

represents the forget gate weight matrix,

is the weight matrix between the input layer and the hidden layer of the forget gate, b _f represents the bias of the forget gate, and the forget gate is a linear combination of the output h _t-1 of the previous hidden layer and the current input x _t , and then Use an activation function to compress its output values between 0 and 1. When the output value is closer to 1, it indicates that the memory retains more information; on the contrary, the closer it is to 0, it indicates that the memory retains less information.

侯选门计算当前输入的单元状态，具体公式可以为：The candidate gate calculates the current input cell state, and the specific formula can be:

其中，C_i表示当前输入的单元状态，通过tanh激活函数可以把输出值规整到-1和1之间。Among them, C _i represents the current input unit state, and the output value can be adjusted to between -1 and 1 through the tanh activation function.

输出门可以控制用于下一层网络更新的记忆信息的数量，公式可以表示为：The output gate can control the amount of memory information used for the next layer network update, the formula can be expressed as:

其中，O_t表示用于下一层网络更新的记忆信息的数量。Among them, O _t represents the amount of memory information used for the next layer network update.

通过LSTM单元可以计算得到最后的输出，公式可以表示为：The final output can be calculated by the LSTM unit, and the formula can be expressed as:

h_t＝O_t×tanh(C_t)h _t =O _t ×tanh(C _t )

由正向和反向计算得到最后的声学特征向量或频谱特征向量，公式可以表示为：The final acoustic eigenvector or spectral eigenvector is obtained by forward and reverse calculations, and the formula can be expressed as:

其中

为正向的输出向量，

为反向的输出向量，h_i为最后的标注了类别标签的多个声学特征向量或频谱特征向量。in

is the forward output vector,

is the reversed output vector, and _hi is the final multiple acoustic feature vectors or spectral feature vectors labeled with class labels.

进一步地，LSTM中的输出层可以根据预设的决策函数计算出输出单元C_i的值。其中，输出单元C_i的值可以为0至1之间的值，1代表语音类，0代表非语音类。Further, the output layer in the LSTM can calculate the value of the output unit C _i according to the preset decision function. Wherein, the value of the output unit C _i can be a value between 0 and 1, where 1 represents a speech class, and 0 represents a non-speech class.

终端利用标注了语音类别标签的多个声学特征向量和频谱特征向量计算出每个声学特征和频谱特征属于类别对照表中语音类别和非语音类别的概率，提取声学特征向量和频谱特征向量在类别对照表中的概率值最大的类别，对声学特征向量或频谱特征向量添加与概率值最大的类别对应的语音类别标签。The terminal calculates the probability that each acoustic feature and spectral feature belong to the voice category and non-speech category in the category comparison table by using the multiple acoustic feature vectors and spectral feature vectors marked with the voice category label, and extracts the acoustic feature vector and the spectral feature vector in the category. The category with the largest probability value in the table is compared, and the voice category label corresponding to the category with the largest probability value is added to the acoustic feature vector or the spectral feature vector.

终端利用添加了语音类别标签的带噪语音数据进行训练，得到初始分类器；终端获取第一验证集，第一验证集中包括多个第一语音数据。终端将多个第一语音数据输入至分类器，得到多个第一语音数据对应的类别概率后，对多个第一语音数据对应的类别概率进行筛选。工作人员利用终端对选出的第一语音数据添加语音类别标签，终端获取添加语音类别标签后的第一语音数据，利用添加语音类别标签后的第一语音数据生成添加语音类别标签的验证集。终端利用添加类别语音标签的验证集和带噪语音数据再进行训练，得到验证分类器。终端获取第二验证集，第二验证集中包括多个第二语音数据；将多个第二语音数据输入至验证分类器，得到多个第二语音数据对应的类别概率。终端筛选出类别概率在预设范围的第二语言数据，并将筛选出的第二语音数据再进行标注，将标注后的第二语音数据和添加标签后的带噪语音数据重新进行训练得到新的分类器。一直持续训练，直到所有验证集中预设数量的声学特征向量或频谱特征向量的概率值在预设概率范围值之间时，停止训练，就可以得到所需的分类器。由此可以得到准确率较高的分类器，从而实现对声学特征向量和频谱特征向量进行准确分类，进而能够准确地识别出语音和非语音。The terminal uses the noisy speech data to which the speech category label is added for training to obtain an initial classifier; the terminal obtains a first verification set, and the first verification set includes a plurality of first speech data. The terminal inputs the plurality of first voice data into the classifier, and after obtaining the class probabilities corresponding to the plurality of first voice data, the terminal screens the class probabilities corresponding to the plurality of first voice data. The staff uses the terminal to add a voice category tag to the selected first voice data, the terminal obtains the voice category tagged first voice data, and uses the voice category tagged first voice data to generate a voice category tagged verification set. The terminal uses the verification set with the added category voice labels and the noisy voice data to retrain to obtain a verification classifier. The terminal acquires a second verification set, and the second verification set includes a plurality of second voice data; and inputs the plurality of second voice data into the verification classifier to obtain class probabilities corresponding to the plurality of second voice data. The terminal filters out the second language data whose category probability is within the preset range, and then labels the filtered second speech data, and retrains the labelled second speech data and the labelled noisy speech data to obtain new data. classifier. The training continues until the probability values of the preset number of acoustic feature vectors or spectral feature vectors in all validation sets are between the preset probability range values, and then the training is stopped, and the desired classifier can be obtained. Thereby, a classifier with a high accuracy rate can be obtained, so that the acoustic feature vector and the spectral feature vector can be accurately classified, and then the speech and non-speech can be accurately identified.

在一个实施例中，利用分类器对声学特征向量和频谱特征向量进行分类的步骤包括：将声学特征向量和频谱特征向量作为分类器的输入，得到声学特征向量和频谱特征向量对应的决策值；当决策值为第一阈值时，对声学特征向量或频谱特征向量添加语音标签；当决策值为第二阈值时，对声学特征向量或频谱特征向量添加非语音标签。In one embodiment, the step of using the classifier to classify the acoustic feature vector and the spectral feature vector includes: using the acoustic feature vector and the spectral feature vector as the input of the classifier to obtain the decision value corresponding to the acoustic feature vector and the spectral feature vector; When the decision value is the first threshold, a voice tag is added to the acoustic feature vector or the spectral feature vector; when the decision value is the second threshold, a non-voice tag is added to the acoustic feature vector or the spectral feature vector.

终端获取带噪语音信号后，提取带噪语音信号对应的声学特征和频谱特征。终端对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量。终端获取分类器后，将声学特征向量和频谱特征向量输入至分类器。分类器对输入的声学特征向量和频谱特征向量进行分类后，可以得到声学特征向量和频谱特征向量对应的决策值。当得到的决策值为预设的第一阈值时，终端对声学特征向量或频谱特征向量添加语音标签。其中，第一阈值可以是一个范围值。当得到的决策值为预设的第二阈值时，终端对声学特征向量或频谱特征向量添加非语音标签。通过利用分类器对声学特征向量和频谱特征向量进行准确分类，从而能够准确地识别出带噪语音信号中语音信号和非语音信号。After acquiring the noisy speech signal, the terminal extracts the acoustic features and spectral features corresponding to the noisy speech signal. The terminal converts the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector. After acquiring the classifier, the terminal inputs the acoustic feature vector and the spectral feature vector to the classifier. After the classifier classifies the input acoustic eigenvectors and spectral eigenvectors, decision values corresponding to the acoustic eigenvectors and the spectral eigenvectors can be obtained. When the obtained decision value is the preset first threshold, the terminal adds a voice tag to the acoustic feature vector or the spectral feature vector. Wherein, the first threshold may be a range value. When the obtained decision value is a preset second threshold, the terminal adds a non-speech tag to the acoustic feature vector or the spectral feature vector. By using the classifier to accurately classify the acoustic feature vector and the spectral feature vector, the speech signal and the non-speech signal in the noisy speech signal can be accurately identified.

例如，得到的决策值可以是一个0到1之间的值。预设的第一阈值可以是1，预设的第二阈值可以是0。当得到的决策值为1时，终端对声学特征向量或频谱特征向量添加语音标签。当得到的决策值为0时，终端对声学特征向量或频谱特征向量添加非语音标签。由此能够准确地对声学特征向量和频谱特征向量进行准确分类。For example, the resulting decision value can be a value between 0 and 1. The preset first threshold may be 1, and the preset second threshold may be 0. When the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the spectral feature vector. When the obtained decision value is 0, the terminal adds a non-speech tag to the acoustic feature vector or the spectral feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified.

在一个实施例中，如图2所示，提供了一种语音端点检测装置，包括提取模块202、转换模块204、分类模块206和解析模块208，其中：In one embodiment, as shown in FIG. 2, a voice endpoint detection apparatus is provided, including an extraction module 202, a conversion module 204, a classification module 206 and a parsing module 208, wherein:

提取模块202，用于获取带噪语音信号，提取带噪语音信号对应的声学特征和频谱特征。The extraction module 202 is used for acquiring the noisy speech signal, and extracting the acoustic features and spectral features corresponding to the noisy speech signal.

转换模块204，用于对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量。The conversion module 204 is configured to convert the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector.

分类模块206，用于获取分类器，将声学特征向量和频谱特征向量输入至所述分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量。The classification module 206 is configured to obtain a classifier, input the acoustic feature vector and the spectral feature vector to the classifier, and obtain the acoustic feature vector with added voice tag and the spectral feature vector with added voice tag.

解析模块208，用于对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；根据语音信号的时序确定语音信号对应的起始点和终止点。The parsing module 208 is configured to parse the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag to obtain a corresponding voice signal; determine the corresponding start point and end point of the voice signal according to the time sequence of the voice signal.

在一个实施例中，提取模块202还用于将所述带噪语音信号转换为带噪语音频谱；对所述带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，得到所述带噪语音信号对应的声学特征。In one embodiment, the extraction module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum; perform time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech spectrum, Acoustic features corresponding to the noisy speech signal are obtained.

在一个实施例中，提取模块202还用于将带噪语音信号转换为带噪语音频谱，根据带噪语音频谱计算带噪语音幅度谱；根据带噪语音幅度谱对带噪语音频谱进行动态噪声估计，得到噪声幅度谱；根据带噪语音幅度谱和噪声幅度谱估计纯净语音信号的语音幅度谱；利用带噪语音幅度谱、噪声幅度谱和语音幅度谱生成带噪语音信号对应的频谱特征。In one embodiment, the extraction module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum, calculate the noisy speech amplitude spectrum according to the noisy speech spectrum; perform dynamic noise on the noisy speech spectrum according to the noisy speech amplitude spectrum Estimate and obtain the noise amplitude spectrum; estimate the speech amplitude spectrum of the pure speech signal according to the noisy speech amplitude spectrum and the noise amplitude spectrum; use the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum to generate the corresponding spectral features of the noisy speech signal.

在一个实施例中，转换模块204还用于提取所述声学特征和所述频谱特征中当前帧的前后预设数量帧；通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。In one embodiment, the conversion module 204 is further configured to extract a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculate the mean vector sum corresponding to the current frame by using the preset number of frames before and after the current frame /or variance vector; perform logarithmic domain transformation on the acoustic feature and spectral feature after calculating the mean vector and/or variance vector corresponding to the current frame, to obtain the converted acoustic feature vector and spectral feature vector.

在一个实施例中，该装置还包括训练模块，用于获取添加语音类别标签的带噪语音数据，通过对带噪语音数据进行训练，得到初始分类器；获取第一验证集，第一验证集中包括多个第一语音数据；将多个第一语音数据输入至初始分类器，得到多个第一语音数据对应的类别概率；对多个第一语音数据对应的类别概率进行筛选，对选出的第一语音数据添加类别标签，得到添加类别标签的验证集；利用添加类别标签的验证集和添加语音类别标签的带噪语音数据进行训练，得到验证分类器；获取第二验证集，第二验证集中包括多个第二语音数据；将多个第二语音数据输入至验证分类器，得到多个第二语音数据对应的类别概率；当多个第二语音数据对应的类别概率达到预设概率值时，得到所需的分类器。In one embodiment, the device further includes a training module for acquiring noisy speech data to which a speech category label is added, and obtaining an initial classifier by training the noisy speech data; acquiring a first verification set, the first verification set including a plurality of first voice data; inputting the plurality of first voice data into the initial classifier to obtain the class probabilities corresponding to the plurality of first voice data; screening the class probabilities corresponding to the plurality of first voice data, and selecting Add category labels to the first voice data of the 1st voice data to obtain a verification set with category tags; use the verification set with category tags and the noisy voice data with voice category tags for training to obtain a verification classifier; obtain the second verification set, the second The verification set includes multiple second voice data; the multiple second voice data is input into the verification classifier to obtain the class probability corresponding to the multiple second voice data; when the class probability corresponding to the multiple second voice data reaches the preset probability value, the desired classifier is obtained.

在一个实施例中，分类模块206还用于将声学特征向量和频谱特征向量作为分类器的输入，得到声学特征向量和频谱特征向量对应的决策值；当决策值为第一阈值时，对声学特征向量或频谱特征向量添加语音标签；当决策值为第二阈值时，对声学特征向量或频谱特征向量添加非语音标签。In one embodiment, the classification module 206 is further configured to use the acoustic feature vector and the spectral feature vector as the input of the classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is the first threshold A voice tag is added to the feature vector or the spectral feature vector; when the decision value is the second threshold, a non-speech tag is added to the acoustic feature vector or the spectral feature vector.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图3所示。例如，该计算机设备可以是终端，终端可以但不限于是各种是智能手机、平板电脑、笔记本电脑、个人计算机和便携式可穿戴设备等具有输入语音的功能的设备。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和语音输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音端点检测方法。该计算机设备的语音输入装置可以包括麦克风，还可以包括外接的耳机等。In one embodiment, a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 3 . For example, the computer device may be a terminal, and the terminal may be, but not limited to, various devices having functions of inputting speech, such as smart phones, tablet computers, notebook computers, personal computers, and portable wearable devices. The computer equipment includes a processor, memory, a network interface, and a voice input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a voice endpoint detection method. The voice input device of the computer equipment may include a microphone, and may also include an external earphone or the like.

本领域技术人员可以理解，图3中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的服务器的限定，具体的服务器可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the server to which the solution of the present application is applied. More or fewer components are shown in the figures, either in combination or with different arrangements of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现以下步骤：获取带噪语音信号，提取带噪语音信号对应的声学特征和频谱特征；对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量；获取分类器，将声学特征向量和频谱特征向量输入至分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量；对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；根据语音信号的时序确定语音信号对应的起始点和终止点。In one embodiment, a computer device is provided, including a memory and a processor, where a computer program is stored in the memory, and when the processor executes the computer program, the processor implements the following steps: acquiring a noisy speech signal, and extracting a corresponding noise signal. Acoustic feature and spectral feature; convert the acoustic feature and spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector; obtain a classifier, input the acoustic feature vector and spectral feature vector into the classifier, and obtain the acoustic feature with the added voice tag vector and the spectral feature vector of the added voice tag; analyze the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain the corresponding voice signal; determine the corresponding start point and end point of the voice signal according to the time sequence of the voice signal .

在一个实施例中，处理器执行计算机程序时还实现以下步骤：将所述带噪语音信号转换为带噪语音频谱；对所述带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，得到所述带噪语音信号对应的声学特征。In one embodiment, the processor further implements the following steps when executing the computer program: converting the noisy speech signal into a noisy speech spectrum; performing time domain analysis and/or frequency domain analysis and/or on the noisy speech spectrum Or transform domain analysis to obtain the acoustic features corresponding to the noisy speech signal.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：将带噪语音信号转换为带噪语音频谱，根据带噪语音频谱计算带噪语音幅度谱；根据带噪语音幅度谱对带噪语音频谱进行动态噪声估计，得到噪声幅度谱；根据带噪语音幅度谱和噪声幅度谱估计纯净语音信号的语音幅度谱；利用带噪语音幅度谱、噪声幅度谱和语音幅度谱生成带噪语音信号对应的频谱特征。In one embodiment, the processor further implements the following steps when executing the computer program: converting the noisy speech signal into a noisy speech spectrum, calculating the noisy speech amplitude spectrum according to the noisy speech spectrum; Perform dynamic noise estimation on the speech spectrum to obtain the noise amplitude spectrum; estimate the speech amplitude spectrum of the pure speech signal according to the noisy speech amplitude spectrum and the noise amplitude spectrum; use the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum to generate the noisy speech signal corresponding spectral features.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：提取所述声学特征和所述频谱特征中当前帧的前后预设数量帧；通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。In one embodiment, the processor further implements the following steps when executing the computer program: extracting a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculating the current frame by using the preset number of frames before and after the current frame Corresponding mean vector and/or variance vector; perform logarithmic domain transformation on the acoustic feature and spectral feature after calculating the mean vector and/or variance vector corresponding to the current frame, to obtain the transformed acoustic feature vector and spectral feature vector.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：获取添加语音类别标签的带噪语音数据，通过对带噪语音数据进行训练，得到初始分类器；获取第一验证集，第一验证集中包括多个第一语音数据；将多个第一语音数据输入至分类器，得到多个第一语音数据对应的类别概率；对多个第一语音数据对应的类别概率进行筛选，对选出的第一语音数据添加类别标签，得到添加类别标签的验证集；利用添加类别标签的验证集和训练集进行训练，得到验证分类器；获取第二验证集，第二验证集中包括多个第二语音数据；将多个第二语音数据输入至验证分类器，得到多个第二语音数据对应的类别概率；当多个第二语音数据对应的类别概率达到预设概率值时，得到所需的分类器。In one embodiment, the processor further implements the following steps when executing the computer program: acquiring noisy speech data with added speech category labels, and obtaining an initial classifier by training the noisy speech data; acquiring a first validation set, first The verification set includes a plurality of first voice data; the plurality of first voice data are input into the classifier to obtain the class probabilities corresponding to the plurality of first voice data; the class probabilities corresponding to the plurality of first voice data are screened, and the The first voice data outputted is added with category labels to obtain a verification set with added category labels; training is performed by using the verification set and training set with added category labels to obtain a verification classifier; a second verification set is obtained, and the second verification set includes a plurality of Second voice data; input multiple second voice data into the verification classifier to obtain the class probability corresponding to the multiple second voice data; when the class probability corresponding to the multiple second voice data reaches the preset probability value, obtain the required classifier.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：将声学特征向量和频谱特征向量作为分类器的输入，得到声学特征向量和频谱特征向量对应的决策值；当决策值为第一阈值时，对声学特征向量或频谱特征向量添加语音标签；当决策值为第二阈值时，对声学特征向量或频谱特征向量添加非语音标签。In one embodiment, the processor also implements the following steps when executing the computer program: using the acoustic feature vector and the spectral feature vector as the input of the classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is the first When the threshold is set, a voice tag is added to the acoustic feature vector or the spectral feature vector; when the decision value is the second threshold, a non-voice tag is added to the acoustic feature vector or the spectral feature vector.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：获取带噪语音信号，提取带噪语音信号对应的声学特征和频谱特征；对声学特征和频谱特征进行转换，得到对应的声学特征向量和频谱特征向量；获取分类器，将声学特征向量和频谱特征向量输入至分类器，得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量；对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析，得到对应的语音信号；根据语音信号的时序确定语音信号对应的起始点和终止点。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring a noisy speech signal, extracting acoustic features corresponding to the noisy speech signal and Spectral feature; convert the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector; obtain the classifier, input the acoustic feature vector and the spectral feature vector to the classifier, and obtain the acoustic feature vector and add the voice tag. The spectral feature vector of the voice tag; analyze the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain the corresponding voice signal; determine the corresponding start point and end point of the voice signal according to the time sequence of the voice signal.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：将所述带噪语音信号转换为带噪语音频谱；对所述带噪语音频谱进行时域分析和/或频域分析和/或变换域分析，得到所述带噪语音信号对应的声学特征。In one embodiment, the computer program, when executed by the processor, further implements the following steps: converting the noisy speech signal into a noisy speech spectrum; performing time domain analysis and/or frequency domain analysis and/or frequency domain analysis on the noisy speech spectrum and/or transform domain analysis to obtain acoustic features corresponding to the noisy speech signal.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：将带噪语音信号转换为带噪语音频谱，根据带噪语音频谱计算带噪语音幅度谱；根据带噪语音幅度谱对带噪语音频谱进行动态噪声估计，得到噪声幅度谱；根据带噪语音幅度谱和噪声幅度谱估计纯净语音信号的语音幅度谱；利用带噪语音幅度谱、噪声幅度谱和语音幅度谱生成带噪语音信号对应的频谱特征。In one embodiment, when the computer program is executed by the processor, the following steps are further implemented: converting a noisy speech signal into a noisy speech spectrum, calculating a noisy speech amplitude spectrum according to the noisy speech spectrum; Perform dynamic noise estimation on the noisy speech spectrum to obtain the noise amplitude spectrum; estimate the speech amplitude spectrum of the pure speech signal according to the noisy speech amplitude spectrum and the noise amplitude spectrum; use the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum to generate the noisy speech The corresponding spectral characteristics of the signal.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：提取所述声学特征和所述频谱特征中当前帧的前后预设数量帧；通过利用当前帧的前后预设数量帧计算当前帧对应的均值矢量和/或方差矢量；对计算当前帧对应的均值矢量和/或方差矢量后的声学特征和频谱特征进行对数域转换，得到转换后的声学特征向量和频谱特征向量。In one embodiment, the computer program further implements the following steps when executed by the processor: extracting a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculating the current frame by using a preset number of frames before and after the current frame mean vector and/or variance vector corresponding to the frame; perform logarithmic domain transformation on the acoustic feature and spectral feature after calculating the mean vector and/or variance vector corresponding to the current frame to obtain the transformed acoustic feature vector and spectral feature vector.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：获取添加语音类别标签的带噪语音数据，通过对带噪语音数据进行训练，得到初始分类器；获取第一验证集，第一验证集中包括多个第一语音数据；将多个第一语音数据输入至分类器，得到多个第一语音数据对应的类别概率；对多个第一语音数据对应的类别概率进行筛选，对选出的第一语音数据添加类别标签，得到添加类别标签的验证集；利用添加类别标签的验证集和训练集进行训练，得到验证分类器；获取第二验证集，第二验证集中包括多个第二语音数据；将多个第二语音数据输入至验证分类器，得到多个第二语音数据对应的类别概率；当多个第二语音数据对应的类别概率达到预设概率值时，得到所需的分类器。In one embodiment, when the computer program is executed by the processor, the following steps are further implemented: acquiring the noisy speech data to which the speech category label is added, and obtaining an initial classifier by training the noisy speech data; acquiring a first verification set; A verification set includes a plurality of first voice data; inputting the plurality of first voice data into a classifier to obtain class probabilities corresponding to the plurality of first voice data; screening the class probabilities corresponding to the plurality of first voice data, The selected first voice data is added with a category label to obtain a verification set with the added category label; the verification set and the training set with the added category label are used for training to obtain a verification classifier; a second verification set is obtained, and the second verification set includes multiple second voice data; inputting a plurality of second voice data into the verification classifier to obtain the class probabilities corresponding to the plurality of second voice data; when the class probabilities corresponding to the plurality of second voice data reach the preset probability value, obtain the required classifier.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：将声学特征向量和频谱特征向量作为分类器的输入，得到声学特征向量和频谱特征向量对应的决策值；当决策值为第一阈值时，对声学特征向量或频谱特征向量添加语音标签；当决策值为第二阈值时，对声学特征向量或频谱特征向量添加非语音标签。In one embodiment, when the computer program is executed by the processor, the following steps are further implemented: the acoustic feature vector and the spectral feature vector are used as the input of the classifier, and the decision value corresponding to the acoustic feature vector and the spectral feature vector is obtained; when the decision value is the first When a threshold is set, a voice tag is added to the acoustic feature vector or the spectral feature vector; when the decision value is the second threshold, a non-voice tag is added to the acoustic feature vector or the spectral feature vector.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. A voice endpoint detection method, comprising:

acquiring a noisy speech signal, and extracting the acoustic features corresponding to the noisy speech signal;

extracting the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum of the noisy speech signal;

generating a spectral feature corresponding to the noisy speech signal according to the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum;

Converting the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector;

Obtaining a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier, obtaining the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag;

Analyzing the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain a corresponding voice signal;

The starting point and the ending point corresponding to the voice signal are determined according to the time sequence of the voice signal.

2. The method according to claim 1, characterized in that, before extracting the corresponding acoustic features and spectral features of the noisy speech signal, further comprising:

converting the noisy speech signal into a noisy speech spectrum;

Time domain analysis and/or frequency domain analysis and/or transform domain analysis are performed on the noisy speech spectrum to obtain acoustic features corresponding to the noisy speech signal.

3. The method according to claim 1, wherein the extracting the noisy speech amplitude spectrum, the noise amplitude spectrum and the speech amplitude spectrum of the noisy speech signal comprises:

Converting the noisy speech signal into a noisy speech spectrum, and calculating a noisy speech amplitude spectrum according to the noisy speech spectrum;

Perform dynamic noise estimation on the noisy speech spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum;

The speech amplitude spectrum of the clean speech signal is estimated from the noisy speech amplitude spectrum and the noise amplitude spectrum.

4. The method according to claim 1, wherein the converting the acoustic feature and the spectral feature comprises:

Extracting a preset number of frames before and after the current frame in the acoustic feature and the spectral feature;

Calculate the mean value vector and/or the variance vector corresponding to the current frame by using a preset number of frames before and after the current frame;

Perform logarithmic domain transformation on the acoustic feature and the spectral feature after calculating the mean vector and/or the variance vector corresponding to the current frame, to obtain the transformed acoustic feature vector and spectral feature vector.

5. The method according to claim 1, wherein before the step of obtaining the classifier, it further comprises:

Obtaining the noisy speech data to which the speech category label is added, and obtaining an initial classifier by training the noisy speech data;

acquiring a first verification set, where the first verification set includes a plurality of first voice data;

inputting a plurality of first voice data into the initial classifier to obtain class probabilities corresponding to the plurality of first voice data;

Screening the class probabilities corresponding to the plurality of first voice data, adding a class label to the selected first speech data, and obtaining a verification set with the class label added;

Use the verification set with the added category label and the noisy speech data with the added voice category label for training to obtain a verification classifier;

acquiring a second verification set, the second verification set includes a plurality of second voice data;

Inputting a plurality of second voice data into the verification classifier to obtain class probabilities corresponding to the plurality of second voice data;

When the class probabilities corresponding to the plurality of second speech data reach the preset probability value, the required classifier is obtained.

6. The method according to any one of claims 1 to 5, wherein the step of using the classifier to classify the acoustic feature vector and the spectral feature vector comprises:

Using the acoustic feature vector and the spectral feature vector as the input of the classifier, obtain the decision value corresponding to the acoustic feature vector and the spectral feature vector;

When the decision value is the first threshold, adding a voice tag to the acoustic feature vector or the spectral feature vector;

When the decision value is the second threshold, a non-speech tag is added to the acoustic feature vector or the spectral feature vector.

7. A voice endpoint detection device, comprising:

an extraction module for acquiring a noisy speech signal, and extracting the acoustic features corresponding to the noisy speech signal; extracting the noisy speech amplitude spectrum, noise amplitude spectrum and speech amplitude spectrum of the noisy speech signal; The speech amplitude spectrum, the noise amplitude spectrum, and the speech amplitude spectrum generate spectral features corresponding to the noisy speech signal;

a conversion module, configured to convert the acoustic features and spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;

A classification module, for obtaining a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag;

The parsing module is used to analyze the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain a corresponding voice signal; determine the corresponding starting point and termination of the voice signal according to the time sequence of the voice signal point.

8. The device according to claim 7, wherein the conversion module is also used to extract a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; Set the number of frames to calculate the mean vector and/or variance vector of the current frame; perform logarithmic domain transformation on the acoustic features and spectral features after calculating the mean vector and/or variance vector corresponding to the current frame, to obtain the converted acoustic feature vector and spectrum Feature vector.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 6 when the processor executes the computer program .

10. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.