CN111402908A

CN111402908A - Voice processing method, device, electronic device and storage medium

Info

Publication number: CN111402908A
Application number: CN202010235282.9A
Authority: CN
Inventors: 李泽帅; 黄远望
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-10

Abstract

The application provides a voice processing method, a voice processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: decoding original coded data obtained by voice sampling to obtain decoded audio data; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

Description

Voice processing method, device, electronic device and storage medium

技术领域technical field

本申请涉及语音处理技术领域，尤其涉及一种语音处理方法、装置、电子设备和存储介质。The present application relates to the technical field of speech processing, and in particular, to a speech processing method, apparatus, electronic device and storage medium.

背景技术Background technique

语音文字转换(speech-to-text，简称STT)系统是一种将说出的单词转换为文本文件以供后续用途的方式。针对STT，目前常见的方案是直接将采集到的音频文件(如MP3，M4A，AMR等格式音频)传输至服务器，由服务器端对音频数据进行语音转换处理，并返回转换后的文本。A speech-to-text (STT) system is a way of converting spoken words into text files for subsequent use. For STT, the current common solution is to directly transfer the collected audio files (such as MP3, M4A, AMR and other format audio) to the server, and the server will perform voice conversion processing on the audio data, and return the converted text.

为了保证音质，在录制过程中会大幅度的提高采样率、采样位数以及比特率，从而导致传输的音频文件体积增大，增加了音频文件传输至服务器的过程中的负担，降低了传输效率。In order to ensure the sound quality, the sampling rate, the number of sampling bits and the bit rate will be greatly increased during the recording process, which will lead to an increase in the volume of the transmitted audio file, increase the burden of the audio file transmission process to the server, and reduce the transmission efficiency. .

发明内容SUMMARY OF THE INVENTION

本申请旨在至少在一定程度上解决相关技术中的技术问题之一。The present application aims to solve one of the technical problems in the related art at least to a certain extent.

本申请第一方面实施例提出了一种语音处理方法，包括：The embodiment of the first aspect of the present application proposes a speech processing method, including:

对语音采样得到的原始编码数据解码，得到解码音频数据；Decode the original encoded data obtained by the voice sampling to obtain decoded audio data;

若确定所述解码音频数据的采样率和/或采样位数大于设定阈值，则对所述解码音频数据降采样，得到目标音频数据；If it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, down-sampling the decoded audio data to obtain target audio data;

向服务器端发送所述目标音频数据，以从所述服务器端获取对所述目标音频数据语音识别得到的文本。The target audio data is sent to the server to obtain the text obtained by speech recognition of the target audio data from the server.

作为本申请实施例的第一种可能的实现方式，所述对所述解码音频数据降采样，包括：As a first possible implementation manner of the embodiment of the present application, the down-sampling of the decoded audio data includes:

采用同步采样率转换SSRC算法，对所述解码音频数据降采样。The decoded audio data is down-sampled using a synchronous sampling rate conversion SSRC algorithm.

作为本申请实施例的第二种可能的实现方式，所述采用同步采样率转换SSRC算法，对所述解码音频数据降采样，包括：As a second possible implementation manner of the embodiment of the present application, the synchronous sampling rate conversion SSRC algorithm is used to downsample the decoded audio data, including:

对所述解码音频数据中设定长度序列采用有限长单位冲激响应FIR滤波器滤波；A finite-length unit impulse response FIR filter is used to filter the set length sequence in the decoded audio data;

将滤波后得到的设定长度序列后增加所述设定长度的目标序列，得到傅里叶变换的输入序列；其中，所述目标序列中各元素取值为零；The target sequence of the set length is added to the set length sequence obtained after filtering to obtain the input sequence of the Fourier transform; wherein, each element in the target sequence takes a value of zero;

对所述输入序列进行快速傅里叶变换，得到频域序列；performing fast Fourier transform on the input sequence to obtain a frequency domain sequence;

对所述频域序列滤波后，进行快速傅里叶逆变换，以得到时域序列；After filtering the frequency domain sequence, perform inverse fast Fourier transform to obtain the time domain sequence;

对所述时域序列，根据设定的降采样率重采样，得到所述目标音频数据。The target audio data is obtained by resampling the time-domain sequence according to the set down-sampling rate.

作为本申请实施例的第三种可能的实现方式，所述向服务器端发送所述目标音频数据之前，还包括：As a third possible implementation manner of the embodiment of the present application, before the sending the target audio data to the server side, the method further includes:

若所述目标音频数据中包括双声道数据，剔除所述双声道数据中一个声道数据。If the target audio data includes binaural data, one channel data in the binaural data is eliminated.

作为本申请实施例的第四种可能的实现方式，所述剔除所述双声道数据中一个声道数据，包括：As a fourth possible implementation manner of the embodiment of the present application, the removing one channel data from the binaural data includes:

确定所述目标音频数据中单一声道数据占用的数据长度；Determine the data length occupied by single channel data in the target audio data;

对所述目标音频数据每间隔所述数据长度，剔除一段符合所述数据长度的数据。For each interval of the data length of the target audio data, a piece of data that conforms to the data length is eliminated.

作为本申请实施例的第五种可能的实现方式，所述向服务器端发送所述目标音频数据之前，还包括：As a fifth possible implementation manner of the embodiment of the present application, before the sending the target audio data to the server side, the method further includes:

根据所述目标音频数据，进行语音端点检测，以从所述目标音频数据中提取出浊音部分和清音部分，并去除静音部分；According to the target audio data, voice endpoint detection is performed to extract the voiced part and the unvoiced part from the target audio data, and remove the mute part;

其中，所述浊音部分的能量值大于第一能量阈值；Wherein, the energy value of the voiced portion is greater than the first energy threshold;

所述清音部分的能量值大于第二能量阈值；The energy value of the unvoiced portion is greater than the second energy threshold;

所述第一能量阈值大于所述第二能量阈值。The first energy threshold is greater than the second energy threshold.

作为本申请实施例的第六种可能的实现方式，所述向服务器端发送所述目标音频数据之前，还包括：As a sixth possible implementation manner of the embodiment of the present application, before the sending the target audio data to the server side, the method further includes:

若所述目标音频数据的比特率低于设定比特率，则采用线性预测编码方式进行压缩编码；If the bit rate of the target audio data is lower than the set bit rate, the linear prediction encoding method is used for compression encoding;

若所述目标音频数据的比特率不低于所述设定比特率，则采用变换编码方式进行压缩编码。If the bit rate of the target audio data is not lower than the set bit rate, a transform coding method is used to perform compression coding.

本申请实施例的语音处理方法，对语音采样得到的原始编码数据解码，得到解码音频数据；若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据；向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，通过对高采样率和/或高采样位数的音频数据进行降采样处理，进而将降采样后的目标音频数据传输至服务器端，以从服务器端获取语音识别得到的文本，从而减少了数据传输量，提高了数据传输速率。In the speech processing method of the embodiment of the present application, the original encoded data obtained by the speech sampling is decoded to obtain decoded audio data; if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than the set threshold, the decoded audio data is down-sampled , obtain the target audio data; send the target audio data to the server, so as to obtain the text obtained from the speech recognition of the target audio data from the server. Therefore, by performing down-sampling processing on the audio data with high sampling rate and/or high sampling bits, and then transmitting the down-sampled target audio data to the server, so as to obtain the text obtained by speech recognition from the server, thereby reducing The data transmission volume is increased, and the data transmission rate is improved.

本申请第二方面实施例提出了一种语音处理装置，包括：The embodiment of the second aspect of the present application provides a voice processing apparatus, including:

解码模块，用于对语音采样得到的原始编码数据解码，得到解码音频数据；The decoding module is used to decode the original encoded data obtained by the voice sampling to obtain the decoded audio data;

降采样模块，用于若确定所述解码音频数据的采样率和/或采样位数大于设定阈值，则对所述解码音频数据降采样，得到目标音频数据；A downsampling module, configured to downsample the decoded audio data to obtain target audio data if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold;

发送模块，用于向服务器端发送所述目标音频数据，以从所述服务器端获取对所述目标音频数据语音识别得到的文本。The sending module is configured to send the target audio data to the server, so as to obtain the text obtained by speech recognition of the target audio data from the server.

本申请实施例的语音处理装置，对语音采样得到的原始编码数据解码，得到解码音频数据；若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据；向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，通过对高采样率和/或高采样位数的音频数据进行降采样处理，进而将降采样后的目标音频数据传输至服务器端，以从服务器端获取语音识别得到的文本，从而减少了数据传输量，提高了数据传输速率。The speech processing apparatus of the embodiment of the present application decodes the original encoded data obtained by speech sampling to obtain decoded audio data; if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, the decoded audio data is down-sampled , obtain the target audio data; send the target audio data to the server, so as to obtain the text obtained from the speech recognition of the target audio data from the server. Therefore, by performing down-sampling processing on the audio data with high sampling rate and/or high sampling bits, and then transmitting the down-sampled target audio data to the server, so as to obtain the text obtained by speech recognition from the server, thereby reducing The data transmission volume is increased, and the data transmission rate is improved.

本申请第三方面实施例提出了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时，实现第一方面实施例所述的语音处理方法。An embodiment of a third aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the program, the implementation of the first aspect is implemented. The speech processing method described in the example.

本申请第四方面实施例提出了一种非临时性计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现第一方面实施例所述的语音处理方法。Embodiments of the fourth aspect of the present application provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the voice processing method described in the first aspect of the embodiment.

本申请附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, in the following description, and in part will be apparent from the following description, or learned by practice of the present application.

附图说明Description of drawings

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为本申请实施例提供的第一种语音处理方法的流程示意图；1 is a schematic flowchart of a first voice processing method provided by an embodiment of the present application;

图2为本申请实施例提供的第二种语音处理方法的流程示意图；2 is a schematic flowchart of a second speech processing method provided by an embodiment of the present application;

图3为本申请实施例提供的第三种语音处理方法的流程示意图；3 is a schematic flowchart of a third voice processing method provided by an embodiment of the present application;

图4为本申请实施例提供的第四种语音处理方法的流程示意图；4 is a schematic flowchart of a fourth voice processing method provided by an embodiment of the present application;

图5为本申请实施例提供的一种语音处理装置的结构示意图。FIG. 5 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application.

具体实施方式Detailed ways

下面详细描述本申请的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本申请，而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to be used to explain the present application, but should not be construed as a limitation to the present application.

相关技术中，STT检测在采样率为16kHz，采样位数为16bit的单声道音频上已经有了极高的识别率，因此在STT过程中，使用采样率超过16kHz，采样位数超过16bit的音频已经不会大幅度提高STT的识别率，反而会因为音频文件的大小增加音频传输过程中的资源损耗。In the related art, STT detection has a very high recognition rate on mono audio with a sampling rate of 16 kHz and a sampling number of 16 bits. Therefore, in the STT process, the sampling rate exceeds 16 kHz and the sampling number exceeds 16 bits. Audio has not greatly improved the recognition rate of STT, but will increase the resource consumption during audio transmission due to the size of the audio file.

针对上述技术问题，本申请提出了一种语音处理方法，通过对语音采样得到的原始编码数据解码，得到解码音频数据，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据，向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。该方法通过对高采样率和/或高采样位数的音频数据进行降采样处理，进而将降采样后的目标音频数据传输至服务器端，以从服务器端获取语音识别得到的文本，从而减少了数据传输量，提高了数据传输速率。In view of the above-mentioned technical problems, the present application proposes a voice processing method, which obtains decoded audio data by decoding the original encoded data obtained by voice sampling. Then, the decoded audio data is down-sampled to obtain target audio data, and the target audio data is sent to the server to obtain the text obtained by speech recognition of the target audio data from the server. The method performs down-sampling processing on the audio data with high sampling rate and/or high sampling bits, and then transmits the down-sampled target audio data to the server, so as to obtain the text obtained by speech recognition from the server, thereby reducing the number of The amount of data transfer increases the data transfer rate.

下面参考附图描述本申请实施例的语音处理方法、装置、电子设备和存储介质。The speech processing method, apparatus, electronic device, and storage medium of the embodiments of the present application are described below with reference to the accompanying drawings.

图1为本申请实施例提供的第一种语音处理方法的流程示意图。FIG. 1 is a schematic flowchart of a first voice processing method provided by an embodiment of the present application.

本申请实施例以该语音处理方法被配置于语音处理装置中来举例说明，该语音处理装置可以应用于任一电子设备中，以使该电子设备可以执行语音处理功能。The embodiment of the present application is exemplified in that the voice processing method is configured in a voice processing apparatus, and the voice processing apparatus can be applied to any electronic device, so that the electronic device can perform a voice processing function.

其中，电子设备可以为个人电脑(Personal Computer，简称PC)、云端设备、移动设备等，移动设备例如可以为手机、平板电脑、个人数字助理、穿戴式设备、车载设备等具有各种操作系统的硬件设备。The electronic device may be a personal computer (Personal Computer, PC for short), a cloud device, a mobile device, etc. The mobile device may be, for example, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, a vehicle-mounted device, etc. with various operating systems. hardware equipment.

如图1所示，该语音处理方法包括以下步骤：As shown in Figure 1, the speech processing method includes the following steps:

步骤101，对语音采样得到的原始编码数据解码，得到解码音频数据。Step 101: Decode the original encoded data obtained by the speech sampling to obtain decoded audio data.

其中，语音采样得到的原始编码数据，是指从硬件设备采集语音信号，对语音信号进行模数转换后得到的原始编码数据。Wherein, the original encoded data obtained by voice sampling refers to the original encoded data obtained by collecting a voice signal from a hardware device and performing analog-to-digital conversion on the voice signal.

通信系统的信源有两大类：模拟信号和数字信号。例如：话筒输出的语音信号属于模拟信号；而文字、计算机数据属于数字信号。数字信号相比于模拟信号有抗干扰能力强、无噪声积累的优点。因此，若输入是模拟信号，则在数字通信系统的信源编码部分需对输入模拟信号进行数字化。There are two main categories of sources for communication systems: analog signals and digital signals. For example: the voice signal output by the microphone belongs to the analog signal; while the text and computer data belong to the digital signal. Compared with analog signals, digital signals have the advantages of strong anti-interference ability and no noise accumulation. Therefore, if the input is an analog signal, the input analog signal needs to be digitized in the source coding part of the digital communication system.

从硬件设备采集语音信号后，对语音信号进行数字化需要三个步骤：抽样、量化和编码。抽样是指用每隔一定时间的信号样值序列来代替原来在时间上连续的信号，也就是在时间上将模拟信号离散化。量化是用有限幅度值近似原来连续变化的幅度值，把模拟信号的连续幅度变为有限数量的有一定间隔的离散值。编码是将量化后的信号编码形成多位二进制码组成的码组表示抽样值，完成模拟信号到数字信号的转换。After the voice signal is collected from the hardware device, three steps are required to digitize the voice signal: sampling, quantization and encoding. Sampling refers to replacing the original continuous signal in time with a sequence of signal samples at regular intervals, that is, discretizing the analog signal in time. Quantization is to approximate the original continuously changing amplitude value with a finite amplitude value, and change the continuous amplitude of the analog signal into a finite number of discrete values with a certain interval. Coding is to encode the quantized signal into a code group composed of multi-bit binary codes to represent the sampled value, and complete the conversion of analog signal to digital signal.

需要说明的是，对采集的原始数据进行编码后得到数据格式的音频存储至音频文件中，其中，音频文件的格式有MP3，M4A，AMR，WAV，等等。It should be noted that, after encoding the collected raw data, audio in a data format is obtained and stored in an audio file, wherein the format of the audio file includes MP3, M4A, AMR, WAV, and so on.

作为一种可能的实现方式，可以采用脉冲编码调制(Pulse Code Modulation，简称PCM)对采集的原始数据进行编码。其中，编码的主要过程是将话音、图像等模拟信号每隔一定时间进行取样，使其离散化，同时将抽样值按分层单位四舍五入取整量化，同时将抽样值按一组二进制码来表示抽样脉冲的幅值。As a possible implementation manner, pulse code modulation (Pulse Code Modulation, PCM for short) may be used to encode the collected raw data. Among them, the main process of coding is to sample the analog signals such as voice and image at regular intervals to make them discretized, and at the same time, the sampled value is rounded and quantized according to the hierarchical unit, and the sampled value is represented by a set of binary codes. The amplitude of the sampled pulse.

本申请实施例中，获取到语音采样得到的原始编码数据后，需要对语音采样得到的原始编码数据解码，以得到解码音频数据。In the embodiment of the present application, after obtaining the original encoded data obtained by the speech sampling, the original encoded data obtained by the speech sampling needs to be decoded to obtain decoded audio data.

作为一种可能的实现方式，可以将存储原始编码数据的音频文件字节流填入MediaCodec的输入数据缓冲区，MediaCodec采用消费者模式，通过异步的方式从数据缓冲区中读取字节流后进行解码处理，最后得到解码音频数据。As a possible implementation, the audio file byte stream that stores the original encoded data can be filled into the input data buffer of MediaCodec. MediaCodec adopts the consumer mode to asynchronously read the byte stream from the data buffer. The decoding process is performed, and finally the decoded audio data is obtained.

步骤102，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据。Step 102, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, down-sample the decoded audio data to obtain target audio data.

其中，音频数据的采样率，是指一秒钟内对声音信号的采样次数，采样率越高，声音的还原就越真实越自然。降采样，又作减采集，是一种多速率数字信号处理的技术或是降低信号采样率的过程，通常用于降低数据传输速率或者数据大小。Among them, the sampling rate of the audio data refers to the number of times of sampling the sound signal in one second. The higher the sampling rate, the more realistic and natural the restoration of the sound. Downsampling, also known as downsampling, is a multi-rate digital signal processing technique or the process of reducing the sampling rate of a signal, usually used to reduce the data transmission rate or data size.

本申请实施例中，在解码音频数据的采样率和/或采样位数较高时，导致传输的音频文件体积较大，增加了传输负担，从而使得传输效率较低。因此，本申请中，对语音采样得到的原始编码数据解码，得到解码音频数据后，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据进行降采样处理，以得到目标音频数据。由此，通过对解码音频数据进行降采样处理，以降低音频数据的大小，从而有利于提高数据的传输速率。In the embodiment of the present application, when the sampling rate and/or the sampling number of bits of the decoded audio data are high, the volume of the audio file to be transmitted is large, which increases the transmission burden, thereby making the transmission efficiency low. Therefore, in the present application, after decoding the original encoded data obtained by voice sampling and obtaining the decoded audio data, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than the set threshold, the decoded audio data is subjected to down-sampling processing , to get the target audio data. Therefore, by performing down-sampling processing on the decoded audio data, the size of the audio data is reduced, thereby helping to increase the data transmission rate.

例如，确定解码音频数据的采样率大于16kHz，或者，采样位数大于16bit时，为了降低音频数据传输数据，则对解码音频数据进行降采样处理，以得到目标音频数据。For example, it is determined that the sampling rate of the decoded audio data is greater than 16 kHz, or when the number of sampling bits is greater than 16 bits, in order to reduce the transmission data of the audio data, the decoded audio data is down-sampled to obtain the target audio data.

作为一种可能的实现方式，确定解码音频数据的采样率和/或采样位数大于设定阈值，可以采用同步采样率转换(Synchronous Sample Rate Converter，简称SSRC)算法，对解码音频数据进行降采样，得到目标音频数据。As a possible implementation, it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, and a Synchronous Sample Rate Converter (SSRC) algorithm may be used to downsample the decoded audio data. , get the target audio data.

需要说明的是，采用SSRC算法对解码音频数据进行降采样时，采样率转换前和转换后的必须都是整数，SSRC算法不支持任意频率之间的转换。It should be noted that when the SSRC algorithm is used to downsample the decoded audio data, the sampling rate before and after the conversion must be integers, and the SSRC algorithm does not support conversion between arbitrary frequencies.

本申请实施例中，采用SSRC算法对解码音频数据进行降采样时，首先对解码音频数据中设定长度序列采用有限长单位冲激响应(Finite Impulse Response，FIR)滤波器滤波；将滤波后得到的设定长度序列增加设定长度的目标序列，得到傅里叶变换的输入序列；其中，目标序列中各元素取值为零；对输入序列进行快速傅里叶变换，得到频域序列；对频域序列滤波后，进行快速傅里叶逆变换，以得到时域序列；对时域序列，根据设定的降采样率重采样，得到目标音频数据。In the embodiment of the present application, when the SSRC algorithm is used to downsample the decoded audio data, first, a finite-length unit impulse response (Finite Impulse Response, FIR) filter is used to filter the set length sequence in the decoded audio data; The target sequence of set length is added to the target sequence of set length, and the input sequence of the Fourier transform is obtained; wherein, each element in the target sequence takes the value of zero; the fast Fourier transform is performed on the input sequence to obtain the frequency domain sequence; After the frequency domain sequence is filtered, inverse fast Fourier transform is performed to obtain the time domain sequence; the time domain sequence is resampled according to the set downsampling rate to obtain the target audio data.

作为一种示例，SSRC算法采用n点快速傅里叶变换对解码音频数据进行降采样处理。首先，将前n/2个编码输入样本通过y(n)＝a0*x(n)+a1*x(n-l)+···+a7*x(n-8)，进行一次9阶的FIR数位滤波，得到快速傅里叶变换(fast Fourier transform，简称FFT)的输入；在滤波得到的n/2个输出后面添加n/2个0，然后对n个数据进行快速傅里叶变换，在频域对频域数据进行复数域的加窗、滤波后再进行逆向快速傅里叶变换，逆向变换重新得到时域数据，再根据重采样输出数据长度要求进行删减、包络处理，最后输出重采样后的目标音频数据。As an example, the SSRC algorithm uses n-point fast Fourier transform to downsample the decoded audio data. First, pass the first n/2 encoded input samples through y(n)=a0*x(n)+a1*x(n-l)+...+a7*x(n-8), and perform a 9th-order FIR Digital filtering to obtain the input of fast Fourier transform (fast Fourier transform, referred to as FFT); add n/2 0s after the n/2 outputs obtained by filtering, and then perform fast Fourier transform on n data, in In the frequency domain, the frequency domain data is windowed and filtered in the complex domain, and then the inverse fast Fourier transform is performed. The inverse transform obtains the time domain data again, and then performs deletion and envelope processing according to the length requirements of the resampling output data, and finally outputs the The resampled target audio data.

步骤103，向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。Step 103: Send the target audio data to the server, so as to obtain the text obtained by speech recognition of the target audio data from the server.

本申请实施例中，在对解码音频数据进行降采样，得到目标音频数据后，将目标音频数据发送至服务器端。进而，服务器端对接收到的目标音频数据进行语音识别，得到目标音频数据对应的文本后，可以从服务器端获取到目标音频数据对应的文本。In the embodiment of the present application, after down-sampling the decoded audio data to obtain the target audio data, the target audio data is sent to the server. Further, the server side performs speech recognition on the received target audio data, and after obtaining the text corresponding to the target audio data, the text corresponding to the target audio data can be obtained from the server side.

本申请实施例的语音处理方法，通过对语音采样得到的原始编码数据解码，得到解码音频数据；若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据；向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，通过对高采样率和/或高采样位数的音频数据进行降采样处理，进而将降采样后的目标音频数据传输至服务器端，以从服务器端获取语音识别得到的文本，从而减少了数据传输量，提高了数据传输速率。In the speech processing method of the embodiment of the present application, decoded audio data is obtained by decoding the original encoded data obtained by sampling the speech; if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, then the decoded audio data is reduced Sampling to obtain the target audio data; sending the target audio data to the server to obtain the text obtained from the speech recognition of the target audio data from the server. Therefore, by performing down-sampling processing on the audio data with high sampling rate and/or high sampling bits, and then transmitting the down-sampled target audio data to the server, so as to obtain the text obtained by speech recognition from the server, thereby reducing The data transmission volume is increased, and the data transmission rate is improved.

在上述实施例的基础上，在步骤102中对解码音频数据降采样，得到目标音频数据之后，若目标音频数据中包括双声道数据，还需要剔除双声道数据中一个声道数据，以减小需要传输的目标音频数据的体积，从而有利于提高数据的传输速率。具体实现过程参见图2，图2为本申请实施例提供的第二种语音处理方法的流程示意图。On the basis of the above-mentioned embodiment, after the decoded audio data is down-sampled in step 102 to obtain the target audio data, if the target audio data includes binaural data, it is also necessary to remove one channel of the binaural data to obtain the target audio data. The volume of the target audio data to be transmitted is reduced, thereby helping to improve the data transmission rate. For a specific implementation process, refer to FIG. 2 , which is a schematic flowchart of a second speech processing method provided by an embodiment of the present application.

如图2所示，该语音处理方法，还可以包括以下步骤：As shown in Figure 2, the voice processing method may further include the following steps:

步骤201，确定目标音频数据中单一声道数据占用的数据长度。Step 201: Determine the data length occupied by single channel data in the target audio data.

其中，数据长度，是指数据所占字节。The data length refers to the bytes occupied by the data.

本申请实施例中，在对采样率和/或采样位数大于设定阈值的解码音频数据进行降采样，得到的目标音频数据中可能包括双声道数据。由于双声道数据所占空间比单声道数据多一倍，为了降低数据传输量，可以剔除双声道数据中的一个声道数据。In this embodiment of the present application, when down-sampling is performed on decoded audio data whose sampling rate and/or sampling number is greater than a set threshold, the obtained target audio data may include binaural data. Since the space occupied by the binaural data is twice that of the monophonic data, in order to reduce the amount of data transmission, one channel of data in the binaural data can be eliminated.

具体的，在得到目标音频数据后，可以确定目标音频数据中单一声道数据占用的数据长度。例如，一个声道数据占用的数据长度可以为2字节，也可以为1字节。Specifically, after obtaining the target audio data, the data length occupied by the single channel data in the target audio data can be determined. For example, the data length occupied by one channel data may be 2 bytes or 1 byte.

步骤202，对目标音频数据每间隔数据长度，剔除一段符合数据长度的数据。Step 202: For each data length of the target audio data, remove a piece of data that conforms to the data length.

本申请实施例中，确定目标音频数据中单一声道数据占用的数据长度后，可以每间隔数据长度，剔除一段符合数据长度的数据。In the embodiment of the present application, after determining the data length occupied by the single channel data in the target audio data, a piece of data that conforms to the data length may be eliminated at every interval of the data length.

例如，假设目标音频数据中单一声道数据占用的数据长度为2字节，可以每隔2个字节剔除2字节数据，这样可以单独获取一个单声道的目标音频数据。For example, assuming that the data length occupied by the single-channel data in the target audio data is 2 bytes, the 2-byte data can be removed every 2 bytes, so that a single-channel target audio data can be obtained separately.

如，剔除目标音频数据中的左声道数据可以采用如下公式：For example, to remove the left channel data in the target audio data, the following formula can be used:

f(n)＝f(0)+f(1)+f(4)+f(5)+...+f(2n-1)+f(2n)；f(n)=f(0)+f(1)+f(4)+f(5)+...+f(2n-1)+f(2n);

剔除目标音频数据中的左声道数据可以采用如下公式：The following formula can be used to remove the left channel data in the target audio data:

f(n)＝f(2)+f(3)+f(6)+f(7)+...+f(2n-3)+f(2n-2)。f(n)=f(2)+f(3)+f(6)+f(7)+...+f(2n-3)+f(2n-2).

本申请实施例的语音处理方法，若目标音频数据中包括双声道数据，剔除双声道数据中一个声道数据时，可以通过确定目标音频数据中单一声道数据占用的数据长度，对目标音频数据每间隔数据长度，剔除一段符合数据长度的数据。由此，对双声道的目标音频数据进行剔除后，得到单声道的目标音频数据，从而减小了需要传输的目标音频数据的体积，有利于提高数据的传输速率。In the speech processing method of the embodiment of the present application, if the target audio data includes two-channel data, when one channel of data in the two-channel data is excluded, the data length occupied by the single-channel data in the target audio data can be determined to determine the data length of the target audio data. For each data length of the audio data, remove a piece of data that matches the data length. Therefore, after the target audio data of two channels is eliminated, the target audio data of single channel is obtained, thereby reducing the volume of the target audio data to be transmitted, which is beneficial to improve the transmission rate of data.

在一种可能的情况下，对解码音频数据降采样，得到目标音频数据后，在向服务器端发送目标音频数据之前，还可以对目标音频数据进行语音端点检测，以从目标音频数据中提取出浊音部分和清音部分，并去除静音部分。由此，实现了从目标音频数据中识别和消除长时间的静音期，以达到在不降低业务质量的情况下节省话路资源的作用。下面结合图3对上述过程进行详细介绍，图3为本申请实施例提供的第三种语音处理方法的流程示意图。In a possible case, after down-sampling the decoded audio data to obtain the target audio data, before sending the target audio data to the server, voice endpoint detection may also be performed on the target audio data to extract the target audio data from the target audio data. Voiced parts and unvoiced parts, and remove the silent parts. Thus, it is realized to identify and eliminate long silent periods from the target audio data, so as to achieve the effect of saving voice channel resources without reducing service quality. The above process will be described in detail below with reference to FIG. 3 , which is a schematic flowchart of a third speech processing method provided by an embodiment of the present application.

如图3所示，该语音处理方法，还可以包括以下步骤：As shown in Figure 3, the voice processing method may further include the following steps:

步骤301，对语音采样得到的原始编码数据解码，得到解码音频数据。Step 301: Decode the original encoded data obtained by the speech sampling to obtain decoded audio data.

步骤302，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据。Step 302, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, down-sample the decoded audio data to obtain target audio data.

本申请实施例中，步骤301和步骤302的实现过程，可以参见上述实施例中步骤101和步骤102的实现过程，在此不再赘述。In this embodiment of the present application, for the implementation process of step 301 and step 302, reference may be made to the implementation process of step 101 and step 102 in the foregoing embodiment, and details are not described herein again.

步骤303，根据目标音频数据，进行语音端点检测，以从目标音频数据中提取出浊音部分和清音部分，并去除静音部分。Step 303: Perform voice endpoint detection according to the target audio data, so as to extract the voiced part and the unvoiced part from the target audio data, and remove the silent part.

其中，语音端点检测(Voice Activity Detection,简称VAD)，是用于鉴别目标音频数据当中的语音出现和语音消失。Among them, Voice Activity Detection (VAD for short) is used to identify the appearance and disappearance of voice in the target audio data.

本申请中，对解码音频数据降采样得到的目标音频数据中可能包括浊音部分、清音部分和静音部分，为了减小目标音频数据传输时音频文件的体积，可以对目标音频数据进行语音端点检测，以从目标音频数据中提取出浊音部分和清音部分，并去除静音部分。In the present application, the target audio data obtained by down-sampling the decoded audio data may include a voiced part, an unvoiced part and a silent part, in order to reduce the volume of the audio file when the target audio data is transmitted, the target audio data can be detected by voice endpoint detection, To extract the voiced part and the unvoiced part from the target audio data, and remove the silent part.

本申请中，在对目标音频数据进行语音端点检测时，还可以首先将音频数据进行分帧处理，进而，从每一帧数据当中提取特征，在一个已知语音和静默信号区域的数据帧集合上训练一个分类器，对未知的分帧数据进行分类，以判断其属于浊音部分、清音部分或静音部分。In this application, when the voice endpoint detection is performed on the target audio data, the audio data can also be divided into frames first, and then features are extracted from each frame of data, and a set of data frames in a known voice and silent signal area Train a classifier on the above to classify the unknown framed data to determine whether it belongs to the voiced part, the unvoiced part or the silent part.

作为一种可能的情况，对目标音频数据进行特征提取时，可以提取每一帧音频数据的能量。需要解释的是，目标音频数据中浊音部分的能量值大于第一能量阈值，清音部分的能量值大于第二能量阈值，其中，第一能量阈值大于第二能量阈值。因此，可以通过设置能量阈值，来提取目标音频数据中的浊音部分和清音部分。As a possible situation, when the feature extraction is performed on the target audio data, the energy of each frame of audio data can be extracted. It should be explained that the energy value of the voiced part in the target audio data is greater than the first energy threshold, and the energy value of the unvoiced part is greater than the second energy threshold, wherein the first energy threshold is greater than the second energy threshold. Therefore, the voiced part and the unvoiced part in the target audio data can be extracted by setting the energy threshold.

作为一种可能的实现方式，在对目标音频数据进行语音端点检测时，可以将音频数据进行分帧处理，提取每一帧音频数据的能量值，将各帧音频数据的能量值与第一能量阈值进行比较，若音频数据的能量值大于第一能量阈值，则确定该帧音频数据为浊音部分。由此，可以从目标音频数据中提取出浊音部分。As a possible implementation, when the voice endpoint detection is performed on the target audio data, the audio data can be processed into frames, the energy value of each frame of audio data can be extracted, and the energy value of each frame of audio data can be compared with the first energy value. thresholds are compared, and if the energy value of the audio data is greater than the first energy threshold, it is determined that the frame of audio data is a voiced part. Thus, the voiced portion can be extracted from the target audio data.

进一步的，提取出浊音部分后的目标音频数据的每一帧音频数据的能量值与第二能量阈值进行比较，若音频数据的能量值大于第二能量阈值，则确定该帧音频数据为清音部分。Further, the energy value of each frame of audio data of the target audio data after extracting the voiced portion is compared with the second energy threshold, if the energy value of the audio data is greater than the second energy threshold, then determine that this frame of audio data is the unvoiced portion. .

本申请实施例中，可以通过短时过零率阈值来区分目标音频数据中的清音与静音部分的能量，区分出清音部分后，可以去除清音部分，以通过抛弃静默音频段以从音频文件本身减少TTS过程中需要传输的语音文件的体积。In the embodiment of the present application, the energy of the unvoiced and silent parts in the target audio data can be distinguished by the short-term zero-crossing rate threshold, and after the unvoiced parts are distinguished, the unvoiced parts can be removed, so that the silent audio segment can be discarded from the audio file itself. Reduce the volume of voice files that need to be transmitted during TTS.

步骤304，向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。Step 304: Send the target audio data to the server to obtain the text obtained by speech recognition of the target audio data from the server.

需要说明的是，在步骤304中向服务器端发送的目标音频数据为去除静音部分之后的数据。It should be noted that the target audio data sent to the server in step 304 is the data after removing the mute portion.

本申请实施例中，步骤304的实现过程，可以参见上述实施例中步骤步骤103的实现过程，在此不再赘述。In this embodiment of the present application, for the implementation process of step 304, reference may be made to the implementation process of step 103 in the foregoing embodiment, and details are not described herein again.

本申请实施例的语音处理方法，对语音采样得到的原始编码数据解码，得到解码音频数据，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据，根据目标音频数据，进行语音端点检测，以从目标音频数据中提取出浊音部分和清音部分，并去除静音部分，向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，在向服务器端发送目标音频数据之前去除静音部分，从而减少了音频数据传输量，有利于提高数据的传输速率。The voice processing method of the embodiment of the present application decodes the original encoded data obtained by voice sampling to obtain decoded audio data, and if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than the set threshold, the decoded audio data is down-sampled , obtain the target audio data, and perform voice endpoint detection according to the target audio data to extract the voiced part and the unvoiced part from the target audio data, and remove the mute part, and send the target audio data to the server to obtain the target audio from the server. Text from audio data speech recognition. Therefore, the mute portion is removed before the target audio data is sent to the server, thereby reducing the amount of audio data transmission, which is beneficial to improve the data transmission rate.

在一种可能的情况下，对解码音频数据降采样，得到目标音频数据后，在向服务器端发送目标音频数据之前，还可以将目标音频数据的比特率与设定比特率进行比较，以确定采用相应的编码方式对目标音频数据进行压缩编码，进而将压缩编码后的目标音频数据发送至服务器端。下面结合图4对上述过程进行详细介绍，图4为本申请实施例提供的第四种语音处理方法的流程示意图。In a possible case, after down-sampling the decoded audio data to obtain the target audio data, before sending the target audio data to the server, the bit rate of the target audio data can also be compared with the set bit rate to determine The target audio data is compressed and encoded using a corresponding encoding method, and then the compressed and encoded target audio data is sent to the server. The above process will be described in detail below with reference to FIG. 4 , which is a schematic flowchart of a fourth voice processing method provided by an embodiment of the present application.

如图4所示，在上述步骤103之前，还可以包括以下步骤：As shown in FIG. 4, before the above step 103, the following steps may also be included:

步骤401，对语音采样得到的原始编码数据解码，得到解码音频数据。Step 401: Decode the original encoded data obtained by the speech sampling to obtain decoded audio data.

步骤402，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据。Step 402, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, down-sample the decoded audio data to obtain target audio data.

本申请实施例中，步骤401和步骤402的实现过程，可以参见上述实施例中步骤101和步骤102的实现过程，在此不再赘述。In this embodiment of the present application, for the implementation process of step 401 and step 402, reference may be made to the implementation process of step 101 and step 102 in the foregoing embodiment, and details are not described herein again.

步骤403，比较目标音频数据的比特率与设定比特率的大小。Step 403: Compare the bit rate of the target audio data with the set bit rate.

其中，音频数据的比特率，指将模拟声音信号转换成数字声音信号后，单位时间内的二进制数据量，是间接衡量音频质量的一个指标。其中，比特率越高，音频的质量就越好，但编码后的音频文件的体积就越大；比特率越少，音频质量就越差，但是编码后的音频文件的体积就越小。The bit rate of the audio data refers to the amount of binary data per unit time after converting the analog sound signal into a digital sound signal, and is an indicator for indirectly measuring the audio quality. Among them, the higher the bit rate, the better the audio quality, but the larger the volume of the encoded audio file; the lower the bit rate, the worse the audio quality, but the smaller the volume of the encoded audio file.

可以理解的是，目标音频数据若不经过压缩编码直接传输给服务器端，将会占用极大的带宽，巨大的数据量会给音频数据传输和存储带来压力，因此，在得到目标音频数据后，可以对目标音频数据进行压缩编码，以减小数据传输过程中音频文件的体积，从而减小数据传输量。It is understandable that if the target audio data is directly transmitted to the server without compression and encoding, it will occupy a huge amount of bandwidth, and the huge amount of data will put pressure on the transmission and storage of audio data. Therefore, after obtaining the target audio data, , the target audio data can be compressed and encoded to reduce the volume of the audio file during data transmission, thereby reducing the amount of data transmission.

例如，对解码音频数据进行降采样和压缩编码后，得到的音频数据后，向服务器端发送音频数据比直接发送解码音频数据，减少了80％左右的数据传输量。For example, after down-sampling and compressing and encoding the decoded audio data, after the audio data is obtained, sending the audio data to the server reduces the data transmission amount by about 80% compared with directly sending the decoded audio data.

本申请实施例中，得到目标音频数据后，将目标音频数据的比特率和设定的比特率进行比较，以确定对目标音频数据进行压缩编码的方式。In the embodiment of the present application, after the target audio data is obtained, the bit rate of the target audio data is compared with the set bit rate to determine a manner of compressing and encoding the target audio data.

作为一种可能的实现方式，得到目标音频数据后，可以采用Opus编码对目标音频数据进行压缩编码。As a possible implementation manner, after obtaining the target audio data, the target audio data may be compressed and encoded by using Opus coding.

其中，Opus编码由两个编码算法Silk和Celt进行组合编码，具有低算法延迟，并且具有极高的压缩比率，编码端和解码端都使用Broadcom提供的滤波器，在进行编码的过程中，前置滤波器可以保留音频信号的低频部分，减弱高频部分，提高编码效率。Among them, Opus coding is combined with two coding algorithms Silk and Celt, which has low algorithm delay and extremely high compression ratio. Both the coding end and the decoding end use the filters provided by Broadcom. Setting the filter can preserve the low-frequency part of the audio signal, attenuate the high-frequency part, and improve the coding efficiency.

Opus可以无缝调节高低比特率，在编码器内部它在较低比特率时使用线性预测编码，在高比特率时候使用变换编码。因此，本申请实施例中，将目标音频数据的比特率与设定比特率进行比较，以确定采用何种编码方式进行压缩编码。Opus can seamlessly adjust high and low bitrates, and inside the encoder it uses linear predictive coding at lower bitrates and transform coding at high bitrates. Therefore, in this embodiment of the present application, the bit rate of the target audio data is compared with the set bit rate, so as to determine which encoding method is used for compression encoding.

步骤404，若目标音频数据的比特率低于设定比特率，则采用线性预测编码方式进行压缩编码。Step 404, if the bit rate of the target audio data is lower than the set bit rate, a linear prediction encoding method is used to perform compression encoding.

在一种可能的情况下，目标音频数据的比特率低于设定比特率，采用线性预测编码方式对目标音频数据进行压缩编码。In a possible situation, the bit rate of the target audio data is lower than the set bit rate, and the target audio data is compressed and encoded by using a linear predictive coding method.

其中，线性预测编码，主要用于音频信号处理与语音处理中根据线性预测模型的信息用压缩形式表示数字语音信号谱包络的工具。Among them, linear predictive coding is mainly used in audio signal processing and speech processing to express the spectral envelope of digital speech signals in a compressed form according to the information of the linear prediction model.

步骤405，若目标音频数据的比特率不低于设定比特率，则采用变换编码方式进行压缩编码。Step 405 , if the bit rate of the target audio data is not lower than the set bit rate, a transform coding method is used to perform compression coding.

其中，变换编码不是直接对空域图像信号进行编码，而是首先将空域图像信号映射变换到另一个正交矢量空间(变换域或频域)，产生一批变换系数，然后对这些变换系数进行编码处理。变换编码是一种间接编码方法，其中关键问题是在时域或空域描述时，数据之间相关性大，数据冗余度大，经过变换在变换域中描述，数据相关性大大减少，数据冗余量减少，参数独立，数据量少，这样再进行量化，编码就能得到较大的压缩比。Among them, transform coding does not directly encode the spatial domain image signal, but firstly maps and transforms the spatial domain image signal to another orthogonal vector space (transform domain or frequency domain), generates a batch of transform coefficients, and then encodes these transform coefficients. deal with. Transform coding is an indirect coding method. The key problem is that when describing in the time domain or the space domain, the correlation between data is large and the data redundancy is large. The margin is reduced, the parameters are independent, and the amount of data is small, so that after quantization, the encoding can obtain a larger compression ratio.

步骤406，向服务器端发送经过压缩编码的目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。Step 406: Send the compressed and encoded target audio data to the server, so as to obtain the text obtained by speech recognition of the target audio data from the server.

本申请实施例的语音处理方法，通过对语音采样得到的原始编码数据解码，得到解码音频数据，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据，若目标音频数据的比特率低于设定比特率，则采用线性预测编码方式进行压缩编码；若目标音频数据的比特率不低于设定比特率，则采用变换编码方式进行压缩编码，向服务器端发送压缩编码后的目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，在音频数据传输之前对音频数据进行压缩编码，以实现在编码后进行音频数据传输，从而减少了数据传输体积，提高了数据传输速率。In the speech processing method of the embodiment of the present application, decoded audio data is obtained by decoding the original encoded data obtained by sampling the speech. Sampling to obtain the target audio data. If the bit rate of the target audio data is lower than the set bit rate, the linear prediction coding method is used for compression coding; if the bit rate of the target audio data is not lower than the set bit rate, the transformation coding is used. Compression coding is performed in the method, and the compressed and coded target audio data is sent to the server, so as to obtain the text obtained by speech recognition of the target audio data from the server. Therefore, the audio data is compressed and encoded before the audio data is transmitted, so as to realize the audio data transmission after the encoding, thereby reducing the data transmission volume and increasing the data transmission rate.

需要说明的是，在上述实施例的基础上，对语音采样得到的原始编码数据解码，得到解码音频数据，若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据后，在确定目标音频数据包括双声道数据后，剔除双声道数据中一个声道数据，进而对剔除得到的单声道目标音频数据进行语音端点检测，以去除静音部分。进一步的，对去除静音部分后的音频数据进行压缩编码，向服务器端发送压缩编码后的音频数据。由此，减少了数据传输过程中的数据传输量，提高了数据的传输效率。It should be noted that, on the basis of the above-mentioned embodiment, the original encoded data obtained by voice sampling is decoded to obtain decoded audio data. The audio data is down-sampled, and after the target audio data is obtained, after determining that the target audio data includes binaural data, one channel data in the binaural data is eliminated, and then voice endpoint detection is performed on the eliminated monophonic target audio data, to remove the silent part. Further, the audio data after the mute part is removed is compressed and encoded, and the compressed and encoded audio data is sent to the server. Therefore, the data transmission amount in the data transmission process is reduced, and the data transmission efficiency is improved.

为了实现上述实施例，本申请还提出一种语音处理装置。In order to realize the above embodiments, the present application also proposes a voice processing apparatus.

如图5所示，该语音处理装置500，可以包括：解码模块510、降采样模块520和发送模块530。As shown in FIG. 5 , the speech processing apparatus 500 may include: a decoding module 510 , a downsampling module 520 and a sending module 530 .

其中，解码模块510，用于对语音采样得到的原始编码数据解码，得到解码音频数据。Wherein, the decoding module 510 is used for decoding the original encoded data obtained by voice sampling to obtain decoded audio data.

降采样模块520，用于若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据。The down-sampling module 520 is configured to down-sample the decoded audio data to obtain target audio data if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold.

发送模块530，用于向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。The sending module 530 is configured to send the target audio data to the server, so as to obtain the text obtained by speech recognition of the target audio data from the server.

作为一种可能的情况，降采样模块520，还可以用于：As a possible situation, the downsampling module 520 can also be used for:

采用同步采样率转换SSRC算法，对解码音频数据降采样。The synchronous sampling rate conversion SSRC algorithm is used to downsample the decoded audio data.

作为另一种可能的情况，降采样模块520，还可以用于：As another possible situation, the downsampling module 520 can also be used for:

对解码音频数据中设定长度序列采用有限长单位冲激响应FIR滤波器滤波；The finite-length unit impulse response FIR filter is used to filter the set length sequence in the decoded audio data;

将滤波后得到的设定长度序列增加设定长度的目标序列，得到傅里叶变换的输入序列；其中，目标序列中各元素取值为零；The set length sequence obtained after filtering is added to the set length target sequence, and the input sequence of the Fourier transform is obtained; wherein, each element in the target sequence takes a value of zero;

对输入序列进行快速傅里叶变换，得到频域序列；Perform fast Fourier transform on the input sequence to obtain the frequency domain sequence;

对频域序列滤波后，进行快速傅里叶逆变换，以得到时域序列；After filtering the frequency domain sequence, perform inverse fast Fourier transform to obtain the time domain sequence;

对时域序列，根据设定的降采样率重采样，得到目标音频数据。For the time domain sequence, resampling according to the set downsampling rate to obtain the target audio data.

作为另一种可能的情况，该语音处理装置500，还可以包括：As another possible situation, the voice processing apparatus 500 may further include:

剔除模块，用于若目标音频数据中包括双声道数据，剔除双声道数据中一个声道数据。The removing module is used for removing one channel data in the two channel data if the target audio data includes two channel data.

作为另一种可能的情况，剔除模块，还可以用于：As another possible case, the culling module can also be used to:

确定目标音频数据中单一声道数据占用的数据长度；Determine the data length occupied by the single channel data in the target audio data;

对目标音频数据每间隔数据长度，剔除一段符合数据长度的数据。For each interval data length of the target audio data, a section of data that conforms to the data length is eliminated.

检测模块，用于根据目标音频数据，进行语音端点检测，以从目标音频数据中提取出浊音部分和清音部分，并去除静音部分；The detection module is used to perform voice endpoint detection according to the target audio data, so as to extract the voiced part and the unvoiced part from the target audio data, and remove the mute part;

其中，浊音部分的能量值大于第一能量阈值；Wherein, the energy value of the voiced part is greater than the first energy threshold;

清音部分的能量值大于第二能量阈值；The energy value of the unvoiced part is greater than the second energy threshold;

第一能量阈值大于第二能量阈值。The first energy threshold is greater than the second energy threshold.

压缩编码模块，用于若目标音频数据的比特率低于设定比特率，则采用线性预测编码方式进行压缩编码；若目标音频数据的比特率不低于设定比特率，则采用变换编码方式进行压缩编码。Compression coding module, used for compression coding by linear prediction coding if the bit rate of the target audio data is lower than the set bit rate; if the bit rate of the target audio data is not lower than the set bit rate, the transformation coding is used Compression encoding is performed.

需要说明的是，前述对语音处理方法实施例的解释说明也适用于该实施例的语音处理装置，此处不再赘述。It should be noted that, the foregoing explanations on the embodiment of the speech processing method are also applicable to the speech processing apparatus of this embodiment, and are not repeated here.

本申请实施例的语音处理装置，通过对语音采样得到的原始编码数据解码，得到解码音频数据；若确定解码音频数据的采样率和/或采样位数大于设定阈值，则对解码音频数据降采样，得到目标音频数据；向服务器端发送目标音频数据，以从服务器端获取对目标音频数据语音识别得到的文本。由此，通过对高采样率和/或高采样位数的音频数据进行降采样处理，进而将降采样后的目标音频数据传输至服务器端，以从服务器端获取语音识别得到的文本，从而减少了数据传输量，提高了数据传输速率。The speech processing apparatus of the embodiment of the present application obtains decoded audio data by decoding the original encoded data obtained by sampling the speech; if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data are greater than the set threshold, the decoded audio data is reduced Sampling to obtain the target audio data; sending the target audio data to the server to obtain the text obtained from the speech recognition of the target audio data from the server. Therefore, by performing down-sampling processing on the audio data with high sampling rate and/or high sampling bits, and then transmitting the down-sampled target audio data to the server, so as to obtain the text obtained by speech recognition from the server, thereby reducing The data transmission volume is increased, and the data transmission rate is improved.

为了实现上述实施例，本申请还提出一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时，实现如上述实施例中的语音处理方法。In order to implement the above embodiments, the present application further proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the program, the above-mentioned implementation is realized. Example speech processing method.

为了实现上述实施例，本申请还提出一种非临时性计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述实施例的语音处理方法。In order to implement the above-mentioned embodiments, the present application further provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the speech processing method according to the above-mentioned embodiments.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless expressly and specifically defined otherwise.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing custom logical functions or steps of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in conjunction with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如，如果用硬件来实现和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

1. A method of speech processing, the method comprising:

decoding original coded data obtained by voice sampling to obtain decoded audio data;

if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than a set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data;

and sending the target audio data to a server side so as to obtain a text obtained by voice recognition of the target audio data from the server side.

2. The speech processing method of claim 1, wherein the downsampling the decoded audio data comprises:

the decoded audio data is down-sampled using a Synchronous Sample Rate Conversion (SSRC) algorithm.

3. The speech processing method of claim 2 wherein the down-sampling the decoded audio data using a Synchronous Sample Rate Conversion (SSRC) algorithm comprises:

filtering the set length sequence in the decoded audio data by adopting a finite length single-bit impulse response FIR filter;

adding the target sequence with the set length to the sequence with the set length obtained after filtering to obtain an input sequence of Fourier transform; wherein, the value of each element in the target sequence is zero;

performing fast Fourier transform on the input sequence to obtain a frequency domain sequence;

filtering the frequency domain sequence, and performing inverse fast Fourier transform to obtain a time domain sequence;

and resampling the time domain sequence according to a set down-sampling rate to obtain the target audio data.

4. The speech processing method according to claim 1, wherein before sending the target audio data to the server, the method further comprises:

and if the target audio data comprises the double-channel data, one channel data in the double-channel data is removed.

5. The speech processing method of claim 4, wherein said removing one of the two-channel data comprises:

determining the data length occupied by single channel data in the target audio data;

and eliminating a section of data which accords with the data length for the target audio data at intervals of the data length.

6. The speech processing method according to claim 1, wherein before sending the target audio data to the server, the method further comprises:

performing voice endpoint detection according to the target audio data to extract a voiced part and an unvoiced part from the target audio data and remove a mute part;

wherein the energy value of the voiced parts is greater than a first energy threshold;

the energy value of the unvoiced part is greater than a second energy threshold;

the first energy threshold is greater than the second energy threshold.

7. The speech processing method according to any one of claims 1 to 6, wherein before sending the target audio data to the server, the method further comprises:

if the bit rate of the target audio data is lower than the set bit rate, performing compression coding by adopting a linear prediction coding mode;

and if the bit rate of the target audio data is not lower than the set bit rate, performing compression coding by adopting a transform coding mode.

8. A speech processing apparatus, comprising:

the decoding module is used for decoding the original coded data obtained by voice sampling to obtain decoded audio data;

the down-sampling module is used for down-sampling the decoded audio data to obtain target audio data if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be greater than a set threshold value;

and the sending module is used for sending the target audio data to a server so as to obtain a text obtained by voice recognition of the target audio data from the server.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech processing method according to any of claims 1-7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the speech processing method according to any one of claims 1 to 7.