CN114822574A

CN114822574A - Human voice recognition and enhancement method, device and storage medium for speech equipment

Info

Publication number: CN114822574A
Application number: CN202210469220.3A
Authority: CN
Inventors: 汤凯; 任崇瀚
Original assignee: Nanjing Yaoze Electronic Technology Co ltd
Current assignee: Nanjing Yaoze Electronic Technology Co ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-29

Abstract

Embodiments of the present invention disclose a method, device and storage medium for human voice recognition and enhancement for speech equipment, relate to the field of communication technologies, and can be applied to the emergency rescue where the voice of a walkie-talkie needs to be amplified and enhanced in a mask scene. The present invention includes: performing echo cancellation on the collected sound signal through the NLMS algorithm, wherein the collected sound signal includes environmental noise and voice signal; performing spectrum analysis on the sound processed in step 1, and extracting features that conform to the human voice sound signal; enhance the sound signal conforming to human voice characteristics through IIR; send the enhanced sound signal to the intercom module.

Description

Human voice recognition and enhancement method, device and storage medium for speech equipment

技术领域technical field

本发明涉及通信技术领域，尤其涉及一种用于语音设备的人声识别与增强方法、装置及存储介质。The present invention relates to the field of communication technologies, and in particular, to a method, device and storage medium for human voice recognition and enhancement for speech equipment.

背景技术Background technique

目前，在应急救援的现场协同中广泛使用对讲机作为语音通信手段，而且执行任务中通常需使用肩咪装置，连接对讲机后，无须手持对讲机，挂在肩上即可对讲，也就是在执行紧急任务使用对讲机时需额外配肩咪才方便实时沟通。At present, walkie-talkies are widely used as a means of voice communication in the on-site coordination of emergency rescue, and shoulder microphone devices are usually used in the execution of tasks. When using the walkie-talkie, an additional shoulder microphone is required to facilitate real-time communication.

因此救援人员需要配备的通信设备数量较多，使用环境也较为复杂，通话的声音质量和去噪效果较差。另外，由于对讲机的应用场景都较特殊，超长待机和可靠性是最重要的需求，因此此类通信设备一般都采用低功耗、性能较低但相对可靠的系统架构，这类架构的计算能力较低，因此目前很多的声音优化方面的算法功能并不适合在此类架构上实现。并且对于应急救援中需要在面罩内对对讲机的声音进行扩音和声音增强的场景，目前也缺乏较为有效的处理方法。Therefore, rescuers need to be equipped with a large number of communication equipment, the use environment is also more complex, and the sound quality and denoising effect of calls are poor. In addition, due to the special application scenarios of walkie-talkies, ultra-long standby and reliability are the most important requirements, so such communication equipment generally adopts a low-power, low-performance but relatively reliable system architecture. The ability is low, so many of the current algorithm functions for sound optimization are not suitable for implementation on this type of architecture. In addition, there is also a lack of effective processing methods for scenarios in which it is necessary to amplify and enhance the sound of the walkie-talkie in the mask in emergency rescue.

发明内容SUMMARY OF THE INVENTION

本发明的实施例提供一种用于语音设备的人声识别与增强方法、装置及存储介质，以便于应用于应急救援中需要在面罩内对对讲机的声音进行扩音和声音增强的场景。Embodiments of the present invention provide a human voice recognition and enhancement method, device and storage medium for voice equipment, so as to be easily applied to scenarios in which the voice of a walkie-talkie needs to be amplified and enhanced in a mask in emergency rescue.

为达到上述目的，本发明的实施例采用如下技术方案：To achieve the above object, the embodiments of the present invention adopt the following technical solutions:

第一方面，本发明的实施例提供的方法，包括：In a first aspect, a method provided by an embodiment of the present invention includes:

步骤1、通过NLMS算法对采集的声音信号进行回声消除，其中，在所采集的声音信号中包括了环境噪音和语音信号；Step 1, performing echo cancellation on the collected sound signal by NLMS algorithm, wherein, the collected sound signal includes environmental noise and voice signal;

步骤2、对经过步骤1处理的声音进行频谱分析，并提取符合人声特征的声音信号；Step 2, performing spectrum analysis on the sound processed in step 1, and extracting a sound signal that conforms to the characteristics of the human voice;

步骤3、通过IIR对符合人声特征的声音信号进行增强处理；Step 3. Perform enhancement processing on the sound signal conforming to human voice characteristics through IIR;

步骤4、将经过增强处理的声音信号送给对讲模块。Step 4. Send the enhanced sound signal to the intercom module.

进一步的还包括：播放并记录经过增强处理的声音信号；再利用记录的声音数据，通过NLMS算法对采集的声音信号进行回声消除。The method further includes: playing and recording the enhanced sound signal; and then using the recorded sound data to perform echo cancellation on the collected sound signal through the NLMS algorithm.

在步骤1中包括：将采集的声音信号输入编译码器进行编码；编码后的声音信号存储至语音信号及噪声缓存区，记录的声音数据存储至参考缓存区，其中，处理器从语音信号及噪声缓存区和回声参考缓存区中提取数据。Step 1 includes: inputting the collected sound signal into the codec for encoding; storing the encoded sound signal in the speech signal and noise buffer area, and storing the recorded sound data in the reference buffer area, wherein the processor converts the sound signal from the speech signal and the noise buffer to the reference buffer area. Extract data from noise buffer and echo reference buffer.

所述语音信号及噪声缓存区中建立m个缓存区，其中，当第n个缓存区正在记录声音信号时，所述处理器同时正在处理第[(n+m-1)mod m]缓存区的数据，同时播放第[(n+m-2)mod m]缓存区的数据，其中，[(n+m-1)mod m]缓存区表示所述第n个缓存区的前一个缓存区，[(n+m-2)mod m]缓存区表示所述第n个缓存区的前两个缓存区。m buffer areas are established in the voice signal and noise buffer areas, wherein when the nth buffer area is recording the sound signal, the processor is processing the [(n+m-1)mod m]th buffer area at the same time and play the data of the [(n+m-2)mod m]th buffer area at the same time, wherein, the [(n+m-1)mod m] buffer area represents the previous buffer area of the nth buffer area , [(n+m-2)mod m] buffer area represents the first two buffer areas of the nth buffer area.

所述通过NLMS算法对采集的声音信号进行回声消除，包括：采集的声音信号经过所述编译码器处理后，以系统输出信号为参考进行NLMS归一化滤波，其中，采集的声音信号经过所述编译码器处理后，输出的数据类型为24位精度有符号整型；再以噪音信号为参考对主信号进行NLMS归一化滤波，得到初步处理后的消噪信号；通过IIR对所述消噪信号进行陷波滤波，其中，IIR工作在被执行在编译时指定频率。The performing echo cancellation on the collected sound signal by using the NLMS algorithm includes: after the collected sound signal is processed by the codec, performing NLMS normalization filtering with the system output signal as a reference, wherein the collected sound signal is processed by the codec. After the codec is processed, the output data type is a 24-bit precision signed integer; then the main signal is subjected to NLMS normalization filtering with the noise signal as a reference to obtain a preliminarily processed denoising signal; The denoised signal is notch filtered, where IIR work is performed at the specified frequency at compile time.

在进行频谱分析的过程中，包括：通过最大功率密度(也可以称之为单音强度)确定声音门限；和，通过信噪比区分属于风噪的声音信号和属于人声的声音信号；和，通过功率谱集中度区分属于呼吸阀共振的声音信号和属于喷麦共振的声音信号。In the process of performing the spectrum analysis, it includes: determining the sound threshold by the maximum power density (which can also be referred to as single tone intensity); and, distinguishing the sound signal belonging to the wind noise and the sound signal belonging to the human voice by the signal-to-noise ratio; and , and distinguish the sound signal belonging to the respiration valve resonance from the sound signal belonging to the spray resonance by the power spectrum concentration.

在进行IIR增强处理的过程中，包括：每次采集1024个点作为一个cache，按照记录cache的时间先后顺序，取出4个cache组成4096个采样点集合；利用所述采样点集合，通过语音活跃度统计函数进行增强。In the process of IIR enhancement processing, it includes: collecting 1024 points as a cache each time, taking out 4 caches to form a set of 4096 sampling points according to the chronological order of recording the caches; using the sampling point set to activate the voice The degree statistics function is enhanced.

所述IIR为五段式声音均衡器；通过IIR按照不同的声音频率区段，对符合人声特征的声音信号进行增强处理，其中，所述声音频率区段包括：40-200Hz的超低音、200-800Hz的中低音、800-1600Hz的中高音、1600-4000Hz的超高音和4000Hz以上的泛高音。The IIR is a five-segment sound equalizer; through the IIR, according to different sound frequency sections, the sound signal conforming to human voice characteristics is enhanced, wherein the sound frequency section includes: 40-200Hz subwoofer, 200-800Hz mid-bass, 800-1600Hz mid-high, 1600-4000Hz super high and over 4000Hz pan treble.

第二方面，本发明的实施例提供的装置，包括：In a second aspect, the device provided by the embodiment of the present invention includes:

预处理模块，用于通过NLMS算法对采集的声音信号进行回声消除，其中，在所采集的声音信号中包括了环境噪音和语音信号；a preprocessing module for performing echo cancellation on the collected sound signal through the NLMS algorithm, wherein the collected sound signal includes environmental noise and voice signals;

分析模块，用于对经过步骤1处理的声音进行频谱分析，并提取符合人声特征的声音信号；The analysis module is used to perform spectrum analysis on the sound processed in step 1, and extract the sound signal conforming to the characteristics of human voice;

增强模块，用于通过IIR对符合人声特征的声音信号进行增强处理；The enhancement module is used to enhance the sound signal conforming to the human voice through IIR;

传输模块，用于将经过增强处理的声音信号送给对讲模块。The transmission module is used to send the enhanced sound signal to the intercom module.

第三方面，本发明的实施例提供一种存储介质，存储有计算机程序或指令，当所述计算机程序或指令被运行时，实现本实施例中的方法。In a third aspect, an embodiment of the present invention provides a storage medium storing a computer program or instruction, and when the computer program or instruction is executed, the method in this embodiment is implemented.

本发明实施例提供的用于语音设备的人声识别与增强方法、装置及存储介质，无论是否有辅助麦克风，回声抑制都是全时段生效的。采集到的信号经过A/D转化后，输出的数据类型为24位精度有符号整型，噪音与主信号经过以系统输出信号为参考的NLMS归一化滤波后，再以噪音信号为参考对主信号进行NLMS，得到初步处理后的消噪信号。消噪信号将被执行在编译时指定频率的IIR陷波滤波，以消除特定频率的噪音。从而能够方便得应用在应急救援中需要在面罩内对对讲机的声音进行扩音和声音增强的场景。In the method, device, and storage medium for human voice recognition and enhancement for speech equipment provided by the embodiments of the present invention, whether or not there is an auxiliary microphone, echo suppression is effective at all times. After the collected signal is converted by A/D, the output data type is a signed integer with 24-bit precision. After the noise and the main signal are normalized and filtered by NLMS with the system output signal as the reference, the noise signal is used as the reference. The main signal is subjected to NLMS to obtain the denoised signal after preliminary processing. The denoised signal will be subjected to IIR notch filtering at the frequency specified at compile time to remove noise at specific frequencies. Therefore, it can be conveniently applied to the scene in which it is necessary to amplify and enhance the sound of the walkie-talkie in the mask in emergency rescue.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的系统架构示意图；1 is a schematic diagram of a system architecture provided by an embodiment of the present invention;

图2、3为本发明实施例提供的原理示意图；2 and 3 are schematic diagrams of principles provided by an embodiment of the present invention;

图4为本发明实施例提供的人声识别与增强的流程示意图；4 is a schematic flowchart of human voice recognition and enhancement provided by an embodiment of the present invention;

图5为本发明实施例提供的频谱对比流程示意图；5 is a schematic flowchart of a spectrum comparison provided by an embodiment of the present invention;

图6为本发明实施例提供的方法流程示意图；6 is a schematic flowchart of a method provided by an embodiment of the present invention;

图7为本发明实施例提供的一种可能的频谱分析的流程示意图。FIG. 7 is a schematic flowchart of a possible spectrum analysis provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本领域技术人员更好地理解本发明的技术方案，下面结合附图和具体实施方式对本发明作进一步详细描述。下文中将详细描述本发明的实施方式，所述实施方式的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的任一单元和全部组合。本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样定义，不会用理想化或过于正式的含义来解释。In order to make those skilled in the art better understand the technical solutions of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Hereinafter, embodiments of the present invention will be described in detail, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention. It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art and, unless defined as herein, are not to be taken in an idealized or overly formal sense. explain.

本发明实施例提供一种用于语音设备的人声识别与增强方法，如图6所示，包括：An embodiment of the present invention provides a human voice recognition and enhancement method for a voice device, as shown in FIG. 6 , including:

步骤1、通过NLMS算法对采集的声音信号进行回声消除。Step 1. Perform echo cancellation on the collected sound signal through the NLMS algorithm.

其中，在所采集的声音信号中包括了环境噪音和语音信号。NLMS(NormalizedLeast Mean Square，归一化最小均方)算法是一种已知噪音信号特征，在有参考的前提下对目标信号滤除噪音的实时处理算法，具有收敛快，负载低的优点。Among them, the collected sound signal includes ambient noise and voice signal. The NLMS (Normalized Least Mean Square, Normalized Least Mean Square) algorithm is a real-time processing algorithm that filters out the noise of the target signal under the premise of a reference with known noise signal characteristics. It has the advantages of fast convergence and low load.

步骤2、对经过步骤1处理的声音进行频谱分析，并提取符合人声特征的声音信号。Step 2: Perform spectrum analysis on the sound processed in Step 1, and extract a sound signal conforming to the characteristics of the human voice.

步骤3、通过IIR对符合人声特征的声音信号进行增强处理。Step 3. Perform enhancement processing on the sound signal conforming to the characteristics of the human voice through the IIR.

其中，IIR指的是数字滤波器，数字滤波器是对数字信号实现滤波的线性时不变系统。数字滤波实质上是一种运算过程，实现对信号的运算处理。输入数字信号(数字序列)通过特定的运算转变为输出的数字序列，因此，数字滤波器本质上是一个完成特定运算的数字计算过程，也可以理解为是一台计算机。描述离散系统输出与输入关系的卷积和差分方程只是给数字信号滤波器提供运算规则，使其按照这个规则完成对输入数据的处理。Among them, IIR refers to a digital filter, and a digital filter is a linear time-invariant system for filtering digital signals. Digital filtering is essentially an operation process, which realizes the operation and processing of the signal. The input digital signal (digital sequence) is transformed into the output digital sequence through a specific operation. Therefore, a digital filter is essentially a digital calculation process that completes a specific operation, and can also be understood as a computer. The convolution and difference equations that describe the relationship between the output and the input of a discrete system only provide the digital signal filter with an operation rule, so that it can complete the processing of the input data according to this rule.

本实施例中，还包括：播放并记录经过增强处理的声音信号，记录经过增强处理的声音信号以便于在下一周期中作为回声消除的参考，其中，将具体的声音数据记录于回声参考缓冲区。In this embodiment, the method further includes: playing and recording the enhanced sound signal, recording the enhanced sound signal so as to be used as a reference for echo cancellation in the next cycle, wherein the specific sound data is recorded in the echo reference buffer .

再利用记录的声音数据，通过NLMS算法对采集的声音信号进行回声消除。例如：播放经过增强处理的声音信号1，并将该声音信号1存储于回声参考缓存区，其中，声音信号1指噪音消除和回声消除后增强的声音，就是能够播放并可以发送给对讲模块的声音。如果有主副麦克风，那么主麦克风拾取的声音信号2包括语音信号2-1、回声信号3(即上述声音信号1播放后产生的回声)和环境噪音4，声音信号2经过A/D转化后存储于语音信号及噪声缓存区，副麦克风拾取的声音信号5包括回声信号3和环境噪音4，声音信号5经过A/D转化后也储存于语音信号及噪声缓存区。Then use the recorded sound data, and perform echo cancellation on the collected sound signal through the NLMS algorithm. For example: play the enhanced sound signal 1, and store the sound signal 1 in the echo reference buffer area, where the sound signal 1 refers to the enhanced sound after noise cancellation and echo cancellation, which can be played and sent to the intercom module the sound of. If there are main and auxiliary microphones, the sound signal 2 picked up by the main microphone includes the voice signal 2-1, the echo signal 3 (that is, the echo generated after the above-mentioned sound signal 1 is played) and the environmental noise 4. After the A/D conversion of the sound signal 2 Stored in the voice signal and noise buffer area, the voice signal 5 picked up by the secondary microphone includes echo signal 3 and ambient noise 4, and the voice signal 5 is also stored in the voice signal and noise buffer area after A/D conversion.

其中，进行进行回声消除的过程，具体采用如图4的方式，无论是否有辅助麦克风，回声抑制都是全时段生效的。例如：所述通过NLMS算法对采集的声音信号进行回声消除，包括：对采集到的信号经过A/D转化后输出为24位精度有符号整型，噪音与主信号经过以系统输出信号为参考的NLMS归一化滤波后，再以噪音信号为参考对主信号进行NLMS，得到初步处理后的消噪信号。其中，系统输出信号可以是系统播放的声音信号1，其中，若采用如图2所示的双麦克风(主副麦克风)架构，主副麦克风采集的的声音信号包括了回声信号、环境噪音信号和语音信号，其中由于主麦克风靠近使用者的口鼻，因此其采集的语音信号强度较高；副麦克风采集的声音信号也包括了回声信号、环境噪音信号和语音信号，但是由于副麦克风距离使用者的口鼻较远，因此所采集的语音信号的强度不如主麦克风。IIR作为无限冲击响应滤波器，工作在被执行在编译时指定频率，若信号需要滤除某一指定频率的噪声，则在其经过NLMS处理完成后，会被送进在编程时已指定参数的IIR陷波。具体的处理原理可以如图2所示的，其中：S1：主麦克风拾取的信号与前面所述的声音信号1通过NLMS算法进行处理，可消除回声信号3。S2：副麦克风拾取的信号与前面所述的声音信号1通过NLMS算法进行处理，可消除回声信号3。S3：再将S1和S2消除回声信号3后的两者信号进行NLMS算法处理，得到语音信号2-1，再将该语音信号2-1进行IIR陷波滤波，消除特定频率的噪音(S4)，再进行FFT变换和频谱分析(S5)，频谱分析主要就是分析人声、喷麦、呼吸阀、白噪音，如果有人声，就增强(S6)，然后再发送给对讲模块和，或本地播放(S7)。Among them, the process of performing echo cancellation is specifically adopted in the manner as shown in FIG. 4 , regardless of whether there is an auxiliary microphone or not, the echo suppression is effective for the whole period of time. For example, performing echo cancellation on the collected sound signal through the NLMS algorithm includes: converting the collected signal into a 24-bit precision signed integer after A/D conversion, and taking the system output signal as a reference after the noise and the main signal pass through After the NLMS is normalized and filtered, the noise signal is used as a reference to perform NLMS on the main signal to obtain a preliminarily processed denoised signal. Wherein, the system output signal can be the sound signal 1 played by the system, wherein, if the dual microphone (main and sub-microphone) architecture as shown in FIG. Voice signal, among which the voice signal strength collected by the main microphone is relatively high because the main microphone is close to the user's mouth and nose; the sound signal collected by the secondary microphone also includes echo signals, environmental noise signals and voice signals, but due to the distance between the secondary microphone and the user The mouth and nose are farther away, so the strength of the collected voice signal is not as good as that of the main microphone. As an infinite impulse response filter, the IIR works at the specified frequency at compile time. If the signal needs to filter out the noise of a specified frequency, it will be sent to the parameter specified at the time of programming after it is processed by NLMS. IIR notch. The specific processing principle can be shown in FIG. 2 , wherein: S1 : The signal picked up by the main microphone and the aforementioned sound signal 1 are processed by the NLMS algorithm, and the echo signal 3 can be eliminated. S2: The signal picked up by the secondary microphone and the aforementioned sound signal 1 are processed by the NLMS algorithm, and the echo signal 3 can be eliminated. S3: Perform NLMS algorithm processing on the two signals of S1 and S2 after the echo signal 3 is eliminated to obtain a voice signal 2-1, and then perform IIR notch filtering on the voice signal 2-1 to eliminate noise of a specific frequency (S4) , and then perform FFT transformation and spectrum analysis (S5). The spectrum analysis is mainly to analyze human voice, spray wheat, breathing valve, and white noise. If there is a human voice, enhance it (S6), and then send it to the intercom module and, or local Play (S7).

再例如：，如果没有副麦克风，再以环境噪音信号为参考对主信号进行NLMS，具体如图3所示的，如果没有副麦克风，播放经过增强处理的声音信号1，并将该声音信号1存储于回声参考缓存区，主麦克风拾取的声音信号2包括语音信号2-1、回声信号3和环境噪音4，声音信号2经过A/D转化后存储于语音信号及噪声缓存区，S01：主麦克风拾取的信号与前面所述的声音信号1通过NLMS算法进行处理，可消除回声信号3。S02：再将S01得到信号进行IIR陷波滤波，消除特定频率的噪音，再进行FFT变换和频谱分析(S03)，频谱分析主要就是分析人声、环境噪声、喷麦、呼吸阀、白噪音(本实施例中，白噪音属于无明显频谱特征的噪音，由AD转化时产生的误差而引起的)，如果有人声，就增强(S04)，然后再发送给对讲模块和，或本地播放(S05)。Another example: if there is no secondary microphone, NLMS is performed on the main signal using the ambient noise signal as a reference. As shown in Figure 3, if there is no secondary microphone, the enhanced sound signal 1 is played, and the sound signal 1 Stored in the echo reference buffer area, the sound signal 2 picked up by the main microphone includes the voice signal 2-1, the echo signal 3 and the ambient noise 4, the sound signal 2 is stored in the voice signal and noise buffer area after A/D conversion, S01: main The signal picked up by the microphone and the aforementioned sound signal 1 are processed by the NLMS algorithm, and the echo signal 3 can be eliminated. S02: Then perform IIR notch filtering on the signal obtained from S01 to eliminate noise at a specific frequency, and then perform FFT transformation and spectrum analysis (S03). In this embodiment, the white noise belongs to the noise without obvious spectral characteristics, which is caused by the error generated during AD conversion). S05).

目前参考信号为语音处理完毕后的输出信号，但是在之后的对讲机中需要连接至对讲模块的输出信号。本实施例中，无论是否有辅助麦克风，回声抑制都是全时段生效的。采集到的信号经过A/D转化后，输出的数据类型为24位精度有符号整型，噪音与主信号经过以系统输出信号为参考的NLMS归一化滤波后，得到初步处理后的消噪信号,可选的，若采用双麦克风的架构则也可以用噪音信号为参考对主信号进行NLMS。消噪信号将被执行在编译时指定频率的IIR陷波滤波，以消除特定频率的噪音(比如单频机械音)，之后送入CFFT分离频率特征。实际应用在对讲机上，可以实现无需与肩咪配合也能实现声控PTT功能，并且通话效果好的目的。At present, the reference signal is the output signal after the voice processing is completed, but it needs to be connected to the output signal of the intercom module in the following walkie-talkies. In this embodiment, regardless of whether there is an auxiliary microphone, the echo suppression takes effect at all times. After the collected signal is converted by A/D, the output data type is a signed integer with 24-bit precision. After the noise and main signal are normalized and filtered by NLMS with the system output signal as a reference, the denoising after preliminary processing is obtained. Signal, optionally, if a dual-microphone architecture is adopted, NLMS can also be performed on the main signal using the noise signal as a reference. The denoised signal will be subjected to IIR notch filtering at the specified frequency at compile time to remove noise at a specific frequency (such as single-frequency mechanical sound), and then sent to CFFT to separate frequency features. Practically applied to the walkie-talkie, it can realize the voice-activated PTT function without cooperating with the shoulder microphone, and the call effect is good.

本实施例中，在步骤1中包括：将采集的声音信号输入编译码器(例如WM8978)进行编码，在编码后的声音信号存储至语音信号及噪声缓存区，记录的声音数据存储至参考缓存区。其中，一种人声识别与增强方法，通过分析人声与自然噪声的频率特性差异而实现的，如图1所示，对音频的编码由WM8978完成，数据流以SAI执行全双工传输，信号用ARM-DSP完成实时处理。具体的，处理器从语音信号及噪声缓存区和回声参考缓存区中提取数据，例如：编码译码后的声音信号2、声音信号5存储至语音信号及噪声缓存区，声音信号1存储于回声参考缓存区。In this embodiment, step 1 includes: inputting the collected sound signal into a codec (for example, WM8978) for encoding, storing the encoded sound signal in the voice signal and noise buffer area, and storing the recorded sound data in the reference buffer Area. Among them, a human voice recognition and enhancement method is realized by analyzing the difference in frequency characteristics between human voice and natural noise. As shown in Figure 1, the encoding of the audio is completed by WM8978, and the data stream is transmitted in full duplex with SAI. The signal is processed in real time with ARM-DSP. Specifically, the processor extracts data from the voice signal and the noise buffer and the echo reference buffer, for example, the encoded and decoded voice signal 2 and voice signal 5 are stored in the voice signal and noise buffer, and the voice signal 1 is stored in the echo Reference buffer.

本实施例中，所述语音信号及噪声缓存区中建立m个缓存区，其中，当第n个缓存区正在记录声音信号时，所述处理器同时正在处理第[(n+m-1)mod m]缓存区的数据，同时播放第[(n+m-2)mod m]缓存区的数据，其中，[(n+m-1)mod m]缓存区表示所述第n个缓存区的前一个缓存区，[(n+m-2)mod m]缓存区表示所述第n个缓存区的前两个缓存区。其中，“mod”在业内表示取余数，在优选方案中m取数值为4。其中，当某个缓存区正在记录声音信号时，所述处理器同时正在处理前一个缓存区的数据，同时播放向前两个的缓存区中的数据(声音信号1)，这就是用写入cache的方式，有效降低处理器的负担。本实施例中，采用全双工并实现输入输出使用同一缓存，故录音与播放一定是同时结束的，因为录音是主设备接收，播放是从设备发送，播放的数据流被同步在录音数据流，故应在挂载录音缓存前将播放数据的缓存地址挂载于DMA。通过使用多重环形缓存区，假设当前正在第n个缓存区记录声音，则一定是正在处理第[(n+m-1)modm]个缓存区的数据，并如果正在播放声音，则有可能是第[(n+m-1)modm]或者[(n+m-2)modm](最低延时模式)个区块的数据。In this embodiment, m buffer areas are established in the voice signal and noise buffer areas, wherein when the nth buffer area is recording the sound signal, the processor is processing the [(n+m-1)th mod m] data in the buffer area, while playing the data in the [(n+m-2)mod m]th buffer area, where the [(n+m-1)mod m] buffer area represents the nth buffer area The previous buffer area of , [(n+m-2)mod m] buffer area represents the first two buffer areas of the nth buffer area. Among them, "mod" means taking the remainder in the industry, and m takes the value of 4 in the preferred solution. Among them, when a certain buffer area is recording a sound signal, the processor is processing the data of the previous buffer area at the same time, and plays the data in the previous two buffer areas (sound signal 1) at the same time. The way of cache effectively reduces the burden on the processor. In this embodiment, full-duplex is used and the same buffer is used for input and output, so recording and playback must end at the same time, because recording is received by the master device, playback is sent by the slave device, and the playback data stream is synchronized with the recording data stream. , so the cache address of the playback data should be mounted to the DMA before the recording cache is mounted. By using multiple ring buffers, assuming that the sound is currently being recorded in the nth buffer, the data in the [(n+m-1)modm]th buffer must be being processed, and if the sound is being played, there may be The data of the [(n+m-1)modm] or [(n+m-2)modm] (lowest delay mode) block.

可选的，也可以同时播放第[(n+m-2)modm]和[(n+m-3)modm]缓存区的数据，从而防止部分语音丢失，获得更好的播放效果。Optionally, the data in the [(n+m-2)modm] and [(n+m-3)modm]th buffer areas can also be played at the same time, so as to prevent partial voice loss and obtain a better playback effect.

具体的，所述输入输出使用同一缓存，指的是：在语音信号及噪声缓存区中划分出m个缓存区，当某个缓存区正在记录声音信号时，所述处理器同时正在处理前一个缓存区的数据，同时播放向前两个的缓存区中的数据(声音信号1)，前面所述的某个缓存区正在记录声音信号，就是输入，同时播放向前两个的缓存区中的数据(声音信号1)，而缓存区中的数据不断更新，从而输入输出使用同一缓存。Specifically, the input and output use the same buffer, which means that m buffer areas are divided into the voice signal and noise buffer areas. When a certain buffer area is recording a sound signal, the processor is processing the previous buffer area at the same time. The data in the buffer area, play the data in the previous two buffer areas (sound signal 1) at the same time, and a buffer area mentioned above is recording the sound signal, which is the input, and play the data in the previous two buffer areas at the same time. data (sound signal 1), and the data in the buffer area is constantly updated, so the input and output use the same buffer.

本实施例中，所进行的频谱分析的方式，具体如图5所示的，包括：对经过步骤1处理的声音信号进行CFFT处理，将声音信号从时域信号转为频域信号。在声音信号的人声区中，确定人声区最大能量集中频率，并记录所述人声区最大能量集中频率下的能量强度。获取人声区的幅频特性信息，所述幅频特性信息包括：人声区的幅频的平均值和标准差。在声音信号的噪音区中，确定噪音区最大能量集中频率，并记录所述噪音区最大能量集中频率下的能量强度。利用有效声音强度、离散度、语音信噪比和总信噪比与参考数据对比，初步判断一个缓存内是否含有人声。其中，有效声音强度＝人声区最大单音能量强度。离散度＝人声区幅频标准差/人声区幅频平均值。语音信噪比＝人声区最大单音能量强度/人声区幅频平均值。总信噪比＝人声区最大单音能量强度/噪音区最大单音能量强度。若初步判断含有人声，则触发一次TTL稳定器，其中，所述TTL稳定器的触发次数与置信度正相关。图5中，“&”表示与门，“&。”表示与非门。In this embodiment, the manner of performing spectrum analysis, as shown in FIG. 5 , includes: performing CFFT processing on the sound signal processed in step 1, and converting the sound signal from a time-domain signal to a frequency-domain signal. In the vocal region of the sound signal, the maximum energy concentration frequency of the human voice region is determined, and the energy intensity at the maximum energy concentration frequency of the human voice region is recorded. The amplitude-frequency characteristic information of the vocal area is acquired, where the amplitude-frequency characteristic information includes: the average value and the standard deviation of the amplitude-frequency of the vocal area. In the noise region of the sound signal, the maximum energy concentration frequency of the noise region is determined, and the energy intensity at the maximum energy concentration frequency of the noise region is recorded. Using the effective sound intensity, dispersion, speech signal-to-noise ratio and total signal-to-noise ratio to compare with the reference data, it is preliminarily judged whether a buffer contains human voice. Among them, the effective sound intensity = the maximum single tone energy intensity in the vocal area. Dispersion = standard deviation of the amplitude and frequency of the vocal area / average value of the amplitude and frequency of the vocal area. Speech signal-to-noise ratio=maximum single tone energy intensity in vocal area/average amplitude frequency in vocal area. The total signal-to-noise ratio = the maximum single-tone energy intensity in the vocal area/the maximum single-tone energy intensity in the noise area. If it is preliminarily determined that the human voice is contained, the TTL stabilizer is triggered once, wherein the number of triggering times of the TTL stabilizer is positively correlated with the confidence. In Fig. 5, "&" represents an AND gate, and "&." represents a NAND gate.

在具体实现中，可以采用如图7所示的流程，其中：记录缓存并CFFT(Complex FastFourier Transform，复快速傅立叶变换)处理，将信号从时域信号转为频域信号；找出人声区的最大能量集中频率并记录在该频率下的能量强度；计算人声区的幅频特性，得出其对应的平均值、标准差；找出噪音区的最大能量集中频率并记录在该频率下的能量强度；分别将有效声音强度、离散度、语音信噪比、总信噪比这些无量纲参数与相关数据对比(试验值),并初步判断一个缓存内是否含有人声；若初步判断含有人声，则触发一次TTL稳定器，由于语音本身具有连续性，TTL稳定器将根据场景、以及触发次数自动稳定识别输出结果,保证其中的置信度(即调用语音活跃度统计函数进行增强信号并输出)。In the specific implementation, the process shown in Figure 7 can be used, in which: record buffering and CFFT (Complex Fast Fourier Transform, complex fast Fourier Transform) processing, convert the signal from the time domain signal to the frequency domain signal; find out the vocal area The maximum energy concentration frequency and record the energy intensity at this frequency; calculate the amplitude-frequency characteristics of the human voice area, and obtain its corresponding average value and standard deviation; find the maximum energy concentration frequency in the noise area and record it at this frequency energy intensity; compare the dimensionless parameters such as effective sound intensity, dispersion, speech signal-to-noise ratio, and total signal-to-noise ratio with relevant data (experimental values), and preliminarily determine whether a buffer contains human voice; For human voice, trigger the TTL stabilizer once. Since the voice itself has continuity, the TTL stabilizer will automatically and stably identify the output result according to the scene and the number of triggers to ensure the confidence in it (that is, call the voice activity statistics function to enhance the signal and output).

本实施例中，在对经过处理的声音进行频谱分析的过程中，包括：In this embodiment, the process of performing spectrum analysis on the processed sound includes:

通过最大功率密度(也可以称之为单音强度)确定声音门限。和，通过信噪比区分属于风噪的声音信号和属于人声的声音信号。和，通过功率谱集中度区分属于呼吸阀共振的声音信号和属于喷麦共振的声音信号。而除了属于人声的声音信号之外的其余的声音信号，本质上都属于噪声。The sound threshold is determined by the maximum power density (also called tone intensity). And, the sound signal belonging to the wind noise and the sound signal belonging to the human voice are distinguished by the signal-to-noise ratio. And, the sound signal belonging to the respiration valve resonance is distinguished from the sound signal belonging to the spray resonance by the power spectrum concentration. The sound signals other than the sound signals belonging to the human voice are essentially noise.

其中，声音门限可以理解为音量阈值，通过采集到的声音信号对应的最大单音强度确定音量阈值。单音具体为单频率。在实际应用中，对于在面罩内扩音的情景,需要分辨三种噪音：1.由AD转化时产生的误差而引起的白噪音；2.由呼吸阀引发的气流啸叫声,一般集中在几个固定频段；3.呼吸时气流冲击麦克风引起的风噪。The sound threshold can be understood as a volume threshold, and the volume threshold is determined by the maximum single tone intensity corresponding to the collected sound signal. A single tone is specifically a single frequency. In practical applications, there are three kinds of noises that need to be distinguished for the situation of sound amplification in the mask: 1. White noise caused by the error generated during AD conversion; 2. Air flow whistling caused by the breathing valve, generally concentrated in the Several fixed frequency bands; 3. Wind noise caused by airflow hitting the microphone when breathing.

业界的语音识别多为语义识别与在安静环境下的声音识别,由于面罩内扩音对噪音滤除的精确度有着较高的要求,而传统Morlet小波变换等时频分析手段实时处理对处理器运算能力要求过高,所以并不适合在60-120Mhz主频的ARM-CortexM4F处理器上工作。对于快速频谱分析技术,首先要将信号从时域转为频域。在48Khz的采样率下,使用快速傅里叶变换占用的内核时间一般为33％左右,处理器对于全时域基波sinθ进行变换的效率远高于morlet小波。人声的能量主要集中在160-2000Hz,所以可以对这一段频率进行重点分析。现采用数学特征对其进行分析。首先白噪音属于无明显频谱特征的噪音,其能量分布均匀,不存在集中段。其频谱离散度极高,反映到正态分布图中为一个极为陡峭的t分布曲线，其中能量全部集中在期望值(均值)附近,所以可以根据频谱标准差σ＝sqrt(((x1-x)^2+(x2-x)^2+......(xn-x)^2)/n)识别白噪音。设置一个合理的阈值即可实现该功能。对于气流冲击啸叫,可以对该频段的集中能量进行降权处理。气流冲击啸叫由于能量集中,所以限制幅频曲线的突变程度并观察t分布时的μ和σ的特性可以分辨冲击噪音。气流冲击麦克风时,风噪的频谱经过分析类似于白噪音,故不再详细赘述。The speech recognition in the industry is mostly semantic recognition and voice recognition in a quiet environment. Since the amplification in the mask has high requirements on the accuracy of noise filtering, traditional time-frequency analysis methods such as Morlet wavelet transform real-time processing. The computing power requirements are too high, so it is not suitable to work on the ARM-CortexM4F processor with a main frequency of 60-120Mhz. For fast spectrum analysis techniques, the first step is to convert the signal from the time domain to the frequency domain. Under the sampling rate of 48Khz, the kernel time occupied by the fast Fourier transform is generally about 33%, and the efficiency of the processor for the full time domain fundamental wave sinθ transform is much higher than that of the morlet wavelet. The energy of human voice is mainly concentrated in 160-2000Hz, so we can focus on this frequency analysis. Now use mathematical features to analyze it. First of all, white noise is noise without obvious spectral characteristics, its energy distribution is uniform, and there is no concentrated segment. Its spectral dispersion is very high, which is reflected in the normal distribution diagram as a very steep t distribution curve, in which the energy is all concentrated near the expected value (mean), so it can be calculated according to the standard deviation of the spectrum σ=sqrt(((x1-x) ^2+(x2-x)^2+...(xn-x)^2)/n) to identify white noise. This function can be achieved by setting a reasonable threshold. For airflow impact howling, the concentrated energy of this frequency band can be weighted down. Due to the concentration of energy in the airflow impact whistle, the impact noise can be distinguished by limiting the sudden change of the amplitude-frequency curve and observing the characteristics of μ and σ in the t distribution. When the airflow impacts the microphone, the spectrum of wind noise is similar to white noise after analysis, so it will not be described in detail.

人声属于一种特殊音源,由于声带结构特性,我们并不能发出一个单音(即单频率声音),所谓声调变化是多个基波同步频移的效果。从分布来看,它的频谱具有较大能量,且有较多泛音使得幅频分布较为分散,若取正态分布X～N(μ,σ²)可以看出曲线较为陡峭，但不存在突变。故经过变化后选择合适的标准差即可分辨人声频段。具体的，人声音色分为以下区段：40-200Hz超低音，200-800Hz中低音，800-1600Hz中高音，1600-4000Hz超高音，4000Hz+泛高音。一般而言，可以使用五段IIR晶格滤波器对声音进行处理：1)超低音-衰减，避免沉闷。2)中低音-增幅，使声音清晰。3)中高音-增幅，使声音通透。4)超高音-衰减，没有明显影响但是可以滤除机械噪音。5)泛高音-陷波，没有明显影响，隔离高频噪音。其中，语音活跃度统计函数，用于将当前的频谱特征与给定阈值对比并得出活跃权重，之后结合最近几次活跃权重进行综合判断。例如：图5所示的频谱对比流程：人声的主要能量集中在160-2000Hz基波中，具有信噪比高(峰值与均值之比)、能量集中度低(一般会落于标准正态分布(μ,σ²)的1σ～2σ范围内)而风噪的能量极为分散，信噪比极低，可以通过信噪比滤除。呼吸阀与喷麦共振虽然整体能量等级也服从t分布并较具有高信噪比，但是能量峰值大多都比较衰弱并且能级靠近中心点μ(小于1σ)，整体标准差较小。经过上述分析，得出以下结论：a.声音门限可以通过最大功率密度(也可以称之为单音强度)判断。b.风噪与人声可以通过信噪比判断。c.呼吸阀与喷麦声可以通过功率谱集中度σ/μ判断。The human voice is a special sound source. Due to the structural characteristics of the vocal cords, we cannot emit a single tone (ie, a single frequency sound). The so-called tone change is the effect of multiple fundamental frequency synchronous frequency shifts. From the distribution point of view, its spectrum has large energy and more overtones, which makes the amplitude-frequency distribution more scattered. If the normal distribution X～N(μ,σ ² ) is taken, it can be seen that the curve is relatively steep, but there is no sudden change . Therefore, after the change, the appropriate standard deviation can be selected to distinguish the vocal frequency band. Specifically, the vocal timbre is divided into the following sections: 40-200Hz super bass, 200-800Hz mid-bass, 800-1600Hz mid-high, 1600-4000Hz super high, 4000Hz+pan treble. Generally speaking, the sound can be processed with a five-band IIR lattice filter: 1) Subwoofer - attenuated to avoid dullness. 2) Mid-bass - Amplify, make the sound clear. 3) Middle and high-pitched-amplification to make the sound transparent. 4) Super high-pitched - attenuation, no obvious effect but can filter out mechanical noise. 5) Pan treble - notch, no obvious effect, isolation of high frequency noise. Among them, the statistical function of voice activity is used to compare the current spectral features with a given threshold to obtain an activity weight, and then make a comprehensive judgment based on the recent activity weights. For example: the spectrum comparison process shown in Figure 5: the main energy of human voice is concentrated in the fundamental wave of 160-2000Hz, with high signal-to-noise ratio (ratio of peak to average) and low energy concentration (generally falls in the standard normal state) distribution (μ,σ ² ) in the range of 1σ to 2σ), and the energy of wind noise is extremely dispersed, and the signal-to-noise ratio is extremely low, which can be filtered out by the signal-to-noise ratio. Although the overall energy level of the respiration valve and the spray-mike resonance also obeys the t distribution and has a relatively high signal-to-noise ratio, most of the energy peaks are relatively weak and the energy level is close to the center point μ (less than 1σ), and the overall standard deviation is small. After the above analysis, the following conclusions are drawn: a. The sound threshold can be judged by the maximum power density (also called single tone intensity). b. Wind noise and human voice can be judged by the signal-to-noise ratio. c. The breathing valve and the microphone sound can be judged by the power spectrum concentration σ/μ.

本实施例中，在进行IIR增强处理的过程中进行增强处理的过程中，包括：每次采集1024个点作为一个cache，按照记录cache的时间先后顺序，取出4个cache组成4096个采样点集合。其中，对于直接扩音(即需要本地播放)，如果采样时间过长会导致延迟过大，当输出延迟(播放延迟)超过100ms时会严重干扰使用者的语音听觉回馈，进而影响使用者的语音能力，但如果采样点过少，又会影响到系统判断的精度。故使用采样循环存储+链式装填的方案，每次采集1024个点作为一个cache，然后按先入先出取出4个cache组成4096个最近采样点，每个cache对应1个时段，根据多个采样时段的信号特征，分析在连续时段中存在人声的可能。In this embodiment, the process of performing the enhancement processing during the IIR enhancement processing includes: collecting 1024 points as a cache each time, and taking out 4 caches to form a set of 4096 sampling points according to the chronological order of recording the caches . Among them, for direct amplification (that is, local playback is required), if the sampling time is too long, the delay will be too large. When the output delay (playback delay) exceeds 100ms, it will seriously interfere with the user's voice auditory feedback, thereby affecting the user's voice. However, if the sampling points are too few, the accuracy of the system judgment will be affected. Therefore, the scheme of sampling circular storage + chain loading is used, 1024 points are collected each time as a cache, and then 4 caches are taken out according to first-in-first-out to form 4096 recent sampling points, each cache corresponds to a period, according to multiple sampling points The signal characteristics of the period, analyze the possibility of human voice in the continuous period.

利用所述采样点集合，通过语音活跃度统计函数进行增强。其中，对于直接扩音，如果采样时间过长会导致延迟过大，当输出延迟＞100ms时会严重干扰使用者的语言听觉回馈，进而影响使用者的语言能力。但如果采样点过少，又会影响到系统判断的精度。故使用采样循环储存+链式装填的方案，每次采集1024个点作为一个cache，然后按先入先出取出4个cache组合成4096个最近的采样点，结合语音活跃度统计函数，进行语音识别与播放。Using the set of sampling points, enhancement is performed through a statistical function of speech activity. Among them, for direct sound reinforcement, if the sampling time is too long, the delay will be too large, and when the output delay is greater than 100ms, it will seriously interfere with the user's language auditory feedback, thereby affecting the user's language ability. However, if the sampling points are too few, the accuracy of the system judgment will be affected. Therefore, the scheme of sampling cycle storage + chain loading is used, 1024 points are collected each time as a cache, and then 4 caches are taken out according to first-in-first-out and combined into 4096 recent sampling points, combined with the voice activity statistical function, for speech recognition with play.

语音活跃度统计函数可以由本领域技术人员进行建模，语音活跃度统计函数的功能包括：就是让人声更加的平滑、连续、自然，因为如果没有这个函数，在处理声音后，即消除噪音、消除回声后增强的声音信号1会出现断断续续、不连贯的现象，出现不连贯的现象是由于频谱分析导致的。本实施例中，所述的“语音活跃度”为业内专业术语。The voice activity statistical function can be modeled by those skilled in the art. The functions of the voice activity statistical function include: that is, the human voice is more smooth, continuous and natural, because if there is no such function, after processing the voice, it will eliminate noise, The enhanced sound signal 1 after the echo cancellation will appear intermittent and incoherent, and the incoherent phenomenon is caused by spectrum analysis. In this embodiment, the "voice activity" is a professional term in the industry.

本实施例中，所述IIR为五段式声音均衡器。具体可以通过IIR按照不同的声音频率区段，对符合人声特征的声音信号进行增强处理，其中，所述声音频率区段包括：40-200Hz的超低音、200-800Hz的中低音、800-1600Hz的中高音、1600-4000Hz的超高音和4000Hz以上的泛高音。In this embodiment, the IIR is a five-segment sound equalizer. Specifically, the IIR can be used to enhance the sound signal conforming to the characteristics of the human voice according to different sound frequency sections, wherein the sound frequency section includes: 40-200Hz super bass, 200-800Hz mid-bass, 800- 1600Hz mid treble, 1600-4000Hz super treble and 4000Hz and above pan treble.

本实施例中还提供一种用于语音设备的人声识别与增强装置，包括：This embodiment also provides a human voice recognition and enhancement device for speech equipment, including:

预处理模块，用于通过NLMS算法对采集的声音信号进行回声消除，其中，在所采集的声音信号中包括了环境噪音和语音信号。The preprocessing module is configured to perform echo cancellation on the collected sound signal through the NLMS algorithm, wherein the collected sound signal includes environmental noise and voice signal.

分析模块，用于对经过步骤1处理的声音进行频谱分析，并提取符合人声特征的声音信号。The analysis module is used for performing spectrum analysis on the sound processed in step 1, and extracting the sound signal conforming to the characteristics of the human voice.

增强模块，用于通过IIR对符合人声特征的声音信号进行增强处理。The enhancement module is used to enhance the sound signal conforming to human voice characteristics through IIR.

本发明的实施例还提供一种存储介质，存储有计算机程序或指令，当所述计算机程序或指令被运行时，实现本实施例中的方法。An embodiment of the present invention further provides a storage medium storing a computer program or instruction, and when the computer program or instruction is executed, the method in this embodiment is implemented.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于设备实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments. The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily think of changes or substitutions. All should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A human voice recognition and enhancement method for voice equipment is characterized by comprising the following steps in the processing process of each period:

step 1, carrying out echo cancellation on the collected sound signals through an NLMS algorithm;

step 2, carrying out spectrum analysis on the sound processed in the step 1, and extracting a sound signal according with the human voice characteristic;

step 3, enhancing the sound signals according with the human voice characteristics through the IIR;

and 4, sending the enhanced sound signal to the talkback module.

2. The method of claim 1, further comprising:

playing and recording the sound signal after enhancement processing;

and then, recorded sound data is utilized to perform echo cancellation on the collected sound signals through an NLMS algorithm.

3. The method of claim 2, wherein step 1 comprises:

inputting the collected sound signal into a coder for coding;

the encoded sound signal is stored in a speech signal and noise buffer, and the recorded sound data is stored in a reference buffer, wherein the processor extracts the data from the speech signal and noise buffer and the echo reference buffer.

4. The method of claim 3, wherein m buffers are created in the speech signal and noise buffer, wherein when the nth buffer is recording the audio signal, the processor is simultaneously processing data of the [ (n + m-1) mod m ] buffer and simultaneously playing data of the [ (n + m-2) mod m ] buffer, wherein the [ (n + m-1) mod m ] buffer represents a previous buffer of the nth buffer, and the [ (n + m-2) mod m ] buffer represents a previous buffer of the nth buffer.

5. The method of claim 1, wherein the performing echo cancellation on the collected sound signal by the NLMS algorithm comprises:

after the collected sound signals are processed by the coder and decoder, NLMS normalization filtering is carried out by taking system output signals as reference, wherein the type of output data is 24-bit precision signed integer after the collected sound signals are processed by the coder and decoder, and the system output signals are sound signals subjected to enhancement processing in the previous period;

performing NLMS normalization filtering on the main signal by taking the noise signal as a reference to obtain a noise elimination signal after primary processing;

notch filtering the noise-canceled signal through an IIR, wherein the IIR works when being executed to specify a frequency in coding.

6. The method of claim 1, wherein the spectral analysis performed comprises:

performing CFFT processing on the sound signal processed in the step 1, and converting the sound signal from a time domain signal into a frequency domain signal;

determining the maximum energy concentration frequency of a human vocal area in the human vocal area of the voice signal, and recording the energy intensity under the maximum energy concentration frequency of the human vocal area;

acquiring amplitude-frequency characteristic information of a human vocal region, wherein the amplitude-frequency characteristic information comprises: average and standard deviation of amplitude and frequency of human vocal tract;

in a noise area of a sound signal, determining the maximum energy concentration frequency of the noise area, and recording the energy intensity of the maximum energy concentration frequency of the noise area;

comparing the effective sound intensity, the dispersion, the voice signal-to-noise ratio and the total signal-to-noise ratio with reference data to preliminarily judge whether a cache contains human voice;

and if the voice is preliminarily judged to be contained, triggering the TTL stabilizer once, wherein the triggering times of the TTL stabilizer are positively correlated with the confidence coefficient.

7. The method of claim 6, wherein the effective sound intensity is the maximum tone energy intensity of the human vocal tract;

the dispersion is the standard deviation of the amplitude frequency of the vocal region/the average value of the amplitude frequency of the vocal region;

the voice signal-to-noise ratio is the average value of the maximum single tone energy intensity in the human voice area/the amplitude frequency in the human voice area;

the total signal-to-noise ratio is the maximum tone energy intensity in the human voice region/the maximum tone energy intensity in the noise region.

8. The method of claim 1 or 7, wherein in performing the spectral analysis, the method comprises:

determining a sound threshold by a maximum power density (single tone strength);

and, distinguishing sound signals belonging to wind noise from sound signals belonging to human voice by signal-to-noise ratio;

and, distinguishing between the acoustic signal belonging to the resonance of the respiratory valve and the acoustic signal belonging to the resonance of the jet microphone by the concentration of the power spectrum.

9. The method of claim 1, wherein during the IIR enhancement processing, the method comprises:

collecting 1024 points as a cache every time, and taking out 4 caches to form 4096 sampling point sets according to the time sequence of the caches;

enhancing through a voice activity statistical function by utilizing the sampling point set;

the IIR is a five-segment type sound equalizer;

carrying out enhancement processing on a sound signal conforming to human sound characteristics according to different sound frequency sections through an IIR, wherein the sound frequency sections comprise: ultralow pitch at 40-200Hz, middle and low pitch at 200-800Hz, middle and high pitch at 800-1600Hz, ultrahigh pitch at 1600-4000Hz, and alto at 4000Hz or above.

10. A human voice recognition and enhancement apparatus for a speech device, comprising:

the preprocessing module is used for carrying out echo cancellation on the collected sound signals through an NLMS algorithm;

the analysis module is used for carrying out spectrum analysis on the sound processed in the step 1 and extracting a sound signal which accords with the characteristics of human voice;

the enhancement module is used for enhancing the sound signals conforming to the human voice characteristics through the IIR;

and the transmission module is used for transmitting the enhanced sound signal to the talkback module.

11. A storage medium, storing a computer program or instructions which, when executed, implement the method of any one of claims 1 to 8.