CN116547753A

CN116547753A - Machine learning assisted spatial noise estimation and suppression

Info

Publication number: CN116547753A
Application number: CN202180080882.5A
Authority: CN
Inventors: R·J·卡特莱特; 王宁
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2020-11-05
Filing date: 2021-11-04
Publication date: 2023-08-04

Abstract

In an embodiment, a method includes: a frequency band of a power spectrum of the input audio signal and microphone covariance are received, and for each frequency band: estimating respective probabilities of speech and noise using a classifier; estimating a set of averages of speech and noise, or a set of averages and covariances of speech and noise, using a directivity model based on the microphone covariances and the probabilities of the frequency bands; estimating means and covariance of noise power based on the probability and the power spectrum using a level model; determining a first noise suppression gain based on the directivity model; determining a second noise suppression gain based on the level model; selecting a first noise suppression gain or a second noise suppression gain or a sum of both based on a signal-to-noise ratio of the input audio signal; and scaling the time-frequency representation of the input signal by the selected noise suppression gain.

Description

Machine Learning-Aided Spatial Noise Estimation and Suppression

相关申请的交叉引用Cross References to Related Applications

本申请要求于2020年11月5日提交的美国临时申请号63/110,228以及于2021年6月14日提交的美国临时申请号63/210,215的优先权，所述专利申请中的每一个均通过引用以其全文并入本文。This application claims priority to U.S. Provisional Application No. 63/110,228, filed November 5, 2020, and U.S. Provisional Application No. 63/210,215, filed June 14, 2021, each of which is filed by References are incorporated herein in their entirety.

技术领域technical field

本公开总体上涉及音频信号处理，并且特别是涉及语音通信中的噪声估计和抑制。The present disclosure relates generally to audio signal processing, and in particular to noise estimation and suppression in speech communications.

背景技术Background technique

用于语音通信的噪声抑制算法已经有效地实施于如电话、膝上型计算机和会议系统等边缘设备。双向语音通信的常见问题是，在每个用户的位置处的背景噪声会与用户的语音信号一起传输。如果在边缘设备处接收到的组合信号的信噪比(SNR)太低，则重构语音的可理解性将降低，从而导致用户体验不佳。Noise suppression algorithms for voice communications have been effectively implemented in edge devices such as phones, laptops, and conferencing systems. A common problem with two-way voice communications is that background noise at each user's location is transmitted along with the user's voice signal. If the signal-to-noise ratio (SNR) of the combined signal received at the edge device is too low, the intelligibility of the reconstructed speech will be reduced, resulting in a poor user experience.

发明内容Contents of the invention

公开了用于机器学习辅助的空间噪声估计和抑制的实施方式。在一些实施例中，一种音频处理方法包括：接收输入音频信号的功率频谱的频带和每个频带的麦克风协方差，其中，所述麦克风协方差基于用于捕获所述输入音频信号的麦克风的配置；针对每个频带：使用机器学习分类器来估计语音和噪声的相应概率；基于所述频带的所述麦克风协方差和所述概率，使用方向性模型(directionality model)来估计语音和噪声的一组均值、或语音和噪声的一组均值和协方差；使用电平模型(level model)基于所述概率和所述功率频谱来估计噪声功率的均值和协方差；基于所述方向性模型的第一输出来确定第一噪声抑制增益；基于所述电平模型的第二输出来确定第二噪声抑制增益；基于所述输入音频信号的信噪比来选择所述第一噪声抑制增益或所述第二噪声抑制增益之一、或者所述第一噪声抑制增益与所述第二噪声抑制增益之和；通过用于所述频带的所选择的第一噪声抑制增益或第二噪声抑制增益缩放所述输入信号的时频表示；以及将所述时频表示转换为输出音频信号。Embodiments for machine learning assisted spatial noise estimation and suppression are disclosed. In some embodiments, a method of audio processing includes receiving frequency bands of a power spectrum of an input audio signal and a microphone covariance for each frequency band, wherein the microphone covariance is based on the frequency band of the microphone used to capture the input audio signal configuration; for each frequency band: use a machine learning classifier to estimate the corresponding probabilities of speech and noise; use a directionality model to estimate speech and noise based on the microphone covariance and the probabilities for the frequency band A set of means, or a set of mean and covariance of speech and noise; using a level model (level model) to estimate the mean and covariance of noise power based on said probability and said power spectrum; based on said directional model determining a first noise suppression gain based on a first output; determining a second noise suppression gain based on a second output of the level model; selecting either the first noise suppression gain or the one of the second noise suppression gains, or the sum of the first noise suppression gain and the second noise suppression gain; scaling by the selected first noise suppression gain or second noise suppression gain for the frequency band a time-frequency representation of the input signal; and converting the time-frequency representation into an output audio signal.

在一些实施例中，所述方法进一步包括：使用所述至少一个处理器接收包括一定数量的块/帧的输入音频信号；针对每个块/帧：使用所述至少一个处理器将所述块/帧转换为子带，每个子带具有与其他子带不同的频谱；使用所述至少一个处理器将所述子带组合为频带；以及使用所述至少一个处理器来确定分带功率。In some embodiments, the method further comprises: using said at least one processor, receiving an input audio signal comprising a certain number of blocks/frames; for each block/frame: using said at least one processor to convert said blocks Converting/frames to subbands, each subband having a different frequency spectrum than other subbands; combining the subbands into frequency bands using the at least one processor; and determining subband powers using the at least one processor.

在一些实施例中，机器学习分类器是包括输入层、输出层以及一个或多个隐藏层的神经网络。在示例中，神经网络是深度神经网络，其包括三个或更多个层、优选地多于3个层。In some embodiments, a machine learning classifier is a neural network comprising an input layer, an output layer, and one or more hidden layers. In an example, the neural network is a deep neural network comprising three or more layers, preferably more than 3 layers.

在一些实施例中，所述麦克风协方差被表示为归一化向量。In some embodiments, the microphone covariance is represented as a normalized vector.

在一些实施例中，所述方法进一步包括：确定所述第一噪声抑制增益进一步包括：计算针对所述频带的语音的概率；如果针对所述频带的语音的所述概率小于阈值，则将所述第一噪声抑制增益设置为等于最大抑制增益；以及如果针对所述频带的语音的概率大于阈值，则基于增益斜坡设置所述第一噪声抑制增益。In some embodiments, the method further comprises: determining the first noise suppression gain further comprises: calculating a probability for speech in the frequency band; if the probability for speech in the frequency band is less than a threshold value, calculating the setting the first noise suppression gain equal to a maximum suppression gain; and setting the first noise suppression gain based on a gain ramp if the probability of speech for the frequency band is greater than a threshold.

在一些实施例中，使用通过所述方向性模型估计的语音和噪声的所述一组均值和所述协方差来计算所述语音的概率。In some embodiments, the probability of speech is calculated using said set of means and said covariance of speech and noise estimated by said directional model.

在一些实施例中，使用通过所述方向性模型估计的语音和噪声的所述一组均值和所述协方差向量、以及多变量联合高斯密度函数来计算语音的所述概率。In some embodiments, said probability of speech is calculated using said set of means of speech and noise estimated by said directional model and said covariance vector, and a multivariate joint Gaussian density function.

在一些实施例中，所述方法进一步包括：确定所述第二噪声抑制增益进一步包括：如果所述频带功率小于第一阈值，则将所述第二噪声抑制增益设置为等于最大抑制增益；如果所述频带功率在第一阈值与第二阈值之间(其中，第二阈值高于第一阈值)，则将所述第二噪声抑制增益设置为等于零；以及如果所述频带功率高于第二阈值，则基于增益斜坡来设置所述第二噪声抑制增益。In some embodiments, the method further comprises: determining the second noise suppression gain further comprises: setting the second noise suppression gain equal to a maximum suppression gain if the band power is less than a first threshold; if said frequency band power is between a first threshold and a second threshold (wherein the second threshold is higher than the first threshold), setting said second noise suppression gain equal to zero; and if said frequency band power is higher than the second threshold, the second noise suppression gain is set based on a gain ramp.

在一些实施例中，使用所述方向性模型进行的所述估计使用被分类为语音和噪声的时频片，但排除被分类为混响的时频片。In some embodiments, said estimating using said directional model uses time-frequency slices classified as speech and noise, but excludes time-frequency slices classified as reverberation.

在一些实施例中，所述方法进一步包括：基于所述频带的所述麦克风协方差和语音的所述概率使用方向性模型或电平模型来估计语音的均值进一步包括：使用一阶低通滤波器计算语音的所述均值的时间平均估计，其中，语音的所述均值和所述麦克风协方差向量作为所述滤波器的输入；以及通过语音的所述概率对所述滤波器的输入进行加权。In some embodiments, the method further comprises: using a directional model or a level model to estimate a mean of speech based on the microphone covariance of the frequency band and the probability of speech further comprising: using a first order low pass filter calculating a time-averaged estimate of the mean of speech, wherein the mean of speech and the microphone covariance vector are inputs to the filter; and weighting the input of the filter by the probability of speech .

在一些实施例中，所述方法进一步包括：基于所述频带的所述麦克风协方差和噪声的所述概率使用方向性模型或电平模型来估计噪声的均值进一步包括：使用一阶低通滤波器计算噪声的所述均值的时间平均估计，其中，噪声的所述均值和所述麦克风协方差向量作为所述滤波器的输入；以及通过噪声的所述概率对所述滤波器的输入进行加权。In some embodiments, the method further comprises: using a directional model or a level model based on the microphone covariance of the frequency band and the probability of noise to estimate the mean of noise further comprising: using first order low pass filtering calculating a time-averaged estimate of the mean of noise, wherein the mean of noise and the microphone covariance vector are inputs to the filter; and weighting the input of the filter by the probability of noise .

在一些实施例中，所述方法进一步包括：基于所述频带的所述麦克风协方差和语音的所述概率使用方向性模型或电平模型来估计语音的协方差进一步包括：使用一阶低通滤波器计算语音的所述协方差的时间平均估计，其中，语音的所述协方差和所述麦克风协方差向量作为所述滤波器的输入；以及通过语音的所述概率对所述滤波器的输入进行加权。In some embodiments, the method further comprises: estimating the covariance of speech using a directional model or a level model based on the microphone covariance of the frequency band and the probability of speech further comprising: using a first-order low-pass a filter computing a time-averaged estimate of said covariance of speech, wherein said covariance of speech and said microphone covariance vector are inputs to said filter; Inputs are weighted.

在一些实施例中，所述方法进一步包括：基于所述频带的所述麦克风协方差和噪声的所述概率使用方向性模型或电平模型来估计噪声的协方差进一步包括：使用一阶低通滤波器计算语音的所述协方差的时间平均估计，其中，语音的所述协方差和所述麦克风协方差向量作为所述滤波器的输入；以及通过噪声的所述概率对所述滤波器的输入进行加权。In some embodiments, the method further comprises estimating the covariance of noise using a directional model or a level model based on the microphone covariance of the frequency band and the probability of noise further comprising using a first order low pass a filter computing a time-averaged estimate of said covariance of speech, wherein said covariance of speech and said microphone covariance vector are inputs to said filter; and said probability of passing noise is a function of said filter's Inputs are weighted.

在一些实施例中，一种系统包括：一个或多个计算机处理器；以及存储有指令的非暂态计算机可读介质，所述指令在由所述一个或多个计算机处理器执行时，使所述一个或多个处理器执行前述方法中的任何一种方法的操作。In some embodiments, a system includes: one or more computer processors; and a non-transitory computer-readable medium storing instructions that, when executed by the one or more computer processors, cause The one or more processors perform the operations of any one of the aforementioned methods.

在一些实施例中，一种非暂态计算机可读介质存储有指令，所述指令在由一个或多个计算机处理器执行时，使所述一个或多个处理器执行前述方法中的任何一种方法。In some embodiments, a non-transitory computer-readable medium stores instructions that, when executed by one or more computer processors, cause the one or more processors to perform any one of the aforementioned methods way.

本文公开的其他实施方式涉及一种系统、一种装置和一种计算机可读介质。下文的附图和描述中阐述了所公开的实施方式的细节。根据所述描述、附图和权利要求，其他特征、目的和优点是显而易见的。Other embodiments disclosed herein relate to a system, an apparatus, and a computer-readable medium. The details of the disclosed embodiments are set forth in the accompanying drawings and the description below. Other features, objects and advantages are apparent from the description, drawings and claims.

本文公开的特定实施方式提供了以下优点中的一个或多个。所公开的实施例使用方向性和机器学习(例如，神经网络)为语音通信应用提供低成本高质量的噪声估计和抑制。所公开的噪声估计和抑制的实施例可以在各种边缘设备上实施并且无需多个麦克风。神经网络的使用扩展至用于多种多样的背景噪声。Certain embodiments disclosed herein provide one or more of the following advantages. The disclosed embodiments use directionality and machine learning (eg, neural networks) to provide low-cost high-quality noise estimation and suppression for voice communication applications. The disclosed embodiments of noise estimation and suppression can be implemented on various edge devices and do not require multiple microphones. The use of neural networks extends to a wide variety of background noises.

附图说明Description of drawings

在附图中，为了便于描述，示出了示意性元素的特定布置或排序，如那些表示设备、单元、指令块和数据元素的示意性元素。然而，本领域技术人员应当理解，附图中示意性元素的特定排序或布置并不意味着暗示需要特定的处理次序或顺序或者单独的过程。进一步地，在附图中包括示意性元素并不意味着暗示在所有实施例中都需要这种元素，或者在一些实施方式中，由该元素表示的特征可以不包括在其他元素中或与其他元素组合。In the drawings, specific arrangements or orderings of schematic elements, such as those representing devices, units, instruction blocks and data elements, are shown for ease of description. Those skilled in the art will understand, however, that the specific ordering or arrangement of schematic elements in the figures is not meant to imply a specific order or sequence of processing or separate processes. Further, the inclusion of a schematic element in a drawing does not imply that such element is required in all embodiments, or that in some implementations, the feature represented by the element may not be included in other elements or be combined with other elements. Combination of elements.

进一步地，在附图中，在使用如实线或虚线或箭头等连接元素来说明两个或更多个其他示意性元素之间的连接、关系或关联的情况下，不存在任何这种连接元素并不意味着暗示不可能存在连接、关系或关联。换句话说，未在附图中示出元素之间的一些连接、关系或关联性，以免模糊本公开。另外，为了便于图示，使用单个连接元素来表示元素之间的多个连接、关系或关联性。例如，在连接元素表示信号、数据或指令的通信的情况下，本领域的技术人员应理解，这样的元素表示为了影响通信而可能需要的一条或多条信号路径。Further, in the drawings, where connecting elements such as solid or dashed lines or arrows are used to illustrate a connection, relationship or association between two or more other schematic elements, there are no such connecting elements It is not meant to imply that a connection, relationship or association cannot exist. In other words, some connections, relationships or associations between elements are not shown in the figures so as not to obscure the disclosure. Additionally, for ease of illustration, a single connection element is used to represent multiple connections, relationships or associations between elements. For example, where a connected element represents the communication of signals, data or instructions, those skilled in the art will understand that such an element represents one or more signal paths that may be required to effect the communication.

图1是根据一些实施例的机器学习辅助的空间噪声估计和抑制系统的框图。Figure 1 is a block diagram of a machine learning assisted spatial noise estimation and suppression system according to some embodiments.

图2是图示了根据一些实施例的基于电平模型的噪声抑制增益计算的图。FIG. 2 is a diagram illustrating level model based noise suppression gain calculations according to some embodiments.

图3是图示了根据一些实施例的基于方向性模型的噪声抑制增益计算的图。FIG. 3 is a graph illustrating noise suppression gain calculation based on a directional model, according to some embodiments.

图4A和图4B示出了根据一些实施例的用于使用方向性和机器学习处理语音通信中的噪声估计和抑制的流程图。4A and 4B illustrate flowcharts for processing noise estimation and suppression in voice communications using directionality and machine learning, according to some embodiments.

图5是根据一些实施例的用于实施参考图1至图4描述的特征和过程的系统的框图。FIG. 5 is a block diagram of a system for implementing the features and processes described with reference to FIGS. 1-4 , according to some embodiments.

各附图中使用的相同附图标记指示相似的元素。The same reference numbers used in the various figures indicate similar elements.

具体实施方式Detailed ways

在以下具体实施方式中，阐述了许多具体细节以提供对所描述的各种实施例的全面理解。对于本领域普通技术人员而言将明显的是，可以在没有这些具体细节的情况下实践所描述的各种实施方式。在其他实例中，并未详细描述熟知方法、过程、部件以及电路以免不必要地模糊实施例的各方面。下文描述了若干特征，每个特征可以彼此独立使用或者与其他特征的任何组合一起使用。In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the various embodiments described. It will be apparent to those skilled in the art that the various described embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. Several features are described below, each of which may be used independently of the other or in any combination with the other features.

术语解释Terminology Explanation

如本文所使用的，术语“包括”及其变体应被理解为意思是“包括但不限于”的开放式术语。除非上下文另外明确指出，否则术语“或”应被理解为“和/或”。术语“基于”应被理解为“至少部分地基于”。术语“一个示例实施方式”和“示例实施方式”应被理解为“至少一个示例实施方式”。术语“另一个实施方式”应被理解为“至少一个其他实施方式”。术语“(被)确定”应被理解为获得、接收、计算、估算、估计、预测或得到。另外，在以下描述和权利要求中，除非另外定义，否则本文所使用的所有技术和科学术语具有与本公开所属领域的普通技术人员通常理解的含义相同的含义。As used herein, the term "comprising" and variations thereof should be understood as open-ended terms meaning "including but not limited to". The term "or" should be read as "and/or" unless the context clearly dictates otherwise. The term "based on" should be understood as "based at least in part on". The terms "one example embodiment" and "example embodiment" should be read as "at least one example embodiment". The term "another embodiment" should be understood as "at least one other embodiment". The term "(to) determine" shall be understood as obtaining, receiving, calculating, estimating, estimating, predicting or obtaining. Also, in the following description and claims, unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

概述overview

传统的噪声抑制解决方案使用两个或更多个麦克风来捕获背景噪声，其中，一个麦克风更靠近用户的嘴部而另一个麦克风离得更远。将来自这两个麦克风的信号相减，以去除两个信号所共有的背景噪声。然而，该技术对于具有单个麦克风的边缘设备或者在移动电话(用户在说话时摇晃或转动电话)的情况下不起作用。其他噪声抑制算法尝试连续查找音频信号中的噪声模式并通过逐帧或逐块地处理音频来适应所述噪声模式，其中，每个块可以包括两个或更多个帧。这些现有的自适应算法在某些用例下效果尚好，但是不能扩展到多种多样的背景噪声。Traditional noise suppression solutions capture background noise using two or more microphones, one closer to the user's mouth and one further away. The signals from these two microphones are subtracted to remove background noise common to both signals. However, this technique doesn't work on edge devices with a single microphone, or in the case of mobile phones (where the user shakes or turns the phone while speaking). Other noise suppression algorithms try to continuously find noise patterns in the audio signal and adapt to them by processing the audio frame by frame or block by block, where each block may consist of two or more frames. These existing adaptive algorithms work well for some use cases, but do not scale to a wide variety of background noises.

最近，已经使用深度神经网络来抑制语音通信中的噪声。然而，这些解决方案需要相当的计算能力并且在实时通信系统上难以实施。所公开的实施例使用方向性和机器学习(例如，深度神经网络)的组合来为语音通信应用提供低成本高质量的噪声估计和抑制。Recently, deep neural networks have been used to suppress noise in speech communication. However, these solutions require considerable computing power and are difficult to implement on real-time communication systems. The disclosed embodiments use a combination of directionality and machine learning (eg, deep neural networks) to provide low-cost high-quality noise estimation and suppression for voice communication applications.

图1是根据一些实施例的机器学习辅助的空间噪声估计和抑制系统100的框图。系统100包括滤波器组101、分带单元(banding unit)102、机器学习分类器103(例如，如深度神经网络(deep neural network,DNN)的神经网络)、方向检测单元104、语音/噪声方向性模型105、噪声电平模型106、噪声抑制增益单元107、乘法单元108以及逆滤波器组109。在一些实施例中，滤波器组101(例如，短时傅里叶变换(short-time Fourier transform,STFT))接收时域输入音频信号并将该时域输入音频信号转换为子带，每个子带具有不同的频谱(例如，时间/频率片)。每个块/帧的子带被输入到分带单元102中，所述分带单元根据心理声学模型将块/帧的子带组合为频带并输出分带功率110(例如，用dB表示的分带功率)。由滤波器组101输出的子带还被输入到方向检测单元104中，所述方向检测单元生成并输出麦克风协方差向量112。在一些实施例中，每个频带都有一个协方差向量，并且在分带单元102、方向检测单元104中使用共同的分带矩阵。在一些实施方式中，分带单元102和方向检测单元104是产生每个频带的频带功率(以分贝为单位)以及协方差向量的单个单元。FIG. 1 is a block diagram of a machine learning assisted spatial noise estimation and suppression system 100 in accordance with some embodiments. The system 100 includes a filter bank 101, a banding unit (banding unit) 102, a machine learning classifier 103 (for example, a neural network such as a deep neural network (DNN)), a direction detection unit 104, a speech/noise direction Sex model 105, noise level model 106, noise suppression gain unit 107, multiplication unit 108 and inverse filter bank 109. In some embodiments, a filter bank 101 (eg, a short-time Fourier transform (STFT)) receives a time domain input audio signal and converts the time domain input audio signal into subbands, each subband The bands have different frequency spectra (eg, time/frequency slices). The sub-bands of each block/frame are input into the band-dividing unit 102, which combines the sub-bands of the block/frame into frequency bands according to a psychoacoustic model and outputs the band-divided power 110 (e.g. with power). The subbands output by the filter bank 101 are also input into a direction detection unit 104 which generates and outputs a microphone covariance vector 112 . In some embodiments, each frequency band has a covariance vector, and a common band-splitting matrix is used in the band-splitting unit 102 and the direction detection unit 104 . In some embodiments, the band splitting unit 102 and the direction detection unit 104 are a single unit that generates the band power (in decibels) and the covariance vector for each band.

分带功率110被输入到机器学习分类器103中。在实施例中，机器学习分类器103是预训练的神经网络分类器，所述分类器估计并输出多个类的概率111，所述多个类包括但不限于针对每个块/帧和每个频带的语音类、平稳噪声类、非平稳噪声类以及混响类。在一些实施例中，使用平稳噪声类来驱动语音/噪声方向性模型105和噪声电平模型106。The subband power 110 is input into the machine learning classifier 103 . In an embodiment, the machine learning classifier 103 is a pre-trained neural network classifier that estimates and outputs probabilities 111 for a number of classes including, but not limited to, for each block/frame and each Speech class, stationary noise class, non-stationary noise class and reverberation class in each frequency band. In some embodiments, the speech/noise directionality model 105 and the noise level model 106 are driven using a stationary noise class.

在以下公开内容中，假设机器学习分类器估计并输出语音和平稳噪声的概率。在机器学习分类器输出混响概率的实施例中，使用语音、平稳噪声、非平稳噪声以及混响的估计概率来将混响与噪声分开。在一些实施例中，被分类为混响的任何时频片(tile)既不添加到语音/噪声方向性模型105中也不添加到噪声电平模型106中。In the following disclosure, it is assumed that a machine learning classifier estimates and outputs the probabilities of speech and stationary noise. In embodiments where the machine learning classifier outputs reverberation probabilities, estimated probabilities for speech, stationary noise, non-stationary noise, and reverberation are used to separate reverberation from noise. In some embodiments, any time-frequency tiles classified as reverberation are neither added to the speech/noise directionality model 105 nor to the noise level model 106 .

概率111与麦克风协方差向量112一起被输入到语音/噪声方向性模型105中。概率111还与分带功率110一起被输入到语音/噪声电平模型106中。针对每个频带和每个块/帧，语音/噪声方向性模型105基于麦克风协方差向量112估计并输出每个频带的语音和噪声的相应均值和/或协方差114。针对每个频带、针对每个块/帧，噪声电平模型106估计并输出噪声功率的均值和方差113。在一些实施例中，噪声电平模型106还输出语音的均值和方差。The probabilities 111 are input into the speech/noise directionality model 105 together with the microphone covariance vector 112 . The probabilities 111 are also input into the speech/noise level model 106 along with the subband powers 110 . For each frequency band and each block/frame, the speech/noise directivity model 105 estimates and outputs the respective mean and/or covariance 114 of speech and noise for each frequency band based on the microphone covariance vector 112 . For each frequency band, for each block/frame, the noise level model 106 estimates and outputs the mean and variance 113 of the noise power. In some embodiments, the noise level model 106 also outputs the mean and variance of the speech.

噪声抑制增益单元107分别接收语音/噪声方向性模型105的输出114和噪声电平模型106的输出113、分带功率110以及麦克风协方差向量112。噪声抑制增益单元107计算并输出抑制增益，乘法单元108使用所述抑制增益来缩放由滤波器组101输出的子带，以抑制每个子带中的噪声。然后，逆滤波器组109将子带转换为时域输出音频信号。下文将进一步详细地描述系统100的每个部件。在一些实施例中，机器学习分类器103的输出直接用作噪声增益或用于计算噪声抑制增益。The noise suppression gain unit 107 receives the output 114 of the speech/noise directivity model 105 and the output 113 of the noise level model 106 , the subband power 110 and the microphone covariance vector 112 , respectively. The noise suppression gain unit 107 calculates and outputs a suppression gain, which is used by the multiplication unit 108 to scale the subbands output by the filter bank 101 to suppress noise in each subband. Then, the inverse filter bank 109 converts the subbands into time domain output audio signals. Each component of system 100 will be described in further detail below. In some embodiments, the output of the machine learning classifier 103 is directly used as a noise gain or used to calculate a noise suppression gain.

机器学习分类器machine learning classifier

如上所述，机器学习分类器103将分带功率110作为输入并提供语音和噪声的概率111作为输出，其由下式给出：As mentioned above, the machine learning classifier 103 takes as input the subband power 110 and provides as output the probabilities 111 for speech and noise, which are given by:

{p_s(t，f)，p_n(t，f)}， [1]{p _s (t, f), p _n (t, f)}, [1]

其中，t是块/帧号，并且f是频带索引，下标“s”表示语音，并且下标“n”表示噪声。在一些实施例中，基于存在一个不到处走动的相关说话者并且噪声场因此相对平稳的假设，噪声是平稳噪声。如果违背了这些假设，则由105和106执行的建模可能是不合适的。在一些实施例中，机器学习分类器103的输出还可以包括非平稳噪声的概率和混响的概率、以及可以由噪声抑制增益单元107直接使用的噪声抑制增益。where t is the block/frame number, and f is the frequency band index, subscript "s" denotes speech, and subscript "n" denotes noise. In some embodiments, the noise is stationary noise based on the assumption that there is one related speaker that is not moving around and that the noise field is therefore relatively stationary. If these assumptions are violated, the modeling performed by 105 and 106 may be inappropriate. In some embodiments, the output of the machine learning classifier 103 may also include the probability of non-stationary noise and the probability of reverberation, and the noise suppression gain which may be directly used by the noise suppression gain unit 107 .

在图示的示例中，机器学习分类器103包括神经网络(NN)。示例NN 103的输入包括当前帧的61个分带功率，并且输入线性密集层将所述分带功率映射到256个特征。所述256个特征通过一组GRU层(每个GRU层包含256个隐藏单元)，并且最后通过具有某种非线性内核的密集层。NN 103输出每个类和每个频带的相对权重，使用softmax函数将所述相对权重转换为概率。使用混合有噪声的已标记语音并且使用交叉熵作为代价函数来训练NN 103。在实施例中，使用Adam优化器来训练神经网络。机器学习分类器的其他示例包括但不限于k最近邻、支持向量机或决策树。In the illustrated example, the machine learning classifier 103 includes a neural network (NN). The input to the example NN 103 includes 61 subband powers of the current frame, and the input linear dense layer maps the subband powers to 256 features. The 256 features are passed through a set of GRU layers (each GRU layer contains 256 hidden units) and finally a dense layer with some kind of non-linear kernel. The NN 103 outputs relative weights for each class and each frequency band, which are converted into probabilities using a softmax function. The NN 103 is trained using labeled speech mixed with noise and using cross-entropy as a cost function. In an embodiment, the neural network is trained using the Adam optimizer. Other examples of machine learning classifiers include, but are not limited to, k-nearest neighbors, support vector machines, or decision trees.

在实施例中，可以如下来训练NN 103：In an embodiment, the NN 103 may be trained as follows:

1.获得一组近距离麦克风录制的干净语音。例如，语音技术中心的语音克隆工具包(Voice Cloning Toolkit,VCTK)；1. Obtain a clean voice recorded by a set of close-range microphones. For example, the Voice Cloning Toolkit (Voice Cloning Toolkit, VCTK) of the Voice Technology Center;

2.获得一组噪声数据。例如，可在http://research.google.com/audioset/上获得的AudioSet数据；以及2. Obtain a set of noise data. For example, AudioSet data available at http://research.google.com/audioset/; and

3.在训练过程开始之前，提取语音和噪声数据的特征。例如，以例如50Hz的速率计算出的、在例如50Hz到8000Hz之间间隔分布的频带的61个频带能量(以dB为单位)。3. Before the training process starts, features of speech and noise data are extracted. For example, 61 band energies (in dB) for frequency bands spaced between eg 50 Hz to 8000 Hz calculated at a rate of eg 50 Hz.

4.作为特征提取过程的一部分：4. As part of the feature extraction process:

a.确定每个语音向量的语音功率的度量。例如，使语音文件通过语音活动检测器(voice activity detector,VAD)，并且针对通过VAD的所有语音计算均值A加权功率(例如，以dBFS为单位)；以及a. Determine a measure of speech power for each speech vector. For example, passing a voice file through a voice activity detector (VAD), and computing the mean A-weighted power (e.g., in dBFS) for all voices passing through the VAD; and

b.确定每个噪声向量的噪声功率的度量。例如，在整个向量上计算均值A加权功率(例如，以dBFS为单位)。b. Determine a measure of noise power for each noise vector. For example, compute the mean A-weighted power (eg, in dBFS) over the entire vector.

5.针对每个训练时期的每个语音向量：5. For each speech vector at each training epoch:

a.从噪声数据集中选择随机噪声向量，将语音与所述随机噪声向量相关联；a. selecting random noise vectors from the noise data set, associating speech with said random noise vectors;

b.抽取随机SNR，以所述随机SNR来混合语音和噪声。例如，从具有例如20dB SNR的均值以及例如10dB SNR的标准偏差的正态分布中进行抽取；b. Draw a random SNR with which to mix speech and noise. For example, drawing from a normal distribution with a mean of eg 20dB SNR and a standard deviation of eg 10dB SNR;

c.根据增益＝SNR–S+N，基于所选的SNR、预定语音功率(S)以及预定噪声功率(N)来确定增益(以dB为单位)，以所述增益将噪声与语音混合；c. Determine the gain (in dB) at which to mix noise with speech based on the selected SNR, predetermined speech power (S) and predetermined noise power (N) according to gain=SNR−S+N;

d.将混响应用于语音，记录在哪个时频片中已添加显著的混响；d. Apply reverberation to speech, record in which time-frequency slices significant reverberation has been added;

e.使用所提供的增益将噪声与经混响的语音混合。由于所述特征以dB为标度，因此可以通过取每个时频片中经混响的语音功率和应用增益后的噪声功率的max()来近似“混合”。在这样处理时，记录在哪个时频片中噪声最终在经混响的语音占主导。在训练期间要呈现给NN的最终训练向量由每帧的61个频带能量以及基于在上述过程期间每个时频片中是噪声、语音还是混响占主导的真值类标签构成；以及e. Mix the noise with the reverberated speech using the gain provided. Since the feature is scaled in dB, "mixing" can be approximated by taking the max() of the reverberated speech power and the noise power after applying the gain in each time-frequency slice. When doing so, note in which time-frequency slice the noise ends up dominating the reverberated speech. The final training vector to be presented to the NN during training consists of the 61 frequency band energies per frame and ground-truth class labels based on whether noise, speech or reverberation dominates in each time-frequency slice during the above process; and

f.将训练向量显示给网络，并且针对在上述过程期间计算的真值使交叉熵损失下降。f. Show the training vectors to the network and drop the cross-entropy loss against the ground truth computed during the above process.

麦克风协方差矩阵Microphone covariance matrix

在一些实施例中，使用短时傅里叶变换(STFT)来实施滤波器组101。令X_t(k)为所有麦克风输入的块/帧t的STFT的第k仓数据(子带)。X_t(k)是长度为N的向量，其中，N是麦克风的数量，M是子带的数量，并且α是对过去和估计的协方差的贡献进行加权的加权因子。在实时音频处理中，频带f的麦克风协方差由下式计算：In some embodiments, filterbank 101 is implemented using a short-time Fourier transform (STFT). Let _Xt (k) be the k-th bin data (subband) of the STFT of block/frame t of all microphone inputs. _Xt (k) is a vector of length N, where N is the number of microphones, M is the number of subbands, and α is a weighting factor that weights the contribution of past and estimated covariances. In real-time audio processing, the microphone covariance for frequency band f is calculated by:

其中，in,

并且[3] and[3]

其中，B(f)是属于频带f的所有STFT仓(即，子带)的集合。where B(f) is the set of all STFT bins (ie, subbands) belonging to frequency band f.

要注意，等式[2]和[3]对于“矩形分带”成立，其中每个子带以增益1对刚好一个输出频带做出贡献。一般而言，存在描述输入子带k对于输出频带f的贡献程度的在范围[0,1]中的某个权重w_kf。对于一个子带k，所有权重w_kf在所有频带f上的总和必须为1。以此方式，可以执行任意形状的分带。例如，除了矩形分带之外，还可以使用线性、对数或梅尔频率的余弦或三角形形状的分带。Note that equations [2] and [3] hold for "rectangular partitioning", where each subband contributes with a gain of 1 to exactly one output band. In general, there is some weight w _kf in the range [0,1] that describes how much the input subband k contributes to the output frequency band f. For a subband k, all weights w _kf must sum to 1 over all frequency bands f. In this way, arbitrary shaped banding can be performed. For example, instead of rectangular banding, linear, logarithmic or cosine of Mel frequency or triangular shaped banding can also be used.

∑_k＝B(f)w_k，f＝1。 [5] _Σk=B(f) _{wk, f} =1. [5]

在实施例中，可以直接从滤波器组101来计算分带功率。例如，嵌入有分带方案的非均匀滤波器组。在这种实施例中，不使用抽取，并且块速率定义块帧t。In an embodiment, the sub-band powers may be calculated directly from the filter bank 101 . For example, a non-uniform filter bank embedded with a banding scheme. In such an embodiment, no decimation is used, and the block rate defines the block frame t.

为了简化符号，将归一化协方差矩阵重新排列成实向量112(v_t，f)，因为协方差矩阵是厄米特矩阵并且其对角线元素是实数。归一化协方差是协方差矩阵除以其迹，使得所述协方差矩阵仅表示方向，同时任何电平分量被去除。To simplify notation, the normalized covariance matrix is rearranged into a real vector 112(vt _,f ), since the covariance matrix is Hermitian and its diagonal elements are real numbers. Normalized covariance is the division of the covariance matrix by its trace such that the covariance matrix only represents direction, while any level components are removed.

如果该协方差矩阵的元素表示为c_m，n，则针对三麦克风系统的重新排列后的向量由下式给出，其中，m和n是协方差矩阵元素的索引：If the elements of this covariance matrix are denoted c _m,n , the rearranged vector for a three-microphone system is given by, where m and n are the indices of the covariance matrix elements:

在一些实施例中，对v_t，f进行归一化。具有更多或更少的麦克风的系统将具有更多或更少的元素，并且因此所述系统可扩展到具有任何数量的麦克风的系统。In some embodiments, v _t,f is normalized. A system with more or fewer microphones will have more or fewer elements, and thus the system is scalable to systems with any number of microphones.

语音/噪声方向性模型Speech/Noise Directivity Model

在一些实施例中，语音/噪声方向性模型105将协方差向量v_t，f(每个频带一个)以及由机器学习分类器103输出的概率111作为输入，并且估计v_t，f的均值和/或协方差矩阵。以下描述了用于估计v_t，f的均值和/或协方差矩阵的至少两个实施例：In some embodiments, the speech/noise directionality model 105 takes as input the covariance vector v _t,f (one for each frequency band) and the probabilities 111 output by the machine learning classifier ₁₀₃ and estimates the mean and /or covariance matrix. At least two embodiments for estimating the mean and/or covariance matrix of v _t,f are described below:

1.仅估计语音和噪声的均值，并且使用所估计的均值作为方向性模型：1. Estimate the mean of speech and noise only, and use the estimated mean as the directionality model:

μ_n(t，f)＝[1-wp_n(t，f)]·μ_n(t-1，f)+wp_n(t，f)·v_t，f，并且 [7]μ _n (t, f) = [1-wp _n (t, f)] · μ _n (t-1, f) + wp _n (t, f) · v _{t, f} , and [7]

μ_s(t，f)＝[1-wp_s(t，f)]·μ_s(t-1，f)+wp_s(t，f)·v_t，f。 [8]μ _s (t, f)=[1-wp _s (t, f)]·μ _s (t−1,f)+wp _s (t,f)·v _t,f . [8]

在等式[7]和[8]中，μ_n(t，f)和μ_s(t，f)分别是针对块/帧t和频带f的噪声和语音的方向性模型。它们是归一化麦克风协方差矩阵的均值。注意，等式[7]和[8]以及下文中使用的w是控制时间平均窗口的长度的加权因子，所述加权因子对于语音和噪声而言或对于均值和方差而言可以是不同的。In equations [7] and [8], μ _n (t, f) and μ _s (t, f) are the directionality models for noise and speech for block/frame t and frequency band f, respectively. They are the means of the normalized microphone covariance matrix. Note that w used in equations [7] and [8] and hereinafter is a weighting factor controlling the length of the temporal averaging window, which may be different for speech and noise or for mean and variance.

2.噪声和语音的均值和协方差向量定义空间模型，所述空间模型可以计算为：2. The mean and covariance vectors of noise and speech define a spatial model that can be calculated as:

μ_n(t，f)＝[1-wp_n(t，f)]·μ_n(t-1，f)+wp_n(t，f)·v_t，f， [9]μ _n (t, f) = [1-wp _n (t, f)] · μ _n (t-1, f) + wp _n (t, f) · v _{t, f} , [9]

μ_s(t，f)＝[1-wp_s(t，f)]·μ_s(t-1，f)+wp_s(t，f)·v_t，f， [10]μ _s (t, f) = [1-wp _s (t, f)] · μ _s (t-1, f) + wp _s (t, f) · v _{t, f} , [10]

噪声电平模型Noise Level Model

噪声电平模型106将分带功率110(L_t，f)(以dB为单位)作为输入，并且估计针对频带f和块/帧t的噪声的均值和方差：The noise level model 106 takes as input the subband power 110(Lt _,f ) in dB and estimates the mean and variance of the noise for frequency band f and block/frame t:

μ_L(t，f)＝[1-wp_n(t，f)]·μ_L(t-1，f)+wp_n(t，f)·L_t，f， [13]μ _L (t, f) = [1-wp _n (t, f)] μ _L (t-1, f) + wp _n (t, f) L _{t, f} , [13]

注意，在等式[7]至[14]中，在随机过程是遍历性的并且在一段时间内平稳的假设下使用时间平均来估计随机变量(电平或方向性)的均值和协方差。使用一阶低通滤波器模型来实现所述平均，其中，对于每个帧，按照语音或噪声的概率对低通滤波器的输入(均值或协方差)进行加权。Note that in equations [7] to [14], the mean and covariance of the random variable (level or directionality) are estimated using time averaging under the assumption that the random process is ergodic and stationary over time. The averaging is achieved using a first-order low-pass filter model, where, for each frame, the inputs to the low-pass filter (mean or covariance) are weighted by the probability of speech or noise.

基于电平模型的噪声抑制增益计算Calculation of Noise Suppression Gain Based on Level Model

图2是图示了根据一些实施例的基于噪声电平模型106的噪声抑制增益G_L(b，f)计算的图。纵轴是增益(以dB为单位)，并且横轴是电平(以dB为单位)。在图2中，G₀是最大抑制增益，并且增益斜坡的斜率β和k是调整参数：FIG. 2 is a graph illustrating the calculation of the noise suppression gain _GL (b,f) based on the noise level model 106 according to some embodiments. The vertical axis is gain (in dB), and the horizontal axis is level (in dB). In Figure 2, _G0 is the maximum suppression gain, and the slope β and k of the gain ramp are tuning parameters:

如果当前的信噪电平大于预定义的信噪比阈值(参见等式[19])，则噪声抑制增益为G_L，否则使用语音/噪声方向性模型105来计算G_S。所述系统使用如下的至少两种方法之一来计算针对频带f和块/帧t的语音概率：If the current signal-to-noise level is greater than a predefined signal-to-noise ratio threshold (see equation [19]), the noise suppression gain is _GL , otherwise the speech/noise directivity model 105 is used to calculate _Gs . The system calculates speech probabilities for frequency band f and block/frame t using one of at least two methods as follows:

1.使用归一化麦克风协方差向量v_t，f的均值和语音/噪声方向性模型105来计算语音的概率：1. Calculate the probability of speech using the normalized microphone covariance vector v _{t, the mean of f} and the speech/noise directionality model 105:

2.将均值和向量v_t，f用作方向性模型(假设向量v_t，f的元素是联合高斯的(jointGaussian))来计算语音的概率：2. Use the mean and the vector v _{t, f} as a directional model (assuming that the elements of the vector v _{t, f} are joint Gaussian) to calculate the probability of speech:

其中，是多变量联合高斯概率密度函数。in, is the multivariate joint Gaussian probability density function.

基于方向性模型的噪声抑制增益计算Calculation of Noise Suppression Gain Based on Directional Model

图3是图示了基于语音/噪声方向性模型105的噪声抑制增益计算的图。纵轴是增益(以dB为单位)，横轴是语音的概率，并且噪声抑制增益由下式给出：FIG. 3 is a graph illustrating noise suppression gain calculation based on the speech/noise directivity model 105 . The vertical axis is the gain (in dB), the horizontal axis is the probability of speech, and the noise suppression gain is given by:

其中，γ和k_s是调整参数。where γ and _ks are tuning parameters.

针对每个频带f和每个块/帧t，最终的抑制增益G(t，f)(例如通过噪声抑制增益单元107)如下计算：For each frequency band f and each block/frame t, the final suppression gain G(t,f) (e.g. by the noise suppression gain unit 107) is calculated as follows:

其中，SNR₀是预定义的信噪比阈值，并且是块/帧t的所估计的信噪比。where SNR ₀ is the predefined signal-to-noise ratio threshold, and is the estimated signal-to-noise ratio for block/frame t.

在一些实施例中，[19]中所估计的可以通过使用VAD(语音活动检测器)输出进行以下操作来获得：驱动自动增益控制(automatic gain control,AGC)部件(未示出)将语音信号调平至预定义的功率电平(以dB为单位)，并从预定义的功率电平中减去所估计的噪声电平μ_L(t，f)，并使用估计噪声所用的类似方法来估计语音电平，其中：In some embodiments, the estimated in [19] This can be obtained by using the VAD (Voice Activity Detector) output to drive an automatic gain control (AGC) component (not shown) to level the speech signal to a predefined power level (in dB units), and subtract the estimated noise level _μL (t,f) from the predefined power level, and estimate the speech level using a method similar to that used for estimating noise, where:

μ_S(t，f)＝[1-wp_s(t，f)]·μ_S(t-1，f)+wp_n(t，f)·L_t，f，并且 [20]μ _S (t, f) = [1-wp _s (t, f)] · μ _S (t-1, f) + wp _n (t, f) · L _{t, f} , and [20]

被计算为： is calculated as:

在乘法单元108中使用最终的抑制增益G(t,f)来缩放由滤波器组101输出的子带，以抑制每个子带中的噪声。例如，将增益G(t,f)应用于属于频带f的所有子带k：The subbands output by the filter bank 101 are scaled using the final suppression gain G(t,f) in a multiplication unit 108 to suppress noise in each subband. For example, applying the gain G(t,f) to all subbands k belonging to frequency band f:

Y_t(k)＝X_t(k)*G(t,f)对于k∈B(f)， [22] _Yt (k)= _Xt (k)*G(t,f) for k∈B(f), [22]

其中，Y_t(k)是乘法单元108的输出。where Y _t (k) is the output of the multiplication unit 108 .

然后，逆滤波器组109将乘法单元108的输出转换为时域输出音频信号，以得到经噪声抑制的输出信号。The inverse filter bank 109 then converts the output of the multiplication unit 108 into a time-domain output audio signal to obtain a noise-suppressed output signal.

示例过程example process

图4A和图4B是根据一些实施例的使用方向性和深度神经网络进行语音通信中的噪声估计和抑制的过程400的流程图。过程400可以使用图5所示的系统500来实施。4A and 4B are flowcharts of a process 400 for noise estimation and suppression in speech communications using directional and deep neural networks, according to some embodiments. Process 400 may be implemented using system 500 shown in FIG. 5 .

过程400开始于接收包括一定数量的块/帧的输入音频信号(401)。针对每个块/帧，过程400继续以下过程：将块/帧转换为子带(402)，其中，每个子带具有与其他子带不同的频谱；将子带组合为频带并确定每个频带中的功率(403)；并且基于子带确定麦克风协方差(404)。Process 400 begins by receiving an input audio signal comprising a certain number of blocks/frames (401). For each block/frame, the process 400 continues with the following process: converting the block/frame into subbands (402), where each subband has a different frequency spectrum than the other subbands; combining the subbands into frequency bands and determining each frequency band power in (403); and determine microphone covariance based on the subbands (404).

过程400继续以下过程：针对每个频带和每个块/帧，使用机器学习分类器(例如，神经网络)来估计语音和噪声的相应概率(405)；使用方向性模型基于麦克风协方差和概率来估计语音和噪声的一组均值或语音和噪声的一组均值和协方差(406)；使用电平模型基于概率和频带功率来估计噪声功率的均值和方差(407)；基于方向性模型的第一输出来计算第一噪声抑制增益(408)；基于电平模型的第二输出来确定第二噪声抑制增益(409)；并且基于输入音频信号的信噪比，选择第一噪声抑制增益或第二噪声抑制增益之一、或者第一噪声抑制增益与第二噪声抑制增益之和(410)。Process 400 continues with the following process: For each frequency band and each block/frame, use a machine learning classifier (e.g., a neural network) to estimate the corresponding probabilities of speech and noise (405); use a directional model based on microphone covariance and probability to estimate a set of mean values of speech and noise or a set of mean values and covariance (406) of speech and noise; use the level model to estimate the mean value and variance (407) of noise power based on probability and frequency band power; first output to calculate a first noise suppression gain (408); based on a second output of the level model to determine a second noise suppression gain (409); and based on the signal-to-noise ratio of the input audio signal, select the first noise suppression gain or One of the second noise suppression gains, or the sum of the first noise suppression gain and the second noise suppression gain (410).

过程400继续以下过程：通过用于每个频带的所选择的第一噪声抑制增益或第二噪声抑制增益来缩放所述频带中的每个子带(411)，并且将经缩放的子带转换为输出音频信号(412)。Process 400 continues by scaling each subband in each frequency band by the selected first noise suppression gain or second noise suppression gain for that band (411), and converting the scaled subbands to An audio signal is output (412).

示例系统架构Example system architecture

图5示出了根据实施例的用于实施参考图1至图5描述的特征和过程的示例系统的框图。系统500包括能够播放音频的任何设备，包括但不限于：智能电话、平板计算机、可穿戴计算机、车载计算机、游戏控制台、环绕系统、信息亭。FIG. 5 shows a block diagram of an example system for implementing the features and processes described with reference to FIGS. 1-5 , according to an embodiment. System 500 includes any device capable of playing audio, including but not limited to: smartphones, tablet computers, wearable computers, vehicle computers, game consoles, surround systems, kiosks.

如所示出的，系统500包括中央处理单元(CPU)501，所述中央处理单元能够根据存储在例如只读存储器(ROM)502中的程序或者从例如存储单元508加载到随机存取存储器(RAM)503的程序来执行各种进程。在RAM 503中，根据需要，还存储当CPU 501执行各种进程时所需的数据。CPU 501、ROM 502和RAM 503经由总线509相互连接。输入/输出(I/O)接口505也连接到总线504。As shown, system 500 includes a central processing unit (CPU) 501 capable of being loaded into random access memory ( RAM) 503 programs to execute various processes. In the RAM 503, data required when the CPU 501 executes various processes is also stored as necessary. The CPU 501 , ROM 502 , and RAM 503 are connected to each other via a bus 509 . An input/output (I/O) interface 505 is also connected to the bus 504 .

以下部件连接到I/O接口505：输入单元506，其可以包括键盘、鼠标等；输出单元507，其可以包括如液晶显示器(LCD)等显示器以及一个或多个扬声器；存储单元508，其包括硬盘或另一种合适的存储设备；以及通信单元509，其包括如网卡(例如，有线网卡或无线网卡)等网络接口卡。The following components are connected to the I/O interface 505: an input unit 506, which may include a keyboard, a mouse, etc.; an output unit 507, which may include a display such as a liquid crystal display (LCD) and one or more speakers; a storage unit 508, which includes a hard disk or another suitable storage device; and a communication unit 509 including a network interface card such as a network card (for example, a wired network card or a wireless network card).

在一些实施方式中，输入单元506包括位于不同位置(取决于主机设备)的一个或多个麦克风，所述一个或多个麦克风使得能够捕获各种格式(例如，单声道、立体声、空间、沉浸式和其他合适的格式)的音频信号。In some implementations, the input unit 506 includes one or more microphones located in different locations (depending on the host device) that enable the capture of various formats (e.g., mono, stereo, spatial, immersive and other suitable formats) audio signal.

在一些实施方式中，输出单元507包括具有各种数量的扬声器的系统。如图5所图示的，输出单元507(取决于主机设备的能力)可以呈现各种格式(例如，单声道、立体声、沉浸式、双耳和其他合适的格式)的音频信号。In some implementations, the output unit 507 includes a system with various numbers of speakers. As illustrated in FIG. 5 , output unit 507 may (depending on the capabilities of the host device) present audio signals in various formats (eg, mono, stereo, immersive, binaural, and other suitable formats).

通信单元509被配置成(例如，经由网络)与其他设备通信。根据需要，驱动器510也连接到I/O接口505。根据需要，如磁盘、光盘、磁光盘、闪存驱动器或其他合适的可移动介质等可移动介质511被安装在驱动器510上，使得从中读取的计算机程序被安装到存储单元508中。本领域技术人员将理解，尽管系统500被描述为包括上文所描述的部件，但是在实际应用中，可以添加、移除和/或替换这些部件中的一些部件，并且所有这些修改或变更都落入本公开的范围内。The communication unit 509 is configured to communicate with other devices (eg, via a network). A drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, optical disk, magneto-optical disk, flash drive, or other suitable removable medium is mounted on drive 510 such that a computer program read therefrom is installed into storage unit 508 as desired. Those skilled in the art will appreciate that although the system 500 is described as including the components described above, in practice, some of these components may be added, removed and/or substituted, and all such modifications or changes are fall within the scope of this disclosure.

本文所描述的系统的各方面可以在适当的基于计算机的声音处理网络环境中实施，以便处理数字或数字化音频文件。自适应音频系统的部分可以包括一个或多个网络，所述网络包括任何期望数量的独立机器，所述机器包括用于缓冲和路由在计算机之间传输的数据的一个或多个路由器(未示出)。这种网络可以在各种不同的网络协议上构建，并且可以是因特网、广域网(WAN)、局域网(LAN)或其任何组合。Aspects of the systems described herein may be implemented in a suitable computer-based sound processing network environment for processing digital or digitized audio files. Portions of an adaptive audio system may include one or more networks comprising any desired number of independent machines including one or more routers (not shown) for buffering and routing data transmitted between computers. out). Such a network can be built on a variety of different network protocols and can be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

根据本公开的示例实施例，上文所描述的过程可以实施为计算机软件程序或者在计算机可读存储介质上实施。例如，本公开的实施例包括计算机程序产品，所述计算机程序产品包括有形地体现在机器可读介质上的计算机程序，所述计算机程序包括用于执行方法的程序代码。在这种实施例中，计算机程序可以经由通信单元509从网络下载和安装，和/或从可移动介质511安装，如图5所示。According to example embodiments of the present disclosure, the processes described above may be implemented as a computer software program or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing a method. In such an embodiment, the computer program may be downloaded and installed from a network via the communication unit 509, and/or installed from a removable medium 511, as shown in FIG.

通常，本公开的各种示例实施例可以以硬件或专用电路(例如，控制电路)、软件、逻辑或其任何组合来实施。例如，上文所讨论的单元可以由控制电路(例如，与图5的其他部件组合的CPU)执行，因此，控制电路可以执行本公开中描述的动作。一些方面可以以硬件来实施，而其他方面可以以可以由控制器、微处理器或其他计算设备执行的固件或软件(例如，控制电路)来实施。尽管本公开的示例实施例的各个方面被图示和描述为框图、流程图或使用一些其他图形表示，但将理解的是，作为非限制性示例，本文所描述的框、装置、系统、技术或方法可以以硬件、软件、固件、专用电路或逻辑、通用硬件或控制器、或其他计算设备、或其某种组合来实施。In general, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (eg, control circuits), software, logic or any combination thereof. For example, the units discussed above may be executed by control circuitry (eg, a CPU combined with other components of FIG. 5 ), and thus the control circuitry may perform the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software (eg, control circuitry) that may be executed by a controller, microprocessor or other computing device. Although various aspects of the example embodiments of the present disclosure have been illustrated and described as block diagrams, flowcharts, or using some other graphical representation, it will be understood that, by way of non-limiting example, the blocks, devices, systems, techniques described herein Or a method may be implemented in hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers, or other computing devices, or some combination thereof.

另外，流程图中所示的各个框可以被视为方法步骤、和/或由计算机程序代码的操作产生的操作、和/或被构造为执行相关联的(多个)功能的多个耦接逻辑电路元件。例如，本公开的实施例包括计算机程序产品，所述计算机程序产品包括有形地体现在机器可读介质上的计算机程序，所述计算机程序包含被配置为执行上文所描述的方法的程序代码。In addition, the various blocks shown in the flowcharts may be regarded as method steps, and/or operations resulting from operation of computer program code, and/or multiple couplings configured to perform the associated function(s) Logic circuit elements. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code configured to perform the methods described above.

在本公开的上下文中，机器可读介质可以是可以包含或存储用于由指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合使用的程序的任何有形介质。机器可读介质可以是机器可读信号介质或机器可读存储介质。机器可读介质可以是非暂态的并且可以包括但不限于电子、磁性、光学、电磁、红外或半导体系统、装置或设备、或前述各项的任何合适的组合。机器可读存储介质的更具体的示例将包括具有一条或多条导线的电连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或闪速存储器)、光纤、便携式致密盘只读存储器(CD-ROM)、光存储设备、磁存储设备、或者前述各项的任何合适的组合。In the context of the present disclosure, a machine-readable medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may be non-transitory and may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections having one or more conductors, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.

用于执行本公开的方法的计算机程序代码可以用一种或多种编程语言的任何组合来编写。这些计算机程序代码可以被提供给通用计算机、专用计算机或具有控制电路的其他可编程数据处理装置的处理器，使得程序代码在由计算机的处理器或其他可编程数据处理装置的处理器执行时，实施流程图和/或框图中指定的功能/操作。程序代码可以完全在计算机上执行，部分在计算机上执行，作为独立的软件包，部分在计算机上执行、并且部分在远程计算机上执行，或者完全在远程计算机或服务器上执行，或者分布在一个或多个远程计算机和/或服务器上。Computer program code for carrying out the methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus with a control circuit, so that when the program code is executed by the processor of the computer or other programmable data processing apparatus, Implement the functions/operations specified in the flowchart and/or block diagrams. The program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on a remote computer or server, or distributed over one or on multiple remote computers and/or servers.

虽然本文档包含许多具体实施方式细节，但是这些细节不应被解释为对可能要求保护的事物的范围的限制，而是被解释为对可能特定于特定实施例的特征的描述。本说明书中在单独实施例的上下文中描述的特定特征还可以在单个实施例中以组合形式实施。相反，在单一实施例的上下文中描述的各种特征也可以被单独地或以任何适合的子组合的方式实施在多个实施例中。此外，尽管特征在上文可以被描述为以某些组合起作用并且甚至最初也是如此要求保护的，但是在一些情况下可以从组合中去除要求保护的组合的一个或多个特征，并且所要求保护的组合可以涉及子组合或子组合的变体。在附图中描绘的逻辑流程不需要所示出的特定顺序或者有序顺序来实现期望的结果。另外，可以从所描述的流程中提供其他步骤，或者可以从中删除步骤，并且可以向所描述的系统添加其他部件，或者从中去除其他部件。因此，其他实施方式在所附权利要求的范围内。While this document contains many implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed as such, in some cases one or more features of the claimed combination may be removed from the combination and the claimed A protected combination may involve a subcombination or a variation of a subcombination. The logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be deleted, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

Claims

1. An audio processing method, comprising:

receiving, using at least one processor, frequency bands of a power spectrum of an input audio signal and a microphone covariance for each frequency band, wherein the microphone covariance is based on a configuration of microphones used to capture the input audio signal;

For each frequency band:

use machine learning classifiers to estimate the corresponding probabilities of speech and noise;

estimating a set of means for speech and noise, or a set of means and covariance for speech and noise, using a directional model based on the microphone covariance and the probability for the frequency band;

estimating a mean and covariance of noise power based on the probability and the power spectrum using a level model;

determining, using the at least one processor, a first noise suppression gain based on a first output of the directional model;

determining a second noise suppression gain based on a second output of the level model using the at least one processor;

Using the at least one processor to select one of the first noise suppression gain or the second noise suppression gain, or the combination of the first noise suppression gain and the second noise suppression gain, based on a signal-to-noise ratio of the input audio signal The sum of two noise suppression gains;

scaling a time-frequency representation of the input signal using the at least one processor by the selected first noise suppression gain or second noise suppression gain for the frequency band; and

The time-frequency representation is converted to an output audio signal using the at least one processor.

2. The method of claim 1, further comprising:

receiving an input audio signal comprising a certain number of blocks/frames using the at least one processor;

For each block/frame:

converting said blocks/frames into subbands, each subband having a different frequency spectrum than other subbands, using said at least one processor;

combining the subbands into frequency bands using the at least one processor; and

Part-band power is determined using the at least one processor.

3. The method of claim 1, wherein the machine learning classifier is a neural network.

4. The method of claim 1 or 2, wherein the microphone covariance is represented as a normalized vector.

5. The method of any one of the preceding claims 1 to 4, wherein determining the first noise suppression gain comprises:

calculating a probability of speech for the frequency band;

If the probability for speech in the frequency band is less than a threshold, setting the first noise suppression gain equal to a maximum suppression gain; and

The first noise suppression gain is set by gradually changing from the maximum suppression gain towards zero to increase the probability of speech.

6. The method of claim 5, wherein the probability of speech is calculated using the set of means and the covariance of speech and noise estimated by the directional model.

7. The method of claim 5, wherein the set of means and the covariance of speech and noise estimated by the directional model and a multivariate joint Gaussian density function are used to calculate the said probability.

8. The method of any one of the preceding claims 1 to 7, wherein determining the second noise suppression gain comprises:

if said band power is less than a first threshold, setting said second noise suppression gain equal to a maximum suppression gain;

The noise suppression gain is set by gradually changing from the maximum suppression gain towards zero to increase the probability of speech.

9. A method as claimed in any one of the preceding claims 1 to 8, wherein said estimating using said directional model uses time-frequency slices classified as speech and noise, but excludes time-frequency slices classified as reverberation time-frequency slice.

10. The method according to any one of the preceding claims 1 to 9, wherein, based on the microphone covariance of the frequency band and the probability of speech, a directional model or a level model is used to estimate the mean value of speech further include:

A time-averaged estimate of the mean of speech is computed using a first-order low-pass filter, wherein the mean of speech and the microphone covariance are inputs to the filter; The input to the filter is weighted.

11. The method according to any one of the preceding claims 1 to 10, wherein, based on the microphone covariance of the frequency band and the probability of noise, a directional model or a level model is used to estimate the mean value of noise further include:

A time-averaged estimate of the mean of noise is computed using a first-order low-pass filter, wherein the mean of noise and the microphone covariance vector are inputs to the filter; and the probability of passing noise contributes to the The input to the filter is weighted.

12. The method according to any one of the preceding claims 1 to 11, wherein the covariance of speech is estimated using a directional model or a level model based on the microphone covariance of the frequency band and the probability of speech Further includes:

Computing a time-averaged estimate of said covariance of speech using a first-order low-pass filter, wherein said covariance of speech and said microphone covariance vector are inputs to said filter; and said probabilities of passing speech are The input to the filter is weighted.

13. The method according to any one of the preceding claims 1 to 12, wherein the covariance of noise is estimated using a directional model or a level model based on the microphone covariance of the frequency band and the probability of noise Further includes:

Computing a time-averaged estimate of said covariance of noise using a first-order low-pass filter, wherein said covariance of noise and said microphone covariance vector are inputs to said filter; and said probability of passing noise The input to the filter is weighted.

14. A system comprising:

one or more computer processors; and

A non-transitory computer readable medium storing instructions which, when executed by the one or more computer processors, cause the one or more processors to perform any one of claims 1 to 13 operation.

15. A non-transitory computer readable medium storing instructions which, when executed by one or more computer processors, cause the one or more processors to perform any one of claims 1 to 13 described operation.