CN118974825A - Source separation combining spatial cues and source cues - Google Patents
Source separation combining spatial cues and source cues Download PDFInfo
- Publication number
- CN118974825A CN118974825A CN202380031124.3A CN202380031124A CN118974825A CN 118974825 A CN118974825 A CN 118974825A CN 202380031124 A CN202380031124 A CN 202380031124A CN 118974825 A CN118974825 A CN 118974825A
- Authority
- CN
- China
- Prior art keywords
- audio signal
- source
- separation module
- based separation
- spatial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Stereophonic System (AREA)
Abstract
本公开涉及一种用于源分离的音频处理方法和系统。所述方法包括获得包括至少两个通道的输入音频信号(A),以及利用基于空间提示的分离模块(10)处理所述输入音频信号(A)以获得中间音频信号(B)。所述基于空间提示的分离模块(10)被配置为确定所述输入音频信号(A)的至少两个通道的混合参数并基于所述混合参数修改所述通道以获得所述中间音频信号(B)。所述方法进一步包括利用基于源提示的分离模块(20)处理所述中间音频信号(B)以生成输出音频信号(C),其中,所述基于源提示的分离模块(20)被配置为实施神经网络,所述神经网络被训练用于在给定所述中间音频信号(B)的情况下预测降噪输出音频信号(C)。
The present disclosure relates to an audio processing method and system for source separation. The method comprises obtaining an input audio signal (A) comprising at least two channels, and processing the input audio signal (A) using a separation module (10) based on spatial cues to obtain an intermediate audio signal (B). The separation module (10) based on spatial cues is configured to determine mixing parameters of at least two channels of the input audio signal (A) and modify the channels based on the mixing parameters to obtain the intermediate audio signal (B). The method further comprises processing the intermediate audio signal (B) using a separation module (20) based on source cues to generate an output audio signal (C), wherein the separation module (20) based on source cues is configured to implement a neural network, which is trained to predict a noise-reduced output audio signal (C) given the intermediate audio signal (B).
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年3月29日提交的美国临时申请号63/325,108、2022年10月18日提交的美国临时申请号63/417,273和2023年2月2日提交的美国临时申请号63/482,949的优先权,所述申请中的每一个均通过援引以其全文并入本文。This application claims priority to U.S. Provisional Application No. 63/325,108 filed on March 29, 2022, U.S. Provisional Application No. 63/417,273 filed on October 18, 2022, and U.S. Provisional Application No. 63/482,949 filed on February 2, 2023, each of which is incorporated herein by reference in its entirety.
技术领域Technical Field
本发明涉及一种用于基于空间提示和源提示进行源分离的方法和音频处理系统。The present invention relates to a method and an audio processing system for source separation based on spatial cues and source cues.
背景技术Background Art
音频处理中的源分离涉及用于隔离原始音频信号中出现的目标音频源(例如,语音或音乐)的系统和方法,原始音频信号包括目标音频源和附加音频内容的混合。附加音频内容例如是平稳或非平稳噪声、背景音频或混响效果。Source separation in audio processing relates to systems and methods for isolating a target audio source (e.g., speech or music) present in an original audio signal that includes a mixture of the target audio source and additional audio content, such as stationary or non-stationary noise, background audio, or reverberation effects.
目标分离处理主要有两种类型,即利用空间提示(描述目标音频如何混合的信息)的基于空间提示的分离以及利用源提示(描述目标音频听起来像什么的信息)的基于源提示的分离。There are two main types of target separation processing, namely spatial cue-based separation that utilizes spatial cues (information describing how the target audio is mixed) and source cue-based separation that utilizes source cues (information describing what the target audio sounds like).
空间提示分离的一个简单示例是从电影的5.1原声带中提取语音的情况。用于这种分离的空间提示是,语音或对话通常混合到中央(C)通道,因此空间分离系统只需提取中央通道即可获得空间分离的对话通道。可替代地,基于空间提示的分离涉及放大中央通道或将中央通道与5.1呈现中的其他通道进行混合,以获得对话可懂度更高的5.1呈现。A simple example of spatial cue separation is the extraction of speech from a 5.1 soundtrack of a movie. The spatial cue used for this separation is that speech or dialogue is usually mixed to the center (C) channel, so the spatial separation system only needs to extract the center channel to obtain a spatially separated dialogue channel. Alternatively, separation based on spatial cues involves amplifying the center channel or mixing the center channel with other channels in the 5.1 presentation to obtain a 5.1 presentation with higher dialogue intelligibility.
基于源提示的分离的一个简单示例是利用带通滤波器,其通带适于与目标音频源的预期频率范围相匹配。如果目标音频源是语音,则可以使用通带为500Hz至8kHz的带通滤波器,因为大多数人类语音的频谱能量预计都处于这个频率范围内。更先进的源提示分离系统对时频域中表示的音频信号进行操作,并采用被训练用于预测音频信号的每个时频瓦片(time-frequency tile)的增益的神经网络,其中,这些增益抑制不属于目标音频源的所有音频内容。A simple example of source cue-based separation is to use a bandpass filter whose passband is adapted to match the expected frequency range of the target audio source. If the target audio source is speech, a bandpass filter with a passband of 500 Hz to 8 kHz can be used, because most of the spectral energy of human speech is expected to be in this frequency range. More advanced source cue separation systems operate on audio signals represented in the time-frequency domain and employ neural networks trained to predict the gains of each time-frequency tile of the audio signal, where these gains suppress all audio content that does not belong to the target audio source.
发明内容Summary of the invention
上述解决方案存在的问题是,利用源提示的基于源提示的分离过程完全忽略了空间提示,并且利用空间提示的基于空间提示的分离过程完全忽略了源提示,这意味着在进行目标源分离时没有考虑所有信息。另一方面,将不同的源分离过程组合在一起并非易事,并且在许多情况下,与仅使用一个目标源分离过程相比,将两个或更多个不同的目标源分离过程组合在一起会导致性能下降。The problem with the above solutions is that the source-cue-based separation process using source cues completely ignores the spatial cues, and the spatial-cue-based separation process using spatial cues completely ignores the source cues, which means that not all information is considered when performing target source separation. On the other hand, combining different source separation processes is not easy, and in many cases, combining two or more different target source separation processes results in degraded performance compared to using only one target source separation process.
为此,需要一种改进的目标分离方法和系统,从而克服上文中提到的其中至少一些缺点。Therefore, there is a need for an improved target separation method and system that overcomes at least some of the disadvantages mentioned above.
根据本发明的第一方面,提供了一种用于源分离的音频处理方法。所述方法包括获得包括至少两个通道的输入音频信号,以及利用基于空间提示的分离模块处理所述输入音频信号以获得中间音频信号。所述基于空间提示的分离模块被配置为确定所述输入音频信号的至少两个通道的混合参数并基于所述混合参数修改所述至少两个通道以获得所述中间音频信号。所述方法进一步包括利用基于源提示的分离模块处理所述中间音频信号以生成输出音频信号,其中,所述基于源提示的分离模块被配置为实施神经网络,所述神经网络被训练用于在给定所述中间音频信号的样本的情况下预测降噪输出音频信号。According to a first aspect of the present invention, there is provided an audio processing method for source separation. The method comprises obtaining an input audio signal comprising at least two channels, and processing the input audio signal using a separation module based on spatial cues to obtain an intermediate audio signal. The separation module based on spatial cues is configured to determine mixing parameters of at least two channels of the input audio signal and modify the at least two channels based on the mixing parameters to obtain the intermediate audio signal. The method further comprises processing the intermediate audio signal using a separation module based on source cues to generate an output audio signal, wherein the separation module based on source cues is configured to implement a neural network, which is trained to predict a denoised output audio signal given a sample of the intermediate audio signal.
基于源提示的分离模块被配置为要去除的噪声是平稳噪声(比如白噪声)、非平稳噪声(包括时变噪声,比如交通噪声或风噪声)、背景音频内容(例如,来自除目标说话人之外的源的语音)和混响中的至少一种。The source cue-based separation module is configured to remove at least one of stationary noise (such as white noise), non-stationary noise (including time-varying noise, such as traffic noise or wind noise), background audio content (e.g., speech from sources other than the target speaker), and reverberation.
换言之,基于空间提示的分离模块被配置为基于音频内容的混合方式来分离音频内容,而基于源提示的分离模块被配置为基于音频内容听起来如何来分离音频内容。In other words, the spatial cue based separation module is configured to separate the audio content based on how the audio content is mixed, while the source cue based separation module is configured to separate the audio content based on how the audio content sounds.
通过先使用混合参数执行基于空间提示的源分离、然后再执行基于神经网络的基于源提示的分离,源分离方法的整体性能得到了提高。特别是,由于基于神经网络的基于源提示的分离可以被训练专门用于操作空间分离的音频源,并且前面的基于空间提示的分离模块实现了这种空间分离,因此基于源提示的分离模块的性能得到了提升。在一个示例中,基于空间提示的分离模块修改输入音频信号以使其接近中央平移混合,这近似于单通道,而基于源提示的分离模块被训练用于抑制中央平移音频信号的噪声。By first performing spatial cue-based source separation using mixing parameters and then performing neural network-based source cue-based separation, the overall performance of the source separation method is improved. In particular, since the neural network-based source cue-based separation can be trained specifically to operate on spatially separated audio sources, and the previous spatial cue-based separation module achieves such spatial separation, the performance of the source cue-based separation module is improved. In one example, the spatial cue-based separation module modifies the input audio signal to make it close to a center-panned mixture, which is approximately single-channel, and the source cue-based separation module is trained to suppress noise of the center-panned audio signal.
在一些实施方式中,基于空间提示的分离模块以第一时间和/或频率分辨率进行操作,并且所述方法进一步包括通过基于空间提示的分离模块向基于源提示的分离模块提供元数据,其中,所述元数据指示基于空间提示的分离模块的时间和/或频率分辨率。所述方法进一步包括由所述基于源提示的分离模块基于所述中间音频信号和所述元数据生成所述输出音频信号。In some embodiments, the spatial cue-based separation module operates at a first time and/or frequency resolution, and the method further comprises providing metadata to the source cue-based separation module by the spatial cue-based separation module, wherein the metadata indicates the time and/or frequency resolution of the spatial cue-based separation module. The method further comprises generating, by the source cue-based separation module, the output audio signal based on the intermediate audio signal and the metadata.
例如,降低基于源提示的分离模块的时间和/或频率分辨率,以与基于空间提示的分离模块的时间和/或频率相匹配。在一些示例中,通过使用平滑窗口和/或平滑内核处理基于源提示的分离模块的输出来降低基于源提示的分离模块的时间和/或频率分辨率。如果不考虑时间和/或频率元数据,则这两个分离模块将以不同的分辨率独立操作,这可能会导致明显的声学伪影。For example, the time and/or frequency resolution of the separation module based on source cues is reduced to match the time and/or frequency of the separation module based on spatial cues. In some examples, the time and/or frequency resolution of the separation module based on source cues is reduced by processing the output of the separation module based on source cues using a smoothing window and/or a smoothing kernel. If the time and/or frequency metadata are not taken into account, the two separation modules will operate independently at different resolutions, which may result in noticeable acoustic artifacts.
在一些实施方式中,基于源提示的分离模块预测源增益掩码,所述源增益掩码被应用于中间音频信号以抑制噪声。时间和/或频率分辨率元数据可以用于对所述增益掩码进行平滑,以形成平滑的增益掩码,所述平滑的增益掩码被应用于中间音频信号。其中,平滑程度(即,分辨率的降低)基于时间和/或频率元数据。In some embodiments, the separation module based on the source cue predicts a source gain mask, which is applied to the intermediate audio signal to suppress noise. Time and/or frequency resolution metadata can be used to smooth the gain mask to form a smoothed gain mask, which is applied to the intermediate audio signal. Wherein the degree of smoothing (i.e., the reduction in resolution) is based on the time and/or frequency metadata.
在一些实施方式中,所述基于空间提示的分离模块确定所述混合参数的时间和/或频率分辨率比所述基于源提示的分离模块的时间和/或频率分辨率更低(更粗),优选地至少低两倍、更优选地至少低四倍、甚至更优选地至少低六倍、或者最优选地至少低八倍。In some embodiments, the spatial cue-based separation module determines the mixing parameters with a lower (coarser) time and/or frequency resolution than the source cue-based separation module, preferably at least two times lower, more preferably at least four times lower, even more preferably at least six times lower, or most preferably at least eight times lower.
也就是说,这两个模块的时间和/或频率分辨率可能存在很大差异,因为最适合基于空间提示的分离的时间和/或频率分辨率与用于执行基于源提示的分离的对应时间和/或频率分辨率存在很大差异。That is, the temporal and/or frequency resolutions of the two modules may differ substantially because the temporal and/or frequency resolutions that are best suited for separation based on spatial cues differ substantially from the corresponding temporal and/or frequency resolutions used to perform separation based on source cues.
根据本发明的第二方面,提供了一种用于源分离的系统,所述系统包括基于空间提示的分离模块,所述基于空间提示的分离模块被配置为获得包括至少两个通道的输入音频信号并处理所述输入音频信号以获得中间音频信号,其中,所述基于空间提示的分离模块被配置为确定所述输入音频信号的至少两个通道的混合参数并基于所述混合参数修改所述至少两个通道以获得所述中间音频信号。所述系统进一步包括基于源提示的分离模块,所述基于源提示的分离模块被配置为通过实施神经网络来处理所述中间音频信号以生成输出音频信号,所述神经网络被训练用于在给定所述中间音频信号的样本的情况下预测降噪输出音频信号。According to a second aspect of the present invention, a system for source separation is provided, the system comprising a separation module based on spatial cues, the separation module based on spatial cues being configured to obtain an input audio signal comprising at least two channels and process the input audio signal to obtain an intermediate audio signal, wherein the separation module based on spatial cues is configured to determine mixing parameters of at least two channels of the input audio signal and modify the at least two channels based on the mixing parameters to obtain the intermediate audio signal. The system further comprises a separation module based on source cues, the separation module based on source cues being configured to process the intermediate audio signal to generate an output audio signal by implementing a neural network, the neural network being trained to predict a denoised output audio signal given a sample of the intermediate audio signal.
根据第二方面的系统具有与根据第一方面的方法相同或等效的益处。关于方法所描述的任何功能可以具有系统或设备中的对应特征,反之亦然。The system according to the second aspect has the same or equivalent benefits as the method according to the first aspect.Any functionality described with respect to the method may have a corresponding feature in the system or device, and vice versa.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
将参考附图更详细地描述本发明的各方面,这些附图示出了当前优选实施例。Aspects of the invention will be described in more detail with reference to the accompanying drawings, which show currently preferred embodiments.
图1是根据一些实施方式的用于源分离的音频处理系统的框图。FIG. 1 is a block diagram of an audio processing system for source separation according to some implementations.
图2是图示了根据一些实施方式的用于实现源分离以及输入音频信号再混合的音频处理系统的框图。2 is a block diagram illustrating an audio processing system for implementing source separation and remixing of input audio signals according to some embodiments.
图3是描述了根据一些实施方式的用于源分离的音频处理方法的流程图。FIG. 3 is a flow chart describing an audio processing method for source separation according to some embodiments.
图4是示出了根据一些实施方式的具有预测源分离增益掩码的基于源提示的分离模块的用于源分离的音频处理系统的框图。4 is a block diagram illustrating an audio processing system for source separation with a source cue based separation module for predicting a source separation gain mask according to some embodiments.
图5是图示了根据一些实施方式的与分类器单元和门控单元进行配合的用于源分离的音频处理系统的框图。5 is a block diagram illustrating an audio processing system for source separation in cooperation with a classifier unit and a gating unit according to some embodiments.
具体实施方式DETAILED DESCRIPTION
本申请中公开的系统和方法可以被实施为软件、固件、硬件或其组合。在硬件实施方式中,任务划分不一定与物理单元的划分相对应;相反,一个物理部件可以具有多个功能,并且一个任务可以由若干个物理部件协作地执行。计算机硬件可以例如是服务器计算机、客户端计算机、个人计算机(PC)、平板PC、机顶盒(STB)、个人数字助理(PDA)、蜂窝电话、智能手机、web设备、网络路由器、交换机或网桥、或能够(顺序或以其他方式)执行指定要由所述计算机硬件采取的动作的指令的任何机器。进一步地,本公开将涉及单独或联合执行指令以执行本文讨论的任何一种或多种概念的计算机硬件的任何集合。The system and method disclosed in the present application can be implemented as software, firmware, hardware or a combination thereof. In hardware implementation, task division does not necessarily correspond to the division of physical units; on the contrary, a physical component can have multiple functions, and a task can be performed collaboratively by several physical components. Computer hardware can be, for example, a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular phone, a smart phone, a web device, a network router, a switch or a bridge, or any machine capable of (sequentially or otherwise) executing instructions specifying the action to be taken by the computer hardware. Further, the present disclosure will relate to any set of computer hardware that executes instructions individually or jointly to perform any one or more concepts discussed herein.
某些或所有部件可以由一个或多个处理器实施,这些处理器接受包含一组指令的计算机可读(也称为机器可读)代码,该组指令当由一个或多个处理器执行时执行本文所描述的至少一种方法。包括能够执行指定要采取的动作的一组指令(顺序的或其他形式)的任何处理器。因此,一个示例是包括一个或多个处理器的典型处理系统(即计算机硬件)。每个处理器可以包括CPU、图形处理单元和可编程DSP单元中的一个或多个。处理系统可以进一步包括存储器子系统,所述存储器子系统包括硬盘驱动器、SSD、RAM和/或ROM。可以包括总线子系统以用于部件之间的通信。软件在由计算机系统执行其期间可以驻留在存储器子系统和/或处理器内。Some or all of the components may be implemented by one or more processors that accept a computer-readable (also referred to as machine-readable) code containing a set of instructions that, when executed by one or more processors, perform at least one method described herein. Any processor that is capable of executing a set of instructions (sequential or otherwise) that specifies an action to be taken is included. Thus, an example is a typical processing system (i.e., computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further include a memory subsystem that includes a hard drive, an SSD, a RAM, and/or a ROM. A bus subsystem may be included for communication between components. The software may reside in the memory subsystem and/or the processor during its execution by the computer system.
所述一个或多个处理器可以作为独立设备操作或可以连接到(多个)其他处理器,例如网络连接到(多个)其他处理器。这种网络可以在各种不同的网络协议上构建,并且可以是因特网、广域网(WAN)、局域网(LAN)或其任何组合。The one or more processors may operate as standalone devices or may be connected to other processor(s), such as via a network. Such a network may be constructed on a variety of different network protocols and may be the Internet, a wide area network (WAN), a local area network (LAN), or any combination thereof.
软件可以分布在计算机可读介质上,所述计算机可读介质可以包括计算机存储介质(或非暂态介质)和通信介质(或暂态介质)。如本领域技术人员所熟知的,术语计算机存储介质包括以用于存储如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实施的易失性和非易失性、可移除和不可移除的介质。计算机存储介质包括但不限于各种形式的物理(非暂态)存储介质,如EEPROM、闪速存储器或其他存储器技术、CD-ROM、数字通用盘(DVD)或其他光盘存储设备、磁带盒、磁带、磁盘存储设备或其他磁性存储设备、或可以用于存储期望信息并且可以被计算机访问的任何其他介质。进一步地,本领域技术人员所熟知的是,(暂态)通信介质通常以如载波等所调制数据信号或其他传输机制的形式来体现计算机可读指令、数据结构、程序模块或其他数据,并且包括任何信息传递介质。The software may be distributed on a computer-readable medium, which may include a computer storage medium (or non-transitory medium) and a communication medium (or transient medium). As is well known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data. Computer storage media include, but are not limited to, various forms of physical (non-transitory) storage media, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage devices, cassettes, tapes, disk storage devices or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Further, it is well known to those skilled in the art that (transitory) communication media typically embodies computer-readable instructions, data structures, program modules or other data in the form of modulated data signals or other transmission mechanisms such as carrier waves, and includes any information transfer medium.
图1描绘了用于基于空间提示和源提示两者进行源分离的源分离音频处理系统1。音频处理系统1获得输入音频信号A,所述输入音频信号被提供给基于空间提示的分离模块10。基于空间提示的分离模块10处理输入音频信号A并输出中间音频信号B。Fig. 1 depicts a source separation audio processing system 1 for performing source separation based on both spatial cues and source cues. The audio processing system 1 obtains an input audio signal A, which is provided to a separation module based on spatial cues 10. The separation module based on spatial cues 10 processes the input audio signal A and outputs an intermediate audio signal B.
输入音频信号A包括至少两个音频通道。例如,输入音频信号A是具有左音频通道和右音频通道的立体声或双耳音频信号。基于空间提示的分离模块10被配置为提取输入音频信号A的至少一个混合参数并基于所述至少一个混合参数修改所述至少两个音频通道以获得中间音频信号B。The input audio signal A comprises at least two audio channels. For example, the input audio signal A is a stereo or binaural audio signal having a left audio channel and a right audio channel. The separation module 10 based on spatial cues is configured to extract at least one mixing parameter of the input audio signal A and modify the at least two audio channels based on the at least one mixing parameter to obtain the intermediate audio signal B.
混合参数指示至少两个音频通道的混合的性质。可以针对单个频带或针对多个频带确定一个或多个混合参数并定期对这些参数进行更新。例如,音频信号被划分成多个连续(可选地重叠)分块,并且通过在至少一个分块频带上对细粒度混合参数进行聚合来确定混合参数。在一些实施方式中,混合参数指示至少两个通道的平移分布和在分块频带中至少两个音频通道的通道间相位差分布(例如,均值或中值)中的至少一项。分块包括至少两个帧,其中,每个帧进而被划分成覆盖窄频带的多个瓦片,如将在下文中进一步描述的。Mixing parameters indicate the properties of the mixture of at least two audio channels. One or more mixing parameters can be determined for a single frequency band or for multiple frequency bands and these parameters are regularly updated. For example, the audio signal is divided into a plurality of continuous (optionally overlapping) blocks, and the mixing parameters are determined by aggregating fine-grained mixing parameters on at least one block frequency band. In some embodiments, the mixing parameters indicate at least two channels of translation distribution and at least one of the channel phase difference distribution (e.g., mean or median) of at least two audio channels in the block frequency band. Blocking includes at least two frames, wherein each frame is further divided into a plurality of tiles covering narrow frequency bands, as will be further described below.
由基于空间提示的分离模块10执行的处理可能需要基于检测到的混合参数来调整至少两个音频通道以接近预定混合类型。在一些实施方式中,确定并在调整混合时使用至少两个不同的混合参数(例如,平移分布和通道间相位差分布两者)。下面呈现了这种情况的一个示例,其中,确定四个混合参数Θ-中、Θ-宽度、Φ-中、Φ-宽度,并将其用于调整混合。预定混合类型是基于后续的基于源提示的分离模块20的能力来选择的。例如,预定混合类型可以是近似中央平移的混合和/或几乎无通道间相位差的混合。The processing performed by the separation module 10 based on spatial cues may require adjusting at least two audio channels to approach a predetermined mix type based on the detected mixing parameters. In some embodiments, at least two different mixing parameters (e.g., both the panning distribution and the inter-channel phase difference distribution) are determined and used when adjusting the mix. An example of this is presented below, in which four mixing parameters θ-center, θ-width, Φ-center, Φ-width are determined and used to adjust the mix. The predetermined mix type is selected based on the capabilities of the subsequent separation module 20 based on source cues. For example, the predetermined mix type can be a mix with approximately central panning and/or a mix with almost no inter-channel phase difference.
例如,后续的基于源提示的分离模块20可以被配置为处理具有至少两个通道的中间音频信号B的下混版本。在一些实施方式中,基于源提示的分离模块20首先从中间音频信号B的至少两个通道中提取下混中音频信号,分析所述下混中音频信号以确定掩码增益以抑制下混中信号中的噪声,并将掩码增益应用于中间音频信号B的通道。为此,已经空间分离的中间音频信号B被中央平移和/或几乎不包含通道间相位差,并且将非常适合于利用这种类型的基于源提示的分离模块20进行处理,因为例如大部分期望内容将被包括在下混中信号中。通过比较,如果中间音频信号B不进行空间分离,则存在的风险是相关音频内容被排除在下混之外,并且基于源提示的分离模块20的神经网络不会对其进行正确考虑。For example, a subsequent source cue-based separation module 20 may be configured to process a downmixed version of an intermediate audio signal B having at least two channels. In some embodiments, the source cue-based separation module 20 first extracts a downmixed mid-audio signal from at least two channels of the intermediate audio signal B, analyzes the downmixed mid-audio signal to determine a masking gain to suppress noise in the downmixed mid-audio signal, and applies the masking gain to the channels of the intermediate audio signal B. To this end, the already spatially separated intermediate audio signal B is centrally panned and/or contains almost no inter-channel phase differences, and will be very suitable for processing using this type of source cue-based separation module 20, because, for example, most of the desired content will be included in the downmixed mid-audio signal. By comparison, if the intermediate audio signal B is not spatially separated, there is a risk that the relevant audio content is excluded from the downmix and will not be properly considered by the neural network of the source cue-based separation module 20.
基于空间提示的分离模块10可以在变换域中(比如在短时傅里叶变换(STFT)域或正交镜像滤波器组(QMF)域中)或在时域中(比如在波形域中)操作。在任一种情况下,输入音频信号A包括至少两个音频通道,比如立体声音频信号的左通道L和右通道R。然而,音频通道不一定是左通道L和右通道R,并且可以例如是5.1呈现的左通道L和中央通道C,5.1呈现的中央通道C和右通道R、或任意呈现的两个音频通道的任何选择。此外,通过“输入音频信号包括至少两个音频通道”,输入在这里是指具有多个信号的任何音频输入,而不仅仅是常规地被称为“通道”的这种信号。例如,输入音频信号的信号可以包括环绕音频通道、多轨信号、高阶高保真立体声信号、对象音频信号和/或沉浸式音频信号。输入音频信号A可以被划分成多个连续时域帧,其中,每个帧进一步被划分成多个瓦片,每个瓦片覆盖窄频带,从而给出细粒度瓦片表示。瓦片有时被称为时间-频率瓦片,并且作为示例,每个瓦片覆盖单独的STFT频率仓(bin)。因此,每个瓦片表示音频信号在预定窄频带中的有限时长。与包括至少两个连续帧的所有瓦片的分块相比,每个细粒度时间频率瓦片表示非常短的时长和/或非常窄的频带(例如,大约短和/或窄一个或多个数量级)。The separation module 10 based on spatial cues can operate in a transform domain (such as in a short-time Fourier transform (STFT) domain or a quadrature mirror filter bank (QMF) domain) or in a time domain (such as in a waveform domain). In either case, the input audio signal A includes at least two audio channels, such as a left channel L and a right channel R of a stereo audio signal. However, the audio channels are not necessarily left channel L and right channel R, and can be, for example, a left channel L and a center channel C presented by 5.1, a center channel C and a right channel R presented by 5.1, or any selection of two audio channels presented arbitrarily. In addition, by "the input audio signal includes at least two audio channels", the input here refers to any audio input with multiple signals, not just such signals conventionally referred to as "channels". For example, the signal of the input audio signal may include surround audio channels, multi-track signals, high-order high-fidelity stereo signals, object audio signals and/or immersive audio signals. The input audio signal A can be divided into a plurality of continuous time domain frames, wherein each frame is further divided into a plurality of tiles, each tile covering a narrow frequency band, thereby giving a fine-grained tile representation. Tiles are sometimes referred to as time-frequency tiles, and as an example, each tile covers a separate STFT frequency bin. Thus, each tile represents a limited duration of the audio signal in a predetermined narrow frequency band. Each fine-grained time-frequency tile represents a very short duration and/or a very narrow frequency band (e.g., approximately one or more orders of magnitude shorter and/or narrower) compared to a partition of all tiles comprising at least two consecutive frames.
由一个瓦片覆盖的频带通常相当窄,例如,大约10Hz,并且由每个瓦片或帧覆盖的时长也相当短,例如,大约20ms。分块(包括至少两个连续帧)覆盖的时长较长(例如,10个连续帧),并且还设想分块可以被划分成分块频带,其中,分块频带与由单独瓦片覆盖的频带相比更宽。例如,分块可以用例如400Hz至800Hz、800Hz至1600Hz、1600Hz至3200Hz等的分块频带来实现,所述分块频带与由每个瓦片覆盖的更窄频带相比要宽得多。The frequency band covered by a tile is usually quite narrow, for example, about 10 Hz, and the duration covered by each tile or frame is also quite short, for example, about 20 ms. The duration covered by a block (including at least two consecutive frames) is longer (for example, 10 consecutive frames), and it is also envisioned that the block can be divided into block bands, wherein the block bands are wider than the bands covered by a single tile. For example, the block can be implemented with block bands such as 400 Hz to 800 Hz, 800 Hz to 1600 Hz, 1600 Hz to 3200 Hz, etc., which are much wider than the narrower bands covered by each tile.
在基于空间提示的分离模块10的操作的第一示例中,该模块首先针对输入音频信号A的每个瓦片(例如,STFT瓦片)检测细粒度混合参数。其次,基于空间提示的分离模块10确定细粒度混合参数在多个瓦片上的(多个)分布,并基于细粒度混合参数的(多个)分布来修改通道。当描述此示例和其他示例时,将假设音频通道是左通道L和右通道R,然而,相同的处理可以应用于如以上提及的任意对音频通道,例如,LC(左和中央)对、RC(右和中央)对或Ls-Rs(左环绕和右环绕)对。In a first example of the operation of the spatial cue-based separation module 10, the module first detects fine-grained mixing parameters for each tile (e.g., STFT tile) of the input audio signal A. Secondly, the spatial cue-based separation module 10 determines the distribution (multiple) of the fine-grained mixing parameters over multiple tiles, and modifies the channels based on the distribution (multiple) of the fine-grained mixing parameters. When describing this example and other examples, it will be assumed that the audio channels are the left channel L and the right channel R, however, the same processing can be applied to any pair of audio channels as mentioned above, for example, an LC (left and center) pair, an RC (right and center) pair, or an Ls-Rs (left surround and right surround) pair.
对于每个细粒度瓦片,可以将检测到的左音频通道L和右音频通道R的平移混合参数Θ确定为For each fine-grained tile, the panning mixing parameters θ of the detected left audio channel L and right audio channel R can be determined as
其中,Θ的范围为从0(指示完全左L平移的音频信号)到π/2(指示完全右R平移的音频信号),其中,Θ=π/4指示中央平移音频信号。Therein, Θ ranges from 0 (indicating a fully left L-panned audio signal) to π/2 (indicating a fully right R-panned audio signal), where Θ=π/4 indicates a center-panned audio signal.
类似地,对于每个细粒度瓦片,可以将检测到的左音频通道L和右音频通道R的通道间相位差混合参数Φ确定为Similarly, for each fine-grained tile, the detected inter-channel phase difference mixing parameter Φ of the left audio channel L and the right audio channel R can be determined as
其中,Φ的范围为从–π到π,其中,Φ=0指示左音频通道L与右音频通道R之间没有通道间相位差。Here, Φ ranges from −π to π, where Φ=0 indicates no inter-channel phase difference between the left audio channel L and the right audio channel R.
此外,可以针对每个细粒度瓦片将检测到的信号幅度混合参数(以分贝表示)确定为Furthermore, the detected signal amplitude mixing parameter (expressed in decibels) can be determined for each fine-grained tile as
UdB=10log10(|L|2+|R|2)。 (等式3)U dB = 10log 10 (|L| 2 + |R| 2 ). (Equation 3)
基于空间提示的分离模块10可以检测以上等式1、等式2和等式3中的瓦片特定混合参数中的一个或多个,并且调整音频通道以接近例如无通道间相位差的中央平移音频信号。然而,由于每个瓦片通常覆盖非常短的时长(例如,1ms至30ms,比如20ms)和非常窄的频率范围,因此瓦片特定混合参数可以随时间和/或频率而快速变化。为此,将多个瓦片的瓦片特定平移和通道间相位差进行组合,并且可选地用瓦片特定幅度UdB进行加权,以在多个瓦片上形成平移分布和/或通道间相位差分布。然后,可以更新这些分布并将其用于以规则间隔(该间隔比单个瓦片或帧长得多)调整通道,以便接近预定混合类型。The separation module 10 based on spatial cues can detect one or more of the tile-specific mixing parameters in equations 1, 2, and 3 above, and adjust the audio channels to approach, for example, a centrally panned audio signal with no inter-channel phase difference. However, since each tile typically covers a very short duration (e.g., 1 ms to 30 ms, such as 20 ms) and a very narrow frequency range, the tile-specific mixing parameters can change rapidly over time and/or frequency. To this end, the tile-specific translations and inter-channel phase differences of multiple tiles are combined and optionally weighted with tile-specific amplitudes U dB to form a translation distribution and/or inter-channel phase difference distribution over multiple tiles. These distributions can then be updated and used to adjust the channels at regular intervals (which are much longer than a single tile or frame) to approach a predetermined mixing type.
例如,将多个帧(比如5或10个帧)的瓦片聚合成一个音频信号分块,其中,所述分块包括200至300ms之间的音频信号内容,并且所述分块可以被划分成相对较粗的(例如,倍频程或半倍频程)分块频带。在一些实施方式中,在一个分块频带中的所有瓦片上确定称为Θ-中的平均平移,并且在一个分块频带中的所有瓦片上确定称为Θ-宽度的相关联平移分布参数,所述相关联平移分布参数指示与Θ-中的对称偏差,其捕获了总信号能量的预定比例(例如,能量的40%)。类似地,在一个分块频带中的所有瓦片上确定称为Φ-中的平均通道间相位差,并且针对每个分块频带确定称为Φ-宽度的相关联通道间相位分布参数,所述相关联通道间相位分布参数指示与Φ-中的对称偏差,其捕获了总信号能量的预定比例(例如,能量的40%)。For example, tiles of multiple frames (such as 5 or 10 frames) are aggregated into an audio signal block, wherein the block includes audio signal content between 200 and 300 ms, and the block can be divided into relatively coarse (e.g., octave or half octave) block frequency bands. In some embodiments, an average translation called Θ-in is determined over all tiles in a block frequency band, and an associated translation distribution parameter called Θ-width is determined over all tiles in a block frequency band, the associated translation distribution parameter indicating a symmetrical deviation from Θ-in, which captures a predetermined proportion of the total signal energy (e.g., 40% of the energy). Similarly, an average inter-channel phase difference called Φ-in is determined over all tiles in a block frequency band, and an associated inter-channel phase distribution parameter called Φ-width is determined for each block frequency band, the associated inter-channel phase distribution parameter indicating a symmetrical deviation from Φ-in, which captures a predetermined proportion of the total signal energy (e.g., 40% of the energy).
然后,对左音频通道L和右音频通道R的修改可以包括调整每个分块频带的平移和/或通道间相位差,从而将Θ-中和/或Φ-中移动到预定位置,例如,对于无通道间相位差的中央平移预定混合,Φ=0且Θ=0。另外或可替代地,对左音频通道和右音频通道的修改可能需要对相应分布进行“挤压”,从而将Θ-宽度和/或Φ-宽度减小到预定宽度或预定因子。Then, the modification of the left audio channel L and the right audio channel R may include adjusting the panning and/or inter-channel phase difference of each blocked frequency band so as to move the θ-center and/or Φ-center to a predetermined position, for example, Φ=0 and θ=0 for a center panned predetermined mix without inter-channel phase difference. Additionally or alternatively, the modification of the left audio channel and the right audio channel may require "squeezing" the corresponding distribution so as to reduce the θ-width and/or Φ-width to a predetermined width or a predetermined factor.
在一个示例性实施方式中,基于空间提示的分离模块10在STFT域中进行操作,其采样率为48kHz,帧包括4096个样本,帧步长为1024个样本(即,75%的重叠),并且采用汉宁窗口或汉宁窗口的平方根。混合参数是针对每个分块频带确定的,其中,一个分块包括10个帧(1个当前帧、4个前瞻(lookahead)帧和5个回溯(lookback)帧),分块步长为5帧。也就是说,当确定每个分块频带中的混合参数时,考虑总缓冲的大约277ms的内容。在瓦片(帧)之间的重叠率为75%且分块步长为5帧的情况下,混合参数可以每5×1024个样本更新一次(或在48kHz的采样率下大约每107ms更新一次),这确定了基于空间提示的分离模块10的时间分辨率。另外,设想在分块之间内插至少一个混合参数。例如,混合参数每个帧内插一次,这意味着混合参数每1024个样本更新一次(或在48kHz的采样率下大约每20ms更新一次)。In an exemplary embodiment, the spatial cue-based separation module 10 operates in the STFT domain, with a sampling rate of 48kHz, a frame including 4096 samples, a frame step of 1024 samples (i.e., 75% overlap), and a Hanning window or a square root of a Hanning window is used. The mixing parameters are determined for each block band, where a block includes 10 frames (1 current frame, 4 lookahead frames, and 5 lookback frames) and the block step is 5 frames. That is, when determining the mixing parameters in each block band, approximately 277ms of the total buffered content is considered. In the case of an overlap rate of 75% between tiles (frames) and a block step of 5 frames, the mixing parameters can be updated once every 5×1024 samples (or approximately once every 107ms at a sampling rate of 48kHz), which determines the temporal resolution of the spatial cue-based separation module 10. In addition, it is envisaged that at least one mixing parameter is interpolated between blocks. For example, the mixing parameters are interpolated once per frame, which means that the mixing parameters are updated every 1024 samples (or approximately every 20 ms at a sampling rate of 48 kHz).
基于空间提示的分离模块10的频率分辨率由每个分块被划分成的分块频带的数量和带宽确定。在一个示例性实施例中,基于空间提示的分离模块10在准倍频程分块频带上操作,带边缘为0Hz、400Hz、800Hz、1600Hz、3200Hz、6400Hz、13200Hz和24000Hz,从而形成具有不同带宽的七个频带,其范围从覆盖0至400Hz频带的400Hz带宽到覆盖13200Hz至24000Hz频带的10800Hz带宽。The frequency resolution of the spatial cue-based separation module 10 is determined by the number and bandwidth of the block frequency bands into which each block is divided. In an exemplary embodiment, the spatial cue-based separation module 10 operates on quasi-octave block frequency bands with band edges at 0 Hz, 400 Hz, 800 Hz, 1600 Hz, 3200 Hz, 6400 Hz, 13200 Hz, and 24000 Hz, thereby forming seven frequency bands with different bandwidths ranging from a 400 Hz bandwidth covering the 0 to 400 Hz band to a 10800 Hz bandwidth covering the 13200 Hz to 24000 Hz band.
基于空间提示的分离模块10的上述时间和/或频率分辨率仅是示例性的,并且设想了其他替代方案。例如,设想的是,当形成分块时组合更少或更多的帧,和/或瓦片(帧)和分块的时间步长/重叠可以变化。然而,一般而言,基于空间提示的分离模块10受益于与后续基于源提示的分离模块20相比在相对较低的时间/频率分辨率下操作。例如,基于空间提示的分离模块10确定混合参数的时间和/或频率分辨率比源分离模块的时间和/或频率分辨率更粗,比如至少粗两倍、至少粗四倍、至少粗六倍、至少粗八倍、或者至少粗十倍。The above-mentioned time and/or frequency resolution of the separation module 10 based on spatial cues is only exemplary, and other alternatives are envisioned. For example, it is envisioned that fewer or more frames are combined when forming blocks, and/or the time step/overlap of tiles (frames) and blocks can vary. However, in general, the separation module 10 based on spatial cues benefits from operating at a relatively low time/frequency resolution compared to the subsequent separation module 20 based on source cues. For example, the time and/or frequency resolution of the separation module 10 based on spatial cues to determine the mixing parameters is coarser than the time and/or frequency resolution of the source separation module, such as at least twice, at least four times, at least six times, at least eight times, or at least ten times.
例如,基于空间提示的分离模块10的处理可以如在Master、Aaron S等人的“Dialog Enhancement via Spatio-Level Filtering and Classification”[通过空间级过滤和分类进行对话增强],AES惯例论文10427中所述。For example, the processing of the spatial cue-based separation module 10 can be as described in Master, Aaron S et al., “Dialog Enhancement via Spatio-Level Filtering and Classification”, AES Convention Paper 10427.
作为基于空间提示的分离模块10操作的第二示例,从多个检测到的细粒度混合参数中确定的平移和/或通道间相位差混合参数可以用作目标平移参数Θ和/或目标相位差参数Φ,如2022年3月9日提交的美国临时申请号63/318,226“TARGET MID-SIDE SIGNALSFOR AUDIO APPLICATIONS[针对音频应用的目标中-侧信号]”中所述,所述申请通过援引以其全文并入本文。这里,混合参数可以用于从输入左音频通道L和输入右音频通道R中提取中央平移的目标中音频通道M和目标侧音频通道S,如下所示As a second example of the operation of the separation module 10 based on spatial cues, the panning and/or inter-channel phase difference mixing parameters determined from the plurality of detected fine-grained mixing parameters can be used as target panning parameters θ and/or target phase difference parameters Φ, as described in U.S. Provisional Application No. 63/318,226, “TARGET MID-SIDE SIGNALSFOR AUDIO APPLICATIONS,” filed on March 9, 2022, which is incorporated herein by reference in its entirety. Here, the mixing parameters can be used to extract a centrally panned target mid audio channel M and a target side audio channel S from an input left audio channel L and an input right audio channel R, as shown below
其中,目标中音频信号M将针对任何主导音频源以包括在每个频带中。然后,可以从目标中音频通道M和目标侧音频通道S中提取具有左音频通道Lint和右音频通道Rint的中央平移中间音频信号,如下所示Among them, the target mid audio signal M will be targeted for any dominant audio source to be included in each frequency band. Then, a central panned middle audio signal with a left audio channel Lint and a right audio channel Rint can be extracted from the target mid audio channel M and the target side audio channel S as shown below
Lint=M+S (等式6) Lint = M + S (Equation 6)
Rint=M-S (等式7)R int =MS (Equation 7)
其中,输入音频信号A的主导音频源已经被移位到中央平移,减小了通道间相位差。因此,提取目标中音频信号M并重建中央平移的左音频通道Lint和右音频通道Rint对是可以基于一个或多个混合参数实现空间源分离的另一种示例性方法。The dominant audio source of the input audio signal A has been shifted to the center pan, reducing the inter-channel phase difference. Therefore, extracting the target mid-audio signal M and reconstructing the center-panned left audio channel Lint and right audio channel Rint pair is another exemplary method that can achieve spatial source separation based on one or more mixing parameters.
在一些实施方式中,在等式6和等式7中,当确定中央平移的中间音频信号通道Lint、Rint时,目标侧音频信号S将被忽略(例如,设为零)。由于许多目标音频源将被目标中音频信号M完全捕获,因此目标侧音频信号S将主要包含不需要的音频信号分量,这意味着它可以被忽略。In some embodiments, the target-side audio signal S will be ignored (e.g., set to zero) when determining the center-panned intermediate audio signal channels Lint , Rint in Equations 6 and 7. Since many target audio sources will be fully captured by the target mid-audio signal M, the target-side audio signal S will mainly contain unwanted audio signal components, which means it can be ignored.
在上文中,已经呈现了基于空间提示的分离模块10的操作的不同示例。从这些示例中可以理解,基于空间提示的分离模块10执行检测操作和提取操作。检测操作包括以细粒度的时间频率分辨率确定至少一个检测到的混合参数(例如,针对每个瓦片确定检测到的至少一个混合参数),其中,提取操作涉及对检测到的至少一个细粒度混合参数随时间和/或频率进行平滑(例如,在分块频带上聚合细粒度混合参数),以获得相对更粗粒度的混合参数。基于空间提示的分离模块10的时间和/或频率分辨率基于提取操作的较粗的时间和/或频率分辨率。然后,用于进行混合的最终调整的是更粗的至少一个混合参数。也就是说,所检测到的细粒度混合参数不直接用于控制混合,因为这可能由于混合的快速调整(例如,对于每个STFT瓦片)而引入显著的声学伪影。In the above, different examples of the operation of the separation module 10 based on spatial cues have been presented. It can be understood from these examples that the separation module 10 based on spatial cues performs a detection operation and an extraction operation. The detection operation includes determining at least one detected mixing parameter with a fine-grained time-frequency resolution (e.g., determining at least one detected mixing parameter for each tile), wherein the extraction operation involves smoothing the detected at least one fine-grained mixing parameter over time and/or frequency (e.g., aggregating the fine-grained mixing parameters on the block frequency bands) to obtain a relatively coarser mixing parameter. The time and/or frequency resolution of the separation module 10 based on spatial cues is based on the coarser time and/or frequency resolution of the extraction operation. Then, the coarser at least one mixing parameter is used for the final adjustment of the mixing. That is, the detected fine-grained mixing parameters are not directly used to control the mixing, because this may introduce significant acoustic artifacts due to the rapid adjustment of the mixing (e.g., for each STFT tile).
基于空间提示的分离模块10输出所产生的中间音频信号B,所述中间音频信号包括空间混合的音频内容,所述空间混合对于基于源提示的分离模块20而言更容易处理(例如,几乎无通道间相位差的中央平移音频信号)。The spatial cue based separation module 10 outputs a generated intermediate audio signal B comprising spatially mixed audio content that is easier for the source cue based separation module 20 to process (eg, a centrally panned audio signal with almost no inter-channel phase difference).
基于源提示的分离模块20包括神经网络,所述神经网络被训练用于在给定中间音频信号B的样本的情况下预测降噪输出音频信号C。神经网络已经被训练用于例如识别目标音频内容(例如,语音或音乐)并放大目标音频内容和/或已经被训练用于识别不期望的音频内容(例如,平稳噪声或非平稳噪声)并使不期望的音频内容衰减。为了实现这一点,神经网络可以包括多个神经网络层并且可以例如是循环神经网络。The source cue-based separation module 20 comprises a neural network that is trained to predict a noise-reduced output audio signal C given a sample of an intermediate audio signal B. The neural network has been trained, for example, to recognize target audio content (e.g., speech or music) and amplify the target audio content and/or has been trained to recognize undesirable audio content (e.g., stationary noise or non-stationary noise) and attenuate the undesirable audio content. To achieve this, the neural network may comprise a plurality of neural network layers and may, for example, be a recurrent neural network.
例如,基于源提示的分离模块20中的神经网络是U-Net型架构,其中,神经网络的输入是频带能量,并且输出是实值频带增益。这种类型的U-Net架构有时被称为U-NetFB。给定立体声中间音频信号B,在基于下混音频信号预测增益掩码之前,U-NetFB首先对音频信号进行下混,由此将所产生的增益掩码应用于中间音频信号B的两个音频通道。For example, the neural network in the source cue-based separation module 20 is a U-Net type architecture, where the input to the neural network is the band energy and the output is the real-valued band gain. This type of U-Net architecture is sometimes referred to as U-NetFB. Given a stereo intermediate audio signal B, U-NetFB first downmixes the audio signal before predicting the gain mask based on the downmix audio signal, thereby applying the generated gain mask to the two audio channels of the intermediate audio signal B.
作为另一示例,基于源提示的分离模块20中的神经网络是具有多个并行卷积路径的聚合多尺度卷积神经网络,每个卷积路径包括一个或多个卷积层。利用这种神经网络,通过聚合并行卷积路径的输出来形成聚合输出,由此基于聚合输出生成输出增益掩码。这种类型的神经网络例如在“METHOD AND APPARATUS FOR SPEECH SOURCE SEPARATION BASEDON A CONVOLUTIONAL NEURAL NETWORK”[用于基于卷积神经网络进行语音源分离的方法和装置]中进行了更详细的描述,该文作为PCT申请提交且公开号为WO/2020/232180,其通过援引以其全文并入本文。As another example, the neural network in the source cue-based separation module 20 is an aggregated multi-scale convolutional neural network having multiple parallel convolution paths, each convolution path including one or more convolution layers. With this neural network, an aggregate output is formed by aggregating the outputs of the parallel convolution paths, thereby generating an output gain mask based on the aggregate output. This type of neural network is described in more detail, for example, in “METHOD AND APPARATUS FOR SPEECH SOURCE SEPARATION BASEDON A CONVOLUTIONAL NEURAL NETWORK”, which was filed as a PCT application and has publication number WO/2020/232180, which is incorporated herein by reference in its entirety.
还如图1所示,向基于源提示的分离模块20提供时间和/或频率元数据D。时间和/或频率元数据D指示基于空间提示的分离模块10操作时的时间分辨率和频率分辨率中的至少一个。例如,基于空间提示的分离模块10的时间和频率分辨率是时域中的分块步长和频域中的一个分块频带的带宽。作为另一示例,基于空间提示的分离模块10的时间和频率分辨率是时域中的帧步长和频域中的一个瓦片的带宽。也就是说,时间和/或频率元数据D指示以下各项中的至少一项:(i)时域中的分块步长和/或频域中的一个分块频带(例如,准倍频程频带)的带宽;或(ii)时域中的帧步长和/或频域中的一个瓦片的带宽。时间和/或频率元数据D可以从外部源获得(例如,用户指定或从数据库访问),或者时间和/或频率元数据D可以由基于空间提示的分离模块10提供给基于源提示的分离20。As also shown in FIG. 1 , time and/or frequency metadata D is provided to the source-cue-based separation module 20. The time and/or frequency metadata D indicates at least one of the time resolution and frequency resolution of the separation module 10 based on spatial cues when it operates. For example, the time and frequency resolution of the separation module 10 based on spatial cues is the bandwidth of a block step in the time domain and a block band in the frequency domain. As another example, the time and frequency resolution of the separation module 10 based on spatial cues is the frame step in the time domain and the bandwidth of a tile in the frequency domain. That is, the time and/or frequency metadata D indicates at least one of the following: (i) the block step in the time domain and/or the bandwidth of a block band (e.g., a quasi-octave band) in the frequency domain; or (ii) the frame step in the time domain and/or the bandwidth of a tile in the frequency domain. The time and/or frequency metadata D can be obtained from an external source (e.g., specified by a user or accessed from a database), or the time and/or frequency metadata D can be provided by the separation module 10 based on spatial cues to the source-cue-based separation 20.
基于源提示的分离模块20基于时间和/或频率元数据D来处理中间音频信号B。在一些实施方式中,基于空间提示的分离模块10以与基于源提示的分离模块20的分辨率相比低得多(即,粗得多)的时间和/或频率分辨率进行操作。例如,基于空间提示的分离模块10以带宽至少为400Hz的准倍频程分块频带进行操作,并且混合参数大约每100ms(分块)或20ms(内插)更新一次。然而,基于源提示的分离模块20可以对单独瓦片(例如,单独STFT瓦片)进行操作,时间分辨率为几毫秒(例如,20ms),并且频率分辨率大约为10Hz。The source cue-based separation module 20 processes the intermediate audio signal B based on the time and/or frequency metadata D. In some embodiments, the spatial cue-based separation module 10 operates with a much lower (i.e., much coarser) time and/or frequency resolution than the resolution of the source cue-based separation module 20. For example, the spatial cue-based separation module 10 operates with a quasi-octave block band with a bandwidth of at least 400 Hz, and the mixing parameters are updated approximately every 100 ms (blocking) or 20 ms (interpolation). However, the source cue-based separation module 20 can operate on individual tiles (e.g., individual STFT tiles) with a time resolution of a few milliseconds (e.g., 20 ms) and a frequency resolution of approximately 10 Hz.
通过将时间和/或频率元数据D提供给基于源提示的分离模块20,所述模块然后可以被配置为(i)使用其默认或典型的时间和/或频率分辨率、(ii)使用基于空间提示的分离模块10的相同时间和/或频率分辨率、或(iii)使用不同于(i)或(ii)的不同时间和/或频率分辨率。作为替代方案(iii)的示例,即使基于空间提示的分离模块10和基于源提示的分离模块20两者通常都以更细的时间和/或频率分辨率进行操作,也可以指示基于源提示的分离模块20使用更低/更粗的时间和/或频率分辨率,而不是更细的分辨率。By providing the time and/or frequency metadata D to the source cue-based separation module 20, the module can then be configured to (i) use its default or typical time and/or frequency resolution, (ii) use the same time and/or frequency resolution of the spatial cue-based separation module 10, or (iii) use a different time and/or frequency resolution than (i) or (ii). As an example of alternative (iii), the source cue-based separation module 20 can be instructed to use a lower/coarser time and/or frequency resolution instead of a finer resolution, even if both the spatial cue-based separation module 10 and the source cue-based separation module 20 normally operate at a finer time and/or frequency resolution.
利用时间和/或频率元数据D,基于源提示的分离模块20可以在更适合(就分离性能和减轻声学伪影而言)与基于空间提示的分离模块10组合的模式下进行操作,并且这种模式可以不同于其在没有基于空间提示的分离模块10的情况下的典型操作。时间和/或频率元数据D可以指定基于源提示的分离模块20应当操作的更合适的时间和/或频率分辨率粒度。Using the time and/or frequency metadata D, the source cue based separation module 20 may operate in a mode that is more suitable (in terms of separation performance and mitigation of acoustic artifacts) for combination with the spatial cue based separation module 10, and such mode may differ from its typical operation without the spatial cue based separation module 10. The time and/or frequency metadata D may specify a more appropriate time and/or frequency resolution granularity at which the source cue based separation module 20 should operate.
在一些实施方式中,时间和/或频率元数据D指示基于源提示的分离模块20应当以等于或低于/粗于基于空间源提示的分离模块10的时间和/或频率分辨率的时间和/或频率分辨率进行操作和/或应用平滑。例如,基于源提示的分离模块20以与基于空间提示的分离模块10中所使用的频率分辨率相同(例如,等于分块频带)的频率分辨率进行操作,并且其时间分辨率比基于空间提示的分离模块10的时间分辨率粗/低一倍到十倍之间(例如,在分块的时长的一倍到十倍之间)。In some embodiments, the time and/or frequency metadata D indicates that the source cue-based separation module 20 should operate and/or apply smoothing at a time and/or frequency resolution equal to or lower/coarser than the time and/or frequency resolution of the spatial source cue-based separation module 10. For example, the source cue-based separation module 20 operates at a frequency resolution that is the same as the frequency resolution used in the spatial cue-based separation module 10 (e.g., equal to the tile frequency band), and its time resolution is between one and ten times coarser/lower than the time resolution of the spatial cue-based separation module 10 (e.g., between one and ten times the duration of the tile).
由于直接将基于源提示的分离模块20的分辨率从其默认值改变可能导致留在输出音频信号C中的残留噪声增加,因此在一些实施方式中,基于源提示的分离模块20可以以其默认时间和/或频率分辨率操作(所述默认时间和/或频率分辨率可以比基于空间提示的分离模块10的时间和/或频率分辨率更细)。在这种实施方式中,时间和/或频率元数据D指示将直接应用输出音频信号C或应用于预测的源增益掩码G的时间和/或频率上的平滑。例如,平滑可以被配置为建立与在基于空间提示的分离模块10中所使用的频率分辨率相同的频率分辨率以及比基于空间提示的分离模块10的时间分辨率粗/低一倍到十倍之间的时间分辨率。Since directly changing the resolution of the source cue-based separation module 20 from its default value may result in an increase in residual noise remaining in the output audio signal C, in some embodiments, the source cue-based separation module 20 can operate with its default time and/or frequency resolution (the default time and/or frequency resolution can be finer than the time and/or frequency resolution of the spatial cue-based separation module 10). In such an embodiment, the time and/or frequency metadata D indicates smoothing in time and/or frequency that will be applied directly to the output audio signal C or to the predicted source gain mask G. For example, smoothing can be configured to establish a frequency resolution that is the same as the frequency resolution used in the spatial cue-based separation module 10 and a time resolution that is between one and ten times coarser/lower than the time resolution of the spatial cue-based separation module 10.
在图2中,其示出了中间音频信号B可选地在中间混合模块30a中与输入音频信号A混合,以生成混合中间音频信号B'。然后,将混合中间音频信号B'提供给基于源提示的分离模块20,由所述基于源提示的分离模块处理混合中间音频信号B'。中间混合模块30a生成混合中间音频信号B',作为输入音频信号A与由基于空间提示的分离模块10输出的中间音频信号B的加权线性组合。例如,中间混合单元30a的混合比将中间音频信号B与输入音频信号A以混合比例进行混合,所述混合比使中间音频信号B与输入音频信号A相比至少增强15dB。在一些实施方式中,不将输入音频信号A与中间音频信号B混合,这可以通过完全省略中间混合单元30a或设置将中间音频信号B与输入音频信号A进行混合的混合比来实现,所述混合比使中间音频信号B与输入音频信号A相比增强∞dB。In FIG. 2 , it is shown that the intermediate audio signal B is optionally mixed with the input audio signal A in the intermediate mixing module 30 a to generate a mixed intermediate audio signal B′. The mixed intermediate audio signal B′ is then provided to the source cue-based separation module 20, which processes the mixed intermediate audio signal B′. The intermediate mixing module 30 a generates the mixed intermediate audio signal B′ as a weighted linear combination of the input audio signal A and the intermediate audio signal B output by the spatial cue-based separation module 10. For example, the mixing ratio of the intermediate mixing unit 30 a mixes the intermediate audio signal B with the input audio signal A at a mixing ratio that enhances the intermediate audio signal B by at least 15 dB compared to the input audio signal A. In some embodiments, the input audio signal A is not mixed with the intermediate audio signal B, which can be achieved by completely omitting the intermediate mixing unit 30 a or setting a mixing ratio for mixing the intermediate audio signal B with the input audio signal A that enhances the intermediate audio signal B by ∞ dB compared to the input audio signal A.
通过使用混合将输入音频信号A重新引入中间音频信号B中,一些未经处理的输入音频信号A的内容被重新引入中间音频信号B中,这可能会提升后续基于源提示的分离模块20的性能和/或提供感知质量更高的输出音频信号C。一般来说,再混合可以掩蔽前面的分离模块10、20可能引入的声学伪影。例如,可以使用包含期望音频内容(例如,语音)和噪声的混合的训练数据来训练基于源提示的分离模块20的神经网络,由此神经网络已经学会抑制噪声和/或放大期望音频内容。然而,基于空间提示的分离模块10可能引入训练数据中不存在的声学伪影,这可能导致基于源提示的分离模块20的性能下降。通过对输入音频信号A进行再混合,这些伪影被掩蔽,从而使得混合中间音频信号B'更类似于训练基于源提示的分离模块20的神经网络所使用的训练数据。因此,通过将输入音频信号A与中间音频信号B进行再混合,避免了这些和其他问题。另一方面,再混合仍然使中间音频信号B与输入音频信号A相比得到增强,因此基于源提示的分离模块20仍会呈现空间分离的(混合)中间音频信号B。By using mixing to reintroduce the input audio signal A into the intermediate audio signal B, some of the unprocessed content of the input audio signal A is reintroduced into the intermediate audio signal B, which may improve the performance of the subsequent source cue-based separation module 20 and/or provide an output audio signal C of higher perceived quality. In general, remixing can mask acoustic artifacts that may have been introduced by the previous separation modules 10, 20. For example, the neural network of the source cue-based separation module 20 can be trained using training data containing a mixture of desired audio content (e.g., speech) and noise, whereby the neural network has learned to suppress noise and/or amplify desired audio content. However, the spatial cue-based separation module 10 may introduce acoustic artifacts that are not present in the training data, which may result in a degradation in the performance of the source cue-based separation module 20. By remixing the input audio signal A, these artifacts are masked, thereby making the mixed intermediate audio signal B' more similar to the training data used to train the neural network of the source cue-based separation module 20. Therefore, by remixing the input audio signal A with the intermediate audio signal B, these and other problems are avoided. On the other hand, the remixing still makes the intermediate audio signal B enhanced compared with the input audio signal A, so the source cue-based separation module 20 still presents the spatially separated (mixed) intermediate audio signal B.
类似地,可选地提供输出混合模块30b,以用于将来自基于源提示的分离模块20的输出音频信号C与输入音频信号A和中间音频信号B中的至少一个进行混合,以生成混合输出音频信号C'。混合输出音频信号C'被生成为输出音频信号C与中间音频信号B和输入音频信号A中的至少一个的加权线性组合。例如,输出混合模块30b以混合比对输出音频信号C进行混合,所述混合比使输出音频信号C分别与中间音频信号B和/或输入音频信号A相比增强20dB。Similarly, an output mixing module 30b is optionally provided for mixing the output audio signal C from the source cue-based separation module 20 with at least one of the input audio signal A and the intermediate audio signal B to generate a mixed output audio signal C'. The mixed output audio signal C' is generated as a weighted linear combination of the output audio signal C and at least one of the intermediate audio signal B and the input audio signal A. For example, the output mixing module 30b mixes the output audio signal C at a mixing ratio that enhances the output audio signal C by 20 dB compared to the intermediate audio signal B and/or the input audio signal A, respectively.
将输入音频信号A和/或中间音频信号B混合到输出音频信号C中可以促进提高最终混合输出音频信号C'的感知质量。在一些情况下,分离模块10、20进行的处理可能引入声学伪影。通过将输入音频信号A和/或中间音频信号B与输出音频信号C进行再混合,克服了这一问题,因为这些伪影在混合输出音频信号C'中至少部分地被掩蔽。另外,将输入音频信号A和/或中间音频信号B与输出音频信号C进行再混合意味着,即使源分离模块10、20中的任何一个抑制了部分期望音频内容,所述内容仍会以有限的数量存在于混合输出音频信号C'中。Mixing the input audio signal A and/or the intermediate audio signal B into the output audio signal C may contribute to improving the perceived quality of the final mixed output audio signal C'. In some cases, the processing performed by the separation modules 10, 20 may introduce acoustic artifacts. By remixing the input audio signal A and/or the intermediate audio signal B with the output audio signal C, this problem is overcome, as these artifacts are at least partially masked in the mixed output audio signal C'. In addition, remixing the input audio signal A and/or the intermediate audio signal B with the output audio signal C means that, even if any of the source separation modules 10, 20 suppresses part of the desired audio content, the content will still be present in a limited amount in the mixed output audio signal C'.
进一步参考图3中的流程图,现在将更详细地描述用于源分离的音频处理方法。在步骤S1处,获得输入音频信号A,并将其提供给基于空间提示的分离模块10,由所述分离模块在步骤S2处理输入音频信号A以获得中间音频信号B。中间音频信号B可选地提供给中间混合模块30a,由所述中间混合模块在步骤S3处将中间音频信号B与输入音频信号A进行混合以获得混合中间音频信号B'。在步骤S4处,将指示由基于空间提示的分离模块10使用的时间和/或频率分辨率的时间/频率元数据D提供给基于源提示的分离模块20,并且在步骤S5处,由源分离模块20使用时间/频率元数据D用于处理混合中间音频信号B'以形成输出音频信号C。输出音频信号C可选地提供给混合模块30b,由所述混合模块在步骤S6处将输出音频信号C与输入音频信号A和中间音频信号B中的至少一个进行混合以获得混合输出音频信号C'。With further reference to the flowchart in FIG3 , the audio processing method for source separation will now be described in more detail. At step S1, an input audio signal A is obtained and provided to a separation module 10 based on spatial cues, which processes the input audio signal A to obtain an intermediate audio signal B at step S2. The intermediate audio signal B is optionally provided to an intermediate mixing module 30a, which mixes the intermediate audio signal B with the input audio signal A at step S3 to obtain a mixed intermediate audio signal B'. At step S4, time/frequency metadata D indicating the time and/or frequency resolution used by the separation module 10 based on spatial cues is provided to the separation module 20 based on source cues, and at step S5, the time/frequency metadata D is used by the source separation module 20 for processing the mixed intermediate audio signal B' to form an output audio signal C. The output audio signal C is optionally provided to a mixing module 30b, which mixes the output audio signal C with at least one of the input audio signal A and the intermediate audio signal B at step S6 to obtain a mixed output audio signal C'.
在上文中,已经描述了用于分离的音频处理方法。应理解,所述方法可以在混合模块30a、30b执行任何混合步骤或不执行任何混合步骤的情况下进行。换言之,混合步骤S3、S6完全是可选的并且彼此独立,这意味着可以不使用混合步骤、使用其中一个混合步骤、或者同时使用两个混合步骤,而其余步骤保持不变。类似地,在步骤S4期间使用时间/频率元数据D是可选的,并且设想了基于时间/频率元数据D的处理的实施方式和不基于时间/频率元数据的处理的实施方式。In the above, the audio processing method for separation has been described. It should be understood that the method can be carried out with or without any mixing steps performed by the mixing modules 30a, 30b. In other words, the mixing steps S3, S6 are completely optional and independent of each other, which means that no mixing steps, one of the mixing steps, or both mixing steps can be used while the remaining steps remain unchanged. Similarly, the use of time/frequency metadata D during step S4 is optional, and embodiments of processing based on time/frequency metadata D and embodiments of processing not based on time/frequency metadata are envisioned.
进一步设想了利用基于空间提示的分离模块10来处理输入音频信号A可以进一步包括将输入音频信号A变换到基于空间提示的分离模块10在其中操作的域。例如,输入音频信号A原本在时域中,由此在将输入音频信号A摄入基于空间提示的分离模块10之前先将所述输入音频信号A变换到STFT域或QMF域。另外或可替代地,中间音频信号B在被提供给随后的中间混合单元30a或基于源提示的分离模块20之前先进行逆变换。类似地,利用基于源提示的分离模块来处理中间音频信号B可以包括对中间音频信号B和输出音频信号C进行变换和逆变换。It is further contemplated that processing the input audio signal A using the separation module 10 based on spatial cues may further include transforming the input audio signal A to a domain in which the separation module 10 based on spatial cues operates. For example, the input audio signal A is originally in the time domain, and thus the input audio signal A is transformed to the STFT domain or the QMF domain before being taken into the separation module 10 based on spatial cues. Additionally or alternatively, the intermediate audio signal B is inversely transformed before being provided to the subsequent intermediate mixing unit 30a or the separation module 20 based on source cues. Similarly, processing the intermediate audio signal B using the separation module based on source cues may include transforming and inversely transforming the intermediate audio signal B and the output audio signal C.
在一些实施方式中,基于源提示的分离模块20包括基于源提示的增益掩码提取器21和增益掩码应用器22,如图4所示。如上所述,基于源提示的分离模块20包括被训练用于生成噪声降低的输出音频信号C的神经网络。这可以通过被训练用于预测源增益掩码G的神经网络来实现,所述源增益掩码被实施为基于源提示的增益掩码提取器21。基于源提示的增益掩码提取器21将源增益掩码G输出到增益掩码应用器22,其中,增益掩码应用器22将源增益掩码G应用于中间音频信号B(混合中间音频信号B'),以形成降噪输出音频信号C。应用源增益掩码G可以包括将源增益掩码G与中间音频信号B(混合中间音频信号B')的对应时频域表示相乘。In some embodiments, the source cue-based separation module 20 includes a source cue-based gain mask extractor 21 and a gain mask applicator 22, as shown in FIG4 . As described above, the source cue-based separation module 20 includes a neural network trained to generate a noise-reduced output audio signal C. This can be achieved by a neural network trained to predict a source gain mask G, which is implemented as a source cue-based gain mask extractor 21. The source cue-based gain mask extractor 21 outputs the source gain mask G to the gain mask applicator 22, wherein the gain mask applicator 22 applies the source gain mask G to the intermediate audio signal B (mixed intermediate audio signal B') to form the noise-reduced output audio signal C. Applying the source gain mask G may include multiplying the source gain mask G with a corresponding time-frequency domain representation of the intermediate audio signal B (mixed intermediate audio signal B').
增益掩码G是预测的一组增益,其中,音频信号的每个瓦片具有一个增益。例如,神经网络可以被训练用于预测细粒度增益掩码,其中,每个帧的每个STFT仓具有一个增益。预测的增益抑制音频信号中存在的噪声,同时留下目标音频内容(例如,语音和/或音乐)。作为示例,中间音频信号B(混合中间音频信号B')被划分成多个连续帧,其中,每个帧进一步被划分成覆盖相应频带的N个瓦片,其中,N≥2。然后,训练基于源提示的增益掩码提取器21的神经网络,以预测每个帧的N个增益(每个瓦片一个增益),从而抑制中间音频信号B(混合中间音频信号B')中的噪声。如果目标音频内容(例如,语音和/或音乐)集中在第一瓦片(N=i)中,则该第一瓦片i可以与高增益相关联,而主要包含噪声的不同第二瓦片与使第二瓦片衰减的不同较低增益相关联。The gain mask G is a predicted set of gains, where each tile of the audio signal has a gain. For example, a neural network can be trained to predict a fine-grained gain mask, where each STFT bin of each frame has a gain. The predicted gain suppresses noise present in the audio signal while leaving the target audio content (e.g., speech and/or music). As an example, the intermediate audio signal B (mixed intermediate audio signal B') is divided into a plurality of consecutive frames, where each frame is further divided into N tiles covering the corresponding frequency band, where N≥2. Then, the neural network of the gain mask extractor 21 based on the source cue is trained to predict N gains for each frame (one gain for each tile), thereby suppressing noise in the intermediate audio signal B (mixed intermediate audio signal B'). If the target audio content (e.g., speech and/or music) is concentrated in a first tile (N=i), the first tile i can be associated with a high gain, while a different second tile containing mainly noise is associated with a different lower gain that attenuates the second tile.
增益掩码应用器22可以进一步被配置为在应用源增益掩码G时考虑时间和/或频率元数据D。例如,如果基于源提示的分离模块20应该以不同于默认或典型的时间和/或频率分辨率的时间和/或频率分辨率进行操作,则增益掩码应用器22可以被配置为在将源增益掩码G应用于中间音频信号B(混合中间音频信号B')之前对所述源增益掩码G进行平滑和/或在应用源增益掩码G之后对所产生的音频信号进行平滑。平滑将降低时间和/或频率分辨率(使其更粗),以便例如实现与基于空间提示的分离模块10的分辨率相匹配(或低于其分辨率)的分辨率。也就是说,与基于空间提示的分离模块10相比,基于源提示的增益掩码提取器21可以以细/高分辨率操作,然而,增益掩码应用器22在应用之前对基于源提示的增益掩码G进行平滑,使得基于源提示的分离模块20的总时间和/或频率分辨率更低/更粗或等于基于空间提示的分离模块10的总时间和/或频率分辨率。The gain mask applicator 22 may further be configured to take into account the time and/or frequency metadata D when applying the source gain mask G. For example, if the source cue based separation module 20 should operate with a time and/or frequency resolution different from a default or typical time and/or frequency resolution, the gain mask applicator 22 may be configured to smooth the source gain mask G before applying it to the intermediate audio signal B (mixed intermediate audio signal B') and/or to smooth the resulting audio signal after applying the source gain mask G. Smoothing will reduce the time and/or frequency resolution (make it coarser) in order to achieve, for example, a resolution that matches (or is lower than) the resolution of the spatial cue based separation module 10. That is, compared to the spatial cue based separation module 10, the source cue based gain mask extractor 21 can operate with fine/high resolution, however, the gain mask applicator 22 smoothes the source cue based gain mask G before application, so that the total time and/or frequency resolution of the source cue based separation module 20 is lower/coarser or equal to the total time and/or frequency resolution of the spatial cue based separation module 10.
在一个示例性实施例中,基于空间提示的分离模块10以五到十倍频程宽度频带的分块频带进行操作,而基于源提示的增益掩码提取器21在单独瓦片上进行操作。然后,源增益掩码应用器22可以对由基于源提示的增益掩码提取器21预测的细粒度增益掩码G进行平滑,以与基于空间提示的分离模块10的频率分辨率相匹配。可以使用不同的技术来应用平滑,比如使用平滑窗口(在一维中)或内核(在二维中)进行卷积。如上所述,基于空间提示的分离模块10的频率分辨率可以等于分块频带的带宽,这与单独瓦片的带宽相比分辨率要低得多。In an exemplary embodiment, the spatial cue-based separation module 10 operates in block bands of five to ten octave width bands, while the source cue-based gain mask extractor 21 operates on individual tiles. The source gain mask applicator 22 can then smooth the fine-grained gain mask G predicted by the source cue-based gain mask extractor 21 to match the frequency resolution of the spatial cue-based separation module 10. Different techniques can be used to apply smoothing, such as convolution using a smoothing window (in one dimension) or a kernel (in two dimensions). As described above, the frequency resolution of the spatial cue-based separation module 10 can be equal to the bandwidth of the block bands, which is much lower resolution than the bandwidth of the individual tiles.
在一些实施方式中,使用卷积平滑窗口进行平滑,这些窗口仅在时间维度上移动,并使用与基于空间提示的分离模块10的分块频带相同的频带进行跨越。平滑窗口的时长介于在基于空间提示的分离模块10中使用的步长长度的一倍到十倍之间。平滑窗口可以是汉明窗口。In some embodiments, smoothing is performed using convolutional smoothing windows that move only in the time dimension and span the same frequency band as the blocking frequency band of the spatial cue based separation module 10. The duration of the smoothing window is between one and ten times the step length used in the spatial cue based separation module 10. The smoothing window may be a Hamming window.
在一些实施方式中,基于空间提示的分离模块10可以以类似的方式被实现为确定或预测空间增益掩码的基于空间提示的增益掩码提取器,其中,所述空间增益掩码被提供给空间增益掩码应用器,所述空间增益掩码应用器将空间增益掩码应用于输入音频信号A,以形成中间音频信号B。也就是说,修改输入音频信号A的至少两个通道可以包括确定和应用空间增益掩码。In some embodiments, the spatial cue based separation module 10 may be implemented in a similar manner as a spatial cue based gain mask extractor that determines or predicts a spatial gain mask, wherein the spatial gain mask is provided to a spatial gain mask applier that applies the spatial gain mask to the input audio signal A to form the intermediate audio signal B. That is, modifying at least two channels of the input audio signal A may include determining and applying the spatial gain mask.
在一些实施方式中,基于空间提示的分离模块10和基于源提示的分离模块20两者都利用由每个模块预测的增益掩码。基于空间提示的分离模块10将中间音频信号B提供给源分离模块20。在一些这样的实施方式中,由每个模块10、20预测的增益掩码被提供给增益掩码组合器和应用器,所述增益掩码组合器和应用器组合这两个增益掩码以形成聚合增益掩码,然后将所述聚合增益掩码应用于输入音频信号A以形成输出音频信号C。可选地,增益掩码组合器和应用器还执行对所产生的组合增益掩模的平滑。In some embodiments, both the spatial cue-based separation module 10 and the source cue-based separation module 20 utilize a gain mask predicted by each module. The spatial cue-based separation module 10 provides the intermediate audio signal B to the source separation module 20. In some such embodiments, the gain mask predicted by each module 10, 20 is provided to a gain mask combiner and applicator, which combines the two gain masks to form an aggregate gain mask, which is then applied to the input audio signal A to form the output audio signal C. Optionally, the gain mask combiner and applicator also performs smoothing of the resulting combined gain mask.
由每个模块10、20预测的增益掩码不必具有相同的时间和/或频率分辨率,并且可以使用比如内插、数据复制或池化等不同的技术来使不同增益掩码的分辨率相匹配。增益掩码的组合方法也有很多种,例如,组合每个模块10、20的增益掩码可包括以下几种组合之一:乘法、选择最小值、选择最大值、选择中值、选择均值、或增益掩码的任何线性组合。The gain masks predicted by each module 10, 20 do not have to have the same time and/or frequency resolution, and different techniques such as interpolation, data replication or pooling can be used to match the resolution of different gain masks. There are also many ways to combine gain masks. For example, combining the gain masks of each module 10, 20 may include one of the following combinations: multiplication, selecting the minimum value, selecting the maximum value, selecting the median value, selecting the mean value, or any linear combination of the gain masks.
还设想到,两个模块10和20可以并行运行,其中,每个模块基于输入音频信号A预测增益掩码,并将各自的增益掩码提供给增益掩码组合器和应用器,所述增益掩码组合器和应用器将这两个增益掩码进行组合以形成聚合增益掩码,然后将聚合增益掩码应用于输入音频信号A,以形成输出音频信号C。在这种实施方式中,不存在中间音频信号B。可选地,输出音频信号C与输入音频信号A进行混合以形成混合输出音频信号C'。It is also contemplated that the two modules 10 and 20 may be run in parallel, wherein each module predicts a gain mask based on the input audio signal A and provides the respective gain mask to a gain mask combiner and applicator, which combines the two gain masks to form an aggregate gain mask, which is then applied to the input audio signal A to form an output audio signal C. In such an embodiment, there is no intermediate audio signal B. Optionally, the output audio signal C is mixed with the input audio signal A to form a mixed output audio signal C'.
图5描绘了音频处理系统的框图,其中,源分离音频处理系统1与分类器50和门控单元60一起使用,以形成门控输出音频信号CG。分类器50对输入音频信号A、中间音频信号B(混合中间音频信号B')和输出音频信号C(混合输出音频信号C')中的至少一个进行操作,并确定指示所获得的音频信号包括目标音频内容的可能性的概率度量。概率度量可以是一个值,其中,值越小指示可能性越低,而值越大指示可能性越高。例如,概率度量是介于0到1之间的值,其中,越接近0的值指示音频信号包括目标音频内容的可能性较低,而接近1的值指示音频信号包括目标音频内容的可能性更高。目标音频内容可以例如是语音或音乐。在一些实施方式中,分类器50包括被训练用于在给定输入音频信号、中间音频信号和/或输出音频信号的样本的情况下预测指示音频信号包括目标音频内容的可能性的概率度量的神经网络。例如,神经网络是被训练用于在给定输入音频信号A、中间音频信号B(混合中间音频信号B')和输出音频信号C(混合输出音频信号C')中的至少一个的时频表示的情况下预测概率度量的残差神经网络(ResNet)。所述时频表示包括被划分成多个瓦片的多个连续帧。作为另一示例,分类器50包括特征提取器,所述特征提取器基于时频表示提取一个或多个特征,从而将所述至少一个特征提供给被训练用于预测概率度量的多层感知机(MLP)神经网络或简化的ResNet神经网络。特征提取过程可以由人工指定,或设想特征提取由经过训练的特征提取神经网络执行。5 depicts a block diagram of an audio processing system, wherein the source separation audio processing system 1 is used together with a classifier 50 and a gating unit 60 to form a gated output audio signal CG . The classifier 50 operates on at least one of an input audio signal A, an intermediate audio signal B (mixed intermediate audio signal B'), and an output audio signal C (mixed output audio signal C'), and determines a probability metric indicating the likelihood that the obtained audio signal includes target audio content. The probability metric may be a value, wherein a smaller value indicates a lower likelihood, and a larger value indicates a higher likelihood. For example, the probability metric is a value between 0 and 1, wherein a value closer to 0 indicates a lower likelihood that the audio signal includes the target audio content, and a value closer to 1 indicates a higher likelihood that the audio signal includes the target audio content. The target audio content may be, for example, speech or music. In some embodiments, the classifier 50 includes a neural network trained to predict a probability metric indicating the likelihood that the audio signal includes the target audio content given a sample of the input audio signal, the intermediate audio signal, and/or the output audio signal. For example, the neural network is a residual neural network (ResNet) trained to predict a probability metric given a time-frequency representation of at least one of an input audio signal A, an intermediate audio signal B (mixed intermediate audio signal B'), and an output audio signal C (mixed output audio signal C'). The time-frequency representation includes a plurality of consecutive frames divided into a plurality of tiles. As another example, the classifier 50 includes a feature extractor that extracts one or more features based on the time-frequency representation, thereby providing the at least one feature to a multi-layer perceptron (MLP) neural network or a simplified ResNet neural network trained to predict a probability metric. The feature extraction process can be manually specified, or it is assumed that the feature extraction is performed by a trained feature extraction neural network.
由于中间音频信号B(混合中间音频信号B')和输出音频信号C(混合输出音频信号C')是目标音频内容已被分离的输入音频信号A的已处理版本,因此可以将这些音频信号中的任意一个作为输入提供给分类器50,以促进可能性预测的准确度和/或能够使用更简单的分类器50。例如,与在输入音频信号A没有经过分离模块10、20的任何分离处理的情况下确定概率度量相比,分类器50在音频信号已经使用空间提示(并且可选地还使用源提示)进行了分离的情况下可能更容易准确地确定概率度量。Since the intermediate audio signal B (mixed intermediate audio signal B') and the output audio signal C (mixed output audio signal C') are processed versions of the input audio signal A from which the target audio content has been separated, any of these audio signals may be provided as input to the classifier 50 to improve the accuracy of the likelihood prediction and/or enable the use of a simpler classifier 50. For example, it may be easier for the classifier 50 to accurately determine the probability metric when the audio signals have been separated using spatial cues (and optionally also using source cues) than when the input audio signal A has not been subjected to any separation processing by the separation modules 10, 20.
另一方面,每个分离模块10、20可能会引入延迟,所述延迟是由于使用预定数量的前瞻样本和回溯样本处理音频信号而带来的。为此,虽然向分类器50提供输出音频信号C(混合输出音频信号C')可以使用更简单的分类器50(例如,层数和可学习参数较少的不太复杂的神经网络)和/或提升分类准确度,但这也会引入更大的信号处理延迟。On the other hand, each separation module 10, 20 may introduce a delay due to processing the audio signal using a predetermined number of look-ahead samples and look-back samples. For this reason, although providing the output audio signal C (mixed output audio signal C') to the classifier 50 can use a simpler classifier 50 (e.g., a less complex neural network with fewer layers and learnable parameters) and/or improve classification accuracy, this will also introduce greater signal processing delays.
概率度量被提供给门控单元60,所述门控单元60基于可能性控制输出音频信号C(混合输出音频信号C')的增益,以形成门控输出音频信号CG。例如,如果由分类器确定的概率度量超过预定阈值,则门控单元60就会应用高增益,否则门控单元就会应用低增益。在一些实施方式中,高增益为单位增益(0dB),而低增益实际上是对音频信号的消音(例如,-25dB、-100dB或-∞dB)。通过这种方式,输出音频信号C、C'就变成了隔离目标音频内容的门控输出音频信号CG。例如,门控输出音频信号CG仅包括语音,并且在没有语音的时间实例下实际上被消音。The probability metric is provided to the gating unit 60, which controls the gain of the output audio signal C (mixed output audio signal C') based on the likelihood to form a gated output audio signal CG . For example, if the probability metric determined by the classifier exceeds a predetermined threshold, the gating unit 60 applies a high gain, otherwise the gating unit applies a low gain. In some embodiments, the high gain is unity gain (0 dB), while the low gain is actually muting the audio signal (e.g., -25 dB, -100 dB, or -∞ dB). In this way, the output audio signals C, C' become gated output audio signals CG that isolate the target audio content. For example, the gated output audio signal CG includes only speech and is actually muted at time instances when there is no speech.
在一些实施方式中,门控单元60被配置为通过实施从低增益到高增益或从高增益到低增益的有限过渡时间来对所应用的增益进行平滑。利用有限的过渡时间,门控单元60的切换可能会变得不那么明显且不会造成干扰。例如,从低增益(例如,-25dB)过渡到高增益(例如,0dB)大约需要180ms,而从高增益过渡到低增益大约需要800ms,其中,当没有目标音频内容时,通过输出音频信号C在高增益到低增益的转换结束后被完全消音(-100dB或-∞dB)来进一步抑制输出音频信号C。可替代地,为了加快启动速度,可以将从低增益到高增益的过渡时间设置为短于约180ms,比如约1ms。In some embodiments, the gating unit 60 is configured to smooth the applied gain by implementing a limited transition time from low gain to high gain or from high gain to low gain. With a limited transition time, the switching of the gating unit 60 may become less noticeable and less intrusive. For example, it takes about 180ms to transition from low gain (e.g., -25dB) to high gain (e.g., 0dB), and about 800ms to transition from high gain to low gain, wherein, when there is no target audio content, the output audio signal C is further suppressed by being completely muted (-100dB or -∞dB) after the conversion from high gain to low gain is completed. Alternatively, in order to speed up the startup speed, the transition time from low gain to high gain can be set to be shorter than about 180ms, such as about 1ms.
通过这种方式,门控输出音频信号CG将通过以下方式增强目标音频内容(例如,语音):首先,使用空间提示和源提示分离目标音频内容,以使目标音频内容在音频信号中出现时更加清晰且易懂;其次,当目标音频内容未出现在输入音频信号A中时,对输出音频信号C进行消音。In this way, the gated output audio signal CG will enhance the target audio content (e.g., speech) in the following ways: first, the target audio content is separated using spatial cues and source cues so that the target audio content is clearer and easier to understand when it appears in the audio signal; second, the output audio signal C is muted when the target audio content does not appear in the input audio signal A.
除非另外特别声明,否则从以下讨论中显而易见的是,应理解,在整个公开的讨论中,利用如“处理”、“计算(computing)”、“计算(calculating)”“确定”、“分析”等术语来指代计算机硬件或计算系统或类似的电子计算设备的将表示为物理(如电子)量的数据操纵和/或变换为类似地表示为物理量的其他数据的动作和/或过程。Unless otherwise specifically stated, it will be apparent from the following discussion that it should be understood that throughout the disclosed discussion, terms such as "processing," "computing," "calculating," "determining," "analyzing," etc. are utilized to refer to the actions and/or processes of computer hardware or computing systems or similar electronic computing devices to manipulate and/or transform data represented as physical (e.g., electronic) quantities into other data similarly represented as physical quantities.
应理解,在以上对本发明的示例性实施例的描述中,有时在单一实施例、图或其描述中将各种特征分组在一起,以便简化本公开,并且帮助理解各个创造性方面中的一个或多个。然而,本公开的方法不应当被解释为反映要求保护的本发明需要比每个权利要求中明确叙述的特征更多的特征的意图。相反,如以下权利要求所反映的,创造性方面在于少于单一前述所公开实施例的所有特征。因此,在具体实施方式后面的权利要求特此明确地并入到具体实施方式中,其中,每个权利要求都独立地作为本发明的单独实施例。此外,虽然本文所描述的一些实施例包括其他实施例中所包括的一些特征而不包括其他实施例中所包括的其他特征,但是如本领域技术人员将理解的,不同实施例的特征的组合旨在处于本发明的范围内并形成不同实施例。例如,在以下权利要求中,要求保护的实施例中的任一个都可以以任何组合来使用。It should be understood that in the above description of the exemplary embodiments of the present invention, various features are sometimes grouped together in a single embodiment, figure or its description to simplify the present disclosure and help understand one or more of the various creative aspects. However, the method of the present disclosure should not be interpreted as reflecting the intention that the claimed invention requires more features than the features explicitly stated in each claim. On the contrary, as reflected in the following claims, the creative aspects are less than all the features of the single aforementioned disclosed embodiment. Therefore, the claims following the specific embodiments are hereby expressly incorporated into the specific embodiments, wherein each claim is independently a separate embodiment of the present invention. In addition, although some embodiments described herein include some features included in other embodiments without including other features included in other embodiments, as will be understood by those skilled in the art, the combination of features of different embodiments is intended to be within the scope of the present invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
此外,本文某些实施例描述为可以通过计算机系统的处理器或通过执行功能的其他器件实施的方法或方法要素的组合。因此,具有执行这种方法或方法元素的指令的处理器形成了用于执行方法或方法元素的器件。应当注意的是,当所述方法包括多个元素(例如,若干步骤)时,除非特别声明,否则不暗示这些元素的任何顺序。此外,本文所描述的装置实施例的元件是执行由元件所执行的功能以便执行本发明的实施例的器件的示例。在本文提供的描述中,阐述了许多具体细节。然而应理解,可以在没有这些具体细节的情况下实践本发明的实施例。在其他实例中,未详细示出众所周知的方法、结构和技术,以便避免模糊对本说明书的理解。In addition, some embodiments herein are described as a combination of methods or method elements that can be implemented by a processor of a computer system or by other devices that perform functions. Therefore, a processor with an instruction to perform such a method or method element forms a device for performing a method or method element. It should be noted that when the method includes multiple elements (e.g., several steps), unless otherwise stated, any order of these elements is not implied. In addition, the elements of the device embodiments described herein are examples of devices that perform functions performed by the elements in order to perform embodiments of the present invention. In the description provided herein, many specific details are set forth. However, it should be understood that embodiments of the present invention can be practiced without these specific details. In other examples, well-known methods, structures, and techniques are not shown in detail to avoid blurring the understanding of this specification.
本领域的技术人员认识到,本发明决不会局限于上述实施例。相反,在所附权利要求的范围内的多种修改和变化是可能的。例如,可以不使用图2中的混合模块30a、30b,也可以只使用其中一个或两个混合模块,而不管每个分离模块是否使用增益掩码进行操作和/或不管是否存在增益掩码组合器和应用器。Those skilled in the art will recognize that the present invention is by no means limited to the above-described embodiments. On the contrary, various modifications and variations within the scope of the appended claims are possible. For example, the mixing modules 30a, 30b in FIG. 2 may not be used, or only one or two of the mixing modules may be used, regardless of whether each separation module operates using a gain mask and/or whether a gain mask combiner and an applicator are present.
Claims (36)
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US63/325,108 | 2022-03-29 | ||
US63/417,273 | 2022-10-18 | ||
US202363482949P | 2023-02-02 | 2023-02-02 | |
US63/482,949 | 2023-02-02 | ||
PCT/US2023/015507 WO2023192039A1 (en) | 2022-03-29 | 2023-03-17 | Source separation combining spatial and source cues |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118974825A true CN118974825A (en) | 2024-11-15 |
Family
ID=93393188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202380031124.3A Pending CN118974825A (en) | 2022-03-29 | 2023-03-17 | Source separation combining spatial cues and source cues |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118974825A (en) |
-
2023
- 2023-03-17 CN CN202380031124.3A patent/CN118974825A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9881635B2 (en) | Method and system for scaling ducking of speech-relevant channels in multi-channel audio | |
CN107004427B (en) | Signal processing device for enhancing speech component in multi-channel audio signal | |
JP6242489B2 (en) | System and method for mitigating temporal artifacts for transient signals in a decorrelator | |
KR101790641B1 (en) | Hybrid waveform-coded and parametric-coded speech enhancement | |
JP7201721B2 (en) | Method and Apparatus for Adaptive Control of Correlation Separation Filter | |
JP2012524304A (en) | Method and apparatus for adjusting channel delay parameters of multi-channel signals | |
JP6301368B2 (en) | Apparatus and method for generating a frequency enhancement signal using enhancement signal shaping | |
CN118974825A (en) | Source separation combining spatial cues and source cues | |
JP2025507119A (en) | Method and audio processing system for wind noise suppression - Patents.com | |
US20250191604A1 (en) | Source separation combining spatial and source cues | |
US20250182774A1 (en) | Multichannel and multi-stream source separation via multi-pair processing | |
CN118974824A (en) | Multi-channel and multi-stream source separation via multi-pair processing | |
EP4348643B1 (en) | Dynamic range adjustment of spatial audio objects | |
US20240161762A1 (en) | Full-band audio signal reconstruction enabled by output from a machine learning model | |
CN118922884A (en) | Method and audio processing system for wind noise suppression | |
CN119631426A (en) | Acoustic Image Enhancement for Stereo Audio | |
WO2025058991A1 (en) | Method and system for stereo source elimination | |
WO2023172852A1 (en) | Target mid-side signals for audio applications | |
HK1222470B (en) | Hybrid waveform-coded and parametric-coded speech enhancement | |
HK1175881B (en) | Method and system for scaling ducking of speech-relevant channels in multi-channel audio | |
HK1175881A (en) | Method and system for scaling ducking of speech-relevant channels in multi-channel audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |