CN118692498A

CN118692498A - Multi-channel audio and video signal processing method and device

Info

Publication number: CN118692498A
Application number: CN202411161729.7A
Authority: CN
Inventors: 王群; 田克平; 孟然; 柴华; 姜安; 朱海涛
Original assignee: Beijing Smarter Eye Technology Co Ltd
Current assignee: Beijing Smarter Eye Technology Co Ltd
Priority date: 2024-08-23
Filing date: 2024-08-23
Publication date: 2024-09-24
Anticipated expiration: 2044-08-23
Also published as: CN118692498B

Abstract

The invention provides a multichannel-based audio and video signal processing method and device, wherein the method comprises the following steps: calculating the audio signal energy and voice activity detection information of all input channels, and determining a plurality of target channels according to the calculation result of the audio signal energy and the calculation result of voice activity detection; and under the condition that the target channels meet the number requirement, the target channels carry out audio enhancement, carry out audio attenuation on the other channels, mix audio of all input channels and output the audio. According to the method and the device, aiming at the problem that the sound is noisy in a microphone-on scene of a plurality of people during audio-video conversation, audio-video teaching or audio-video conference, through reasonable audio channel selection and audio enhancement, a user can acquire needed information more easily, the hearing feeling of the user is improved, the online communication is smoother, and the time cost of the user is saved.

Description

Multi-channel audio and video signal processing method and device

技术领域Technical Field

本发明涉及音视频处理技术领域，尤其涉及一种基于多通道的音视频信号处理方法和装置。The present invention relates to the technical field of audio and video processing, and in particular to a multi-channel-based audio and video signal processing method and device.

背景技术Background Art

在多人音视频通话、音视频教学或者音视频会议时，经常遇到的一个场景是多人的线上互动，但是在有几十路音频同时开麦克风的情况下，听到的声音是非常杂乱的，而多样的设备类型和不同的讲话环境更是加剧了这种声音的嘈杂。在实际应用场景中，在同一时刻往往只需要关注其中几个人的讲话内容，而其他大部分用户的信息可以忽略，或者多个通道的用户是一种轮流上台发言的状态。In multi-person audio and video calls, audio and video teaching, or audio and video conferences, a common scenario is multi-person online interaction. However, when there are dozens of audio channels with microphones turned on at the same time, the sound heard is very messy, and the various types of equipment and different speaking environments further aggravate the noise. In actual application scenarios, at the same time, you often only need to pay attention to the speech content of a few people, and the information of most other users can be ignored, or users of multiple channels take turns to speak on stage.

在多音频同时存在的情况下，如何对多通道音频信号进行选择和增强，以便突出正在发言的目标音频，弱化其他音频的干扰，保证目标音频的接收质量，就成为本领域技术人员亟待解决的问题。When multiple audios exist simultaneously, how to select and enhance the multi-channel audio signals to highlight the target audio being spoken, weaken the interference of other audios, and ensure the reception quality of the target audio has become an urgent problem to be solved by technical personnel in this field.

发明内容Summary of the invention

本发明提供一种基于多通道的音视频信号处理方法和装置，通过对目标音频通道的选择及音频增强，可以显著提升用户的听觉感受，使用户的线上交流更加简洁顺畅，节省用户的交流成本。The present invention provides a multi-channel-based audio and video signal processing method and device, which can significantly improve the user's auditory experience through the selection of target audio channels and audio enhancement, making the user's online communication more concise and smooth, and saving the user's communication costs.

本发明提供一种基于多通道的音视频信号处理方法，所述方法包括：The present invention provides a multi-channel audio and video signal processing method, the method comprising:

计算所有输入通道音频信号的能量和语音活动性检测信息；Calculate the energy and voice activity detection information of all input channel audio signals;

根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；Determining a plurality of target channels according to a calculation result of audio signal energy and a calculation result of voice activity detection;

在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减；When the target channel meets the quantity requirement, the target channel is audio enhanced, and the remaining channels are audio attenuated;

对所有输入通道的音频进行混音并输出。Mix the audio from all input channels and output them.

在一些实施例中，计算所有输入通道的音频信号能量，具体包括：In some embodiments, calculating the audio signal energy of all input channels specifically includes:

计算当前通道的单帧内各音频采样点中的能量最大值；Calculate the maximum energy of each audio sampling point in a single frame of the current channel;

计算当前通道的预设时长内所有帧的能量最大值的平均值；Calculate the average value of the maximum energy of all frames within the preset duration of the current channel;

对能量最大值的平均值进行量化整形计算，以得到当前通道的音频信号能量；Perform quantization and shaping calculation on the average value of the maximum energy value to obtain the audio signal energy of the current channel;

依次以每个输入通道作为所述当前通道，以得到所有输入通道的音频信号能量。Each input channel is used as the current channel in turn to obtain the audio signal energy of all input channels.

在一些实施例中，计算所有输入通道的语音活动性检测信息，具体包括：In some embodiments, calculating voice activity detection information of all input channels specifically includes:

将当前通道的音频信号重采样到预设采样频率，并根据梅尔倒谱频率划分为多个频带，计算每个频带内的能量信息；Resample the audio signal of the current channel to a preset sampling frequency, divide it into multiple frequency bands according to the Mel-frequency cepstrum, and calculate the energy information in each frequency band;

利用高斯混合模型分别计算各子带的语音和噪声的高斯概率密度函数；The Gaussian mixture model is used to calculate the Gaussian probability density function of speech and noise in each sub-band respectively;

计算语音和噪声的对数似然比，并对所有子带的对数似然比求和；Calculate the log-likelihood ratio of speech and noise, and sum the log-likelihood ratios of all sub-bands;

基于预先设定的阈值与所有子带的对数似然比之和的大小关系，对当前通道的语音活动性检测进行标定；Based on the relationship between a preset threshold and the sum of log-likelihood ratios of all sub-bands, the speech activity detection of the current channel is calibrated;

依次以每个输入通道作为所述当前通道，以得到所有输入通道的语音活动性检测信息。Each input channel is used as the current channel in turn to obtain voice activity detection information of all input channels.

在一些实施例中，根据音频信号能量的计算结果和语音活动性的计算结果，确定多个目标通道，具体包括：In some embodiments, determining multiple target channels according to the calculation results of the audio signal energy and the voice activity specifically includes:

根据语音活动性检测的计算结果，选择标定为有语音的通道作为备选通道；According to the calculation result of the voice activity detection, a channel marked as having voice is selected as a candidate channel;

根据所述音频信号能量的计算结果，对所有备选通道按音频信号能量大小进行排序，选择音频信号能量较大的预设数量的通道作为所述目标通道。According to the calculation result of the audio signal energy, all candidate channels are sorted according to the audio signal energy, and a preset number of channels with larger audio signal energy are selected as the target channels.

在一些实施例中，在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减，具体包括：In some embodiments, when the target channel meets the quantity requirement, the target channel is audio enhanced and the remaining channels are audio attenuated, specifically including:

在所述目标通道的数量大于或等于预设数量的情况下，对所有的所述目标通道进行音频增强；When the number of the target channels is greater than or equal to a preset number, performing audio enhancement on all the target channels;

在所述目标通道的数量小于所述预设数量的情况下，在其余通道中选择候补通道作为补充的目标通道，并对所有的目标通道进行音频增强；When the number of the target channels is less than the preset number, selecting candidate channels from the remaining channels as supplementary target channels, and performing audio enhancement on all the target channels;

其中，所述候补通道的数量为所述目标通道与所述预设数量的差值。The number of candidate channels is the difference between the target channel and the preset number.

在一些实施例中，在其余通道中选择候补通道作为补充的目标通道，具体包括：In some embodiments, selecting a candidate channel from the remaining channels as a supplementary target channel specifically includes:

根据语音活动性检测的计算结果，选择标定为有语音的通道作为候补备选通道；According to the calculation result of the voice activity detection, a channel marked as having voice is selected as a candidate channel;

根据所述音频信号能量的计算结果，对所有候补备选通道按音频信号能量大小进行排序，选择音频信号能量较大的至少一个通道作为所述补充的目标通道；According to the calculation result of the audio signal energy, all candidate channels are sorted according to the audio signal energy, and at least one channel with a larger audio signal energy is selected as the supplementary target channel;

在各候补备选通道的音频信号能量值相同的情况下，选择信噪比较高的通道作为补充的目标通道。When the audio signal energy values of the candidate channels are the same, the channel with a higher signal-to-noise ratio is selected as the supplementary target channel.

本发明还提供一种基于多通道的音视频信号处理装置，所述装置包括：The present invention also provides a multi-channel audio and video signal processing device, the device comprising:

参数计算单元，用于计算所有输入通道的音频信号能量和语音活动性检测信息；A parameter calculation unit, used to calculate the audio signal energy and voice activity detection information of all input channels;

通道选择单元，用于根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；A channel selection unit, configured to determine a plurality of target channels according to a calculation result of audio signal energy and a calculation result of voice activity detection;

音频处理单元，用于在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减；An audio processing unit, configured to perform audio enhancement on the target channel and audio attenuation on the remaining channels when the target channel meets the quantity requirement;

结果输出单元，用于对所有输入通道的音频进行混音并输出。The result output unit is used to mix and output the audio of all input channels.

本发明还提供一种电子设备，包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述程序时实现如上所述的方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method described above when executing the program.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如上所述的方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method described above when executed by a processor.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上所述的方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the method described above is implemented.

本发明提供的基于多通道的音视频信号处理方法和装置，通过计算所有输入通道的音频信号能量和语音活动性检测信息，根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减，对所有输入通道的音频进行混音并输出。该方法和装置针对多人音视频通话、音视频教学或者音视频会议时，多人同时开麦克风场景下声音嘈杂的问题，通过合理的音频通道选择及音频增强，使用户更加容易获取到自己需要的信息，提升了用户的听觉感受，使线上交流更加顺畅，节约用户的时间成本。The multi-channel audio and video signal processing method and device provided by the present invention calculates the audio signal energy and voice activity detection information of all input channels, and determines multiple target channels according to the calculation results of the audio signal energy and the voice activity detection; when the target channels meet the quantity requirements, the target channels perform audio enhancement, the remaining channels perform audio attenuation, and the audio of all input channels is mixed and output. This method and device aims at the problem of noisy sound in the scenario where multiple people turn on the microphone at the same time during multi-person audio and video calls, audio and video teaching, or audio and video conferences. Through reasonable audio channel selection and audio enhancement, it makes it easier for users to obtain the information they need, improves the user's auditory experience, makes online communication smoother, and saves users' time costs.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的基于多通道的音视频信号处理方法的流程图之一；FIG1 is a flow chart of a multi-channel audio and video signal processing method provided by the present invention;

图2是本发明提供的基于多通道的音视频信号处理方法的流程图之二；FIG2 is a second flow chart of the multi-channel audio and video signal processing method provided by the present invention;

图3是本发明提供的基于多通道的音视频信号处理方法的流程图之三；FIG3 is a third flow chart of the multi-channel audio and video signal processing method provided by the present invention;

图4是本发明提供的基于多通道的音视频信号处理方法的流程图之四；FIG4 is a fourth flow chart of the multi-channel audio and video signal processing method provided by the present invention;

图5是本发明提供的基于多通道的音视频信号处理方法的流程图之五；FIG5 is a fifth flow chart of the multi-channel audio and video signal processing method provided by the present invention;

图6是本发明提供的基于多通道的音视频信号处理方法的流程图之六；FIG6 is a sixth flowchart of the multi-channel audio and video signal processing method provided by the present invention;

图7是音频增强及压限处理的效果图；FIG7 is a diagram showing the effects of audio enhancement and compression processing;

图8是混音后压限处理的效果图；FIG8 is a diagram showing the effect of compression after mixing;

图9是本发明提供的基于多通道的音视频信号处理装置的结构框图；9 is a block diagram of a multi-channel audio and video signal processing device provided by the present invention;

图10是本发明提供的电子设备的结构示意图。FIG. 10 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

为便于理解，图1所示为本发明所提供基于多通道的音视频信号处理方法的线上应用场景。在该具体的应用场景中，客户端界面的视频窗口区展示了当前用户接收到的多路音视频（即所有通道），上台用户区展示了当前上台准备发言的各路音视频（即目标通道），该实施例中上台用户区窗口数量范围为。本发明的目的是从所有视频窗口通道中选择其中的6路进行音频增强，而其余的视频窗口通道进行音频衰减，以便更好的获取到需要关注的信息。 For ease of understanding, FIG1 shows an online application scenario of the multi-channel audio and video signal processing method provided by the present invention. In this specific application scenario, the video window area of the client interface displays the multiple audio and video channels (i.e., all channels) received by the current user, and the on-stage user area displays the audio and video channels (i.e., the target channels) currently preparing to speak on stage. In this embodiment, the number of windows in the on-stage user area ranges from The purpose of the present invention is to select 6 channels from all video window channels for audio enhancement, and to perform audio attenuation on the remaining video window channels, so as to better obtain the information that needs attention.

本发明所提供基于多通道的音视频信号处理方法的整体技术方案如图2所示。首先对多路输入音频信号根据能量信息和语音活动性检测信息选择需要音频解码的通道，然后选择需要音频增强的通道，再针对选择的多路音频通道进行音频增强和音频衰减，最后对所有通道的音频进行混音并输出。The overall technical solution of the multi-channel audio and video signal processing method provided by the present invention is shown in Figure 2. First, the channels that need audio decoding are selected for the multi-channel input audio signals according to the energy information and the voice activity detection information, and then the channels that need audio enhancement are selected, and then the audio enhancement and audio attenuation are performed on the selected multi-channel audio channels, and finally the audio of all channels is mixed and output.

在一种具体实施方式中，如图3所示，本发明提供的基于多通道的音视频信号处理方法包括以下步骤：In a specific implementation, as shown in FIG3 , the multi-channel audio and video signal processing method provided by the present invention includes the following steps:

S310：计算所有输入通道的音频信号能量和语音活动性检测信息；S310: Calculate audio signal energy and voice activity detection information of all input channels;

S320：根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；S320: determining a plurality of target channels according to the calculation result of the audio signal energy and the calculation result of the voice activity detection;

S330：在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减；S330: When the target channel meets the quantity requirement, the target channel is audio enhanced, and the remaining channels are audio attenuated;

S340：对所有输入通道的音频进行混音并输出。S340: Mix and output the audio of all input channels.

在步骤S310中，计算所有输入通道的音频信号能量，如图4所示，具体包括以下步骤：In step S310, the audio signal energy of all input channels is calculated, as shown in FIG4 , which specifically includes the following steps:

S410：计算当前通道的单帧内各音频采样点中的能量最大值；S410: Calculate the maximum energy of each audio sampling point in a single frame of the current channel;

S420：计算当前通道的预设时长内所有帧的能量最大值的平均值；S420: Calculate the average value of the maximum energy of all frames within the preset time length of the current channel;

S430：对能量最大值的平均值进行量化整形计算，以得到当前通道的音频信号能量；S430: performing quantization shaping calculation on the average value of the maximum energy value to obtain the audio signal energy of the current channel;

S440：依次以每个输入通道作为所述当前通道，以得到所有输入通道的音频信号能量。S440: Taking each input channel as the current channel in turn to obtain audio signal energy of all input channels.

具体地，在一个具体使用场景中，对所有通道的音频信号计算能量信息时，计算音频能量的目的是通过上报音频能量信息到后台，再由后台转发至接收客户端，以此来指导接收端的选路策略。音频能量的上报不能太频繁，否则数据量太大，会对后台造成很大的压力。该实施例中，上报的周期间隔选择为2秒上报一次，统计音量的周期间隔也是2秒，音频帧的帧长为20毫秒，所以2秒内总共有100帧的音频数据。Specifically, in a specific usage scenario, when calculating the energy information of the audio signals of all channels, the purpose of calculating the audio energy is to report the audio energy information to the background, and then forward it to the receiving client by the background, so as to guide the routing strategy of the receiving end. The reporting of audio energy cannot be too frequent, otherwise the amount of data is too large, which will cause great pressure on the background. In this embodiment, the reporting period interval is selected to be reported once every 2 seconds, the period interval of the volume statistics is also 2 seconds, and the frame length of the audio frame is 20 milliseconds, so there are a total of 100 frames of audio data in 2 seconds.

该实施例中，本方法中涉及了一种计算音量的方式，具体包括以下步骤：In this embodiment, the method involves a method for calculating volume, which specifically includes the following steps:

首先，计算一帧内音频采样点能量的最大值：First, calculate the maximum energy of the audio sampling points in a frame:

式中，为当前音频帧采样点的最大值，为音频采样点，值的范围为，音频帧的长度。 In the formula, is the maximum value of the current audio frame sampling point, is the audio sampling point, the value range is , The length of the audio frame.

然后，再对2秒内音频数据帧的最大值求平均值：Then, we average the maximum values of the audio data frames within 2 seconds:

式中，为2秒内音频数据帧最大值的平均值，为当前音频帧采样点的最大值。 In the formula, is the average value of the maximum value of the audio data frame within 2 seconds, It is the maximum value of the sampling point of the current audio frame.

最后进行音量的计算：Finally, calculate the volume:

式中，为经过量化整形的2秒内音量的最大值，取值范围为，为2秒内音频数据帧最大值的平均值。 In the formula, The maximum volume within 2 seconds after quantization and shaping, with a value range of , It is the average value of the maximum value of the audio data frame within 2 seconds.

在步骤S310中，计算所有输入通道的语音活动性检测信息，如图5所示，具体包括以下步骤：In step S310, voice activity detection information of all input channels is calculated, as shown in FIG5 , which specifically includes the following steps:

S510：将当前通道的音频信号重采样到预设采样频率，并根据梅尔倒谱频率划分为多个频带，计算每个频带内的能量信息；S510: resampling the audio signal of the current channel to a preset sampling frequency, dividing it into multiple frequency bands according to the Mel-frequency cepstrum, and calculating the energy information in each frequency band;

S520：利用高斯混合模型分别计算各子带的语音和噪声的高斯概率密度函数；S520: using a Gaussian mixture model to calculate the Gaussian probability density function of the speech and noise of each sub-band;

S530：计算语音和噪声的对数似然比，并对所有子带的对数似然比求和；S530: Calculate the log likelihood ratio of speech and noise, and sum the log likelihood ratios of all sub-bands;

S540：基于预先设定的阈值与所有子带的对数似然比之和的大小关系，对当前通道的语音活动性进行标定；S540: calibrating the speech activity of the current channel based on the relationship between a preset threshold and the sum of log-likelihood ratios of all sub-bands;

S550：依次以每个输入通道作为所述当前通道，以得到所有输入通道的语音活动性检测信息。S550: Taking each input channel as the current channel in turn to obtain voice activity detection information of all input channels.

仍以上述具体使用场景为例，对所有通道的音频信号进行语音活动性检测。语音活动性检测(VAD，Voice activity detection)的目的是区分语音和非语音。在音频传输时，VAD可以来指导音频数据的发送，当信号判断为语音时，则发送，当信号判断为非语音时，则停止发送，这样可以节省大量的带宽和流量。Still taking the above specific usage scenario as an example, voice activity detection is performed on the audio signals of all channels. The purpose of voice activity detection (VAD) is to distinguish between voice and non-voice. During audio transmission, VAD can guide the transmission of audio data. When the signal is judged to be voice, it is sent. When the signal is judged to be non-voice, it is stopped. This can save a lot of bandwidth and traffic.

该实施例采用统计模型的方法进行语音活动性检测，具体包括以下步骤：This embodiment uses a statistical model method to perform voice activity detection, which specifically includes the following steps:

首先将音频信号重采样到8khz（Hertz，赫兹）的采样频率，8khz采样频率的信号对应的截止频率为4khz，然后将8khz采样频率的信号根据梅尔倒谱频率划分为多个频带，计算每个频带内的能量信息，该实施例取其对数能量，计算方式如下：First, the audio signal is resampled to a sampling frequency of 8 kHz (Hertz), and the cutoff frequency corresponding to the signal with a sampling frequency of 8 kHz is 4 kHz. Then, the signal with a sampling frequency of 8 kHz is divided into multiple frequency bands according to the Mel cepstrum frequency, and the energy information in each frequency band is calculated. In this embodiment, the logarithmic energy is taken, and the calculation method is as follows:

式中，为子带标记，为子带的长度标记，表示子带的长度，表示子带的个数，表示子带的对数能量，表示子带内的采样点。 In the formula, is the subband marker, is the length mark of the subband, represents the length of the subband, represents the number of subbands, represents the logarithmic energy of the subband, Represents the sampling points within the subband.

然后利用高斯混合模型分别计算N个子带的语音和噪声的高斯概率密度函数，语音的高斯概率密度函数计算方式如下：Then, the Gaussian mixture model is used to calculate the Gaussian probability density functions of speech and noise in the N subbands respectively. The Gaussian probability density function of speech is calculated as follows:

式中，表示语音的高斯概率密度函数，表示子带个数，表示高斯混合模型阶数，表示子带的对数能量，、、分别表示语音的高斯混合模型的均值、方差、混合权重。 In the formula, represents the Gaussian probability density function of speech, Indicates the number of subbands, represents the order of Gaussian mixture model, represents the logarithmic energy of the subband, , , They respectively represent the mean, variance, and mixing weight of the Gaussian mixture model of speech.

噪声的高斯概率密度函数计算方式如下：The Gaussian probability density function of noise is calculated as follows:

式中，表示噪声的高斯概率密度函数，表示子带个数，表示高斯混合模型阶数，表示子带的对数能量，、、分别表示噪声的高斯混合模型的均值、方差、混合权重。 In the formula, represents the Gaussian probability density function of the noise, Indicates the number of subbands, represents the order of Gaussian mixture model, represents the logarithmic energy of the subband, , , They represent the mean, variance, and mixing weight of the Gaussian mixture model of noise respectively.

然后计算语音和噪声的对数似然比，并对所有子带的对数似然比求和：Then the log-likelihood ratio of speech and noise is calculated and summed over all subbands:

式中，表示所有子带的对数似然比之和，表示子带个数，表示语音的高斯概率密度函数，表示噪声的高斯概率密度函数。 In the formula, represents the sum of log-likelihood ratios of all subbands, Indicates the number of subbands, represents the Gaussian probability density function of speech, Gaussian probability density function representing the noise.

根据阈值判定VAD的结果，语音标定为1，非语音标定为0，如下所示：According to the result of VAD determined by the threshold, speech is marked as 1 and non-speech is marked as 0, as shown below:

式中，表示VAD的结果，表示所有子带的对数似然比之和，为VAD的判定阈值。In the formula, represents the result of VAD, represents the sum of log-likelihood ratios of all subbands, is the VAD decision threshold.

最后更新高斯混合模型的均值、方差和混合权重，做下一次的计算。Finally, update the mean, variance, and mixing weight of the Gaussian mixture model for the next calculation.

在步骤S320中，根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道，具体包括：In step S320, multiple target channels are determined according to the calculation results of the audio signal energy and the calculation results of the voice activity detection, specifically including:

具体地，解码通道的选择中，该实施例的音频编解码采用的是opus codec。音频解码是在客户端进行的，其在整个系统中的性能消耗是比较大的。而该系统又是一个多通道的系统，如果要同时解码几十路的音频编码信号，设备不足以支持这么大的性能压力。并且解码通道数太多的话，会非常耗时，影响后续模块的处理，所以需要有选择性的解码。Specifically, in the selection of decoding channels, the audio codec of this embodiment adopts the opus codec. Audio decoding is performed on the client side, which consumes a lot of performance in the whole system. The system is a multi-channel system. If dozens of audio encoding signals are to be decoded at the same time, the equipment is not enough to support such a large performance pressure. And if there are too many decoding channels, it will be very time-consuming and affect the processing of subsequent modules, so selective decoding is required.

为了兼顾各种低端设备的性能，通过实验和计算得出同时解码20路音频编码信号基本可以覆盖市面上的各种低端设备。In order to take into account the performance of various low-end devices, it is found through experiments and calculations that decoding 20 audio encoding signals at the same time can basically cover various low-end devices on the market.

解码通道选择的原则是：The principle of decoding channel selection is:

根据步骤S310中的VAD信息选择有语音的通道；Select a channel with speech according to the VAD information in step S310;

根据步骤S310中的能量信息对所有通道进行排序，按能量排序从大到小选择解码通道。All channels are sorted according to the energy information in step S310, and a decoding channel is selected from the highest energy to the lowest energy.

在步骤S330中，在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减，具体包括：In step S330, when the target channel meets the quantity requirement, the target channel is audio enhanced and the remaining channels are audio attenuated, which specifically includes:

对所有通道的音频信号计算信噪比时，音频信号的信噪比信息是在编码端降噪模块中输出的信息，它反映的是信号强度与背景噪声强度的比值，该实施例用此信息来指导接收端的选路策略，其计算方式如下：When calculating the signal-to-noise ratio of the audio signals of all channels, the signal-to-noise ratio information of the audio signals is the information outputted in the noise reduction module at the encoding end, which reflects the ratio of the signal strength to the background noise strength. This embodiment uses this information to guide the routing strategy of the receiving end, and the calculation method is as follows:

式中，表示信号与噪声的比值，表示信号的功率，表示噪声的功率。 In the formula, represents the ratio of signal to noise, represents the power of the signal, Represents the power of the noise.

的计算方式如下： The calculation method is as follows:

式中，表示信号的采样点值，表示音频帧的帧长。In the formula, Represents the sampling point value of the signal, Indicates the frame length of the audio frame.

的计算方式如下： The calculation method is as follows:

式中，表示噪声的功率，表示噪声的采样点值，表示音频帧的帧长。 In the formula, represents the power of the noise, represents the sampling point value of the noise, Indicates the frame length of the audio frame.

增强通道的选择中，首先，增强通道和非增强通道的定义为，根据产品的需求，需要对某些通道的声音进行放大处理，而对另一些通道的声音进行降低处理，以便更好的关注到所需要的通道的音频信息。而同时开麦克风的通道数多达数十路，这就需要根据一定的原则来选择哪些通道的音频需要被增强，哪些通道的音频需要被衰减。In the selection of enhanced channels, first of all, the definition of enhanced channels and non-enhanced channels is that according to the needs of the product, the sound of some channels needs to be amplified, while the sound of other channels needs to be reduced, so as to better focus on the audio information of the required channels. There are as many as dozens of channels with microphones turned on at the same time, which requires certain principles to select which channels' audio needs to be enhanced and which channels' audio needs to be attenuated.

音频增强通道选择的流程如图6所示：The process of audio enhancement channel selection is shown in Figure 6:

首先，从N路音频通道里看台上的用户数是否多于6路，假如台上的用户数多于6路，那么增强通道就需要从台上用户里选；假如台上用户数少于6路，那么先选台上用户进行增强，再从未上台的用户里选择，使得总共被选择需增强的通道数之和为6路；First, check whether the number of users on the stage is more than 6 from the N audio channels. If the number of users on the stage is more than 6, the enhanced channels need to be selected from the users on the stage; if the number of users on the stage is less than 6, then the users on the stage are selected for enhancement first, and then the users who have not come on stage are selected, so that the total number of channels selected to be enhanced is 6;

选取的原则是先选VAD为1（即有语音）的通道，再根据能量信息排序从大到小进行选择；假如能量相差不大的情况下，再根据信噪比进行选择，优先选择信噪比高的通道；The principle of selection is to first select the channel with VAD of 1 (i.e., with speech), and then select from large to small according to the energy information sorting; if the energy difference is not large, then select according to the signal-to-noise ratio, and give priority to the channel with a high signal-to-noise ratio;

然后就是关于增强通道选择的切换问题，因为有其他用户想发言的话，也是可以切换进入增强通道的，切换的原则是进入增强队列后，2秒内本通道不做重新选择；未讲话超过4秒的通道进入选择队列，等待重新进行增强通道的选择。Then there is the issue of switching to enhanced channel selection. If other users want to speak, they can also switch to the enhanced channel. The principle of switching is that after entering the enhanced queue, the channel will not be reselected within 2 seconds; the channel that has not spoken for more than 4 seconds enters the selection queue and waits for re-selection of the enhanced channel.

最后，通道增强和混音模块的通道增强策略是通过服务器后台下发配置到客户端的，增强配置的参数为：1dB、3dB、5dB。衰减配置的参数为：-3dB、-5dB、-10dB。对所有需要衰减的音频通道进行增益的衰减，并对所有需要增强的音频通道进行增益的提升。增益提升的时候要注意，增益的变大可能会使音量值超过0dB的最大限制，所以本方法对增益后-10dB以上的部分进行了压限处理，防止爆音，而-10dB以下的部分进行固定增益提升，如图4所示，横坐标为压限处理前音频信号的能量值，纵坐标为压限处理后音频信号的能量值。多路混音之后的音量值也可能会超过0dB的最大限制，所以本方法对混音之后-10dB以上的部分进行了压限处理，而-10dB以下的部分增益保持不变，如图5所示，横坐标为压限处理前音频信号的能量值，纵坐标为压限处理后音频信号的能量值。Finally, the channel enhancement strategy of the channel enhancement and mixing modules is configured to the client through the server background. The parameters of the enhancement configuration are: 1dB, 3dB, 5dB. The parameters of the attenuation configuration are: -3dB, -5dB, -10dB. The gain of all audio channels that need to be attenuated is attenuated, and the gain of all audio channels that need to be enhanced is increased. When increasing the gain, it should be noted that the increase in gain may cause the volume value to exceed the maximum limit of 0dB, so this method performs compression processing on the part above -10dB after the gain to prevent popping, and the part below -10dB is fixedly increased. As shown in Figure 4, the horizontal axis is the energy value of the audio signal before compression processing, and the vertical axis is the energy value of the audio signal after compression processing. The volume value after multi-channel mixing may also exceed the maximum limit of 0dB, so this method performs compression processing on the part above -10dB after mixing, while the gain of the part below -10dB remains unchanged, as shown in Figure 5, the horizontal axis is the energy value of the audio signal before compression processing, and the vertical axis is the energy value of the audio signal after compression processing.

在上述具体实施方式中，本发明提供的基于多通道的音视频信号处理方法，通过计算所有输入通道的音频信号能量和语音活动性检测信息，根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减，对所有输入通道的音频进行混音并输出。该方法针对多人音视频通话、音视频教学或者音视频会议时，多人同时开麦克风场景下声音嘈杂的问题，通过合理的音频通道选择及音频增强，使用户更加容易获取到自己需要的信息，提升了用户的听觉感受，使线上交流更加顺畅，节约用户的时间成本。In the above specific implementation, the multi-channel audio and video signal processing method provided by the present invention calculates the audio signal energy and voice activity detection information of all input channels, and determines multiple target channels according to the calculation results of the audio signal energy and the voice activity detection; when the target channels meet the quantity requirements, the target channels perform audio enhancement, the remaining channels perform audio attenuation, and the audio of all input channels is mixed and output. This method aims at the problem of noisy sound when multiple people turn on their microphones at the same time during multi-person audio and video calls, audio and video teaching, or audio and video conferences. Through reasonable audio channel selection and audio enhancement, it makes it easier for users to obtain the information they need, improves the user's auditory experience, makes online communication smoother, and saves users' time costs.

除了上述方法，本发明还提供一种基于多通道的音视频信号处理装置，如图9所示，所述装置包括：In addition to the above method, the present invention also provides a multi-channel audio and video signal processing device, as shown in FIG9 , the device comprises:

参数计算单元910，用于计算所有输入通道的音频信号能量和语音活动性检测信息；A parameter calculation unit 910, configured to calculate audio signal energy and voice activity detection information of all input channels;

通道选择单元920，用于根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；A channel selection unit 920, configured to determine a plurality of target channels according to a calculation result of audio signal energy and a calculation result of voice activity detection;

音频处理单元930，用于在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减；An audio processing unit 930 is configured to perform audio enhancement on the target channel and audio attenuation on the remaining channels when the target channel meets the quantity requirement;

结果输出单元940，用于对所有输入通道的音频进行混音并输出。The result output unit 940 is used to mix and output the audio of all input channels.

基于预先设定的阈值与所有子带的对数似然比之和的大小关系，对当前通道的语音活动性进行标定；Based on the relationship between a preset threshold and the sum of log-likelihood ratios of all sub-bands, the speech activity of the current channel is calibrated;

在一些实施例中，根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道，具体包括：In some embodiments, determining multiple target channels according to the calculation results of the audio signal energy and the calculation results of the voice activity detection specifically includes:

在上述具体实施方式中，本发明提供的基于多通道的音视频信号处理装置，通过计算所有输入通道的音频信号能量和语音活动性检测信息，根据音频信号能量的计算结果和语音活动性检测的计算结果，确定多个目标通道；在所述目标通道满足数量要求的情况下，所述目标通道进行音频增强，对其余通道进行音频衰减，对所有输入通道的音频进行混音并输出。该装置针对多人音视频通话、音视频教学或者音视频会议时，多人同时开麦克风场景下声音嘈杂的问题，通过合理的音频通道选择及音频增强，使用户更加容易获取到自己需要的信息，提升了用户的听觉感受，使线上交流更加顺畅，节约用户的时间成本。In the above specific implementation, the multi-channel audio and video signal processing device provided by the present invention calculates the audio signal energy and voice activity detection information of all input channels, and determines multiple target channels according to the calculation results of the audio signal energy and the voice activity detection; when the target channels meet the quantity requirements, the target channels perform audio enhancement, the remaining channels perform audio attenuation, and the audio of all input channels is mixed and output. This device aims at the problem of noisy sound in the scenario where multiple people turn on the microphone at the same time during multi-person audio and video calls, audio and video teaching or audio and video conferences. Through reasonable audio channel selection and audio enhancement, it makes it easier for users to obtain the information they need, improves the user's auditory experience, makes online communication smoother, and saves users' time costs.

图10示例了一种电子设备的实体结构示意图，如图10所示，该电子设备可以包括：处理器(processor)1010、通信接口(Communications Interface)1020、存储器(memory)1030和通信总线1040，其中，处理器1010，通信接口1020，存储器1030通过通信总线1040完成相互间的通信。处理器1010可以调用存储器1030中的逻辑指令，以执行上述方法。FIG10 illustrates a schematic diagram of a physical structure of an electronic device. As shown in FIG10 , the electronic device may include: a processor 1010, a communications interface 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communications interface 1020, and the memory 1030 communicate with each other through the communication bus 1040. The processor 1010 may call the logic instructions in the memory 1030 to execute the above method.

此外，上述的存储器1030中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 1030 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够以执行上述方法。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can perform the above method.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multichannel-based audio/video signal processing method, the method comprising:

Calculating energy and voice activity detection information of all input channel audio signals;

determining a plurality of target channels according to the calculation result of the audio signal energy and the calculation result of voice activity detection;

Under the condition that the target channels meet the number requirement, the target channels carry out audio enhancement, and the rest channels carry out audio attenuation;

Audio of all input channels is mixed and output.

2. The multi-channel audio/video signal processing method according to claim 1, wherein calculating the audio signal energy of all input channels comprises:

Calculating the maximum energy value of each audio sampling point in a single frame of the current channel;

calculating the average value of the energy maximum values of all frames in the preset duration of the current channel;

Carrying out quantization shaping calculation on the average value of the maximum energy value to obtain the audio signal energy of the current channel;

and taking each input channel as the current channel in turn to obtain the audio signal energy of all the input channels.

3. The multi-channel audio/video signal processing method according to claim 1, wherein calculating the energy and voice activity detection information of all input channel audio signals comprises:

Resampling an audio signal of a current channel to a preset sampling frequency, dividing the audio signal into a plurality of frequency bands according to the Mel cepstrum frequency, and calculating energy information in each frequency band;

Respectively calculating Gaussian probability density functions of voice and noise of each sub-band by using a Gaussian mixture model;

Calculating the log-likelihood ratio of voice and noise, and summing the log-likelihood ratios of all the sub-bands;

calibrating voice activity of the current channel based on the magnitude relation between a preset threshold value and the sum of log likelihood ratios of all sub-bands;

And taking each input channel as the current channel in turn to obtain voice activity detection information of all the input channels.

4. The multi-channel-based audio/video signal processing method according to claim 1, wherein determining a plurality of target channels based on the calculation result of the audio signal energy and the calculation result of the voice activity detection, specifically comprises:

selecting a channel marked as having voice as an alternative channel according to the calculation result of voice activity detection;

and sorting all the alternative channels according to the energy of the audio signals according to the calculation result of the energy of the audio signals, and selecting a preset number of channels with larger energy of the audio signals as the target channels.

5. The method for processing audio and video signals based on multiple channels according to claim 4, wherein if the target channels meet a number requirement, the target channels perform audio enhancement, and perform audio attenuation on the remaining channels, specifically comprising:

Under the condition that the number of the target channels is greater than or equal to the preset number, carrying out audio enhancement on all the target channels;

Selecting candidate channels from the rest channels as complementary target channels under the condition that the number of the target channels is smaller than the preset number, and carrying out audio enhancement on all the target channels;

wherein the number of the candidate channels is the difference between the target channel and the preset number.

6. The audio-video signal processing method based on multiple channels according to claim 5, wherein the selecting candidate channels from the remaining channels as complementary target channels specifically comprises:

selecting a channel marked as voice as a candidate alternative channel according to the calculation result of voice activity detection;

according to the calculation result of the audio signal energy, sequencing all candidate alternative channels according to the audio signal energy, and selecting at least one channel with larger audio signal energy as the complementary target channel;

When the audio signal energy values of the candidate channels are the same, a channel having a high signal-to-noise ratio is selected as a complementary target channel.

7. An audio-visual signal processing apparatus based on multiple channels, said apparatus comprising:

A parameter calculation unit for calculating audio signal energy and voice activity detection information of all input channels;

a channel selection unit for determining a plurality of target channels according to the calculation result of the audio signal energy and the calculation result of the voice activity detection;

the audio processing unit is used for carrying out audio enhancement on the target channels and carrying out audio attenuation on the other channels under the condition that the target channels meet the quantity requirement;

And the result output unit is used for mixing and outputting the audio of all the input channels.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.