CN101414463B

CN101414463B - A kind of sound mixing coding method, device and system

Info

Publication number: CN101414463B
Application number: CN2007101813767A
Authority: CN
Inventors: 张清; 苗磊; 李伟; 许剑峰; 许丽净; 杜正中; 胡晨; 杨毅; 齐峰岩
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2007-10-19
Filing date: 2007-10-19
Publication date: 2011-08-10
Anticipated expiration: 2027-10-19
Also published as: CN101414463A

Abstract

The invention discloses a terminal side coding method, which comprises the steps of setting a sound mixing identifier for sound information according to a sound mixing strategy, and coding the sound information according to the sound mixing identifier to obtain core coding data; if the audio mixing identification information is the audio mixing requirement, calculating dynamic side information, and generating and outputting an audio coding code stream containing the audio mixing identification, the core coding data and the dynamic side information; and if the audio mixing identification information does not need audio mixing, the terminal generates and outputs an audio coding code stream containing the audio mixing identification and the core coding data. The invention also discloses a corresponding network side audio mixing coding method, and a device and a system for audio mixing coding. The scheme of the invention can solve the problems of signal overflow and error introduction during sound mixing, and can not reduce the coding efficiency.

Description

A kind of sound mixing coding method, device and system

技术领域technical field

本发明涉及多媒体通信技术领域，特别涉及一种混音编码方法、装置和系统。The present invention relates to the technical field of multimedia communication, in particular to a sound mixing encoding method, device and system.

背景技术Background technique

目前，实时多媒体通信服务的应用越来越多，用以满足日益增长的业务需求，例如多媒体会议系统等等，因此各种多媒体会议系统相关技术显得十分重要。At present, there are more and more applications of real-time multimedia communication services, such as multimedia conferencing systems, etc., to meet growing service requirements. Therefore, various technologies related to multimedia conferencing systems are very important.

在多媒体会议中，音频互动是最基本的要素。在集中式会议中，各个终端都与多点控制单元(Multi-point Controlling Unit，MCU)建立基于单播(unicast)的连接，实时地向MCU发送音频码流和从MCU接收音频码流。因此，MCU的输入均是各种编码方案编码后的音频码流，其输出为按照合成策略进行混音处理后的音频码流。In multimedia conferencing, audio interaction is the most basic element. In a centralized conference, each terminal establishes a unicast-based connection with a Multi-point Controlling Unit (MCU), and sends and receives audio streams to and from the MCU in real time. Therefore, the input of the MCU is the audio code stream encoded by various coding schemes, and its output is the audio code stream after the mixing process according to the synthesis strategy.

如图1所示为一个多媒体会议系统示意图，其中虚线框可以看作一个MCU单元。终端位置1，终端位置2等输入音频码流经分别解码，解码后的音频码流在混音单元混音后，再对混音后的音频码流分别进行编码，再输出到相应的终端。如图1所示的多媒体会议系统，有M个终端参与混音。对于特定的时刻t，每个终端会将音频数据送与MCU，MCU首先将音频数据解码，并对每路信号进行混音参数的计算，最终对多路解码信号进行混音处理。混音处理的常用算法即加和所有路解码数据，将加和后的数据再通过编码器编码，最终传送至各个终端。Figure 1 is a schematic diagram of a multimedia conference system, where the dotted box can be regarded as an MCU unit. The input audio streams at terminal position 1, terminal position 2, etc. are respectively decoded, and the decoded audio streams are mixed by the mixing unit, and then the mixed audio streams are respectively encoded, and then output to the corresponding terminals. In the multimedia conference system shown in Figure 1, there are M terminals participating in audio mixing. For a specific time t, each terminal sends audio data to the MCU, and the MCU first decodes the audio data, calculates the mixing parameters for each signal, and finally performs mixing processing on the multiple decoded signals. The commonly used algorithm for audio mixing processing is to add all the decoded data, and then encode the added data through the encoder, and finally transmit it to each terminal.

采用上述的时域叠加混音方案，常常会引入噪声。这是因为每一个终端在向MCU传送的音频信号都有一定的范围[min，max]，其中min表示范围的下限，max表示范围的上限。当直接加和所有路信号时，很可能会超出信号取值范围[min，max]。由于数字音频信号存在量化上限和下限的问题，叠加运算很可能会造成结果溢出。通常的处理手段是进行溢出检测，然后再进行饱和运算，即超过上限的结果被置为上限值，超过下限的值置为下限值。这种运算本身破坏了语音信号原有的时域特征，从而引入了噪声，这就是在某些系统中会出现爆破声和语音不连续现象的原因。With the above-mentioned time-domain superposition and mixing scheme, noise is often introduced. This is because the audio signal transmitted by each terminal to the MCU has a certain range [min, max], where min represents the lower limit of the range, and max represents the upper limit of the range. When adding all signals directly, it is likely to exceed the signal value range [min, max]. Due to the upper and lower limits of quantization of digital audio signals, the superposition operation is likely to cause the result to overflow. The usual processing method is to perform overflow detection, and then perform saturation operation, that is, the result exceeding the upper limit is set as the upper limit value, and the value exceeding the lower limit is set as the lower limit value. This operation itself destroys the original time-domain characteristics of the speech signal, thereby introducing noise, which is why there are pops and speech discontinuities in some systems.

随着参与混音的终端数据增加，出现溢出的频率也不断上升，所以这类时域叠加混音方案存在一个终端数目上限，而且这个上限值很低，实验证明，很多情况下，如果在4个终端参与混音时其结果就有很多噪声和断续，无法分辨语流了。As the data of the terminals participating in the mixing increases, the frequency of overflow also continues to rise. Therefore, there is an upper limit for the number of terminals in this time-domain superposition mixing solution, and the upper limit is very low. Experiments have proved that in many cases, if the When 4 terminals participate in the audio mixing, the result is a lot of noise and interruptions, and the speech flow cannot be distinguished.

发明内容Contents of the invention

有鉴于此，本发明实施例提出一种混音编码方法，能够克服现有技术中时域混音编码的噪声问题。所述混音编码方法包括如下步骤：In view of this, an embodiment of the present invention proposes a mixing coding method, which can overcome the noise problem of time-domain mixing coding in the prior art. Described sound mixing coding method comprises the steps:

对声音信息根据混音策略设置混音标识位，根据混音标识位信息对所述声音信息进行编码，编码的结果作为核心编码数据；Set the mixing flag for the sound information according to the mixing strategy, encode the sound information according to the mixing flag information, and use the encoded result as the core coded data;

若混音标识位信息为需要混音，则计算动态边信息，生成并输出包含所述混音标识位、核心编码数据和动态边信息的音频编码码流；若混音标识位信息为不需要混音，则生成并输出包含所述混音标识位和核心编码数据的音频编码码流；If the audio mixing flag information needs to be mixed, then calculate the dynamic side information, generate and output the audio code stream that contains the audio mixing flag, core coded data and dynamic side information; if the audio mixing flag information does not need Mixing, then generate and output the audio coded stream that contains the mixing identification bit and core coded data;

网络侧收到来自终端的音频编码码流，根据其中的混音标识位信息判断是否需要对该音频编码码流进行混音处理，对需要进行混音处理的M’路音频编码码流，根据其中的动态边信息选出N路音频编码码流，对所选择的N路音频编码码流的核心编码数据进行混音处理，并输出混音后的音频编码码流，其中N小于等于M’。The network side receives the audio coded stream from the terminal, and judges whether the audio coded stream needs to be mixed according to the mixing flag information therein, and for the M' audio coded streams that need to be mixed, according to Among them, the dynamic side information selects N channels of audio coded streams, performs mixing processing on the core coded data of the selected N channels of audio coded streams, and outputs the mixed audio coded streams, wherein N is less than or equal to M' .

本发明实施例还提出一种终端侧编码方法，包括如下步骤：The embodiment of the present invention also proposes a terminal-side encoding method, including the following steps:

根据混音策略对声音信息设置混音标识，根据所述混音标识信息对所述声音信息进行编码获得核心编码数据；Setting a mixing identifier for the sound information according to the mixing strategy, and encoding the sound information according to the mixing identifier information to obtain core coded data;

若所述混音标识信息为需要混音，则计算动态边信息，生成并输出包含所述混音标识、核心编码数据和动态边信息的音频编码码流；若所述混音标识信息为不需要混音，则终端生成并输出包含所述混音标识和核心编码数据的音频编码码流。If the audio mixing identification information is that audio mixing is required, then calculate the dynamic side information, generate and output an audio coded stream containing the audio mixing identification, core coded data and dynamic side information; if the audio mixing identification information is not If audio mixing is required, the terminal generates and outputs an audio coded stream including the audio mixing identifier and core coded data.

本发明实施例还提出一种网络侧混音编码方法，包括如下步骤：The embodiment of the present invention also proposes a network-side audio mixing encoding method, including the following steps:

接收M路音频编码码流，根据其中的混音标识信息判断是否需要对该音频编码码流进行混音处理，对需要进行混音处理的M’路音频编码码流，根据其中的动态边信息选出N路音频编码码流，对所选择的N路音频编码码流的核心编码数据进行混音处理，并输出混音后的音频编码码流，其中M、M’和N均为正整数，N小于等于M’，M’小于等于M。Receive M-channel audio coded streams, judge whether the audio coded streams need to be mixed according to the mixing identification information, and for the M' audio coded streams that need to be mixed, according to the dynamic side information Select N channels of audio coded streams, perform mixing processing on the core coded data of the selected N channels of audio coded streams, and output the mixed audio coded streams, where M, M' and N are all positive integers , N is less than or equal to M', and M' is less than or equal to M.

本发明实施例提出一种多媒体会议系统，包括M个终端和多点控制单元；An embodiment of the present invention proposes a multimedia conference system, including M terminals and a multipoint control unit;

包括M个终端和多点控制单元，其特征在于，Including M terminals and multi-point control units, characterized in that,

所述终端用于对收集的声音信息根据本地的混音策略设置混音标识位，根据混音标识位信息对所述声音信息进行编码，编码的结果作为核心编码数据；并根据本地的混音策略设置混音标识位，生成并输出包含所述核心编码数据、混音标识位为需要混音和动态边信息的音频编码码流，或者生成并输出包含所述核心编码数据和混音标识位为不需要混音的音频编码码流；The terminal is used to set the mixing identification bit for the collected sound information according to the local mixing strategy, encode the sound information according to the mixing identification bit information, and use the encoded result as core coded data; and The policy sets the mixing flag, generates and outputs the audio code stream containing the core coded data, and the mixing flag needs to be mixed and dynamic side information, or generates and outputs the core coded data and the mixing flag Encode streams for audio that does not require mixing;

所述多点控制单元用于接收来自终端的音频编码码流，根据其中的混音标识位的取值判断是否需要对该音频编码码流进行混音处理，对需要进行混音处理的M’路音频码流，根据其中的动态边信息中选出N路音频码流，对所选择的N路音频码流的核心编码数据进行混音处理，并输出混音后的音频编码码流，其中M、M’和N均为正整数，N小于等于M’，M’小于等于M。The multi-point control unit is used to receive the audio coded stream from the terminal, judge whether the audio coded stream needs to be mixed according to the value of the audio mixing flag, and the M' that needs to be mixed audio code streams, N audio code streams are selected according to the dynamic side information therein, the core coded data of the selected N audio code streams are mixed, and the audio coded streams after mixing are output, wherein M, M' and N are all positive integers, N is less than or equal to M', and M' is less than or equal to M.

本发明实施例提出一种多媒体会议终端，包括：An embodiment of the present invention proposes a multimedia conference terminal, including:

声音收集模块，用于收集声音信息；Sound collection module, used for collecting sound information;

混音策略模块，用于根据预先设置的混音策略对所述声音收集模块所收集的声音信息设置混音标识位；A sound mixing strategy module, configured to set a sound mixing identification bit for the sound information collected by the sound collection module according to a preset sound mixing strategy;

核心编码模块，用于对所述声音信息进行编码，输出核心编码数据；A core encoding module, configured to encode the sound information and output core encoded data;

成帧模块，用于根据所述混音策略模块设置的混音标识位计算动态边信息，并根据所述混音标识位的取值，生成包含所述核心编码数据、混音标识位和动态边信息的音频编码数据帧，或者生成包含所述核心编码数据和混音标识位的音频编码数据帧；A framing module, configured to calculate dynamic side information according to the audio mixing flag set by the audio mixing strategy module, and generate a frame containing the core coded data, the audio mixing flag, and the dynamic side information according to the value of the audio mixing flag. The audio coded data frame of the side information, or generate the audio coded data frame containing the core coded data and the mixing identification bit;

输出模块，用于对外输出所述成帧模块生成的音频编码数据帧作为音频编码码流。The output module is configured to output the audio coded data frame generated by the framing module as an audio coded stream.

本发明实施例提出一种多点控制单元，包括：An embodiment of the present invention proposes a multi-point control unit, including:

选择单元，用于对接收来自M个终端的音频编码码流，根据所述音频编码码流的混音标识位的取值判断是否需要对该音频编码码流进行混音处理，对需要进行混音处理的M’路音频编码码流，根据其中的动态边信息选出N路音频编码码流；The selection unit is used for receiving audio coded streams from M terminals, judging whether the audio coded stream needs to be mixed according to the value of the audio coded bit stream of the audio coded stream, and performing mixing M' road audio coded streams for audio processing, select N road audio coded streams according to the dynamic side information therein;

混音单元，用于将所述选择单元所选择的N路音频编码码流中的核心编码数据进行混音处理，得到M’路混音后的音频编码码流；A sound mixing unit, for performing mixing processing on the core coded data in the N-way audio coded streams selected by the selection unit, to obtain M' road-mixed audio coded streams;

发送单元，用于将来自所述混音单元的音频编码码流发送到相应的目的终端。A sending unit, configured to send the audio coded stream from the mixing unit to a corresponding destination terminal.

从以上技术方案可以看出，在终端侧，在编码码流中进行混音标识位的标定并增加相应的动态边信息；在网络侧，根据混音标识位以及动态边信息来选择需要混音的音频编码码流进行混音处理，可以解决混音编码时的噪声问题。From the above technical solutions, it can be seen that on the terminal side, the audio mixing flag is calibrated in the encoded code stream and the corresponding dynamic side information is added; on the network side, the required audio mixing is selected according to the audio mixing flag and dynamic side information The audio coded bit stream is mixed, which can solve the noise problem when mixing and encoding.

附图说明Description of drawings

图1为现有技术的一个多媒体会议系统示意图；Fig. 1 is a schematic diagram of a multimedia conference system in the prior art;

图2为本发明实施例的多媒体会议系统示意图；FIG. 2 is a schematic diagram of a multimedia conference system according to an embodiment of the present invention;

图3为本发明实施例的终端编码器单元输出的音频编码码流中的编码数据帧的结构图；3 is a structural diagram of an encoded data frame in an audio encoded code stream output by a terminal encoder unit according to an embodiment of the present invention;

图4为本发明实施例的终端侧的编码流程图；Fig. 4 is the coding flowchart of the terminal side of the embodiment of the present invention;

图5为本发明实施例的MCU侧的混音编码流程图；FIG. 5 is a flow chart of audio mixing encoding on the MCU side of an embodiment of the present invention;

图6为发明实施例提出的一种多媒体会议终端框图；FIG. 6 is a block diagram of a multimedia conference terminal proposed by an embodiment of the invention;

图7为本发明实施例提出的一种多点控制单元框图。FIG. 7 is a block diagram of a multi-point control unit proposed by an embodiment of the present invention.

具体实施方式Detailed ways

本发明实施例提出基于混音标识位的混音编码方法，终端输出的数据流中，除了承载语音的核心编码码流，还包括混音标识位和动态边信息，其中动态边信息携带混音编码所需的信息，如果混音标识位设置为需要混音，则设置动态边信息；如果混音标识位设置为不需要混音，则不设置动态边信息。MCU根据所述混音标识位选择需要进行混音处理的核心编码码流进行混音处理。The embodiment of the present invention proposes a mixing encoding method based on the mixing identification bit. In addition to the core coded stream carrying the voice, the data stream output by the terminal also includes the mixing identification bit and dynamic side information, wherein the dynamic side information carries the mixing audio The information required for encoding, if the audio mixing flag is set to require audio mixing, then set the dynamic side information; if the audio mixing flag is set to not require audio mixing, then the dynamic side information will not be set. The MCU selects the core coded streams that need to be subjected to the audio mixing process according to the audio mixing flag to perform the audio mixing process.

为使本发明的目的、技术方案和优点更加清楚，下面结合附图对本发明作进一步的详细阐述。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be further elaborated below in conjunction with the accompanying drawings.

图2示出了本发明实施例的多媒体会议系统示意图图。该多媒体会议系统中，包括M个终端，即终端1、终端2......终端M；还包括一个MCU。Fig. 2 shows a schematic diagram of a multimedia conference system according to an embodiment of the present invention. The multimedia conference system includes M terminals, that is, terminal 1, terminal 2...terminal M; and an MCU.

以终端1为例，该终端包括编码器单元201，编码器单元201对终端1的声音收集装置如麦克风收集到的声音进行编码，生成携带所述声音信息的核心编码码流。编码器单元201还根据本地设置的混音策略，设置混音标识位。所述混音策略用于确定本终端输出的声音编码是否需要进行混音处理，根据实际的需要可以设置不同的混音策略，例如，可以对不同的终端设置不同的优先级，对于来自优先级高的终端的音频码流优先进行混音；还可以设置声音能量阈值，当终端收集的声音能量超过该能量阈值则对该终端的音频码流进行混音等等。并且多个混音策略可以同时使用。Taking terminal 1 as an example, the terminal includes an encoder unit 201, which encodes the sound collected by the sound collection device of terminal 1, such as a microphone, to generate a core coded stream carrying the sound information. The encoder unit 201 also sets the audio mixing flag according to the locally set audio mixing policy. The sound mixing strategy is used to determine whether the sound coding output by the terminal needs to be mixed. Different sound mixing strategies can be set according to actual needs. For example, different priorities can be set for different terminals. The audio stream of a high-end terminal is prioritized for mixing; the sound energy threshold can also be set, and when the sound energy collected by the terminal exceeds the energy threshold, the audio stream of the terminal will be mixed and so on. And multiple mixing strategies can be used at the same time.

如果设置的混音标识位表示需要混音，则编码器单元201还要生成动态边信息，写入音频码流中；如果混音标识位表示不需要混音，则编码器单元201输出的音频码流中仅包括核心编码和混音标识位。If the set mixing flag indicates that mixing is required, the encoder unit 201 will also generate dynamic side information and write it into the audio code stream; Only the core encoding and mixing flags are included in the code stream.

图3示出了本发明实施例的终端编码器单元输出的音频编码码流中的编码数据帧的结构图。设一个数据帧的总长度为n比特，当混音标识位表示需要混音时，该编码数据帧如图3中的上图所示，包括t比特的混音标识位，m比特的动态边信息，以及n-m-t比特的核心编码。其中，混音标识位设置在帧头，便于MCU识别。当混音标识位表示不需要混音时，该编码数据帧如图3中的下图所示，包括t比特的混音标识位和n-t比特的核心编码。Fig. 3 shows a structural diagram of encoded data frames in an audio encoded code stream output by a terminal encoder unit according to an embodiment of the present invention. Let the total length of a data frame be n bits, when the audio mixing flag indicates that audio mixing is required, the coded data frame is shown in the upper figure in Figure 3, including the audio mixing flag of t bits, and the dynamic edge of m bits information, and a core encoding of n-m-t bits. Wherein, the audio mixing identification bit is set in the frame header, which is convenient for the MCU to identify. When the audio mixing flag indicates that audio mixing is not required, the coded data frame is shown in the lower figure of FIG. 3 , including t-bit audio-mixing flags and n-t-bit core codes.

对于G.711窄带增强层(Low Band Enhance，LBE)编码来说，图3中各个部分可取如下数值：t＝1，n＝80，m＝9。For G.711 narrowband enhancement layer (Low Band Enhance, LBE) coding, each part in Fig. 3 can take the following values: t=1, n=80, m=9.

边信息包括：帧能量(Frame Energy)和声音分值(Voicing score)，若边信息码长为9比特，则其中6比特为量化的帧能量，3比特为量化的声音分值。The side information includes: frame energy (Frame Energy) and voice score (Voicing score). If the code length of the side information is 9 bits, 6 bits are the quantized frame energy, and 3 bits are the quantized voice score.

其中，帧能量的计算用公式(1)表示：Among them, the calculation of frame energy is expressed by formula (1):

$Frame frame__Energy 能源 = = \frac{{Σ Σ}_{i i = = 00}^{Frame frame__Length Length - - 11} {S S}^{22} ((i i))}{Frame frame__Length Length} - - - - - - ((11))$

Frame_Length为帧长度，S(i)是经过正交镜象滤波器(Quadrature MirrorFilter，QMF)的低频带信号，i为帧中的采样值序号。Frame_Length is the frame length, S(i) is the low frequency band signal passed through the quadrature mirror filter (Quadrature MirrorFilter, QMF), and i is the sample value sequence number in the frame.

声音分值用公式(2)计算：The voice score is calculated using formula (2):

$Voicing Voicing__score score = = \frac{Zero Zero__Cros Cros sin sin g g__Rate Rate}{Scale Scale__factor factor}$

其中，过零率(Zero_Crossing_Rate)表示10ms内，时域波形过零次数。约化因子(Scale_Factor)为预先设置的约化常量，取值为[0，1]。Among them, the zero-crossing rate (Zero_Crossing_Rate) indicates the number of zero-crossing times of the time-domain waveform within 10ms. The reduction factor (Scale_Factor) is a preset reduction constant, and its value is [0, 1].

根据实际情况，动态边信息也可设置为其它可用于作为混音处理判断依据的量，例如，可以设置为静音活动检测(VAD)。According to actual conditions, the dynamic side information can also be set to other quantities that can be used as a basis for judging the sound mixing process, for example, it can be set to a silent activity detection (VAD).

终端输出的音频码流发送到MCU后，首先输入选择单元202。选择单元202从收到的音频编码码流中首先识别出混音标识位，根据混音标识位的取值，确定是否需要对该路音频编码码流进行混音处理，如果不需要混音处理，则选择单元202将该路音频编码码流输出至相应的目的终端。对于所有M’(M’小于等于M)路需要混音处理的音频编码码流，选择单元202根据其中的动态边信息，选择出N(N小于等于M’)路音频编码码流，将这些音频编码码流分别发送至相应的解码器，经过解码后，再发送到混音单元203进行混音处理，得到M’路混音后的音频码流，再将这M’路音频码流分别用编码器编码后，发送至相应的终端。After the audio code stream output by the terminal is sent to the MCU, it is first input into the selection unit 202 . The selection unit 202 first recognizes the audio mixing flag from the received audio coded stream, and determines whether the audio coded stream needs to be mixed according to the value of the audio coded stream. , the selection unit 202 outputs the coded audio stream to the corresponding destination terminal. For all M' (M' is less than or equal to M) channels of audio coded streams that need to be mixed, the selection unit 202 selects N (N is less than or equal to M') channels of audio coded streams according to the dynamic side information therein, and these The audio coded streams are sent to the corresponding decoders respectively, and after being decoded, they are then sent to the mixing unit 203 for mixing processing to obtain M' mixed audio streams, and then the M' audio streams are respectively After encoding with an encoder, send it to the corresponding terminal.

本发明实施例的终端侧的编码过程如图4所示，包括如下步骤：The encoding process on the terminal side of the embodiment of the present invention is shown in Figure 4, including the following steps:

步骤401：对收集的声音信息根据本地的混音策略设置混音标识位，然后对所述声音信息进行编码，编码的结果作为核心编码数据；Step 401: Set the mixing identification bit for the collected sound information according to the local mixing strategy, and then encode the sound information, and the encoded result is used as the core encoded data;

步骤402：若设置混音标识位为需要混音，则计算动态边信息，可以依据前述公式(1)和公式(2)计算帧能量和声音分值作为动态边信息。Step 402: If the audio mixing flag is set to require audio mixing, then calculate the dynamic side information. The frame energy and sound score can be calculated according to the aforementioned formula (1) and formula (2) as the dynamic side information.

步骤403：生成并输出音频编码码流。所述生成音频编码码流具体包括：若所设置的混音标识位为有效，则生成包括所述混音标识位、核心编码数据和动态边信息的音频编码数据帧；若所设置的混音标识位为无效，则生成包括所述混音标识位和核心编码数据的音频编码数据帧。所述混音标识位设置在数据帧最前，较佳地，长度为1比特。Step 403: Generate and output an audio coded stream. The generating the audio coded stream specifically includes: if the set audio mixing flag is valid, generating an audio coded data frame including the audio mixing flag, core coded data and dynamic side information; If the identification bit is invalid, an audio encoding data frame including the mixing identification bit and core encoding data is generated. The audio mixing identification bit is set at the beginning of the data frame, preferably, the length is 1 bit.

本发明实施例的MCU侧的混音编码过程如图5所示，包括如下步骤：The audio mixing encoding process on the MCU side of the embodiment of the present invention is shown in Figure 5, including the following steps:

步骤501：MCU收到来自终端的音频编码码流，根据其中的混音标识位的取值判断是否需要对该音频编码码流进行混音处理，若是，则执行步骤502，否则，执行步骤503。Step 501: The MCU receives the audio coded stream from the terminal, and judges whether the audio coded stream needs to be mixed according to the value of the audio mixing flag. If so, execute step 502; otherwise, execute step 503 .

步骤502：将该路音频编码码流直接发送到对应的目的终端，并结束对该路音频编码码流的处理。Step 502: Send the coded audio stream directly to the corresponding destination terminal, and end the processing of the coded audio stream.

步骤503：对于同一时刻收到的来自M’个终端的音频编码码流，且这些音频编码码流中的混音标识位均为需要进行混音处理，MCU根据这些码流中的动态边信息，从中选择出N路音频编码码流，并丢弃剩下的M’-N路音频编码码流。其中N小于等于M’。Step 503: For the audio coded streams received from M' terminals at the same time, and the audio mixing flags in these audio coded streams all need to be mixed, the MCU according to the dynamic side information in these coded streams , select N channels of audio coded streams, and discard the remaining M'-N channels of audio coded streams. Where N is less than or equal to M'.

可以根据边信息中能量的值，如果大于某一个阈值T，则混音，小于则不进行混音。According to the value of the energy in the side information, if it is greater than a certain threshold T, the sound will be mixed, and if it is smaller than it, the sound will not be mixed.

504：对所选择的N路音频编码码流的核心编码数据分别进行解码，将解码后的核心编码数据进行混音处理，得到M’路混音后的音频码流。504: Decode the core coded data of the selected N channels of audio coded streams respectively, perform audio mixing processing on the decoded core coded data, and obtain M' audio coded streams after mixing.

步骤505：将所述M’路混音后的音频码流分别进行编码，将编码后的M’路编码并混音后的音频编码码流分别发送到M’个目的终端。Step 505: Encode the M' mixed audio streams respectively, and send the encoded M' encoded and mixed audio encoded streams to M' destination terminals respectively.

图6为发明实施例提出的一种多媒体会议终端，包括：Fig. 6 is a kind of multimedia conferencing terminal proposed by the embodiment of the invention, including:

声音收集模块601，用于收集声音信息；Sound collection module 601, for collecting sound information;

混音策略模块602，用于根据预先设置的混音策略对所述声音收集模块601所收集的声音信息设置混音标识位；A sound mixing strategy module 602, configured to set a sound mixing identification bit for the sound information collected by the sound collection module 601 according to a preset sound mixing strategy;

核心编码模块603，用于对所述声音信息进行编码，输出核心编码数据；如果混音策略模块602将混音标识位设置为不需要混音，则核心编码模块603进行编码时，无需考虑动态边信息的比特分配；如果该混音标识位设置为需要混音，则核心编码模块603进行编码时，需要考虑动态边信息的比特分配。例如，如果编码数据帧的总比特数为n比特，混音标识位为t比特，动态边信息为m比特，则对于不需要考虑动态边信息的比特分配的情况，核心编码模块603编码得到的核心编码数据长度为n-t比特；对于需要考虑动态边信息的比特分配的情况，核心编码模块603编码得到的核心编码数据长度为n-m-t比特。The core encoding module 603 is used to encode the sound information and output the core encoded data; if the mixing strategy module 602 sets the mixing identification bit as not requiring mixing, then the core encoding module 603 does not need to consider the dynamic Bit allocation of side information; if the audio mixing flag is set to require audio mixing, the core encoding module 603 needs to consider the bit allocation of dynamic side information when performing encoding. For example, if the total number of bits of the coded data frame is n bits, the audio mixing flag is t bits, and the dynamic side information is m bits, then for the case where the bit allocation of the dynamic side information does not need to be considered, the encoding obtained by the core encoding module 603 The length of the core coded data is n-t bits; for the case where the bit allocation of dynamic side information needs to be considered, the length of the core coded data obtained by encoding by the core coding module 603 is n-m-t bits.

成帧模块604，用于根据所述混音策略模块603设置的混音标识位计算动态边信息，并根据所述混音标识位的取值，生成包含所述核心编码数据、混音标识位和动态边信息的音频数据帧，或者生成包含所述核心编码数据和混音标识位的音频数据帧；The framing module 604 is configured to calculate dynamic side information according to the audio mixing flag set by the audio mixing strategy module 603, and generate a frame containing the core coded data and the audio mixing flag according to the value of the audio mixing flag. and an audio data frame of dynamic side information, or generate an audio data frame comprising the core coded data and a mixing identification bit;

输出模块605，用于将所述成帧模块604生成的音频数据帧作为音频编码码流对外输出。The output module 605 is configured to output the audio data frame generated by the framing module 604 as an audio coded stream.

图7为本发明实施例提出的一种多点控制单元，包括：Fig. 7 is a kind of multi-point control unit proposed by the embodiment of the present invention, including:

选择单元701，用于对接收来自M个终端的音频编码码流，根据所述音频编码码流的混音标识位的取值判断是否需要对该音频编码码流进行混音处理，对需要进行混音处理的M’路音频编码码流，根据其中的动态边信息选出N路音频编码码流；The selection unit 701 is used for receiving audio coded streams from M terminals, judging whether the audio coded stream needs to be mixed according to the value of the audio coded code stream of the audio coded stream, and performing M' channel audio coded streams processed by sound mixing, and N-channel audio coded streams are selected according to the dynamic side information therein;

混音单元702，用于将所述选择单元所选择的N路音频编码码流中的核心编码数据进行混音处理，得到M’路混音后的音频码流；The audio mixing unit 702 is used to perform mixing processing on the core coded data in the N-way audio coded streams selected by the selection unit, to obtain M' road-mixed audio streams;

发送单元703，用于将来自所述混音单元的音频码流发送到相应的目的终端。The sending unit 703 is configured to send the audio code stream from the mixing unit to a corresponding destination terminal.

所述选择单元701将不需要混音处理的音频编码码流发送到所述发送单元703；则所述发送单元703将来自所述选择单元的音频编码码流发送到相应的目的终端。The selection unit 701 sends the audio coded stream that does not need to be mixed to the sending unit 703; then the sending unit 703 sends the audio coded stream from the selection unit to the corresponding destination terminal.

所述多点控制单元进一步包括：解码器704，用于对所述选择单元701所选择的音频编码码流中的核心编码数据进行解码，并将解码后的核心编码数据发送到所述混音单元702；The multipoint control unit further includes: a decoder 704, configured to decode the core coded data in the audio coded stream selected by the selection unit 701, and send the decoded core coded data to the audio mixing Unit 702;

编码器705，用于对来自所述混音单元702的混音后的音频码流进行编码，并将编码后的音频编码码流发送到所述发送单元703。The encoder 705 is configured to encode the mixed audio stream from the mixing unit 702 , and send the encoded audio encoded stream to the sending unit 703 .

本发明实施例方案在编码码流中进行混音标识位的标定并增加相应的动态边信息，根据混音标识位和动态分配边信息比特分配。MCU根据混音标识位以及动态边信息来选择需要混音的音频编码码流进行混音处理，可以解决信号溢出以及对大信号进行混音时会引入误差的问题，并降低MCU的计算复杂度；在不进行混音时，能够充分利用码流比特分配，提高核心编码质量。本发明方案既可用于混音系统，又可应用常用编解码系统的编解码器，有利实现编码码流的智能控制，增强MCU单元交互性。The solution of the embodiment of the present invention performs the marking of the audio mixing identification bit in the encoded code stream and adds the corresponding dynamic side information, and allocates the side information bits according to the audio mixing identification bit and dynamic allocation. The MCU selects the audio coded stream that needs to be mixed according to the mixing flag and dynamic side information for mixing processing, which can solve the problem of signal overflow and the introduction of errors when mixing large signals, and reduce the computational complexity of the MCU ; When not performing audio mixing, it can make full use of the code stream bit allocation to improve the core coding quality. The solution of the invention can be used in a sound mixing system, and can also be applied to a codec of a commonly used codec system, which is beneficial to realize the intelligent control of the code stream and enhance the interactivity of the MCU unit.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. an audio mixing coding method is characterized in that, comprises the steps:

Acoustic information is provided with the audio mixing flag according to the audio mixing strategy, according to audio mixing flag information described acoustic information is encoded, the result of coding is as the core encoder data;

If audio mixing flag information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises described audio mixing flag, core encoder data and dynamic side information; If audio mixing flag information for not needing audio mixing, then generates and exports the stream of audio codes that comprises described audio mixing flag and core encoder data;

Network side receives the stream of audio codes of self terminal, judge whether that according to wherein audio mixing flag information needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein, the core encoder data of selected N road stream of audio codes are carried out audio mixing to be handled, and the stream of audio codes behind the output audio mixing, wherein N is smaller or equal to M '.

2. method according to claim 1 is characterized in that, described dynamic side information comprises frame energy, sound score value and/or quiet motion detection.

3. method according to claim 2 is characterized in that, the dynamic side information of described calculating comprises: according to formula

Calculate the frame energy, wherein, Frame_Energy represents the frame energy, and S (i) is the low band signal through Quadrature Mirror Filter QMF, and i is the sampled value sequence number in the frame.

4. method according to claim 2 is characterized in that, the dynamic side information of described calculating comprises: according to formula

Calculate the sound score value, wherein Voicing_score represents the sound score value; Zero_Crossing_Rate represented in the schedule time, the time domain waveform zero passage number of times of described acoustic information; Scale_Factor is the reduction constant that sets in advance, and value is [0,1].

5. method according to claim 1, it is characterized in that, the information of described basis audio mixing flag wherein judges whether that needs carry out audio mixing to this stream of audio codes and handle, its judged result is handled for not needing that this stream of audio codes is carried out audio mixing, then further comprises: export described stream of audio codes to the purpose terminal.

6. according to each described method of claim 1 to 5, it is characterized in that, described core encoder data to selected N road stream of audio codes are carried out audio mixing and are handled, and the audio code stream behind the output audio mixing comprises: the core encoder data in the audio code stream of selected N road are decoded respectively, decoded N road core encoder data are carried out audio mixing to be handled, obtain the audio code stream behind the audio mixing of M ' road, audio code stream behind the audio mixing of described M ' road is encoded respectively, the stream of audio codes behind coding of the M ' road behind the coding and the audio mixing is sent to the individual purpose terminal of M ' respectively.

7. an end side coding method is characterized in that, comprises the steps:

According to the audio mixing strategy acoustic information is provided with the audio mixing sign, according to described audio mixing identification information described acoustic information being encoded obtains the core encoder data;

If described audio mixing identification information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises described audio mixing sign, core encoder data and dynamic side information; If described audio mixing identification information is not for needing audio mixing, then terminal generates and exports the stream of audio codes that comprises described audio mixing sign and core encoder data.

8. the audio mixing coding method of a network side is characterized in that, comprises the steps:

Receive M road stream of audio codes, whether needs carry out audio mixing to this stream of audio codes handles according to wherein audio mixing identification information judgment, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein, the core encoder data of selected N road stream of audio codes are carried out audio mixing handle, and the stream of audio codes behind the output audio mixing, wherein M, M ' and N are positive integer, N is smaller or equal to M ', and M ' is smaller or equal to M.

9. a multimedia conference system comprises M terminal and multipoint control unit, it is characterized in that,

Described terminal is used for the acoustic information collected is provided with the audio mixing flag according to the audio mixing strategy of this locality, according to audio mixing flag information described acoustic information is encoded, and the result of coding is as the core encoder data; And the audio mixing flag is set according to the audio mixing strategy of this locality, generate and output to comprise described core encoder data, audio mixing flag be to need the audio mixing and the dynamic stream of audio codes of side information, perhaps generate and export to comprise described core encoder data and audio mixing flag for not needing the stream of audio codes of audio mixing;

Described multipoint control unit is used to receive the stream of audio codes of self terminal, value according to wherein audio mixing flag judges whether that needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road audio code stream that audio mixing is handled, according to selecting N road audio code stream in the dynamic side information wherein, the core encoder data of selected N road audio code stream are carried out audio mixing to be handled, and the stream of audio codes behind the output audio mixing, wherein M, M ' and N are positive integer, N is smaller or equal to M ', and M ' is smaller or equal to M.

10. a multimedia conferencing terminal is characterized in that, comprising:

The sound collecting module is used to collect acoustic information;

The audio mixing policy module is used for according to the audio mixing strategy that sets in advance the collected acoustic information of described sound collecting module being provided with the audio mixing flag;

The core encoder module is used for described acoustic information is encoded, output core encoder data;

Become frame module, be used for calculating dynamic side information according to the audio mixing flag of described audio mixing policy module setting, and according to the value of described audio mixing flag, generation comprises the coded audio data frame of described core encoder data, audio mixing flag and dynamic side information, perhaps generates the coded audio data frame that comprises described core encoder data and audio mixing flag;

Output module, the coded audio data frame that is used for the described one-tenth frame module generation of externally output is as stream of audio codes.

11. a multipoint control unit is characterized in that, comprising:

Selected cell, be used for receiving stream of audio codes from M terminal, value according to the audio mixing flag of described stream of audio codes judges whether that needs carry out audio mixing to this stream of audio codes and handle, needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein;

The audio mixing unit is used for that the core encoder data of the selected N of described selected cell road stream of audio codes are carried out audio mixing and handles, and obtains the stream of audio codes behind the audio mixing of M ' road;

Transmitting element is used for the stream of audio codes from described audio mixing unit is sent to the corresponding target terminal.

12. multipoint control unit according to claim 11 is characterized in that, the stream of audio codes that described selected cell will not need audio mixing to handle sends to described transmitting element; Then described transmitting element will send to the corresponding target terminal from the stream of audio codes of described selected cell.

13. according to claim 11 or 12 described multipoint control units, it is characterized in that, described multipoint control unit further comprises: demoder, be used for the core encoder data of the selected stream of audio codes of described selected cell are decoded, and decoded core encoder data are sent to described audio mixing unit;

Scrambler be used for encoding from the audio code stream behind the audio mixing of described audio mixing unit, and the stream of audio codes after will encoding sends to described transmitting element.