CN100556151C

CN100556151C - A kind of video terminal and a kind of audio code stream processing method

Info

Publication number: CN100556151C
Application number: CNB2006100646565A
Authority: CN
Inventors: 詹五洲
Original assignee: Huawei Technologies Co Ltd
Current assignee: FUGUE ACOUSTICS TECHNOLOGY Co Ltd
Priority date: 2006-12-30
Filing date: 2006-12-30
Publication date: 2009-10-28
Anticipated expiration: 2026-12-30
Also published as: CN1997161A

Abstract

Embodiments of the invention disclose the method that a kind of audio code stream is handled, and this method is: compressed video stream is decoded, obtain to comprise the image of source of sound, detect the positional information of described source of sound in described image; Compressed audio bitstream stream is decoded, obtain voice messaging; Positional information according to described source of sound is handled described voice messaging, and the sound bearing of playback and the position of described source of sound are complementary.Like this, receiving terminal does not need to depend on the sound source position information that transmitting terminal provides, and the positional information of source of sound and the sound bearing information of playback are complementary.Embodiments of the invention also disclose a kind of video terminal simultaneously.

Description

A video terminal and an audio stream processing method

技术领域 technical field

本发明涉及通讯技术，特别是涉及一种视频终端以及一种音频码流处理方法。The invention relates to communication technology, in particular to a video terminal and an audio code stream processing method.

背景技术 Background technique

随着宽带的普及，视频通讯在我们的社会生活中占据着越来越重要的地位，通讯的视频化时代已揭开帷幕。但是，目前电视机的屏幕越来越大，而有的视频通讯系统采用投影仪或电视墙显示，导致与会者在画面上移动的位置较大，而目前的多媒体通讯系统的声音并没有根据说话人的位置发生改变，即声音没有方位信息，导致视频通讯缺乏真实感。With the popularization of broadband, video communication occupies an increasingly important position in our social life, and the video era of communication has begun. However, at present, the screens of televisions are getting bigger and bigger, and some video communication systems use projectors or video wall displays, which cause the participants to move on the screen. The position of the person changes, that is, the sound has no orientation information, resulting in a lack of realism in video communication.

现有技术公开了一种解决上述问题的方法：在电视机顶部放置一个长条型的装置，在该装置里有多个麦克风，多个扬声器，以及摄像头。对多个麦克风采集的声音信号进行处理之后，可以获得一个语音信号，以及一个相对于长条型装置的说话人方位信息。视频通讯系统的发送端将获得的语音信号和说话人方位信息通过网络传送到接收端，接收端根据接收到的方位信息，选择一个或多个扬声器播放，这样在接收端就可以重现说话人的方位信息。The prior art discloses a method for solving the above-mentioned problem: place a strip-shaped device on the top of the TV set, and there are multiple microphones, multiple speakers, and cameras in the device. After the sound signals collected by the multiple microphones are processed, a voice signal and a speaker's orientation information relative to the elongated device can be obtained. The sending end of the video communication system transmits the obtained voice signal and speaker's orientation information to the receiving end through the network, and the receiving end selects one or more speakers to play according to the received orientation information, so that the speaker can be reproduced at the receiving end. location information.

在上述方案中，发送端采集的说话人方位信息是相对于长条型装置的，而不是相对于摄像机镜头的。当转动摄像机镜头时，长条形装置正前方的说话人就在画面的旁边，甚至不在画面之内，而采集的声音方位信息还是正前方的，这样就导致画面中说话人的位置和采集的声音方位信息不匹配。In the above solution, the speaker's orientation information collected by the sending end is relative to the elongated device, not relative to the camera lens. When the camera lens is turned, the speaker directly in front of the strip device is on the side of the screen, or even not in the screen, but the collected sound orientation information is still directly in front, which leads to the position of the speaker in the screen and the collected information. Sound orientation information does not match.

另外，发送端需要将方位信息通过网络发送给接收端，如果发送端和接收端是不同厂家的设备，就会存在互通的问题，就是说接收端不能正确处理发送端的方位信息。In addition, the sending end needs to send the orientation information to the receiving end through the network. If the sending end and the receiving end are devices of different manufacturers, there will be intercommunication problems, which means that the receiving end cannot correctly process the orientation information of the sending end.

发明内容 Contents of the invention

本发明的实施例提供一种视频终端以及一种音频码流处理方法，使得发送端不需要将音源位置信息通过网络发送给接收端，重放的声音也可以和音源的位置实现准确的匹配。Embodiments of the present invention provide a video terminal and an audio stream processing method, so that the sending end does not need to send the location information of the sound source to the receiving end through the network, and the replayed sound can also accurately match the location of the sound source.

一种音频码流处理方法，该方法包括：A method for processing an audio code stream, the method comprising:

对视频压缩码流进行解码，获得包含音源的图像；Decode the video compression code stream to obtain the image containing the audio source;

若所述音源为说话人，则根据嘴唇特征从所述的图像中检测所述说话人的嘴唇位置；If the sound source is a speaker, detecting the lip position of the speaker from the image according to the lip feature;

根据检测到的嘴唇位置，检测唇动位置；According to the detected lip position, detect the lip movement position;

如果在所述的视频压缩码流解码得到的前一帧图像中已检测出唇动位置，则当前帧在所述前一帧唇动位置的附近检测是否有嘴唇存在，如果没有，则在整个图像范围内检测唇动位置，如果有，则进一步判断嘴唇是否在运动；如果在运动，则将运动的嘴唇位置作为唇动位置；If the lip movement position has been detected in the previous frame image obtained by decoding the video compression code stream, then the current frame detects whether there is a lip near the lip movement position of the previous frame, if not, then in the entire Detect the lip movement position within the image range, if there is, further judge whether the lips are moving; if they are moving, use the moving lip position as the lip movement position;

根据检测到的唇动位置，检测所述音源的位置信息；Detecting position information of the sound source according to the detected lip movement position;

对视频压缩码流对应的音频压缩码流进行解码，获得语音信息；Decode the audio compression code stream corresponding to the video compression code stream to obtain voice information;

根据所述音源的位置信息对所述语音信息进行处理，使重放的声音方位和所述音源的位置信息相匹配。The voice information is processed according to the location information of the sound source, so that the replayed sound orientation matches the location information of the sound source.

一种视频终端，包括：A video terminal, comprising:

视频解码模块，用于对接收到的视频压缩码流进行解码，并输出解码后的图像；The video decoding module is used to decode the received video compression code stream, and output the decoded image;

音频解码模块，用于对接收到的视频压缩码流对应的音频压缩码流进行解码，并输出解码后的语音信息；The audio decoding module is used to decode the audio compression code stream corresponding to the received video compression code stream, and output the decoded voice information;

音源位置检测模块，用于接收视频解码模块发送的图像，并提取音源为说话人的嘴唇特征，根据嘴唇特征从所述的图像中检测所述说话人的嘴唇位置；The sound source position detection module is used to receive the image sent by the video decoding module, and extract the sound source as the speaker's lip feature, and detect the speaker's lip position from the image according to the lip feature;

根据检测到的唇动位置，从而检测到音源位置信息；According to the detected lip movement position, the sound source position information is detected;

声音方位处理模块，用于接收音频解码模块发送的语音信息和音源位置检测模块发送的音源位置信息，将声音方位和音源的位置相互匹配。The sound orientation processing module is used to receive the voice information sent by the audio decoding module and the sound source location information sent by the sound source location detection module, and match the sound location and the location of the sound source.

本发明的实施例通过检测图像中音源的位置信息，对重放的声音进行处理，可以使得扬声器中重放的声音的方位和图像中音源的位置相互匹配；同时接收终端不必依赖发送终端提供音源位置信息。The embodiments of the present invention process the replayed sound by detecting the position information of the sound source in the image, so that the orientation of the replayed sound in the speaker can match the position of the sound source in the image; at the same time, the receiving terminal does not have to rely on the sending terminal to provide the sound source location information.

附图说明 Description of drawings

图1是本发明实施例的方法流程图；Fig. 1 is the method flowchart of the embodiment of the present invention;

图2是本发明实施例的一个应用场景；Fig. 2 is an application scene of the embodiment of the present invention;

图3是本发明实施例中唇动检测的流程图；Fig. 3 is the flowchart of lip movement detection in the embodiment of the present invention;

图4是本发明实施例中视频终端的结构图。Fig. 4 is a structural diagram of a video terminal in an embodiment of the present invention.

具体实施方式 Detailed ways

本发明的实施例提供了一种音频码流处理方法。如图1所示，本方法由如下步骤组成：An embodiment of the present invention provides a method for processing an audio code stream. As shown in Figure 1, the method consists of the following steps:

对视频压缩码流进行解码，获得包含音源的图像，在所述的图像中检测所述音源的位置信息；Decoding the video compression code stream to obtain an image containing a sound source, and detecting the position information of the sound source in the image;

根据所述音源的位置信息对所述语音信息进行处理，使重放的声音方位和所述音源的位置相匹配。The voice information is processed according to the position information of the sound source, so that the position of the played sound matches the position of the sound source.

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

下面以一个视频会议作为本发明实施例的一个应用场景来详细说明本发明。但该应用场景并不用来限定本发明。The present invention will be described in detail below by taking a video conference as an application scenario of the embodiment of the present invention. But this application scenario is not used to limit the present invention.

图2是视频通信系统的示意图。在图2中，10是发送端会场，11是接收端会场，12是通信网络，通信网络可以是IP网络、PSTN网络、无线网络等。在会场10中，101是摄像头，102是视频通信终端，103是电视机，104是参会者，105、106是扬声器。终端102中内置有麦克风，也可以是独立的置于外部，通过传输线和终端112相连接。在会场11中，111是摄像头，112是视频通信终端，113是电视机，104a是参会者104的图像，115、116是扬声器。终端112中内置有麦克风，也可以是独立的置于外部，通过传输线和终端102相连接。发送端会场10中的摄像头101捕获图像后，传送到终端102，终端102对图像经过编码等处理之后，通过网络12传输到终端112，终端112对接收到的图像码流进行解码，并将解码之后的图像传输到电视机113上显示。会场10中的麦克风捕获声音信号之后，传递给终端102，终端102进行音频编码，通过网络12将编码后的音频码流传输给终端112，终端112对接收到的音频码流解码之后，传送给扬声器115、116重放。Fig. 2 is a schematic diagram of a video communication system. In FIG. 2 , 10 is a site at the sending end, 11 is a site at the receiving end, and 12 is a communication network, which may be an IP network, a PSTN network, or a wireless network. In the venue 10, 101 is a camera, 102 is a video communication terminal, 103 is a TV, 104 is a participant, and 105 and 106 are speakers. The terminal 102 has a built-in microphone, or it can be placed outside independently, and connected to the terminal 112 through a transmission line. In the venue 11, 111 is a camera, 112 is a video communication terminal, 113 is a television, 104a is an image of a participant 104, and 115 and 116 are speakers. The terminal 112 has a built-in microphone, or it can be placed outside independently, and connected to the terminal 102 through a transmission line. After the camera 101 in the venue 10 at the sending end captures the image, it transmits it to the terminal 102. After the terminal 102 encodes the image, it transmits it to the terminal 112 through the network 12. The terminal 112 decodes the received image code stream and decodes the The subsequent images are transmitted to the TV 113 for display. After the microphone in the venue 10 captures the sound signal, it is transmitted to the terminal 102, and the terminal 102 performs audio encoding, and transmits the encoded audio stream to the terminal 112 through the network 12, and after the terminal 112 decodes the received audio stream, transmits it to Loudspeakers 115, 116 reproduce.

在图2的11会场中，为了使声音具有临场感，需要使扬声器115、116重放的声音和说话人104a的位置相匹配。In the 11 venues in FIG. 2 , in order to make the sound have a sense of presence, it is necessary to match the sound reproduced by the speakers 115 and 116 to the position of the speaker 104a.

下面我们以在视频会议中，会议中的说话人为音源作为例子对本发明的方法进行说明：Below we take the speaker in the conference as the sound source in the video conference as an example to illustrate the method of the present invention:

Step1：将发送端传送过来的视频压缩码流进行视频解码，得到发送端的图像，然后检测出图像中说话人的位置信息。Step1: Decode the compressed video stream sent by the sender to obtain the image of the sender, and then detect the position information of the speaker in the image.

对视频压缩码流进行视频解码，得到的是多帧图像，然后对帧序列中的图像进行检测，得到说话人的位置信息。Video decoding is performed on the video compression code stream to obtain multi-frame images, and then the images in the frame sequence are detected to obtain the position information of the speaker.

其中，检测说话人位置的方法有许多种，例如采用图像识别技术，用说话人的某些特点作为特征检测出图像中说话人的位置，可以用于检测的特征包括人脸、眼睛、嘴唇等，下面我们以说话人的嘴唇作为特征为例子，来说明如何通过检测说话人的唇动位置来确定说话人的位置信息。Among them, there are many ways to detect the position of the speaker. For example, image recognition technology is used to detect the position of the speaker in the image using certain characteristics of the speaker as features. The features that can be used for detection include face, eyes, lips, etc. , below we take the speaker's lips as an example to illustrate how to determine the position information of the speaker by detecting the position of the speaker's lip movement.

请参考图3的唇动检测处理流程。Please refer to the lip movement detection processing flow in FIG. 3 .

S11：检测当前帧的唇动位置，如果当前帧有唇动，则执行步骤S12；否则执行步骤S14；S11: Detect the lip movement position in the current frame, if there is lip movement in the current frame, execute step S12; otherwise, execute step S14;

S12：进一步判断是否有多个唇动位置，如果有多个唇动位置，则在多个唇动位置中选择一个唇动位置，或计算多个唇动位置的中心位置并将此中心位置作为唇动位置，执行步骤S13；否则，直接执行步骤S13；S12: further judge whether there are multiple lip movement positions, if there are multiple lip movement positions, select a lip movement position among the multiple lip movement positions, or calculate the center position of the multiple lip movement positions and use this center position as Lip movement position, execute step S13; otherwise, directly execute step S13;

S13：输出唇动位置；S13: output the position of lip movement;

S14：不输出唇动位置。S14: Do not output the lip movement position.

唇动位置即说话人的嘴唇所在的位置。检测唇动位置可以采用现有技术中的检测方法。一个简单有效的方法是根据嘴唇的颜色，唇色的搜索可以在YIQ或YUV颜色空间进行。例如，在YIQ空间，经过统计及实验效果，得到唇色各分量的最佳阈值分别为Y∈[80，220]，I∈[12，78]，Q∈[7，25]。根据这些阈值可以比较容易的搜索出嘴唇的位置。如果只根据唇色进行搜索，不可避免的会带来一些误判，因而还可以在根据唇色搜索出嘴唇位置后，进一步根据嘴唇周围的肤色来判断。肤色也有一个相对集中的阈值范围，利用这些阈值范围可以判断出嘴唇周边的颜色是否是肤色，如果是则说明嘴唇位置的判断是正确的，否则不正确。此外可以利用的特征还有眼部特征等。The lip movement position is the position of the speaker's lips. The detection method in the prior art may be used to detect the position of the lip movement. A simple and effective method is based on lip color, lip color search can be done in YIQ or YUV color space. For example, in the YIQ space, after statistical and experimental results, the optimal thresholds for each component of lip color are Y∈[80,220], I∈[12,78], Q∈[7,25]. According to these thresholds, it is relatively easy to search for the position of the lips. If the search is only based on the lip color, it will inevitably lead to some misjudgments. Therefore, after the lip position is searched based on the lip color, it can be further judged based on the skin color around the lips. Skin color also has a relatively concentrated threshold range, which can be used to determine whether the color around the lips is skin color. If so, it means that the judgment of the lip position is correct, otherwise it is incorrect. In addition, features that can be used include eye features, etc.

在判断出嘴唇的位置之后还需要判断嘴唇是否处于运动状态，这可以根据前后若干帧图像相同位置的嘴唇的大小以及变化的快慢就可以很容易的做出判断。由于唇动位置具有连续性，因此不需要每帧图像都在图像的整个范围内检测唇动位置，具体方法是如果前一帧已检测出唇动的位置，则检测当前帧的唇动位置可以在前一帧唇动位置的附近检测是否有嘴唇存在，如果没有，则在整个图像范围内搜索唇动位置，如果有，则进一步判断嘴唇是否在运动；如果在运动，则将运动嘴唇的位置作为唇动位置，否则，设置一个预定帧数，在当前帧之后的预定帧数之内都保持唇动位置不变，如果超过预定帧数嘴唇都没有运动，则重新开始在整个图像范围搜索唇动位置。采用该方法可以很大程度上减小计算量，并且可以保证声音方位的连续性。After judging the position of the lips, it is also necessary to judge whether the lips are in a moving state, which can be easily judged according to the size and speed of the lips at the same position in several frames of images before and after. Since the lip movement position is continuous, it is not necessary to detect the lip movement position in the entire range of the image in each frame. The specific method is that if the lip movement position has been detected in the previous frame, then the lip movement position of the current frame can be detected. Detect whether there are lips near the lip movement position in the previous frame. If not, search for the lip movement position in the entire image range. If there is, further judge whether the lips are moving; if they are moving, move the position of the lips As the lip movement position, otherwise, set a predetermined number of frames, and keep the lip movement position unchanged within the predetermined number of frames after the current frame. If there is no movement of the lips beyond the predetermined number of frames, start searching for lips in the entire image range again. move position. Using this method can greatly reduce the amount of calculation, and can ensure the continuity of sound orientation.

在视频通信中，特别是在视频会议的应用中，同一个会场可能有多个参会者，此时因为有人打哈欠、小声议论等原因，会检测出多个唇动位置，因此需要从多个唇动位置中选择一个合适的唇动位置。如前所述，如果前一帧有唇动位置，则只在前一帧唇动位置的附近检测唇动位置，因此如果检测到多个唇动位置，也是在整个图像范围内搜索唇动位置才发生的。从多个唇动位置中选择一个唇动位置的策略有多种，例如选择正面的唇动位置，过滤掉侧面的唇动位置；或者选择靠近画面中间的唇动位置，而过滤掉画面边上的唇动位置。在会场中，有时也可能同时存在多个说话人，若采用上述的方法都不能选择合适的唇动位置，此时可以计算这多个说话人唇动位置的中心位置，并将此中心位置作为输出的唇动的位置。In video communication, especially in the application of video conferencing, there may be multiple participants in the same venue. Choose an appropriate lip movement position from the three lip movement positions. As mentioned before, if there is a lip movement position in the previous frame, the lip movement position is only detected in the vicinity of the lip movement position in the previous frame, so if multiple lip movement positions are detected, the lip movement position is also searched in the entire image range just happened. There are many strategies for selecting a lip movement position from multiple lip movement positions, such as selecting the front lip movement position and filtering out the side lip movement positions; lip position. In the venue, sometimes there may be multiple speakers at the same time. If the above methods cannot be used to select a suitable lip movement position, at this time, the center position of the lip movement positions of these multiple speakers can be calculated, and this center position can be used as The position of the output lip movement.

Step2：对发送端发送的音频压缩码流进行解码，获得语音信息；Step2: Decode the audio compression code stream sent by the sender to obtain voice information;

Step1和Step2中所述的对音频压缩码流和视频压缩码流的解码可以同时进行，也可以分开进行，无先后顺序之分。The decoding of the audio compressed code stream and the video compressed code stream described in Step1 and Step2 can be performed simultaneously or separately, and there is no sequence.

Step3：根据说话人的位置信息对接收到的语音信息进行处理，使得说话人的声音方位和其位置相匹配。Step3: Process the received voice information according to the speaker's position information, so that the speaker's voice direction matches its position.

根据说话人的位置处理语音信息，可以利用现有技术的方法来实现。下面举例进行说明。对于图2的应用场景，如果重放的是两个扬声器，且两个扬声器分别在电视机左右两边，一个声音处理方案是，通过调整左右声道声音的幅度，来达到声音的水平方位和画面中说话人位置相匹配的目的，也就是使说话人的位置和声音方位相匹配。可用下面的两个公式描述具体的调整方法：Processing the speech information according to the position of the speaker can be realized by using the method in the prior art. An example is given below. For the application scenario in Figure 2, if two speakers are played back, and the two speakers are on the left and right sides of the TV, a sound processing solution is to adjust the sound amplitude of the left and right channels to achieve the sound level and picture The purpose of matching the position of the speaker is to match the position of the speaker with the direction of the voice. The specific adjustment method can be described by the following two formulas:

D＝(g1-g2)/(g1+g2)D=(g1-g2)/(g1+g2)

C＝g1*g1+g2*g2C=g1*g1+g2*g2

上述两个式子中C是一个固定值，g1是左声道幅度增益，g2是右声道幅度增益，D是根据唇动位置信息计算出来的说话人水平方向在画面上的相对距离，令唇动位置距离画面中间垂直线的距离为D’(唇动位置在画面左边为正值，右边为负值)，电视画面水平方向的宽度为W，则D按下式计算：In the above two formulas, C is a fixed value, g1 is the amplitude gain of the left channel, g2 is the amplitude gain of the right channel, and D is the relative distance of the speaker in the horizontal direction on the screen calculated according to the lip movement position information, so that The distance between the lip movement position and the vertical line in the middle of the screen is D' (the lip movement position is a positive value on the left side of the screen, and a negative value on the right side), and the horizontal width of the TV screen is W, then D is calculated according to the following formula:

D＝D’/(W/2)D=D'/(W/2)

根据音源位置信息处理声音的方法还可以采用HRTF(Head RelatedTransfer Functions，头部相关传输函数)。采用HRTF虚拟出一个声源的技术在现有的技术文献中都已公开，在本发明中不再详述。The method of processing sound according to the position information of the sound source can also adopt HRTF (Head Related Transfer Functions, Head Related Transfer Function). The technology of using HRTF to virtualize a sound source has been disclosed in existing technical documents, and will not be described in detail in the present invention.

在本发明的实施例提供的方法中，通过在声音重放地检测并获得说话人位置信息，使得接收终端不必依赖发送终端提供说话人位置信息；在获得位置信息后，根据此位置信息对重放的语音信息进行处理，从而使得重放的声音和图像中说话人的位置实现准确的匹配。In the method provided by the embodiment of the present invention, by detecting and obtaining the speaker's position information in the place of sound reproduction, the receiving terminal does not need to rely on the sending terminal to provide the speaker's position information; The replayed voice information is processed, so that the replayed sound and the position of the speaker in the image can be accurately matched.

需要说明的是，本发明提供的音频码流处理方法不仅仅局限于处理从发送端接收的音频码流，同样适用于对存储在本地的视频、音频码流进行处理。It should be noted that the audio code stream processing method provided by the present invention is not limited to processing the audio code stream received from the sending end, and is also suitable for processing locally stored video and audio code streams.

本发明的实施例还提供了一种视频终端。如图4所示，在视频通信终端中有视频解码、音频解码、音源位置检测、声音方位处理等模块。视频压缩码流经视频解码模块解码之后，一方面输出到电视机显示，另外一方面输出到音源位置检测模块。音源位置检测模块接收视频解码模块输出的图像，并对图像进行检测，提取音源的特征，从而得到音源位置信息，并将音源位置信息输出到声音方位处理模块。音频压缩码流经音频解码模块解码之后，输出到声音方位处理模块。声音方位处理模块根据音源位置信息对接收的音频码流进行处理，使得处理后的声音方位和音源的位置相一致，并产生左右两路音频输出，分别输送到左、右扬声器重放。为了具有更好的声音重放效果，视频通信终端可以外接三个或三个以上的扬声器，此时声音方位处理模块相应的输出三路或三路以上的音频流。The embodiment of the invention also provides a video terminal. As shown in Figure 4, there are modules such as video decoding, audio decoding, sound source position detection, and sound orientation processing in the video communication terminal. After the video compression code stream is decoded by the video decoding module, it is output to the TV for display on the one hand, and is output to the sound source position detection module on the other hand. The sound source position detection module receives the image output by the video decoding module, detects the image, extracts the feature of the sound source, thereby obtains the sound source position information, and outputs the sound source position information to the sound orientation processing module. After the audio compression code stream is decoded by the audio decoding module, it is output to the sound orientation processing module. The sound orientation processing module processes the received audio code stream according to the location information of the sound source, so that the processed sound orientation is consistent with the location of the sound source, and generates two audio outputs on the left and right, which are respectively sent to the left and right speakers for playback. In order to have a better sound playback effect, the video communication terminal can be connected with three or more speakers externally, and at this time, the sound orientation processing module outputs three or more audio streams correspondingly.

视频终端中的音源位置检测模块的目的是对视频解码模块输出的图像进行检测，得到其中音源的位置信息。所以在视频终端中如果音源是说话人时，位置检测可以通过提取说话人的嘴唇特征来实现，也可以通过检测说话人的人脸等特征，只要该模块能检测到视频解码模块输出的图像中的说话人位置即可。The purpose of the sound source position detection module in the video terminal is to detect the image output by the video decoding module, and obtain the position information of the sound source therein. Therefore, if the audio source is a speaker in a video terminal, position detection can be realized by extracting the speaker’s lip features, or by detecting features such as the speaker’s face, as long as the module can detect The position of the speaker is sufficient.

如果以说话人的嘴唇为特征来检测说话人的位置，则音源位置检测模块包括：If the speaker's position is detected based on the speaker's lips, the sound source position detection module includes:

第一接收模块，用于接收视频解码模块发送的包含说话人的图像；The first receiving module is used to receive the image containing the speaker sent by the video decoding module;

特征提取模块，用于提取所述第一接收模块接收的图像中所述说话人的嘴唇特征；A feature extraction module, configured to extract the lip features of the speaker in the image received by the first receiving module;

位置检测模块，用于根据所述的特征提取模块提取的所述说话人的嘴唇特征，来确定所述说话人的位置。The position detection module is used to determine the position of the speaker according to the lip features of the speaker extracted by the feature extraction module.

其中，检测唇动位置可以采用前面介绍的唇动检测方法。Wherein, the lip movement detection method described above may be used to detect the lip movement position.

声音方位处理模块包括：The sound orientation processing module includes:

第二接收模块，用于接收所述音频解码模块发送的语音信息和所述位置检测模块发送的所述说话人的位置信息；The second receiving module is used to receive the voice information sent by the audio decoding module and the speaker's location information sent by the location detection module;

匹配模块，用于根据所述第二接收模块接收的语音信息和所述说话人的位置信息，使重放的声音方位和所述说话人的位置相匹配。A matching module, configured to match the position of the replayed sound with the position of the speaker according to the voice information received by the second receiving module and the position information of the speaker.

综上所述，以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。To sum up, the above are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. An audio code stream processing method, characterized in that, comprising:

Decode the video compression code stream to obtain the image containing the audio source;

If the sound source is a speaker, detecting the lip position of the speaker from the image according to the lip feature;

According to the detected lip position, detect the lip movement position;

If the lip movement position has been detected in the previous frame image obtained by decoding the video compression code stream, then the current frame detects whether there is a lip near the lip movement position of the previous frame, if not, then in the entire Detect the lip movement position within the image range, if there is, further judge whether the lips are moving; if they are moving, use the moving lip position as the lip movement position;

Detecting position information of the sound source according to the detected lip movement position;

Decode the audio compression code stream corresponding to the video compression code stream to obtain voice information;

The voice information is processed according to the location information of the sound source, so that the replayed sound orientation matches the location information of the sound source.

2. The method according to claim 1, characterized in that the lip movement position is detected according to the size and the speed of change of the lips at the same position in several frames of images before and after.

3. The method according to claim 1, wherein when at least two speakers are used to replay the voice, the processing of the voice information according to the position information of the sound source is specifically:

Adjust the amplitude of the sound of the left and right channels of the speaker, so that the horizontal orientation of the sound matches the position of the speaker.

4. The method according to claim 1, wherein said detecting the location information of said sound source in said image further comprises:

When there are multiple lip movement positions in the image, calculate the central position of the multiple lip movement positions, and use this central position as the output speaker's position.

5. The method of claim 1, wherein said lip characteristics include lip color.

6. The method according to claim 5, characterized in that after determining the position of the lips according to the color of the lips, it is further judged whether the color around the lips is the color of the skin.

7. The method according to claim 5 or 6, after detecting the position of the lips, it is further judged whether the lips are moving; if they are moving, the position of the moving lips is taken as the lip movement position, otherwise, a predetermined number of frames is set, The lip movement position is kept unchanged within the predetermined number of frames after the current frame, and if there is no movement of the lips beyond the predetermined number of frames, the search for the lip movement position in the entire image range is restarted.

8. A video terminal, characterized in that,

The video decoding module is used to decode the received video compression code stream, and output the decoded image;

The audio decoding module is used to decode the audio compression code stream corresponding to the received video compression code stream, and output the decoded voice information;

The sound source position detection module is used to receive the image sent by the video decoding module, and extract the sound source as the speaker's lip feature, and detect the speaker's lip position from the image according to the lip feature;

According to the detected lip position, detect the lip movement position;

If the lip movement position has been detected in the previous frame image obtained by decoding the video compression code stream, then the current frame detects whether there is a lip near the lip movement position of the previous frame, if not, then in the entire Detect the lip movement position within the image range, and if so, further judge whether the lips are moving; if they are moving, use the moving lip position as the lip movement position; according to the detected lip movement position, the sound source position information is detected;

The sound orientation processing module is used to receive the voice information sent by the audio decoding module and the sound source location information sent by the sound source location detection module, and match the sound location and the location of the sound source.

9. The video terminal according to claim 8, wherein the sound source position detection module comprises:

The first receiving module is used to receive the image containing the speaker sent by the video decoding module;

A feature extraction module, configured to extract the lip features of the speaker in the image received by the first receiving module;

The position detection module is used to determine the position of the speaker according to the lip features extracted by the feature extraction module.

10. The video terminal according to claim 8, wherein the sound orientation processing module comprises:

The second receiving module is used to receive the voice information sent by the audio decoding module and the speaker's location information sent by the location detection module;

A matching module, configured to match the position of the replayed sound with the position of the speaker according to the voice information received by the second receiving module and the position information of the speaker.