CN118413802A

CN118413802A - Spatial audio rendering method and device

Info

Publication number: CN118413802A
Application number: CN202310108164.5A
Authority: CN
Inventors: 秦鹏; 寇毅伟; 杨宇; 贾可; 林远鹏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2024-07-30
Also published as: WO2024159885A1

Abstract

The present application provides a spatial audio rendering method and device. The spatial audio rendering method of the present application includes: obtaining the delay information of the audio content provider; obtaining the predicted head movement position according to the delay information; sending first rendering information to the audio content provider, wherein the first rendering information includes the predicted head movement position. The present application can eliminate the delay, so that the rendering effect exactly matches the user's head movement position, achieving a real and immersive listening experience.

Description

Spatial audio rendering method and device

技术领域Technical Field

本申请涉及音频处理技术，尤其涉及一种空间音频渲染方法和装置。The present application relates to audio processing technology, and in particular to a spatial audio rendering method and device.

背景技术Background technique

在现实的三维空间中，声音可以具有方位感和空间感两种特性。空间音频技术通常是指实现在耳机等头戴播放设备上模拟前述两种特性的沉浸式音频体验的技术。人们对声音的方位感的判断主要受时间差、声级差、成长过程中学习到的人体滤波效应和头部晃动等因素的影响。其中，时间差、声级差、人体滤波效应这三个要素可以被综合的表述为头部相关传输函数(Head Related Transfer Functions，HRTF)；头部晃动在各个方向上都对我们判断声源位置有着极大帮助。室内声场可以由直达声、早期反射声(EarlyReflection，ER)和后期混响声(Late Reverb，LR)组成，人们对声音的空间感主要是依据ER和LR来建立的。In the real three-dimensional space, sound can have two characteristics: sense of direction and sense of space. Spatial audio technology usually refers to the technology that realizes the immersive audio experience of simulating the above two characteristics on head-mounted playback devices such as headphones. People's judgment of the sense of direction of sound is mainly affected by factors such as time difference, sound level difference, human body filtering effect learned during growth, and head shaking. Among them, the three elements of time difference, sound level difference, and human body filtering effect can be comprehensively expressed as head related transfer functions (HRTF); head shaking in all directions is of great help to us in judging the location of the sound source. The indoor sound field can be composed of direct sound, early reflection sound (ER) and late reverberation sound (LR). People's sense of space for sound is mainly based on ER and LR.

利用空间音频技术，可以将立体声、多声道环绕声、特殊制作的多对象音源等不同格式的声源渲染为双耳信号，即可在耳机上享受到真实的、沉浸式的听音体验。目前可以通过耳机中的陀螺仪等感应器进行头部跟踪(Head Tracking)，进而把相应的旋转或位移变化体现在对声源的空间渲染中。Using spatial audio technology, stereo, multi-channel surround sound, specially produced multi-object sound sources and other sound sources in different formats can be rendered into binaural signals, so that you can enjoy a real, immersive listening experience on headphones. Currently, head tracking can be performed through sensors such as gyroscopes in headphones, and the corresponding rotation or displacement changes can be reflected in the spatial rendering of the sound source.

但是，空间音频渲染通常需要较大算力支撑，而且头动追踪需要较低时延。如果将空间音频渲染放在音频内容提供端(例如，手机、平板、PC等)，虽可以满足算力需求，但会引入较高时延，尤其在头动场景下，会影响整体听音体验。However, spatial audio rendering usually requires a lot of computing power, and head tracking requires low latency. If spatial audio rendering is placed on the audio content provider (for example, mobile phones, tablets, PCs, etc.), although it can meet the computing power requirements, it will introduce higher latency, especially in head movement scenarios, which will affect the overall listening experience.

发明内容Summary of the invention

本申请提供一种空间音频渲染方法和装置，以消除时延，从而使渲染效果正好与用户的头动位置相匹配，达成真实的、沉浸式的听音体验。The present application provides a spatial audio rendering method and device to eliminate time delay, so that the rendering effect exactly matches the user's head movement position, achieving a real, immersive listening experience.

第一方面，本申请提供一种空间音频渲染方法，包括：获取音频内容提供端的时延信息；根据所述时延信息获取预测头动位置；向所述音频内容提供端发送第一渲染信息，所述第一渲染信息包括所述预测头动位置。In a first aspect, the present application provides a spatial audio rendering method, comprising: obtaining delay information of an audio content provider; obtaining a predicted head movement position based on the delay information; and sending first rendering information to the audio content provider, the first rendering information including the predicted head movement position.

本申请通过对用户的头动位置进行预测以得到预测头动位置，将该预测头动位置发送给音频内容提供端后，使得音频内容提供端可以基于该预测头动位置对音频帧进行渲染，这样得到的渲染后音频帧可以提供与预测头动位置对应的听音效果。那么当该渲染后音频帧发送至音频信号播放端时，预测的时间可以与渲染处理、链路传输等导致的时延相抵消，从而在音频信号播放端播放该音频帧时，正好与用户的头动位置相匹配，达成上文所述的真实的、沉浸式的听音体验。This application predicts the user's head movement position to obtain the predicted head movement position, and after sending the predicted head movement position to the audio content provider, the audio content provider can render the audio frame based on the predicted head movement position, so that the rendered audio frame obtained can provide the listening effect corresponding to the predicted head movement position. Then when the rendered audio frame is sent to the audio signal playback end, the predicted time can offset the delay caused by rendering processing, link transmission, etc., so that when the audio frame is played at the audio signal playback end, it just matches the user's head movement position, achieving the real and immersive listening experience described above.

时延信息用于指示音频内容提供端的时延情况，可以包括以下几种情况：The delay information is used to indicate the delay of the audio content provider, which can include the following situations:

第一种，时延信息包括音频内容提供端发送的时延The first type of delay information includes the delay sent by the audio content provider.

音频内容提供端在确定自己的传输链路、渲染处理等造成的时延后，可以将该时延封装于发送给音频信号播放端的第一信息中，发送给音频信号播放端。这样音频信号播放端可以从来自于音频内容提供端的第一信息中直接提取到音频内容提供端的时延。After determining the delay caused by its own transmission link, rendering processing, etc., the audio content provider can encapsulate the delay in the first information sent to the audio signal player and send it to the audio signal player. In this way, the audio signal player can directly extract the delay of the audio content provider from the first information from the audio content provider.

第二种，时延信息包括音频内容提供端发送的第二历史头动位置The second type of delay information includes the second historical head movement position sent by the audio content provider.

音频信号播放端(例如耳机)可以周期性地通过感应器(例如陀螺仪、重力传感器等)检测佩戴了耳机的用户的头在当下的位置(本文中将其称作实测头动位置)，这样音频信号播放端可以将测得的实测头动位置发送给音频内容提供端。The audio signal playback end (such as headphones) can periodically detect the current position of the head of the user wearing the headphones (referred to as the measured head movement position in this article) through sensors (such as gyroscopes, gravity sensors, etc.), so that the audio signal playback end can send the measured head movement position to the audio content provider.

音频内容提供端收到前述实测头动位置后，先不做处理，依然按照既定流程对音频帧进行相关处理。当处理完成要向音频信号播放端发送第一信息时，在第一信息中携带上前述实测头动位置，由于此时已经过去了一段时间，因此前述实测头动位置此时成为历史头动位置(本文中将其称作第二历史头动位置)。应理解，从数学意义上讲，前述实测头动位置和前述第二历史头动位置是相同的。After receiving the above-mentioned measured head movement position, the audio content provider does not process it first, and still processes the audio frame according to the established process. When the processing is completed and the first information is to be sent to the audio signal playback end, the above-mentioned measured head movement position is carried in the first information. Since a period of time has passed at this time, the above-mentioned measured head movement position becomes the historical head movement position (referred to as the second historical head movement position in this article). It should be understood that, in a mathematical sense, the above-mentioned measured head movement position and the above-mentioned second historical head movement position are the same.

基于此，音频信号播放端可以根据接收到第二历史头动位置的时间和发送实测头动位置的时间获取到二者的时间差(亦即计算同一头动位置的接收时间和发送时间的时间差)，从而间接获取到音频内容提供端的时延。Based on this, the audio signal playback end can obtain the time difference between the time of receiving the second historical head movement position and the time of sending the actual head movement position (that is, calculating the time difference between the receiving time and the sending time of the same head movement position), thereby indirectly obtaining the delay of the audio content provider.

第三种，时延信息包括音频内容提供端发送的第二历史头动位置和第一不确定系数第二历史头动位置可以参照上文第二种情况的描述，此处不再赘述。The third type is that the delay information includes the second historical head movement position and the first uncertainty coefficient sent by the audio content provider. The second historical head movement position can refer to the description of the second situation above, which will not be repeated here.

第一不确定系数是Kalman滤波器中的一个重要参数，有助于提高预测结果的准确性。The first uncertainty coefficient is an important parameter in the Kalman filter, which helps to improve the accuracy of the prediction results.

为了让用户通过音频信号播放端(例如耳机)享受到真实的、沉浸式的听音体验，可以利用空间音频技术，可以将立体声、多声道环绕声、特殊制作的多对象音源等不同格式的声源渲染为双耳信号。在真实世界中，当我们的头部发生转动或位移时，声源本身的绝对位置不会改变，而声源与头部的相对方向会产生变化。例如，在你前方有一把吉他正在演奏，如果你转向右边，吉他的声音就会相对的变到你的左边。又例如，舞台左侧有一把吉他，右侧有一支萨克斯，当你移动到舞台的侧面，吉他与萨克斯的声音会重合到一起，来自同一个方向。可见，所谓的真实的、沉浸式的听音体验，是正如前述描述的对吉他或者舞台表演的听音体验，随着用户的头的位置的转变(包括但不限于头的上下左右位移、头部转动等)，来自同一音频源的声音会在用户的耳朵里被渲染成不同的听音效果，而这种渲染即为带头动效果的渲染方法，即通过耳机中的感应器进行头部跟踪(Head Tracking)，进而把相应的旋转或位移变化体现在对声源的空间渲染中。In order to allow users to enjoy a real, immersive listening experience through the audio signal playback end (such as headphones), spatial audio technology can be used to render sound sources in different formats such as stereo, multi-channel surround sound, and specially produced multi-object sound sources into binaural signals. In the real world, when our head turns or moves, the absolute position of the sound source itself does not change, but the relative direction of the sound source and the head will change. For example, there is a guitar playing in front of you. If you turn to the right, the sound of the guitar will move relatively to your left. For another example, there is a guitar on the left side of the stage and a saxophone on the right. When you move to the side of the stage, the sounds of the guitar and the saxophone will overlap and come from the same direction. It can be seen that the so-called real and immersive listening experience is just like the listening experience of guitar or stage performance described above. As the position of the user's head changes (including but not limited to the up, down, left, right displacement of the head, head rotation, etc.), the sound from the same audio source will be rendered into different listening effects in the user's ears. This rendering is the rendering method with head motion effect, that is, head tracking (Head Tracking) is performed through the sensor in the earphones, and the corresponding rotation or displacement changes are reflected in the spatial rendering of the sound source.

与此同时，虽然音频内容提供端可以提供较高算力，从而实现上述带头动效果的渲染算法，但是由于音频内容提供端本身的特性，会引入较高时延，从而造成方位等重要空间信息错乱，例如，某一物体的声像设计在空间正前方，如果用户的头发生转动，声像先移动至其面部正前方，再恢复至空间正前方，有明显延迟“阻尼感”，影响听感。At the same time, although the audio content provider can provide higher computing power to realize the above-mentioned rendering algorithm with head movement effect, due to the characteristics of the audio content provider itself, it will introduce higher latency, resulting in confusion of important spatial information such as direction. For example, the sound and image of an object is designed to be directly in front of the space. If the user's head turns, the sound and image will first move to the front of his face, and then return to the front of the space. There is an obvious delay "damping feeling", which affects the listening experience.

为了解决上述时延导致的问题，音频信号播放端可以对用户的头动位置进行预测，然后将该预测头动位置发送给音频内容提供端，使得音频内容提供端可以基于该预测头动位置对音频帧进行渲染，这样得到的渲染后音频帧可以提供与预测头动位置对应的听音效果。那么当该渲染后音频帧发送至音频信号播放端时，预测的时间可以与渲染处理、链路传输等导致的时延相抵消，从而在音频信号播放端播放该音频帧时，正好与用户的头动位置相匹配，达成上文所述的真实的、沉浸式的听音体验。In order to solve the problem caused by the above delay, the audio signal playback end can predict the user's head movement position, and then send the predicted head movement position to the audio content provider, so that the audio content provider can render the audio frame based on the predicted head movement position, so that the rendered audio frame obtained can provide a listening effect corresponding to the predicted head movement position. Then when the rendered audio frame is sent to the audio signal playback end, the predicted time can offset the delay caused by rendering processing, link transmission, etc., so that when the audio frame is played at the audio signal playback end, it just matches the user's head movement position, achieving the real and immersive listening experience described above.

本申请中，音频信号播放端可以先获取与时延信息对应的第一历史头动位置；再通过传感器获取实测头动位置；然后根据第一历史头动位置和实测头动位置获取预测头动位置。In the present application, the audio signal playback end can first obtain the first historical head movement position corresponding to the delay information; then obtain the measured head movement position through the sensor; and then obtain the predicted head movement position based on the first historical head movement position and the measured head movement position.

根据上文中时延信息的三种情况，音频信号播放端可以采用三种方法获取第一历史头动位置，即：According to the three cases of delay information mentioned above, the audio signal playback end can use three methods to obtain the first historical head movement position, namely:

1、当时延信息包括音频内容提供端发送的时延(上文第一种情况)时，从缓存中提取与时延对应的第一历史头动位置，缓存中预先存储了多组时延和历史头动位置的对应关系。1. When the delay information includes the delay sent by the audio content provider (the first case above), the first historical head movement position corresponding to the delay is extracted from the cache, and the cache pre-stores the correspondence between multiple sets of delays and historical head movement positions.

如上文所述，音频信号播放端可以周期性地通过感应器检测佩戴了音频信号播放端的用户的实测头动位置，为了便于后续使用，音频信号播放端可以将存储下实测头动位置和检测时间的对应关系。这样当获取到上述时时延，可以通过查询前述对应关系得到与该时延对应的当时的实测头动位置(本文中将其称作第一历史头动位置)。As described above, the audio signal playback end can periodically detect the measured head movement position of the user wearing the audio signal playback end through the sensor. For the convenience of subsequent use, the audio signal playback end can store the corresponding relationship between the measured head movement position and the detection time. In this way, when the above-mentioned time delay is obtained, the measured head movement position at that time corresponding to the time delay can be obtained by querying the above-mentioned corresponding relationship (hereinafter referred to as the first historical head movement position).

2、当时延信息包括音频内容提供端发送的第二历史头动位置(上文第二种情况)时，将第二历史头动位置作为第一历史头动位置。2. When the delay information includes the second historical head movement position sent by the audio content provider (the second case above), the second historical head movement position is used as the first historical head movement position.

参照上文第二种情况的描述可知，音频内容提供端直接将第二历史头动位置封装于第一信息中发送给音频信号播放端。这样音频信号播放端直接可以得到对应于时延的当时的实测头动位置(第二历史头动位置)，因此音频信号播放端可以直接将该第二历史头动位置确定为第一历史头动位置。Referring to the description of the second case above, it can be known that the audio content provider directly encapsulates the second historical head movement position in the first information and sends it to the audio signal player. In this way, the audio signal player can directly obtain the measured head movement position (the second historical head movement position) corresponding to the time delay, so the audio signal player can directly determine the second historical head movement position as the first historical head movement position.

3、当时延信息包括音频内容提供端发送的第二历史头动位置和第一不确定系数(上文第三种情况)时，将第二历史头动位置作为第一历史头动位置。3. When the delay information includes the second historical head movement position and the first uncertainty coefficient sent by the audio content provider (the third case above), the second historical head movement position is used as the first historical head movement position.

参照上文2的描述，在时延信息的第三种情况下，音频信号播放端也可以直接将该第二历史头动位置确定为第一历史头动位置。With reference to the description in paragraph 2 above, in the third case of the time delay information, the audio signal playback end may also directly determine the second historical head movement position as the first historical head movement position.

音频信号播放端可以通过传感器获取实测头动位置，该实测头动位置是最新检测到的用户的头动位置。The audio signal playback end may obtain the measured head movement position through a sensor, where the measured head movement position is the most recently detected head movement position of the user.

本申请中，根据第一历史头动位置和实测头动位置获取预测头动位置可以包括以下两种算法：In the present application, obtaining the predicted head movement position according to the first historical head movement position and the measured head movement position may include the following two algorithms:

第一种，在上述1和2的基础上，音频信号播放端可以获取第一历史头动位置和实测头动位置的差值；获取头动变化率，头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据差值和头动变化率获取预测头动位置。The first method is that, based on 1 and 2 above, the audio signal playback end can obtain the difference between the first historical head movement position and the measured head movement position; obtain the head movement change rate, which is obtained based on the N measured head movement positions obtained previously, N>1; and obtain the predicted head movement position based on the difference and the head movement change rate.

第一历史头动位置和实测头动位置的差值可以是指第一历史头动位置和实测头动位置之间的距离。例如，在三维世界中，头动位置可以采用xyz坐标轴的方式进行表示，因此上述差值可以是第一历史头动位置和实测头动位置各自的坐标值之间的距离。此外，头动位置还可以采用其他方式表示，本申请对此不做具体限定。The difference between the first historical head movement position and the measured head movement position may refer to the distance between the first historical head movement position and the measured head movement position. For example, in a three-dimensional world, the head movement position may be represented by an xyz coordinate axis, so the difference may be the distance between the coordinate values of the first historical head movement position and the measured head movement position. In addition, the head movement position may also be represented by other methods, which are not specifically limited in the present application.

如上文所述，音频信号播放端可以周期性地通过感应器检测佩戴了音频信号播放端的用户的实测头动位置，因此音频信号播放端可以根据之前得到的N个实测头动位置得到头动变化率，N个实测头动位置可以包括从当前的实测头动位置开始向前数N个。As described above, the audio signal playback end can periodically detect the measured head movement position of the user wearing the audio signal playback end through a sensor. Therefore, the audio signal playback end can obtain the head movement change rate based on the N measured head movement positions obtained previously. The N measured head movement positions can include N positions counting forward from the current measured head movement position.

音频信号播放端可以根据差值和头动变化率，使用三次样条插值等手段，预测得到头动变化值，然后将该头动变化值与当前的实测头动位置相加得到预测头动位置。The audio signal playback end can predict the head movement change value based on the difference and the head movement change rate by using cubic spline interpolation and other means, and then add the head movement change value to the current measured head movement position to obtain the predicted head movement position.

第二种，在上述3的基础上，音频信号播放端获取头动变化率，头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据头动变化率和第一不确定系数获取第二不确定系数；将第二不确定系数、第一历史头动位置和实测当前头动位置输入Kalman滤波器以得到预测头动位置。The second method is that, based on the above 3, the audio signal playback end obtains the head movement change rate, which is obtained based on the N measured head movement positions obtained previously, where N>1; the second uncertainty coefficient is obtained based on the head movement change rate and the first uncertainty coefficient; the second uncertainty coefficient, the first historical head movement position and the measured current head movement position are input into the Kalman filter to obtain the predicted head movement position.

头动变化率的获取方式可以参照上文描述。The method for obtaining the head movement change rate can refer to the above description.

头动变化率结合第一不确定系数可以得到第二不确定系数。将第二不确定系数、第一历史头动位置和实测当前头动位置输入Kalman滤波器。在Kalman滤波器中，可以根据第二不确定系数和上一轮迭代的估计不确性确定Kalman增益系数，并将Kalman增益系数作为最新的增益系数。这样Kalman滤波器可以输出预测头动位置。The head movement change rate combined with the first uncertainty coefficient can obtain the second uncertainty coefficient. The second uncertainty coefficient, the first historical head movement position and the measured current head movement position are input into the Kalman filter. In the Kalman filter, the Kalman gain coefficient can be determined according to the second uncertainty coefficient and the estimated uncertainty of the previous iteration, and the Kalman gain coefficient is used as the latest gain coefficient. In this way, the Kalman filter can output the predicted head movement position.

需要说明的是，除了上述两种方法，本申请还可以采用其他算法根据第一历史头动位置和实测头动位置获取预测头动位置，对此不做具体限定。It should be noted that, in addition to the above two methods, the present application may also use other algorithms to obtain the predicted head movement position according to the first historical head movement position and the measured head movement position, and there is no specific limitation on this.

音频信号播放端向音频内容提供端发送第一渲染信息，该第一渲染信息包括预测头动位置，使得音频内容提供端可以根据预测头动位置对音频帧进行带头动效果的渲染。这样得到的渲染后音频帧可以提供与预测头动位置对应的听音效果。那么当该渲染后音频帧发送至音频信号播放端时，预测的时间可以与渲染处理、链路传输等导致的时延相抵消，从而在音频信号播放端播放该音频帧时，正好与用户的头动位置相匹配，达成上文所述的真实的、沉浸式的听音体验。The audio signal playback end sends the first rendering information to the audio content provider end, and the first rendering information includes the predicted head movement position, so that the audio content provider end can render the audio frame with the head movement effect according to the predicted head movement position. The rendered audio frame obtained in this way can provide a listening effect corresponding to the predicted head movement position. Then when the rendered audio frame is sent to the audio signal playback end, the predicted time can offset the delay caused by the rendering process, link transmission, etc., so that when the audio frame is played at the audio signal playback end, it just matches the user's head movement position, achieving the real and immersive listening experience described above.

在一种可能的实现方式中，当采用第一渲染模式时，音频内容提供端获取当前帧。当音频源的音频格式为非立体声格式时，音频内容提供端获取预测头动位置。音频内容提供端根据预测头动位置对当前帧进行渲染以得到第一双耳信号。当音频源的音频格式为立体声格式时，音频内容提供端不对音频源进行空间渲染。音频内容提供端向音频信号播放端发送第一信息，该第一信息包括第一音频流、音频源的音频格式和时延信息。当音频格式指示为立体声格式时，音频信号播放端通过传感器获取实测头动位置。音频信号播放端根据实测头动位置对当前帧渲染得到的第二双耳信号。音频信号播放端根据目标双耳信号播放音频，目标双耳信号包括第一双耳信号或第二双耳信号。In a possible implementation, when the first rendering mode is adopted, the audio content provider obtains the current frame. When the audio format of the audio source is a non-stereo format, the audio content provider obtains the predicted head movement position. The audio content provider renders the current frame according to the predicted head movement position to obtain a first binaural signal. When the audio format of the audio source is a stereo format, the audio content provider does not perform spatial rendering on the audio source. The audio content provider sends first information to the audio signal player, and the first information includes a first audio stream, an audio format of the audio source, and delay information. When the audio format is indicated as a stereo format, the audio signal player obtains the measured head movement position through a sensor. The audio signal player renders the current frame according to the measured head movement position to obtain a second binaural signal. The audio signal player plays audio according to the target binaural signal, and the target binaural signal includes the first binaural signal or the second binaural signal.

第一渲染模式可以是非低时延链路模式，在该模式下音频内容提供端可以基于预测头动位置对音频帧进行带头动效果的渲染。用户可以通过操作安装于音频内容提供端上的应用程序(application，APP)，选择是否选择第一渲染模式。相应的，本申请还包括第二渲染模式，在该模式下音频内容提供端对音频帧进行不带头动效果的渲染，该过程可以参见下文描述。同理，用户可以通过操作安装于音频内容提供端上的APP，选择是否选择第二渲染模式。用户的操作可以参照下文实施例。The first rendering mode may be a non-low-latency link mode, in which the audio content provider can render the audio frame with a head movement effect based on the predicted head movement position. The user can choose whether to select the first rendering mode by operating the application (application, APP) installed on the audio content provider. Correspondingly, the present application also includes a second rendering mode, in which the audio content provider renders the audio frame without a head movement effect, and the process can be described below. Similarly, the user can choose whether to select the second rendering mode by operating the APP installed on the audio content provider. The user's operation can refer to the following embodiments.

上述当前帧可以是音频源的其中一帧，通常可以是指音频内容提供端当前正在处理的音频帧。The current frame may be one of the frames of the audio source, and may generally refer to an audio frame currently being processed by the audio content provider.

音频源的音频格式包括立体声格式或者非立体声格式，其中，非立体声格式可以包括但不限于5.1、7.1、3DA等多声道、多对象格式的音频。The audio format of the audio source includes a stereo format or a non-stereo format, wherein the non-stereo format may include but is not limited to audio in multi-channel, multi-object formats such as 5.1, 7.1, and 3DA.

本申请中，音频内容提供端可以对5.1、7.1、3DA等多声道、多对象格式(非立体声格式)的音频进行高算力渲染，这样可以充分发挥音频内容提供端的算力优势。基于此，当音频源的音频格式为非立体声格式时，音频内容提供端可以对音频帧进行带头动效果的渲染，因此需要获取预测头动位置。In this application, the audio content provider can perform high-computing rendering on audio in multi-channel, multi-object formats (non-stereo formats) such as 5.1, 7.1, and 3DA, so as to give full play to the computing power advantage of the audio content provider. Based on this, when the audio format of the audio source is a non-stereo format, the audio content provider can render the audio frame with a head movement effect, so it is necessary to obtain the predicted head movement position.

音频内容提供端可以从之前接收到的来自音频信号播放端的渲染信息中获取预测头动位置，可选的，该渲染信息可以是最新接收到的渲染信息，这样可以提高预测头动位置的准确性。预测头动位置的获取可以参照图3所示实施例，此处不再赘述。The audio content provider can obtain the predicted head movement position from the rendering information previously received from the audio signal player. Optionally, the rendering information can be the latest received rendering information, which can improve the accuracy of the predicted head movement position. The acquisition of the predicted head movement position can refer to the embodiment shown in Figure 3, which will not be repeated here.

本申请中，音频内容提供端可以根据预测头动位置分别对当前帧中的直达声部分、早期反射声(Early Reflection，ER)部分和后期混响声(Late Reverb，LR)部分进行带头动效果的渲染，以得到第一双耳信号。In the present application, the audio content provider can render the direct sound part, early reflection sound (ER) part and late reverberation sound (LR) part in the current frame with head movement effect according to the predicted head movement position to obtain a first binaural signal.

可选的，具有头动效果的直达声和具有头动效果的混响(由ER和LR混合后得到)分别经过音频效果器渲染后进行混音，混音后的输出结果再次经过音频效果器渲染后，以第一双耳信号的形式传递。前述音频效果器包括但不限于均衡效果器、动态压缩效果器、低频增强效果器等。Optionally, the direct sound with the head movement effect and the reverberation with the head movement effect (obtained by mixing ER and LR) are respectively rendered by an audio effector and mixed, and the mixed output result is rendered again by the audio effector and transmitted in the form of a first binaural signal. The aforementioned audio effector includes but is not limited to an equalizer, a dynamic compression effector, a low-frequency enhancement effector, etc.

本申请中，音频内容提供端不用对立体声格式的音频源进行渲染，交由音频信号播放端进行带头动效果的渲染，因此处理可以对5.1、7.1、3DA等多声道、多对象格式(非立体声格式)的音频进行高算力渲染，这样可以充分发挥音频内容提供端的算力优势。基于此，当音频源的音频格式为非立体声格式时，音频内容提供端可以对音频帧进行带头动效果的渲染，因此音频内容提供端在第一音频流中封装入音频源即可。In this application, the audio content provider does not need to render the audio source in stereo format, and the audio signal player is responsible for rendering the leading motion effect. Therefore, the processing can perform high-computing power rendering on audio in multi-channel, multi-object formats (non-stereo formats) such as 5.1, 7.1, 3DA, etc., so as to give full play to the computing power advantage of the audio content provider. Based on this, when the audio format of the audio source is a non-stereo format, the audio content provider can render the audio frame with a leading motion effect, so the audio content provider can encapsulate the audio source in the first audio stream.

如上文所述，当音频格式指示为非立体声格式时，第一音频流包括经音频内容提供端渲染的第一双耳信号；当音频格式指示为立体声格式时，第一音频流包括音频源。As described above, when the audio format indicates a monaural format, the first audio stream includes a first binaural signal rendered by the audio content provider; when the audio format indicates a stereo format, the first audio stream includes an audio source.

立体声格式的音频源，由音频信号播放端对其进行带头动位置的渲染，由于音频信号播放端的IMU数据传输链路时延较低，因此音频信号播放端可以实时地通过传感器获取当下的头动位置(本文中称作实测头动位置)。The audio source in stereo format is rendered with the head movement position by the audio signal player. Since the IMU data transmission link latency of the audio signal player is low, the audio signal player can obtain the current head movement position (referred to as the measured head movement position in this article) through the sensor in real time.

音频信号播放端可以根据实测头动位置对当前帧中的直达声部分进行带头动效果的渲染，并对当前帧中的ER和LR进行不带头动效果的渲染，以得到第二双耳信号。The audio signal playback end can render the direct sound part in the current frame with the head movement effect according to the measured head movement position, and render the ER and LR in the current frame without the head movement effect to obtain a second binaural signal.

可选的，具有头动效果的直达声和没有头动效果的混响(由ER和LR混合后得到)分别经过音频效果器渲染后进行混音，混音后的输出结果再次经过音频效果器渲染后再次混音以得到第二双耳信号。前述音频效果器包括但不限于均衡效果器、动态压缩效果器、低频增强效果器等。Optionally, the direct sound with the head movement effect and the reverberation without the head movement effect (obtained by mixing ER and LR) are respectively rendered by an audio effector and mixed, and the mixed output is rendered again by the audio effector and mixed again to obtain a second binaural signal. The aforementioned audio effector includes but is not limited to an equalizer, a dynamic compression effector, a low-frequency enhancement effector, etc.

经过上述步骤的处理，音频信号播放端得到目标双耳信号，该目标双耳信号可以是第一双耳信号(音频源为非立体声格式)，或者也可以是第二双耳信号(音频源为立体声格式)。After the above steps, the audio signal player obtains a target binaural signal, which may be a first binaural signal (the audio source is in a non-stereo format) or a second binaural signal (the audio source is in a stereo format).

在一种可能的实现方式中，当采用第二渲染模式时，音频内容提供端获取当前帧；对当前帧进行不带头动效果的渲染以得到第二双耳信号(该第二双耳信号区别于上文中的第二双耳信号，为区分下文将其称作第三双耳信号)；向音频信号播放端发送第二信息，第二信息包括第二音频流，第二音频流包括第三双耳信号。音频信号播放端音频信号播放端通过传感器获取实测头动位置，根据实测头动位置对当前帧(该当前帧是经过音频内容提供端进行前述不带头动效果的渲染后的音频帧)渲染得到的第四双耳信号。音频信号播放端根据目标双耳信号播放音频，目标双耳信号包括第四双耳信号。In one possible implementation, when the second rendering mode is adopted, the audio content provider obtains the current frame; renders the current frame without the head movement effect to obtain a second binaural signal (the second binaural signal is different from the second binaural signal mentioned above, and is referred to as the third binaural signal below for the sake of distinction); sends second information to the audio signal player, the second information includes a second audio stream, and the second audio stream includes the third binaural signal. The audio signal player obtains the measured head movement position through a sensor, and renders the current frame (the current frame is the audio frame after the audio content provider performs the aforementioned rendering without the head movement effect) according to the measured head movement position to obtain a fourth binaural signal. The audio signal player plays audio according to the target binaural signal, and the target binaural signal includes the fourth binaural signal.

第二渲染模式下，音频内容提供端对音频帧进行不带头动效果的渲染，可以充分发挥音频内容提供端的算力优势，不带头动效果可以缩短渲染时延，转而由音频信号播放端进行低算力的带头动效果的渲染，相较于第一渲染模式其时延更低，但头动效果的渲染效果较差。In the second rendering mode, the audio content provider renders the audio frames without the head motion effect, which can give full play to the computing power advantage of the audio content provider. The lack of head motion effect can shorten the rendering delay, and the audio signal player performs low-computing power rendering with head motion effect instead. Compared with the first rendering mode, the delay is lower, but the rendering effect of the head motion effect is poor.

第二方面，本申请提供一种空间音频渲染装置，包括：获取模块，用于获取音频内容提供端的时延信息；预测模块，用于根据所述时延信息获取预测头动位置；发送模块，用于向所述音频内容提供端发送第一渲染信息，所述第一渲染信息包括所述预测头动位置。In a second aspect, the present application provides a spatial audio rendering device, including: an acquisition module for acquiring delay information of an audio content provider; a prediction module for acquiring a predicted head movement position based on the delay information; and a sending module for sending first rendering information to the audio content provider, wherein the first rendering information includes the predicted head movement position.

在一种可能的实现方式中，所述预测模块，具体用于获取与所述时延信息对应的第一历史头动位置；通过传感器获取实测头动位置；根据所述第一历史头动位置和所述实测头动位置获取所述预测头动位置。In a possible implementation, the prediction module is specifically used to obtain a first historical head movement position corresponding to the time delay information; obtain the measured head movement position through a sensor; and obtain the predicted head movement position based on the first historical head movement position and the measured head movement position.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的时延；所述预测模块，具体用于从缓存中提取与所述时延对应的所述第一历史头动位置，所述缓存中预先存储了多组时延和历史头动位置的对应关系。In one possible implementation, the delay information includes the delay sent by the audio content provider; the prediction module is specifically used to extract the first historical head movement position corresponding to the delay from the cache, and the cache pre-stores the correspondence between multiple sets of delays and historical head movement positions.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的第二历史头动位置，所述第二历史头动位置为发送第二渲染信息时的实测头动位置，所述第二渲染信息早于所述第一渲染信息发送；所述预测模块，具体用于将所述第二历史头动位置作为所述第一历史头动位置。In one possible implementation, the delay information includes a second historical head movement position sent by the audio content provider, where the second historical head movement position is the actual head movement position when sending the second rendering information, and the second rendering information is sent earlier than the first rendering information; the prediction module is specifically used to use the second historical head movement position as the first historical head movement position.

在一种可能的实现方式中，所述预测模块，具体用于获取所述第一历史头动位置和所述实测头动位置的差值；获取头动变化率，所述头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据所述差值和所述头动变化率获取所述预测头动位置。In a possible implementation, the prediction module is specifically used to obtain the difference between the first historical head movement position and the measured head movement position; obtain the head movement change rate, which is obtained based on the N measured head movement positions obtained previously, N>1; and obtain the predicted head movement position based on the difference and the head movement change rate.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的第二历史头动位置和第一不确定系数，所述第二历史头动位置为发送第二渲染信息时的实测头动位置，所述第一不确定系数来自于所述第二渲染信息，所述第二渲染信息早于所述第一渲染信息发送；所述预测模块，具体用于将所述第二历史头动位置作为所述第一历史头动位置。In a possible implementation, the delay information includes a second historical head movement position and a first uncertainty coefficient sent by the audio content provider, wherein the second historical head movement position is the actual head movement position when the second rendering information is sent, and the first uncertainty coefficient comes from the second rendering information, and the second rendering information is sent earlier than the first rendering information; the prediction module is specifically used to use the second historical head movement position as the first historical head movement position.

在一种可能的实现方式中，所述预测模块，具体用于获取头动变化率，所述头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据所述头动变化率和所述第一不确定系数获取第二不确定系数；将所述第二不确定系数、所述第一历史头动位置和所述实测当前头动位置输入Kalman滤波器以得到所述预测头动位置。In one possible implementation, the prediction module is specifically used to obtain the head movement change rate, which is obtained based on N previously measured head movement positions, N>1; obtain a second uncertainty coefficient based on the head movement change rate and the first uncertainty coefficient; and input the second uncertainty coefficient, the first historical head movement position and the measured current head movement position into a Kalman filter to obtain the predicted head movement position.

在一种可能的实现方式中，还包括：接收模块，用于接收所述音频内容提供端发送的第一信息，所述第一信息包括音频格式、第一音频流以及所述时延信息；其中，所述音频格式指示音频源为立体声格式或者非立体声格式；当所述音频格式指示为所述非立体声格式时，所述第一音频流包括经所述音频内容提供端渲染的第一双耳信号；当所述音频格式指示为所述立体声格式时，所述第一音频流包括所述音频源；渲染模块，用于当所述音频格式指示为所述立体声格式时，通过传感器获取实测头动位置；根据所述实测头动位置对当前帧渲染得到的第二双耳信号，所述当前帧是所述音频源的其中一帧；播放模块，用于根据目标双耳信号播放音频，所述目标双耳信号包括所述第一双耳信号或所述第二双耳信号。In a possible implementation, it also includes: a receiving module, which is used to receive first information sent by the audio content provider, the first information including an audio format, a first audio stream and the delay information; wherein the audio format indicates whether the audio source is in a stereo format or a non-stereo format; when the audio format indicates the non-stereo format, the first audio stream includes a first binaural signal rendered by the audio content provider; when the audio format indicates the stereo format, the first audio stream includes the audio source; a rendering module, which is used to obtain a measured head position through a sensor when the audio format indicates the stereo format; a second binaural signal obtained by rendering a current frame according to the measured head position, the current frame being one of the frames of the audio source; and a playing module, which is used to play audio according to a target binaural signal, the target binaural signal including the first binaural signal or the second binaural signal.

在一种可能的实现方式中，所述渲染模块，具体用于根据所述实测头动位置对所述当前帧中的直达声部分进行带头动效果的渲染，并对所述当前帧中的早期反射声ER和后期混响声LR进行不带头动效果的渲染，以得到所述第二双耳信号。In a possible implementation, the rendering module is specifically used to render the direct sound part in the current frame with the head movement effect according to the measured head movement position, and to render the early reflected sound ER and the late reverberation sound LR in the current frame without the head movement effect, so as to obtain the second binaural signal.

第三方面，本申请提供一种空间音频渲染装置，包括：获取模块，用于当采用第一渲染模式时，获取当前帧，所述当前帧是音频源的其中一帧；当所述音频源的音频格式为非立体声格式时，获取预测头动位置；渲染模块，用于根据所述预测头动位置对所述当前帧进行渲染以得到第一双耳信号；发送模块，用于向音频信号播放端发送第一信息，所述第一信息包括第一音频流、所述音频源的音频格式和时延信息，所述第一音频流包括所述第一双耳信号。In a third aspect, the present application provides a spatial audio rendering device, comprising: an acquisition module, used to acquire a current frame when a first rendering mode is adopted, wherein the current frame is one of the frames of an audio source; when the audio format of the audio source is a non-stereo format, acquire a predicted head movement position; a rendering module, used to render the current frame according to the predicted head movement position to obtain a first binaural signal; and a sending module, used to send first information to an audio signal playback end, wherein the first information includes a first audio stream, the audio format and delay information of the audio source, and the first audio stream includes the first binaural signal.

在一种可能的实现方式中，所述渲染模块，具体用于根据所述预测头动位置分别对所述当前帧中的直达声部分、早期反射声ER部分和后期混响声LR部分进行带头动效果的渲染，以得到所述第一双耳信号。In a possible implementation, the rendering module is specifically used to render the direct sound part, the early reflection sound ER part and the late reverberation sound LR part in the current frame with head movement effect according to the predicted head movement position, so as to obtain the first binaural signal.

在一种可能的实现方式中，所述获取模块，具体用于从所述音频信号播放端发送的渲染信息中得到所述预测头动位置。In a possible implementation, the acquisition module is specifically configured to obtain the predicted head movement position from rendering information sent by the audio signal playback end.

在一种可能的实现方式中，所述时延信息包括时延。In a possible implementation manner, the delay information includes delay.

在一种可能的实现方式中，所述渲染信息还包括所述音频信号播放端发送所述渲染信息时的实测头动位置；相应的，所述时延信息包括所述实测头动位置。In a possible implementation, the rendering information further includes the measured head movement position when the audio signal playback end sends the rendering information; correspondingly, the delay information includes the measured head movement position.

在一种可能的实现方式中，所述渲染信息还包括所述音频信号播放端发送所述渲染信息时的实测头动位置和第一不确定系数；相应的，所述时延信息包括所述实测头动位置和所述第一不确定系数。In a possible implementation, the rendering information also includes the measured head position and the first uncertainty coefficient when the audio signal playback end sends the rendering information; correspondingly, the delay information includes the measured head position and the first uncertainty coefficient.

在一种可能的实现方式中，当所述音频源的音频格式为立体声格式时，所述第一音频流包括所述音频源。In a possible implementation manner, when the audio format of the audio source is a stereo format, the first audio stream includes the audio source.

在一种可能的实现方式中，所述获取模块，还用于当采用第二渲染模式时，获取所述当前帧；所述渲染模块，还用于对所述当前帧进行不带头动效果的渲染以得到第二双耳信号；所述发送模块，还用于向所述音频信号播放端发送第二信息，所述第二信息包括第二音频流，所述第二音频流包括所述第二双耳信号。In a possible implementation, the acquisition module is further used to acquire the current frame when a second rendering mode is adopted; the rendering module is further used to render the current frame without a head motion effect to obtain a second binaural signal; the sending module is further used to send second information to the audio signal playback end, the second information includes a second audio stream, and the second audio stream includes the second binaural signal.

第四方面，本申请提供一种音频信号播放设备，包括：一个或多个处理器；存储器，用于存储一个或多个程序；当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如上述第一方面中任一项由音频信号播放端实施的所述方法。In a fourth aspect, the present application provides an audio signal playback device, comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the method implemented by the audio signal playback end as described in any one of the first aspects above.

第五方面，本申请提供一种音频内容提供设备，包括：一个或多个处理器；存储器，用于存储一个或多个程序；当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如上述第一方面中任一项由音频内容提供端实施的所述方法。In a fifth aspect, the present application provides an audio content providing device, comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the method implemented by the audio content provider as described in any one of the first aspects above.

第六方面，本申请提供一种计算机可读存储介质，包括计算机程序，所述计算机程序在计算机上被执行时，使得所述计算机执行上述第一方面中任一项所述的方法。In a sixth aspect, the present application provides a computer-readable storage medium, comprising a computer program, wherein when the computer program is executed on a computer, the computer executes any one of the methods described in the first aspect.

第七方面，本申请提供一种计算机程序产品，所述计算机程序产品包括计算机程序代码，当所述计算机程序代码在计算机上运行时，使得计算机执行上述第一方面中任一项所述的方法。In a seventh aspect, the present application provides a computer program product, wherein the computer program product comprises a computer program code, and when the computer program code is executed on a computer, the computer executes any one of the methods in the first aspect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请的一个应用场景示意图；FIG1 is a schematic diagram of an application scenario of the present application;

图2为本申请的音频内容提供端200的结构示意图；FIG2 is a schematic diagram of the structure of an audio content provider 200 of the present application;

图3为本申请空间音频渲染方法的过程300的流程图；FIG3 is a flow chart of a process 300 of the spatial audio rendering method of the present application;

图4为本申请空间音频渲染方法的过程400的流程图；FIG4 is a flow chart of a process 400 of the spatial audio rendering method of the present application;

图5为本申请的空间音频渲染方法的总流程示意图；FIG5 is a schematic diagram of the overall flow of the spatial audio rendering method of the present application;

图6为本申请的空间音频渲染方法的具体流程示意图；FIG6 is a schematic diagram of a specific process of the spatial audio rendering method of the present application;

图7为头动位置的预测算法的流程示意图；FIG7 is a flow chart of a prediction algorithm for head movement position;

图8a和图8b为头动位置的预测算法的流程示意图；8a and 8b are schematic diagrams of the flow chart of the prediction algorithm of the head movement position;

图9a和图9b为头动位置的预测算法的流程示意图；9a and 9b are schematic diagrams of the flow chart of the prediction algorithm of the head movement position;

图10为本申请的框架流程示意图；FIG10 is a schematic diagram of the framework flow of the present application;

图11为本申请低时延模式的流程示意图；FIG11 is a schematic diagram of the process of the low latency mode of the present application;

图12为本申请非低时延模式的流程示意图；FIG12 is a schematic diagram of the process of the non-low latency mode of the present application;

图13为本申请空间音频渲染装置1300的一个示例性的结构示意图；FIG13 is a schematic diagram of an exemplary structure of a spatial audio rendering device 1300 of the present application;

图14为本申请空间音频渲染装置1400的一个示例性的结构示意图。FIG. 14 is a schematic diagram of an exemplary structure of a spatial audio rendering device 1400 of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the drawings in this application. Obviously, the described embodiments are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

本申请的说明书实施例和权利要求书及附图中的术语“第一”、“第二”等仅用于区分描述的目的，而不能理解为指示或暗示相对重要性，也不能理解为指示或暗示顺序。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", etc. in the specification embodiments, claims, and drawings of the present application are only used for the purpose of distinguishing descriptions, and cannot be understood as indicating or implying relative importance, nor can they be understood as indicating or implying order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, including a series of steps or units. The method, system, product, or device is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products, or devices.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

利用空间音频技术，可以将立体声、多声道环绕声、特殊制作的多对象音源等不同格式的声源渲染为双耳信号，即可在耳机上享受到真实的、沉浸式的听音体验。Using spatial audio technology, sound sources in different formats, such as stereo, multi-channel surround sound, specially produced multi-object sound sources, etc., can be rendered into binaural signals, so that you can enjoy a real, immersive listening experience on headphones.

当我们的头部发生转动或位移时，声源本身的绝对位置不会改变，而声源与头部的相对方向会产生变化。例如，在你前方有一把吉他正在演奏，如果你转向右边，吉他的声音就会相对的变到你的左边。又例如，舞台左侧有一把吉他，右侧有一支萨克斯，当你移动到舞台的侧面，吉他与萨克斯的声音会重合到一起，来自同一个方向。目前可以通过耳机中的陀螺仪等感应器进行头部跟踪(Head Tracking)，进而把相应的旋转或位移变化体现在对声源的空间渲染中。When our head rotates or moves, the absolute position of the sound source itself will not change, but the relative direction between the sound source and the head will change. For example, there is a guitar playing in front of you. If you turn to the right, the sound of the guitar will move to your left. For another example, there is a guitar on the left side of the stage and a saxophone on the right side. When you move to the side of the stage, the sounds of the guitar and saxophone will overlap and come from the same direction. Head tracking can currently be performed through sensors such as gyroscopes in headphones, and the corresponding rotation or displacement changes can be reflected in the spatial rendering of the sound source.

但是，空间音频渲染通常需要较大算力支撑，而且头动追踪需要较低时延。如果将空间音频渲染放在音频内容提供端(例如，手机、平板、PC等)，虽可以满足算力需求，但会引入较高时延，尤其在头动场景下，对于多声道、多对象等格式的音频源会造成方位等重要空间信息错乱，导致空间音频整体渲染质量下降，影响整体听音体验。However, spatial audio rendering usually requires a lot of computing power, and head tracking requires low latency. If spatial audio rendering is placed on the audio content provider (for example, mobile phones, tablets, PCs, etc.), it can meet the computing power requirements, but it will introduce high latency, especially in head movement scenarios. For audio sources in multi-channel, multi-object formats, it will cause confusion of important spatial information such as orientation, resulting in a decrease in the overall rendering quality of spatial audio, affecting the overall listening experience.

为了解决上述技术问题，本申请提供了一种空间音频渲染方法和装置，下文实施例将对本申请的技术方案进行说明。In order to solve the above technical problems, the present application provides a spatial audio rendering method and device. The following embodiments will illustrate the technical solution of the present application.

图1为本申请的一个应用场景示意图，如图1所示，该场景包括音频内容提供端和音频信号播放端，音频内容提供端和音频信号播放端之间的互连方式包括但不限于蓝牙技术。其中，音频内容提供端可以包括但不限于手机、平板、笔记本电脑、台式电脑等，可以提供较高的算力，但可能时延较高；音频信号播放端可以包括但不限于真实无线立体声(truewireless stereo，TWS)耳机、无线头戴式耳机、无线颈圈式耳机等，时延较低，但提供的算力较低，此外作为音频信号播放端的设备中设置有陀螺仪等感应器，以进行用户的头动信息采集。FIG1 is a schematic diagram of an application scenario of the present application. As shown in FIG1 , the scenario includes an audio content provider and an audio signal player. The interconnection method between the audio content provider and the audio signal player includes but is not limited to Bluetooth technology. The audio content provider may include but is not limited to mobile phones, tablets, laptops, desktop computers, etc., which can provide higher computing power, but may have higher latency; the audio signal player may include but is not limited to true wireless stereo (TWS) headphones, wireless headphones, wireless neckband headphones, etc., which have lower latency but provide lower computing power. In addition, the device serving as the audio signal player is provided with sensors such as gyroscopes to collect user head movement information.

本申请中，渲染算法包括两部分，一部分算法部署于音频内容提供端，用于对5.1、7.1、3DA等多声道、多对象格式的音频进行高算力渲染，这样可以充分发挥音频内容提供端的算力优势；另一部分算法部署于音频信号播放端，用于对立体声格式的音频进行低算力渲染，其惯性测量单元(Inertial Measurement Unit，IMU)数据传输链路时延较低，也可以独立运行，从而不依赖于特定的音频内容提供端，可以满足用户连接任意一种音频内容提供端的设备的要求。In this application, the rendering algorithm includes two parts. One part of the algorithm is deployed at the audio content provider end, which is used to perform high-computing power rendering of audio in multi-channel, multi-object formats such as 5.1, 7.1, 3DA, etc., so as to give full play to the computing power advantage of the audio content provider end; the other part of the algorithm is deployed at the audio signal playback end, which is used to perform low-computing power rendering of audio in stereo format. Its inertial measurement unit (IMU) data transmission link has a low latency and can also run independently, so it does not rely on a specific audio content provider end, and can meet the user's requirements for connecting to any device of an audio content provider end.

基于此，本申请还提供了用户的头动位置预测算法，用于预测用户的未来头动位置，该预测的头动位置用于辅助音频内容提供端的渲染算法，可以降低音频内容提供端的时延影响。Based on this, the present application also provides a user's head movement position prediction algorithm for predicting the user's future head movement position. The predicted head movement position is used to assist the rendering algorithm of the audio content provider, which can reduce the delay impact of the audio content provider.

需要说明的是，图1所示的应用场景是一种示例，但这不应对本申请构成任何限定，本申请对空间音频渲染方法的应用场景也不做具体限定。It should be noted that the application scenario shown in FIG1 is an example, but this should not constitute any limitation to the present application, and the present application does not make any specific limitation to the application scenario of the spatial audio rendering method.

图2为本申请的音频内容提供端200的结构示意图。应该理解的是，图2所示音频内容提供端200仅是一个范例，并且音频内容提供端200可以具有比图中所示的更多的或者更少的部件，可以组合两个或多个的部件，或者可以具有不同的部件配置。图2中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。Fig. 2 is the structural representation of the audio content providing terminal 200 of the present application.It should be understood that the audio content providing terminal 200 shown in Fig. 2 is only an example, and the audio content providing terminal 200 can have more or less parts than shown in the figure, can combine two or more parts, or can have different component configurations.The various parts shown in Fig. 2 can be realized in the combination of hardware, software, or hardware and software including one or more signal processing and/or application specific integrated circuits.

音频内容提供端200可以包括：处理器210，外部存储器接口220，内部存储器221，通用串行总线(universal serial bus，USB)接口230，充电管理模块240，电源管理模块241，电池242，天线1，天线2，移动通信模块250，无线通信模块260，音频模块270，扬声器270A，受话器270B，麦克风270C，耳机接口270D，传感器模块280，按键290，马达291，指示器292，摄像头293，显示屏294，以及用户标识模块(subscriber identification module，SIM)卡接口295等。其中传感器模块280可以包括压力传感器280A，陀螺仪传感器280B，气压传感器280C，磁传感器280D，加速度传感器280E，距离传感器280F，接近光传感器280G，指纹传感器280H，温度传感器280J，触摸传感器280K，环境光传感器280L，骨传导传感器280M等。The audio content provider 200 may include: a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, a sensor module 280, a button 290, a motor 291, an indicator 292, a camera 293, a display screen 294, and a subscriber identification module (SIM) card interface 295, etc. The sensor module 280 may include a pressure sensor 280A, a gyroscope sensor 280B, an air pressure sensor 280C, a magnetic sensor 280D, an acceleration sensor 280E, a distance sensor 280F, a proximity light sensor 280G, a fingerprint sensor 280H, a temperature sensor 280J, a touch sensor 280K, an ambient light sensor 280L, a bone conduction sensor 280M, etc.

处理器210可以包括一个或多个处理单元，例如：处理器210可以包括应用处理器(application processor，AP)，调制解调处理器，图形处理器(graphics processingunit，GPU)，图像信号处理器(image signal processor，ISP)，控制器，存储器，视频编解码器，数字信号处理器(digital signal processor，DSP)，基带处理器，和/或神经网络处理器(neural-network processing unit，NPU)等。其中，不同的处理单元可以是独立的器件，也可以集成在一个或多个处理器中。The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc. Among them, different processing units may be independent devices or integrated in one or more processors.

其中，控制器可以是音频内容提供端200的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号，产生操作控制信号，完成取指令和执行指令的控制。The controller may be the nerve center and command center of the audio content provider 200. The controller may generate an operation control signal according to the instruction operation code and the timing signal to complete the control of fetching and executing instructions.

处理器210中还可以设置存储器，用于存储指令和数据。在一些实施例中，处理器210中的存储器为高速缓冲存储器。该存储器可以保存处理器210刚用过或循环使用的指令或数据。如果处理器210需要再次使用该指令或数据，可从所述存储器中直接调用。避免了重复存取，减少了处理器210的等待时间，因而提高了系统的效率。The processor 210 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. The memory may store instructions or data that the processor 210 has just used or cyclically used. If the processor 210 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 210, and thus improves the efficiency of the system.

在一些实施例中，处理器210可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit，I2C)接口，集成电路内置音频(inter-integrated circuitsound，I2S)接口，脉冲编码调制(pulse code modulation，PCM)接口，通用异步收发传输器(universal asynchronous receiver/transmitter，UART)接口，移动产业处理器接口(mobile industry processor interface，MIPI)，通用输入输出(general-purposeinput/output，GPIO)接口，用户标识模块(subscriber identity module，SIM)接口，和/或通用串行总线(universal serial bus，USB)接口等。In some embodiments, the processor 210 may include one or more interfaces. The interface may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (SIM) interface, and/or a universal serial bus (USB) interface, etc.

I2C接口是一种双向同步串行总线，包括一根串行数据线(serial data line，SDA)和一根串行时钟线(derail clock line，SCL)。在一些实施例中，处理器210可以包含多组I2C总线。处理器210可以通过不同的I2C总线接口分别耦合触摸传感器280K，充电器，闪光灯，摄像头293等。例如：处理器210可以通过I2C接口耦合触摸传感器280K，使处理器210与触摸传感器280K通过I2C总线接口通信，实现音频内容提供端200的触摸功能。The I2C interface is a bidirectional synchronous serial bus, including a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 210 may include multiple groups of I2C buses. The processor 210 may be coupled to the touch sensor 280K, the charger, the flash, the camera 293, etc. through different I2C bus interfaces. For example, the processor 210 may be coupled to the touch sensor 280K through the I2C interface, so that the processor 210 communicates with the touch sensor 280K through the I2C bus interface, thereby realizing the touch function of the audio content provider 200.

I2S接口可以用于音频通信。在一些实施例中，处理器210可以包含多组I2S总线。处理器210可以通过I2S总线与音频模块270耦合，实现处理器210与音频模块270之间的通信。在一些实施例中，音频模块270可以通过I2S接口向无线通信模块260传递音频信号，实现通过蓝牙耳机接听电话的功能。The I2S interface can be used for audio communication. In some embodiments, the processor 210 can include multiple groups of I2S buses. The processor 210 can be coupled to the audio module 270 via the I2S bus to achieve communication between the processor 210 and the audio module 270. In some embodiments, the audio module 270 can transmit an audio signal to the wireless communication module 260 via the I2S interface to achieve the function of answering a call through a Bluetooth headset.

PCM接口也可以用于音频通信，将模拟信号抽样，量化和编码。在一些实施例中，音频模块270与无线通信模块260可以通过PCM总线接口耦合。在一些实施例中，音频模块270也可以通过PCM接口向无线通信模块260传递音频信号，实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。The PCM interface can also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 270 and the wireless communication module 260 can be coupled via a PCM bus interface. In some embodiments, the audio module 270 can also transmit audio signals to the wireless communication module 260 via the PCM interface to realize the function of answering calls via a Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

UART接口是一种通用串行数据总线，用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中，UART接口通常被用于连接处理器210与无线通信模块260。例如：处理器210通过UART接口与无线通信模块260中的蓝牙模块通信，实现蓝牙功能。在一些实施例中，音频模块270可以通过UART接口向无线通信模块260传递音频信号，实现通过蓝牙耳机播放音乐的功能。The UART interface is a universal serial data bus for asynchronous communication. The bus can be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, the UART interface is generally used to connect the processor 210 and the wireless communication module 260. For example, the processor 210 communicates with the Bluetooth module in the wireless communication module 260 through the UART interface to implement the Bluetooth function. In some embodiments, the audio module 270 can transmit an audio signal to the wireless communication module 260 through the UART interface to implement the function of playing music through a Bluetooth headset.

MIPI接口可以被用于连接处理器210与显示屏294，摄像头293等外围器件。MIPI接口包括摄像头串行接口(camera serial interface，CSI)，显示屏串行接口(displayserial interface，DSI)等。在一些实施例中，处理器210和摄像头293通过CSI接口通信，实现音频内容提供端200的拍摄功能。处理器210和显示屏294通过DSI接口通信，实现音频内容提供端200的显示功能。The MIPI interface can be used to connect the processor 210 with peripheral devices such as the display screen 294 and the camera 293. The MIPI interface includes a camera serial interface (CSI), a display serial interface (DSI), etc. In some embodiments, the processor 210 and the camera 293 communicate via the CSI interface to implement the shooting function of the audio content provider 200. The processor 210 and the display screen 294 communicate via the DSI interface to implement the display function of the audio content provider 200.

GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号，也可被配置为数据信号。在一些实施例中，GPIO接口可以用于连接处理器210与摄像头293，显示屏294，无线通信模块260，音频模块270，传感器模块280等。GPIO接口还可以被配置为I2C接口，I2S接口，UART接口，MIPI接口等。The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface can be used to connect the processor 210 with the camera 293, the display screen 294, the wireless communication module 260, the audio module 270, the sensor module 280, etc. The GPIO interface can also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, etc.

USB接口230是符合USB标准规范的接口，具体可以是Mini USB接口，Micro USB接口，USB Type C接口等。USB接口230可以用于连接充电器为音频内容提供端200充电，也可以用于音频内容提供端200与外围设备之间传输数据。也可以用于连接耳机，通过耳机播放音频。该接口还可以用于连接其他用户设备，例如AR设备等。The USB interface 230 is an interface that complies with the USB standard specification, and specifically can be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc. The USB interface 230 can be used to connect a charger to charge the audio content provider 200, and can also be used to transmit data between the audio content provider 200 and a peripheral device. It can also be used to connect headphones to play audio through the headphones. The interface can also be used to connect other user devices, such as AR devices, etc.

可以理解的是，本申请实施例示意的各模块间的接口连接关系，只是示意性说明，并不构成对音频内容提供端200的结构限定。在本申请另一些实施例中，音频内容提供端200也可以采用上述实施例中不同的接口连接方式，或多种接口连接方式的组合。It is understandable that the interface connection relationship between the modules illustrated in the embodiment of the present application is only a schematic illustration and does not constitute a structural limitation on the audio content provider 200. In other embodiments of the present application, the audio content provider 200 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.

充电管理模块240用于从充电器接收充电输入。其中，充电器可以是无线充电器，也可以是有线充电器。在一些有线充电的实施例中，充电管理模块240可以通过USB接口230接收有线充电器的充电输入。在一些无线充电的实施例中，充电管理模块240可以通过音频内容提供端200的无线充电线圈接收无线充电输入。充电管理模块240为电池242充电的同时，还可以通过电源管理模块241为用户设备供电。The charging management module 240 is used to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 240 may receive charging input from a wired charger through the USB interface 230. In some wireless charging embodiments, the charging management module 240 may receive wireless charging input through a wireless charging coil of the audio content provider 200. While the charging management module 240 is charging the battery 242, it may also power the user device through the power management module 241.

电源管理模块241用于连接电池242，充电管理模块240与处理器210。电源管理模块241接收电池242和/或充电管理模块240的输入，为处理器210，内部存储器221，外部存储器，显示屏294，摄像头293，和无线通信模块260等供电。电源管理模块241还可以用于监测电池容量，电池循环次数，电池健康状态(漏电，阻抗)等参数。在其他一些实施例中，电源管理模块241也可以设置于处理器210中。在另一些实施例中，电源管理模块241和充电管理模块240也可以设置于同一个器件中。The power management module 241 is used to connect the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives input from the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the external memory, the display screen 294, the camera 293, and the wireless communication module 260. The power management module 241 can also be used to monitor parameters such as battery capacity, battery cycle number, battery health status (leakage, impedance), etc. In some other embodiments, the power management module 241 can also be set in the processor 210. In other embodiments, the power management module 241 and the charging management module 240 can also be set in the same device.

音频内容提供端200的无线通信功能可以通过天线1，天线2，移动通信模块250，无线通信模块260，调制解调处理器以及基带处理器等实现。The wireless communication function of the audio content provider 200 can be implemented through the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor and the baseband processor.

天线1和天线2用于发射和接收电磁波信号。音频内容提供端200中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用，以提高天线的利用率。例如：可以将天线1复用为无线局域网的分集天线。在另外一些实施例中，天线可以和调谐开关结合使用。Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the audio content provider 200 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of the antennas. For example, antenna 1 can be reused as a diversity antenna for a wireless local area network. In some other embodiments, the antenna can be used in combination with a tuning switch.

移动通信模块250可以提供应用在音频内容提供端200上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块250可以包括至少一个滤波器，开关，功率放大器，低噪声放大器(low noise amplifier，LNA)等。移动通信模块250可以由天线1接收电磁波，并对接收的电磁波进行滤波，放大等处理，传送至调制解调处理器进行解调。移动通信模块250还可以对经调制解调处理器调制后的信号放大，经天线1转为电磁波辐射出去。在一些实施例中，移动通信模块250的至少部分功能模块可以被设置于处理器210中。在一些实施例中，移动通信模块250的至少部分功能模块可以与处理器210的至少部分模块被设置在同一个器件中。The mobile communication module 250 can provide solutions for wireless communications including 2G/3G/4G/5G applied to the audio content provider 200. The mobile communication module 250 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), etc. The mobile communication module 250 can receive electromagnetic waves from the antenna 1, and filter, amplify, and process the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 250 can also amplify the signal modulated by the modulation and demodulation processor, and convert it into electromagnetic waves for radiation through the antenna 1. In some embodiments, at least some of the functional modules of the mobile communication module 250 can be set in the processor 210. In some embodiments, at least some of the functional modules of the mobile communication module 250 can be set in the same device as at least some of the modules of the processor 210.

调制解调处理器可以包括调制器和解调器。其中，调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后，被传递给应用处理器。应用处理器通过音频设备(不限于扬声器270A，受话器270B等)输出声音信号，或通过显示屏294显示图像或视频。在一些实施例中，调制解调处理器可以是独立的器件。在另一些实施例中，调制解调处理器可以独立于处理器210，与移动通信模块250或其他功能模块设置在同一个器件中。The modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. After the low-frequency baseband signal is processed by the baseband processor, it is passed to the application processor. The application processor outputs a sound signal through an audio device (not limited to a speaker 270A, a receiver 270B, etc.), or displays an image or video through a display screen 294. In some embodiments, the modem processor may be an independent device. In other embodiments, the modem processor may be independent of the processor 210 and be set in the same device as the mobile communication module 250 or other functional modules.

无线通信模块260可以提供应用在音频内容提供端200上的包括无线局域网(wireless local area networks，WLAN)(如无线保真(wireless fidelity，Wi-Fi)网络)，蓝牙(bluetooth，BT)，全球导航卫星系统(global navigation satellite system，GNSS)，调频(frequency modulation，FM)，近距离无线通信技术(near field communication，NFC)，红外技术(infrared，IR)等无线通信的解决方案。无线通信模块260可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块260经由天线2接收电磁波，将电磁波信号调频以及滤波处理，将处理后的信号发送到处理器210。无线通信模块260还可以从处理器210接收待发送的信号，对其进行调频，放大，经天线2转为电磁波辐射出去。The wireless communication module 260 can provide wireless communication solutions including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), infrared (IR) and the like applied to the audio content provider 200. The wireless communication module 260 can be one or more devices integrating at least one communication processing module. The wireless communication module 260 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signal and performs filtering, and sends the processed signal to the processor 210. The wireless communication module 260 can also receive the signal to be sent from the processor 210, modulate the frequency, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2.

在一些实施例中，音频内容提供端200的天线1和移动通信模块250耦合，天线2和无线通信模块260耦合，使得音频内容提供端200可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobilecommunications，GSM)，通用分组无线服务(general packet radio service，GPRS)，码分多址接入(code division multiple access，CDMA)，宽带码分多址(wideband codedivision multiple access，WCDMA)，时分码分多址(time-division code divisionmultiple access，TD-SCDMA)，长期演进(long term evolution，LTE)，BT，GNSS，WLAN，NFC，FM，和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system，GPS)，全球导航卫星系统(global navigation satellite system，GLONASS)，北斗卫星导航系统(beidou navigation satellite system，BDS)，准天顶卫星系统(quasi-zenithsatellite system，QZSS)和/或星基增强系统(satellite based augmentation systems，SBAS)。In some embodiments, the antenna 1 of the audio content provider 200 is coupled to the mobile communication module 250, and the antenna 2 is coupled to the wireless communication module 260, so that the audio content provider 200 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), wideband code division multiple access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technology. The GNSS may include a global positioning system (GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS) and/or a satellite based augmentation system (SBAS).

音频内容提供端200通过GPU，显示屏294，以及应用处理器等实现显示功能。GPU为图像处理的微处理器，连接显示屏294和应用处理器。GPU用于执行数学和几何计算，用于图形渲染。处理器210可包括一个或多个GPU，其执行程序指令以生成或改变显示信息。The audio content provider 200 implements the display function through a GPU, a display screen 294, and an application processor. The GPU is a microprocessor for image processing, connecting the display screen 294 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 210 may include one or more GPUs that execute program instructions to generate or change display information.

显示屏294用于显示图像，视频等。显示屏294包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display，LCD)，有机发光二极管(organic light-emittingdiode，OLED)，有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrixorganic light emitting diode的，AMOLED)，柔性发光二极管(flex light-emittingdiode，FLED)，Miniled，MicroLed，Micro-oLed，量子点发光二极管(quantum dot lightemitting diodes，QLED)等。在一些实施例中，音频内容提供端200可以包括1个或N个显示屏294，N为大于1的正整数。The display screen 294 is used to display images, videos, etc. The display screen 294 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the audio content provider 200 may include 1 or N display screens 294, where N is a positive integer greater than 1.

音频内容提供端200可以通过ISP，摄像头293，视频编解码器，GPU，显示屏294以及应用处理器等实现拍摄功能。The audio content provider 200 can realize the shooting function through ISP, camera 293, video codec, GPU, display screen 294 and application processor.

ISP用于处理摄像头293反馈的数据。例如，拍照时，打开快门，光线通过镜头被传递到摄像头感光元件上，光信号转换为电信号，摄像头感光元件将所述电信号传递给ISP处理，转化为肉眼可见的图像。ISP还可以对图像的噪点，亮度，肤色进行算法优化。ISP还可以对拍摄场景的曝光，色温等参数优化。在一些实施例中，ISP可以设置在摄像头293中。The ISP is used to process the data fed back by the camera 293. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye. The ISP can also perform algorithm optimization on the noise, brightness, and skin color of the image. The ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP can be set in the camera 293.

摄像头293用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device，CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor，CMOS)光电晶体管。感光元件把光信号转换成电信号，之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB，YUV等格式的图像信号。在一些实施例中，音频内容提供端200可以包括1个或N个摄像头293，N为大于1的正整数。The camera 293 is used to capture still images or videos. The object generates an optical image through the lens and projects it onto the photosensitive element. The photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV or other format. In some embodiments, the audio content provider 200 may include 1 or N cameras 293, where N is a positive integer greater than 1.

数字信号处理器用于处理数字信号，除了可以处理数字图像信号，还可以处理其他数字信号。例如，当音频内容提供端200在频点选择时，数字信号处理器用于对频点能量进行傅里叶变换等。The digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the audio content provider 200 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.

视频编解码器用于对数字视频压缩或解压缩。音频内容提供端200可以支持一种或多种视频编解码器。这样，音频内容提供端200可以播放或录制多种编码格式的视频，例如：动态图像专家组(moving picture experts group，MPEG)1，MPEG2，MPEG3，MPEG4等。Video codecs are used to compress or decompress digital videos. The audio content provider 200 can support one or more video codecs. In this way, the audio content provider 200 can play or record videos in multiple coding formats, such as moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

NPU为神经网络(neural-network，NN)计算处理器，通过借鉴生物神经网络结构，例如借鉴人脑神经元之间传递模式，对输入信息快速处理，还可以不断的自学习。通过NPU可以实现音频内容提供端200的智能认知等应用，例如：图像识别，人脸识别，语音识别，文本理解等。NPU is a neural network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transmission mode between neurons in the human brain, it can quickly process input information and can also continuously self-learn. Through NPU, intelligent cognition and other applications of the audio content provider 200 can be realized, such as image recognition, face recognition, voice recognition, text understanding, etc.

外部存储器接口220可以用于连接外部存储卡，例如Micro SD卡，实现扩展音频内容提供端200的存储能力。外部存储卡通过外部存储器接口220与处理器210通信，实现数据存储功能。例如将音乐，视频等文件保存在外部存储卡中。The external memory interface 220 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the audio content provider 200. The external memory card communicates with the processor 210 through the external memory interface 220 to implement a data storage function. For example, files such as music and videos are stored in the external memory card.

内部存储器221可以用于存储计算机可执行程序代码，所述可执行程序代码包括指令。处理器210通过运行存储在内部存储器221的指令，从而执行音频内容提供端200的各种功能应用以及数据处理。内部存储器221可以包括存储程序区和存储数据区。其中，存储程序区可存储操作系统，至少一个功能所需的应用程序(比如声音播放功能，图像播放功能等)等。存储数据区可存储音频内容提供端200使用过程中所创建的数据(比如音频数据，电话本等)等。此外，内部存储器221可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件，闪存器件，通用闪存存储器(universal flash storage，UFS)等。The internal memory 221 can be used to store computer executable program codes, which include instructions. The processor 210 executes various functional applications and data processing of the audio content provider 200 by running the instructions stored in the internal memory 221. The internal memory 221 may include a storage program area and a storage data area. Among them, the storage program area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc. The storage data area may store data (such as audio data, a phone book, etc.) created during the use of the audio content provider 200, etc. In addition, the internal memory 221 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, a universal flash storage (UFS), etc.

音频内容提供端200可以通过音频模块270，扬声器270A，受话器270B，麦克风270C，耳机接口270D，以及应用处理器等实现音频功能。例如音乐播放，录音等。The audio content provider 200 can implement audio functions such as music playing and recording through an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, and an application processor.

音频模块270用于将数字音频信息转换成模拟音频信号输出，也用于将模拟音频输入转换为数字音频信号。音频模块270还可以用于对音频信号编码和解码。在一些实施例中，音频模块270可以设置于处理器210中，或将音频模块270的部分功能模块设置于处理器210中。The audio module 270 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal. The audio module 270 can also be used to encode and decode audio signals. In some embodiments, the audio module 270 can be arranged in the processor 210, or some functional modules of the audio module 270 can be arranged in the processor 210.

扬声器270A，也称“喇叭”，用于将音频电信号转换为声音信号。音频内容提供端200可以通过扬声器270A收听音乐，或收听免提通话。The speaker 270A, also called a "speaker", is used to convert an audio electrical signal into a sound signal. The audio content provider 200 can listen to music or listen to a hands-free call through the speaker 270A.

受话器270B，也称“听筒”，用于将音频电信号转换成声音信号。当音频内容提供端200接听电话或语音信息时，可以通过将受话器270B靠近人耳接听语音。The receiver 270B, also called a "handset", is used to convert audio electrical signals into sound signals. When the audio content provider 200 receives a call or voice message, the voice can be received by placing the receiver 270B close to the ear.

麦克风270C，也称“话筒”，“传声器”，用于将声音信号转换为电信号。当拨打电话或发送语音信息时，用户可以通过人嘴靠近麦克风270C发声，将声音信号输入到麦克风270C。音频内容提供端200可以设置至少一个麦克风270C。在另一些实施例中，音频内容提供端200可以设置两个麦克风270C，除了采集声音信号，还可以实现降噪功能。在另一些实施例中，音频内容提供端200还可以设置三个，四个或更多麦克风270C，实现采集声音信号，降噪，还可以识别声音来源，实现定向录音功能等。Microphone 270C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 270C with his mouth to input the sound signal into the microphone 270C. The audio content provider 200 can be provided with at least one microphone 270C. In other embodiments, the audio content provider 200 can be provided with two microphones 270C, which can not only collect sound signals but also realize noise reduction function. In other embodiments, the audio content provider 200 can also be provided with three, four or more microphones 270C to realize the collection of sound signals, noise reduction, identification of sound sources, realization of directional recording function, etc.

耳机接口270D用于连接有线耳机。耳机接口270D可以是USB接口230，也可以是3.5mm的开放移动用户设备平台(open mobile terminal platform，OMTP)标准接口，美国蜂窝电信工业协会(cellular telecommunications industry association of the USA，CTIA)标准接口。The earphone interface 270D is used to connect a wired earphone and can be a USB interface 230 or a 3.5 mm open mobile terminal platform (OMTP) standard interface or a cellular telecommunications industry association of the USA (CTIA) standard interface.

压力传感器280A用于感受压力信号，可以将压力信号转换成电信号。在一些实施例中，压力传感器280A可以设置于显示屏294。压力传感器280A的种类很多，如电阻式压力传感器，电感式压力传感器，电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器280A，电极之间的电容改变。音频内容提供端200根据电容的变化确定压力的强度。当有触摸操作作用于显示屏294，音频内容提供端200根据压力传感器280A检测所述触摸操作强度。音频内容提供端200也可以根据压力传感器280A的检测信号计算触摸的位置。在一些实施例中，作用于相同触摸位置，但不同触摸操作强度的触摸操作，可以对应不同的操作指令。例如：当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时，执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时，执行新建短消息的指令。The pressure sensor 280A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 280A can be set on the display screen 294. There are many types of pressure sensors 280A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc. The capacitive pressure sensor can be a parallel plate including at least two conductive materials. When a force acts on the pressure sensor 280A, the capacitance between the electrodes changes. The audio content provider 200 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 294, the audio content provider 200 detects the touch operation intensity according to the pressure sensor 280A. The audio content provider 200 can also calculate the touch position according to the detection signal of the pressure sensor 280A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities can correspond to different operation instructions. For example: when a touch operation with a touch operation intensity less than the first pressure threshold acts on the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to a first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.

陀螺仪传感器280B可以用于确定音频内容提供端200的运动姿态。在一些实施例中，可以通过陀螺仪传感器280B确定音频内容提供端200围绕三个轴(即，x，y和z轴)的角速度。陀螺仪传感器280B可以用于拍摄防抖。示例性的，当按下快门，陀螺仪传感器280B检测音频内容提供端200抖动的角度，根据角度计算出镜头模组需要补偿的距离，让镜头通过反向运动抵消音频内容提供端200的抖动，实现防抖。陀螺仪传感器280B还可以用于导航，体感游戏场景。The gyroscope sensor 280B can be used to determine the motion posture of the audio content provider 200. In some embodiments, the angular velocity of the audio content provider 200 around three axes (i.e., x, y, and z axes) can be determined by the gyroscope sensor 280B. The gyroscope sensor 280B can be used for anti-shake shooting. Exemplarily, when the shutter is pressed, the gyroscope sensor 280B detects the angle of the shaking of the audio content provider 200, calculates the distance that the lens module needs to compensate based on the angle, and allows the lens to offset the shaking of the audio content provider 200 through reverse movement to achieve anti-shake. The gyroscope sensor 280B can also be used for navigation and somatosensory game scenes.

气压传感器280C用于测量气压。在一些实施例中，音频内容提供端200通过气压传感器280C测得的气压值计算海拔高度，辅助定位和导航。The air pressure sensor 280C is used to measure air pressure. In some embodiments, the audio content provider 200 calculates the altitude through the air pressure value measured by the air pressure sensor 280C to assist in positioning and navigation.

磁传感器280D包括霍尔传感器。音频内容提供端200可以利用磁传感器280D检测翻盖皮套的开合。在一些实施例中，当音频内容提供端200是翻盖机时，音频内容提供端200可以根据磁传感器280D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态，设置翻盖自动解锁等特性。The magnetic sensor 280D includes a Hall sensor. The audio content provider 200 can use the magnetic sensor 280D to detect the opening and closing of the flip leather case. In some embodiments, when the audio content provider 200 is a flip phone, the audio content provider 200 can detect the opening and closing of the flip cover according to the magnetic sensor 280D. Then, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, the flip cover automatic unlocking and other features are set.

加速度传感器280E可检测音频内容提供端200在各个方向上(一般为三轴)加速度的大小。当音频内容提供端200静止时可检测出重力的大小及方向。还可以用于识别用户设备姿态，应用于横竖屏切换，计步器等应用。The acceleration sensor 280E can detect the magnitude of acceleration of the audio content provider 200 in all directions (generally three axes). When the audio content provider 200 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of the user device and is applied to applications such as horizontal and vertical screen switching and pedometers.

距离传感器280F，用于测量距离。音频内容提供端200可以通过红外或激光测量距离。在一些实施例中，拍摄场景，音频内容提供端200可以利用距离传感器280F测距以实现快速对焦。The distance sensor 280F is used to measure the distance. The audio content provider 200 can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the audio content provider 200 can use the distance sensor 280F to measure the distance to achieve fast focusing.

接近光传感器280G可以包括例如发光二极管(LED)和光检测器，例如光电二极管。发光二极管可以是红外发光二极管。音频内容提供端200通过发光二极管向外发射红外光。音频内容提供端200使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时，可以确定音频内容提供端200附近有物体。当检测到不充分的反射光时，音频内容提供端200可以确定音频内容提供端200附近没有物体。音频内容提供端200可以利用接近光传感器280G检测用户手持音频内容提供端200贴近耳朵通话，以便自动熄灭屏幕达到省电的目的。接近光传感器280G也可用于皮套模式，口袋模式自动解锁与锁屏。The proximity light sensor 280G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The audio content provider 200 emits infrared light outward through the light emitting diode. The audio content provider 200 uses a photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the audio content provider 200. When insufficient reflected light is detected, the audio content provider 200 can determine that there is no object near the audio content provider 200. The audio content provider 200 can use the proximity light sensor 280G to detect that the user holds the audio content provider 200 close to the ear to talk, so as to automatically turn off the screen to achieve the purpose of power saving. The proximity light sensor 280G can also be used in leather case mode, pocket mode automatic unlocking and lock screen.

环境光传感器280L用于感知环境光亮度。音频内容提供端200可以根据感知的环境光亮度自适应调节显示屏294亮度。环境光传感器280L也可用于拍照时自动调节白平衡。环境光传感器280L还可以与接近光传感器280G配合，检测音频内容提供端200是否在口袋里，以防误触。The ambient light sensor 280L is used to sense the ambient light brightness. The audio content provider 200 can adaptively adjust the brightness of the display screen 294 according to the perceived ambient light brightness. The ambient light sensor 280L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 280L can also cooperate with the proximity light sensor 280G to detect whether the audio content provider 200 is in a pocket to prevent accidental touch.

指纹传感器280H用于采集指纹。音频内容提供端200可以利用采集的指纹特性实现指纹解锁，访问应用锁，指纹拍照，指纹接听来电等。The fingerprint sensor 280H is used to collect fingerprints. The audio content provider 200 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photography, fingerprint answering calls, etc.

温度传感器280J用于检测温度。在一些实施例中，音频内容提供端200利用温度传感器280J检测的温度，执行温度处理策略。例如，当温度传感器280J上报的温度超过阈值，音频内容提供端200执行降低位于温度传感器280J附近的处理器的性能，以便降低功耗实施热保护。在另一些实施例中，当温度低于另一阈值时，音频内容提供端200对电池242加热，以避免低温导致音频内容提供端200异常关机。在其他一些实施例中，当温度低于又一阈值时，音频内容提供端200对电池242的输出电压执行升压，以避免低温导致的异常关机。Temperature sensor 280J is used to detect temperature. In certain embodiments, audio content provider 200 utilizes the temperature detected by temperature sensor 280J to execute temperature processing strategy. For example, when the temperature reported by temperature sensor 280J exceeds a threshold value, audio content provider 200 performs a performance reduction of a processor near temperature sensor 280J to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold value, audio content provider 200 heats battery 242 to avoid abnormal shutdown of audio content provider 200 caused by low temperature. In other embodiments, when the temperature is lower than another threshold value, audio content provider 200 performs a boost on the output voltage of battery 242 to avoid abnormal shutdown caused by low temperature.

触摸传感器280K，也称“触控面板”。触摸传感器280K可以设置于显示屏294，由触摸传感器280K与显示屏294组成触摸屏，也称“触控屏”。触摸传感器280K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器，以确定触摸事件类型。可以通过显示屏294提供与触摸操作相关的视觉输出。在另一些实施例中，触摸传感器280K也可以设置于音频内容提供端200的表面，与显示屏294所处的位置不同。The touch sensor 280K is also called a "touch panel". The touch sensor 280K can be set on the display screen 294, and the touch sensor 280K and the display screen 294 form a touch screen, also called a "touch screen". The touch sensor 280K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through the display screen 294. In other embodiments, the touch sensor 280K can also be set on the surface of the audio content provider 200, which is different from the position of the display screen 294.

骨传导传感器280M可以获取振动信号。在一些实施例中，骨传导传感器280M可以获取人体声部振动骨块的振动信号。骨传导传感器280M也可以接触人体脉搏，接收血压跳动信号。在一些实施例中，骨传导传感器280M也可以设置于耳机中，结合成骨传导耳机。音频模块270可以基于所述骨传导传感器280M获取的声部振动骨块的振动信号，解析出语音信号，实现语音功能。应用处理器可以基于所述骨传导传感器280M获取的血压跳动信号解析心率信息，实现心率检测功能。The bone conduction sensor 280M can obtain a vibration signal. In some embodiments, the bone conduction sensor 280M can obtain a vibration signal of a vibrating bone block of the vocal part of the human body. The bone conduction sensor 280M can also contact the human pulse to receive a blood pressure beat signal. In some embodiments, the bone conduction sensor 280M can also be set in an earphone and combined into a bone conduction earphone. The audio module 270 can parse out a voice signal based on the vibration signal of the vibrating bone block of the vocal part obtained by the bone conduction sensor 280M to realize a voice function. The application processor can parse the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 280M to realize a heart rate detection function.

按键290包括开机键，音量键等。按键290可以是机械按键。也可以是触摸式按键。音频内容提供端200可以接收按键输入，产生与音频内容提供端200的用户设置以及功能控制有关的键信号输入。The key 290 includes a power key, a volume key, etc. The key 290 may be a mechanical key or a touch key. The audio content provider 200 may receive key inputs and generate key signal inputs related to user settings and function controls of the audio content provider 200.

马达291可以产生振动提示。马达291可以用于来电振动提示，也可以用于触摸振动反馈。例如，作用于不同应用(例如拍照，音频播放等)的触摸操作，可以对应不同的振动反馈效果。作用于显示屏294不同区域的触摸操作，马达291也可对应不同的振动反馈效果。不同的应用场景(例如：时间提醒，接收信息，闹钟，游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。Motor 291 can generate vibration prompts. Motor 291 can be used for incoming call vibration prompts, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, audio playback, etc.) can correspond to different vibration feedback effects. For touch operations acting on different areas of the display screen 294, motor 291 can also correspond to different vibration feedback effects. Different application scenarios (for example: time reminders, receiving messages, alarm clocks, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

指示器292可以是指示灯，可以用于指示充电状态，电量变化，也可以用于指示消息，未接来电，通知等。Indicator 292 may be an indicator light, which may be used to indicate charging status, power changes, messages, missed calls, notifications, etc.

SIM卡接口295用于连接SIM卡。SIM卡可以通过插入SIM卡接口295，或从SIM卡接口295拔出，实现和音频内容提供端200的接触和分离。音频内容提供端200可以支持1个或N个SIM卡接口，N为大于1的正整数。SIM卡接口295可以支持Nano SIM卡，Micro SIM卡，SIM卡等。同一个SIM卡接口295可以同时插入多张卡。所述多张卡的类型可以相同，也可以不同。SIM卡接口295也可以兼容不同类型的SIM卡。SIM卡接口295也可以兼容外部存储卡。音频内容提供端200通过SIM卡和网络交互，实现通话以及数据通信等功能。在一些实施例中，音频内容提供端200采用eSIM，即：嵌入式SIM卡。eSIM卡可以嵌在音频内容提供端200中，不能和音频内容提供端200分离。The SIM card interface 295 is used to connect a SIM card. The SIM card can be connected to and separated from the audio content provider 200 by inserting the SIM card interface 295 or pulling it out from the SIM card interface 295. The audio content provider 200 can support 1 or N SIM card interfaces, where N is a positive integer greater than 1. The SIM card interface 295 can support Nano SIM cards, Micro SIM cards, SIM cards, etc. Multiple cards can be inserted into the same SIM card interface 295 at the same time. The types of the multiple cards can be the same or different. The SIM card interface 295 can also be compatible with different types of SIM cards. The SIM card interface 295 can also be compatible with external memory cards. The audio content provider 200 interacts with the network through the SIM card to realize functions such as calls and data communications. In some embodiments, the audio content provider 200 uses an eSIM, i.e., an embedded SIM card. The eSIM card can be embedded in the audio content provider 200 and cannot be separated from the audio content provider 200.

可以理解的是，本发明实施例示意的结构并不构成对主控设备的具体限定。在本申请另一些实施例中，主控设备可以包括比图示更多或更少的部件，或者组合某些部件，或者拆分某些部件，或者不同的部件布置。图示的部件可以以硬件，软件或软件和硬件的组合实现。It is to be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the main control device. In other embodiments of the present application, the main control device may include more or fewer components than shown in the figure, or combine some components, or split some components, or arrange the components differently. The components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.

基于上文实施例，图3为本申请空间音频渲染方法的过程300的流程图，如图3所示，过程300可以应用于图1所示应用场景，由音频信号播放端执行，获取预测头动位置，以供音频内容提供端在对音频进行渲染时使用。过程300描述为一系列的步骤或操作，应当理解的是，过程300可以以各种顺序执行和/或同时发生，不限于图3所示的执行顺序。Based on the above embodiments, FIG3 is a flowchart of process 300 of the spatial audio rendering method of the present application. As shown in FIG3, process 300 can be applied to the application scenario shown in FIG1, and is executed by the audio signal playback end to obtain the predicted head movement position for use by the audio content provider when rendering the audio. Process 300 is described as a series of steps or operations, and it should be understood that process 300 can be executed in various orders and/or occur simultaneously, and is not limited to the execution order shown in FIG3.

过程300包括如下步骤：Process 300 includes the following steps:

步骤301、获取音频内容提供端的时延信息。Step 301: Obtain delay information of the audio content provider.

该情况与步骤302中获取预测头动位置的算法相关，具体内容可以参照下文描述。This situation is related to the algorithm for obtaining the predicted head movement position in step 302, and the specific content can be referred to the description below.

步骤302、根据时延信息获取预测头动位置。Step 302: Obtain the predicted head movement position according to the time delay information.

如上文所述，为了让用户通过音频信号播放端(例如耳机)享受到真实的、沉浸式的听音体验，可以利用空间音频技术，可以将立体声、多声道环绕声、特殊制作的多对象音源等不同格式的声源渲染为双耳信号。在真实世界中，当我们的头部发生转动或位移时，声源本身的绝对位置不会改变，而声源与头部的相对方向会产生变化。例如，在你前方有一把吉他正在演奏，如果你转向右边，吉他的声音就会相对的变到你的左边。又例如，舞台左侧有一把吉他，右侧有一支萨克斯，当你移动到舞台的侧面，吉他与萨克斯的声音会重合到一起，来自同一个方向。可见，所谓的真实的、沉浸式的听音体验，是正如前述描述的对吉他或者舞台表演的听音体验，随着用户的头的位置的转变(包括但不限于头的上下左右位移、头部转动等)，来自同一音频源的声音会在用户的耳朵里被渲染成不同的听音效果，而这种渲染即为带头动效果的渲染方法，即通过耳机中的感应器进行头部跟踪(Head Tracking)，进而把相应的旋转或位移变化体现在对声源的空间渲染中。As mentioned above, in order to allow users to enjoy a real, immersive listening experience through the audio signal playback end (such as headphones), spatial audio technology can be used to render sound sources in different formats such as stereo, multi-channel surround sound, and specially produced multi-object sound sources into binaural signals. In the real world, when our head turns or moves, the absolute position of the sound source itself does not change, but the relative direction of the sound source and the head will change. For example, there is a guitar playing in front of you. If you turn to the right, the sound of the guitar will move relatively to your left. For another example, there is a guitar on the left side of the stage and a saxophone on the right. When you move to the side of the stage, the sounds of the guitar and the saxophone will overlap and come from the same direction. It can be seen that the so-called real and immersive listening experience is just like the listening experience of guitar or stage performance described above. As the position of the user's head changes (including but not limited to the up, down, left, right displacement of the head, head rotation, etc.), the sound from the same audio source will be rendered into different listening effects in the user's ears. This rendering is the rendering method with head motion effect, that is, head tracking (Head Tracking) is performed through the sensor in the earphones, and the corresponding rotation or displacement changes are reflected in the spatial rendering of the sound source.

根据步骤301中时延信息的三种情况，音频信号播放端可以采用三种方法获取第一历史头动位置，即：According to the three situations of the time delay information in step 301, the audio signal playback end can use three methods to obtain the first historical head movement position, namely:

步骤303、向音频内容提供端发送第一渲染信息，第一渲染信息包括预测头动位置。Step 303: Send first rendering information to the audio content provider, where the first rendering information includes the predicted head movement position.

音频信号播放端向音频内容提供端发送第一渲染信息，该第一渲染信息包括步骤302中得到的预测头动位置，使得音频内容提供端可以根据预测头动位置对音频帧进行带头动效果的渲染。这样得到的渲染后音频帧可以提供与预测头动位置对应的听音效果。那么当该渲染后音频帧发送至音频信号播放端时，预测的时间可以与渲染处理、链路传输等导致的时延相抵消，从而在音频信号播放端播放该音频帧时，正好与用户的头动位置相匹配，达成上文所述的真实的、沉浸式的听音体验。The audio signal playback end sends the first rendering information to the audio content provider end, and the first rendering information includes the predicted head movement position obtained in step 302, so that the audio content provider end can render the audio frame with the head movement effect according to the predicted head movement position. The rendered audio frame obtained in this way can provide a listening effect corresponding to the predicted head movement position. Then when the rendered audio frame is sent to the audio signal playback end, the predicted time can offset the delay caused by the rendering process, link transmission, etc., so that when the audio frame is played at the audio signal playback end, it just matches the user's head movement position, achieving the real and immersive listening experience described above.

图4为本申请空间音频渲染方法的过程400的流程图，如图4所示，过程400可以应用于图1所示通信系统，由音频内容提供端和音频信号播放端共同执行，完成对音频源的渲染，并在音频信号播放端上播放给用户听。过程400描述为一系列的步骤或操作，应当理解的是，过程400可以以各种顺序执行和/或同时发生，不限于图4所示的执行顺序。FIG4 is a flow chart of a process 400 of the spatial audio rendering method of the present application. As shown in FIG4 , the process 400 can be applied to the communication system shown in FIG1 , and is jointly executed by the audio content provider and the audio signal player to complete the rendering of the audio source and play it on the audio signal player for the user to listen to. The process 400 is described as a series of steps or operations. It should be understood that the process 400 can be executed in various orders and/or occur simultaneously, and is not limited to the execution order shown in FIG4 .

过程400包括如下步骤：Process 400 includes the following steps:

步骤401、当采用第一渲染模式时，音频内容提供端获取当前帧。Step 401: When the first rendering mode is adopted, the audio content provider obtains the current frame.

步骤402、当音频源的音频格式为非立体声格式时，音频内容提供端获取预测头动位置。Step 402: When the audio format of the audio source is a non-stereo format, the audio content provider obtains a predicted head movement position.

步骤403、音频内容提供端根据预测头动位置对当前帧进行渲染以得到第一双耳信号。Step 403: The audio content provider renders the current frame according to the predicted head movement position to obtain a first binaural signal.

步骤403的分支下，第一音频流可以包括上述第一双耳信号。In the branch of step 403, the first audio stream may include the first binaural signal.

步骤404、当音频源的音频格式为立体声格式时，音频内容提供端不对音频源进行空间渲染。Step 404: When the audio format of the audio source is a stereo format, the audio content provider does not perform spatial rendering on the audio source.

步骤405、音频内容提供端向音频信号播放端发送第一信息，该第一信息包括第一音频流、音频源的音频格式和时延信息。Step 405: The audio content provider sends first information to the audio signal player, where the first information includes a first audio stream, an audio format of an audio source, and delay information.

时延信息可以参照图3所示实施例的描述，此处不再赘述。The delay information can be referred to the description of the embodiment shown in FIG3 , which will not be repeated here.

步骤406、当音频格式指示为立体声格式时，音频信号播放端通过传感器获取实测头动位置。Step 406: When the audio format is indicated as a stereo format, the audio signal playback end obtains the measured head movement position through a sensor.

步骤407、音频信号播放端根据实测头动位置对当前帧渲染得到的第二双耳信号。Step 407: The audio signal player renders the current frame according to the measured head movement position to obtain a second binaural signal.

步骤408、音频信号播放端根据目标双耳信号播放音频，目标双耳信号包括第一双耳信号或第二双耳信号。Step 408: The audio signal playing end plays audio according to the target binaural signal, where the target binaural signal includes the first binaural signal or the second binaural signal.

以下通过几个具体的实施例对本申请的技术方案进行详细说明。下文实施例中，音频内容提供端可以是手机，音频信号播放端可以是耳机。音频源也可以是指音源、输入信号，渲染也可以是指(双耳)空间音频渲染，头动位置也可以是指IMU数据，部署于手机上的渲染算法也可以是指双耳空间音频渲染算法第一渲染部分，部署于耳机上的渲染算法也可以是指双耳空间音频渲染算法第二渲染部分，音频格式也可以是指格式标志，时延也可以是指空间音频链路动态Delay。The technical solution of the present application is described in detail below through several specific embodiments. In the following embodiments, the audio content provider may be a mobile phone, and the audio signal player may be a headset. The audio source may also refer to the sound source or input signal, the rendering may also refer to (binaural) spatial audio rendering, the head movement position may also refer to IMU data, the rendering algorithm deployed on the mobile phone may also refer to the first rendering part of the binaural spatial audio rendering algorithm, the rendering algorithm deployed on the headset may also refer to the second rendering part of the binaural spatial audio rendering algorithm, the audio format may also refer to the format flag, and the delay may also refer to the dynamic Delay of the spatial audio link.

实施例一Embodiment 1

图5为本申请的空间音频渲染方法的总流程示意图，如图5所示，本申请是一种第一渲染模式下的手机和耳机联合渲染的空间音频处理场景。本实施例中的联合渲染策略可以设计为：手机判定，如果音源是立体声，则发送给耳机进行双耳空间音频渲染；如果音源是非立体声，则在手机上进行渲染，渲染完成后，再发送给耳机，耳机直接输出给扬声器。Figure 5 is a schematic diagram of the overall process of the spatial audio rendering method of the present application. As shown in Figure 5, the present application is a spatial audio processing scenario in which a mobile phone and headphones are jointly rendered in a first rendering mode. The joint rendering strategy in this embodiment can be designed as follows: the mobile phone determines that if the sound source is stereo, it is sent to the headphones for binaural spatial audio rendering; if the sound source is non-stereo, it is rendered on the mobile phone, and after the rendering is completed, it is sent to the headphones, and the headphones directly output to the speakers.

图6为本申请的空间音频渲染方法的具体流程示意图，如图6所示，本实施例的流程包括、FIG6 is a schematic diagram of a specific process of the spatial audio rendering method of the present application. As shown in FIG6 , the process of this embodiment includes:

S1、手机接收输入信号，并检测输入信号格式，同时耳机下发预测IMU数据给到手机的双耳空间音频渲染算法第一渲染部分；S1. The mobile phone receives the input signal and detects the input signal format. At the same time, the headset sends the predicted IMU data to the first rendering part of the binaural spatial audio rendering algorithm of the mobile phone;

双耳空间音频渲染算法第一渲染部分包括直达声(Direct)渲染处理，早期反射(ER)渲染处理，晚期混响(LR)渲染处理。直达声部分与ER、LR混合后的混响部分使用接受到的IMU数据进行头动效果处理。具有头动效果的直达声和具有头动效果的混响分别经过音频效果器渲染后进入第一混音模块(Mixer1)。第一混音模块的输出结果再次经过音频效果器渲染后，以双声道形式传递至蓝牙。上述音频效果器包括不限于均衡效果器、动态压缩效果器、低频增强效果器等。The first rendering part of the binaural spatial audio rendering algorithm includes direct sound (Direct) rendering processing, early reflection (ER) rendering processing, and late reverberation (LR) rendering processing. The reverberation part after the direct sound part is mixed with ER and LR uses the received IMU data to process the head movement effect. The direct sound with head movement effect and the reverberation with head movement effect are respectively rendered by the audio effector and enter the first mixing module (Mixer1). The output result of the first mixing module is rendered again by the audio effector and transmitted to Bluetooth in a dual-channel form. The above-mentioned audio effectors include but are not limited to equalizers, dynamic compression effectors, low-frequency enhancement effectors, etc.

S2、若输入信号格式是立体声，手机直接将输入信号和格式标志传给第一混音模块；S2. If the input signal format is stereo, the mobile phone directly transmits the input signal and the format flag to the first mixing module;

S3、若输入信号格式不是立体声，则使用预测IMU数据，并根据双耳空间音频渲染算法第一部分对输入信号进行渲染，将渲染后的音频流和格式标志传给第一混音模块；S3. If the input signal format is not stereo, use the predicted IMU data and render the input signal according to the first part of the binaural spatial audio rendering algorithm, and pass the rendered audio stream and format flag to the first mixing module;

S4、将第一混音模块的音频流、格式标志以及手机空间音频链路动态Delay打包，通过蓝牙等传给耳机；S4, packaging the audio stream, format flag and dynamic Delay of the mobile phone space audio link of the first mixing module, and transmitting them to the headset via Bluetooth or the like;

S5、耳机对实测IMU数据进行存储，将根据手机上报的Delay，选取对应历史的实测IMU数据和耳机的实测IMU数据传给预测模块，耳机的实测IMU数据传给双耳空间音频渲染算法第二渲染部分，预测模块根据历史的实测IMU数据和耳机的实测IMU数据，生成预测IMU数据，并再次参照S1将预测IMU数据下发到手机；S5. The headset stores the measured IMU data, and selects the corresponding historical measured IMU data and the measured IMU data of the headset according to the Delay reported by the mobile phone and transmits them to the prediction module. The measured IMU data of the headset is transmitted to the second rendering part of the binaural spatial audio rendering algorithm. The prediction module generates predicted IMU data according to the historical measured IMU data and the measured IMU data of the headset, and sends the predicted IMU data to the mobile phone again with reference to S1.

图7为头动位置的预测算法的流程示意图，如图7所示，该流程包括如下步骤：FIG. 7 is a flow chart of a head movement position prediction algorithm. As shown in FIG. 7 , the flow chart includes the following steps:

S5.1、耳机储存历史的IMU数据至缓存中，根据接收到的手机时延，从历史缓存取出对应的历史IMU数据，将实测IMU数据与历史IMU数据求差，并将差值送给预测单元；S5.1. The headset stores historical IMU data in a cache, retrieves corresponding historical IMU data from the historical cache according to the received mobile phone delay, calculates the difference between the measured IMU data and the historical IMU data, and sends the difference to the prediction unit;

S5.2、计算过去N个实测耳机IMU数据(例如包括偏航角(yaw)，俯仰角(pitch)，翻滚角(roll)或者四元素)变化率，并将变化系数传递给预测单元；S5.2, calculating the change rate of the past N measured headphone IMU data (for example, including yaw, pitch, roll or four elements), and passing the change coefficient to the prediction unit;

S5.3、根据变化率和S5.1中计算的差值，使用三次样条插值等手段，预测头动变化值；S5.3, predicting the head movement change value based on the change rate and the difference calculated in S5.1 by using cubic spline interpolation or other means;

S5.4、将头动变化值与实测耳机IMU数据相加生成预测头动位置；S5.4, adding the head movement change value to the measured headphone IMU data to generate a predicted head movement position;

S5.5、将预测头动位置下发给手机。S5.5. Send the predicted head movement position to the mobile phone.

S6、耳机检测音源格式，若音源格式为立体声，则使用耳机的实测IMU数据，根据双耳空间音频渲染算法第二部分对输入信号进行渲染，将渲染后的音频流给到第二混音模块；S6. The earphone detects the sound source format. If the sound source format is stereo, the measured IMU data of the earphone is used to render the input signal according to the second part of the binaural spatial audio rendering algorithm, and the rendered audio stream is given to the second mixing module;

双耳空间音频渲染算法第二渲染部分包括直达声(Direct)渲染处理和基础的低算力混响渲染处理。直达声部分使用接受到的IMU数据进行头动效果处理。具有头动效果的直达声和没有头动效果的基础混响分别经过音频效果器渲染后进行混音。混音后的总体输出结果再次经过音频效果器渲染后，进入第二混音模块(Mixer2)。上述音频效果器包括不限于均衡效果器、动态压缩效果器、低频增强效果器等。The second rendering part of the binaural spatial audio rendering algorithm includes direct sound (Direct) rendering processing and basic low-computing power reverberation rendering processing. The direct sound part uses the received IMU data to process the head movement effect. The direct sound with head movement effect and the basic reverberation without head movement effect are mixed after being rendered by the audio effector. The overall output result after mixing is rendered again by the audio effector and enters the second mixing module (Mixer2). The above-mentioned audio effectors include but are not limited to equalizers, dynamic compression effectors, low-frequency enhancement effectors, etc.

S7、若音源格式不是立体声，则耳机直接将输入信号传给第二混音模块；S7, if the sound source format is not stereo, the earphone directly transmits the input signal to the second mixing module;

S8、将第二混音模块送给耳机扬声器播放。S8, sending the second mixing module to the earphone speaker for playing.

实施例一充分发挥手机高算力和耳机低时延的优势，耳机可以独立运行。针对手机链路长时延大的问题，设计了端到端时延预测算法，大幅降低了头动时延(预计可降低50％以上)。Embodiment 1 takes full advantage of the high computing power of mobile phones and the low latency of headphones, and the headphones can operate independently. To address the problem of long latency in mobile phone links, an end-to-end latency prediction algorithm is designed to significantly reduce head movement latency (estimated to be reduced by more than 50%).

实施例二Embodiment 2

与实施例一相比，实施例二中的不同点主要集中于上文的S1和S4中的蓝牙传输内容，以及S5中的头动位置预测算法。Compared with the first embodiment, the differences in the second embodiment mainly focus on the Bluetooth transmission content in S1 and S4 above, and the head movement position prediction algorithm in S5.

本实施例中，S1变化为手机接收输入信号，并检测输入信号格式，同时耳机下发耳机IMU数据和预测IMU数据给到手机双耳空间音频渲染算法第一渲染部分。In this embodiment, S1 changes to the mobile phone receiving an input signal and detecting the format of the input signal, while the earphone sends earphone IMU data and predicted IMU data to the first rendering part of the mobile phone binaural spatial audio rendering algorithm.

S4变化为将第一混音模块的IMU数据、音频流和格式标志打包，通过蓝牙等传给耳机。S4 changes to packaging the IMU data, audio stream and format flag of the first mixing module and transmitting them to the headset via Bluetooth or the like.

S5变化为耳机将手机上报的历史IMU数据传给预测模块，同时将实测IMU数据传给预测模块，将实测IMU数据传给双耳空间音频渲染算法第二渲染部分，预测模块根据手历史IMU数据和实测IMU数据，生成预测IMU数据，并参照S1将实测IMU数据和预测IMU数据下发到手机。S5 changes to the headset transmitting the historical IMU data reported by the mobile phone to the prediction module, and at the same time transmitting the measured IMU data to the prediction module, and transmitting the measured IMU data to the second rendering part of the binaural spatial audio rendering algorithm. The prediction module generates predicted IMU data based on the historical IMU data and the measured IMU data, and sends the measured IMU data and predicted IMU data to the mobile phone with reference to S1.

图8a和图8b为头动位置的预测算法的流程示意图，如图8a和图8b所示，该流程包括如下步骤：FIG8a and FIG8b are schematic diagrams of a flow chart of a prediction algorithm for head movement position. As shown in FIG8a and FIG8b, the flow chart includes the following steps:

S5.1：将实测IMU数据与手机上报的历史IMU数据求差，并将差值送给预测算法；S5.1: Calculate the difference between the measured IMU data and the historical IMU data reported by the mobile phone, and send the difference to the prediction algorithm;

S5.2：计算过去N个实测耳机IMU数据(例如包括yaw，pitch，roll或者四元素)变化率，并将变化系数传递给预测单元；S5.2: Calculate the change rate of the past N measured headphone IMU data (e.g., including yaw, pitch, roll or four elements), and pass the change coefficient to the prediction unit;

S5.3：根据变化率和S5.1中计算的差值，使用三次样条插值等手段，预测头动变化值；S5.3: Predict the head movement change value based on the change rate and the difference calculated in S5.1 using cubic spline interpolation or other methods;

S5.4：将头动变化值与实测耳机IMU数据相加生成预测头动位置；S5.4: Add the head motion change value to the measured headphone IMU data to generate a predicted head motion position;

S5.5：将实测头动位置和预测头动位置下发给手机。S5.5: Send the measured head movement position and the predicted head movement position to the mobile phone.

实施例三Embodiment 3

与实施例一相比，实施例三中的不同点主要集中于S1和S4中的蓝牙传输内容，以及S5中的预测算法。Compared with the first embodiment, the differences in the third embodiment mainly focus on the Bluetooth transmission content in S1 and S4, and the prediction algorithm in S5.

实施例三中，S1变化为手机接收输入信号，并检测输入信号格式，同时耳机下发不确定系数Rn和预测IMU数据给到手机双耳空间音频渲染算法第一渲染部分。In the third embodiment, S1 is changed to the mobile phone receiving the input signal and detecting the format of the input signal, while the earphone sends the uncertainty coefficient Rn and the predicted IMU data to the first rendering part of the mobile phone binaural spatial audio rendering algorithm.

S4变化为将第一混音模块的IMU数据、不确定系数Rn、音频流和格式标志打包，通过蓝牙等传给耳机。S4 changes to packaging the IMU data, uncertainty coefficient Rn, audio stream and format flag of the first mixing module and transmitting them to the headset via Bluetooth or the like.

S5变化为耳机将手机上报的IMU数据、不确定系数Rn传给预测模块，同时将实测IMU数据传给预测模块，将实测IMU数据传给双耳空间音频渲染算法第二渲染部分，预测模块根据手机上报IMU数据、不确定系数Rn和此时耳机IMU数据，生成IMU数据预测值，并参照S1将实测IMU数据和预测IMU数据下发到手机。S5 changes to the headset transmitting the IMU data and uncertainty coefficient Rn reported by the mobile phone to the prediction module, and at the same time transmitting the measured IMU data to the prediction module, and transmitting the measured IMU data to the second rendering part of the binaural spatial audio rendering algorithm. The prediction module generates an IMU data prediction value based on the IMU data, uncertainty coefficient Rn and the headset IMU data reported by the mobile phone, and sends the measured IMU data and predicted IMU data to the mobile phone with reference to S1.

图9a和图9b为头动位置的预测算法的流程示意图，如图9a和图9b所示，该流程包括如下步骤：FIG9a and FIG9b are schematic diagrams of a flow chart of a prediction algorithm for head movement position. As shown in FIG9a and FIG9b, the flow chart includes the following steps:

S5.1、计算过去N个实测耳机IMU数据(例如包括yaw，pitch，roll或者四元素)变化率，结合历史测量不确定系数Rn，确定测量不确定系数Rn+1，并将Rn+1给到Kalman滤波器，同时将手机上报的历史头动位置和耳机的传感器测量的实测头动位置给到Kalman滤波器；S5.1. Calculate the change rate of the past N measured headphone IMU data (for example, including yaw, pitch, roll or four elements), combine the historical measurement uncertainty coefficient Rn, determine the measurement uncertainty coefficient Rn+1, and give Rn+1 to the Kalman filter. At the same time, give the historical head movement position reported by the mobile phone and the measured head movement position measured by the headphone sensor to the Kalman filter;

S5.2、根据不确定系数Rn+1和上一轮迭代的估计不确性Pn+1确定Kalman增益系数Kn+1，并将增益系数送到更新当前状态模块；S5.2, determine the Kalman gain coefficient Kn+1 according to the uncertainty coefficient Rn+1 and the estimation uncertainty Pn+1 of the previous iteration, and send the gain coefficient to the current state update module;

S5.3、根据Xn+1＝Xn+Kn+1*(Yn-Xn)，对预测头动位置进行更新，并输出预测头动位置，同时根据Pn+1＝(1-Kn)*Pn对估计不确定性进行更新，其中Yn是当前实际头动位置，Xn是历史头动位置。S5.3. According to Xn+1=Xn+Kn+1*(Yn-Xn), the predicted head movement position is updated and the predicted head movement position is output. At the same time, the estimated uncertainty is updated according to Pn+1=(1-Kn)*Pn, where Yn is the current actual head movement position and Xn is the historical head movement position.

实施例四Embodiment 4

图10为本申请的框架流程示意图，如图10所示，本实施例提供一种非低时延(第一渲染模式)和低时延(第二渲染模式)的手机和耳机联合渲染的空间音频处理场景的兼容方案，其中在第二渲染模式下，耳机不再进行混响渲染处理，不能脱离手机独立进行空间音频渲染，但具有算力需求小，结构简单的特点。Figure 10 is a schematic diagram of the framework flow of the present application. As shown in Figure 10, this embodiment provides a compatible solution for spatial audio processing scenarios of non-low-latency (first rendering mode) and low-latency (second rendering mode) mobile phones and headphones jointly rendering. In the second rendering mode, the headphones no longer perform reverberation rendering processing and cannot perform spatial audio rendering independently from the mobile phone, but have the characteristics of low computing power requirements and simple structure.

S1、在空间音频支持的APP的用户界面(User Interface，UI)界面增加低时延模式，根据用户选择下发是否低时延模式标志；S1. Add a low-latency mode to the user interface (UI) of the APP that supports spatial audio, and send a low-latency mode flag based on the user's choice;

S2、如果是低时延模式，手机对全部类型的音频信号进行预处理，包括下混处理和混响处理；S2: If it is low latency mode, the phone pre-processes all types of audio signals, including downmixing and reverberation processing;

S3、传输至耳机进行双耳空间音频渲染算法中直达声渲染后输出给耳机扬声器；S3, transmitting to the earphone for direct sound rendering in binaural spatial audio rendering algorithm and then outputting to the earphone speaker;

S4、如果是非低时延模式，则在手机进行带头动效果的渲染，该部分渲染包括非立体声的直达声处理和所有格式信号的混响处理，渲染后将信号传递至耳机；S4. If it is not low-latency mode, the mobile phone will render the dynamic effect, which includes non-stereo direct sound processing and reverberation processing of all format signals. After rendering, the signal will be transmitted to the headphones.

S5、耳机对输入格式进行判断，对立体声音源进行直达声渲染处理后输出至耳机扬声器，对非立体声音源直接输出至耳机扬声器。S5. The earphone determines the input format, performs direct sound rendering processing on the stereo sound source and outputs it to the earphone speaker, and directly outputs the non-stereo sound source to the earphone speaker.

本实施例中，低时延模式的流程如图11(图11为本申请低时延模式的流程示意图)所示。用户选择低时延模式后，手机对接收到的立体声、5.1/7.1、3DA等格式音源进行预处理。预处理包括对5.1/7.1等多声道或3DA等多对象格式音源进行下混处理，统一为立体声格式。然后送入双耳空间音频渲染算法第一渲染部分，进行不带头动效果的基础混响渲染和音效效果器处理。混响部分与经过音效效果器的下混输入混合后通过蓝牙传输至耳机。In this embodiment, the process of the low-latency mode is shown in Figure 11 (Figure 11 is a schematic diagram of the process of the low-latency mode of the present application). After the user selects the low-latency mode, the mobile phone pre-processes the received stereo, 5.1/7.1, 3DA and other format audio sources. The preprocessing includes down-mixing the audio sources in multi-channel formats such as 5.1/7.1 or multi-object formats such as 3DA, and unifying them into a stereo format. It is then sent to the first rendering part of the binaural spatial audio rendering algorithm for basic reverberation rendering and sound effector processing without head effects. The reverberation part is mixed with the downmix input of the sound effector and transmitted to the headphones via Bluetooth.

耳机接收手机传入的带有混响效果的立体声，结合耳机传感器输入的IMU头动数据，利用双耳空间音频渲染算法第二部分进行直达声渲染。该双耳空间音频渲染算法第二部分包括直达声方位渲染模块和音效效果器模块。The earphone receives stereo sound with reverberation effect from the mobile phone, combines it with the IMU head movement data input by the earphone sensor, and uses the second part of the binaural spatial audio rendering algorithm to render direct sound. The second part of the binaural spatial audio rendering algorithm includes a direct sound azimuth rendering module and an audio effector module.

本实施例中，非低时延模式的流程如图12(图12为本申请非低时延模式的流程示意图)所示。手机接收到各类格式的输入源后，利用双耳空间音频渲染算法第一部分进行非立体声音源的直达声渲染和所有格式音源的混响渲染。该部分的直达声和混响渲染均包含头动效果，且混响采取ER和LR分别渲染后相加的实施方案，所需要的预测头动位置可参照上文实施例一至三中的预测算法进行头动时延优化，图12以实施例二中的预测算法为例进行示意。In this embodiment, the process of the non-low-latency mode is shown in Figure 12 (Figure 12 is a schematic diagram of the process of the non-low-latency mode of this application). After the mobile phone receives input sources of various formats, the first part of the binaural spatial audio rendering algorithm is used to perform direct sound rendering of non-stereo sound sources and reverberation rendering of all formats of sound sources. The direct sound and reverberation rendering of this part both include head movement effects, and the reverberation adopts the implementation scheme of ER and LR rendering separately and then adding them together. The required predicted head movement position can refer to the prediction algorithm in Examples 1 to 3 above to optimize the head movement delay. Figure 12 uses the prediction algorithm in Example 2 as an example for illustration.

手机将直达声渲染结果或立体声的原始输入和混响渲染结果经过对应音效效果器处理后混音为立体声格式通过蓝牙传输至耳机。The mobile phone processes the direct sound rendering result or the original stereo input and reverberation rendering result through the corresponding sound effector, mixes the result into a stereo format, and transmits the result to the headset via Bluetooth.

耳机进行原始输入格式判断，非立体声格式直接输出至扬声器，立体声格式通过双耳空间音频渲染算法第二部分和耳机传感器输入的IMU头动数据进行直达声渲染和音效效果器处理后输出至扬声器播放。The headphones determine the original input format, and the non-stereo format is directly output to the speaker. The stereo format is processed by the second part of the binaural spatial audio rendering algorithm and the IMU head movement data input by the headphone sensor for direct sound rendering and sound effect processing, and then output to the speaker for playback.

本实施例，增加UI设计，结合用户对头动时延和空间音频渲染质量的关注点，进行联合渲染方案的选择。In this embodiment, UI design is added, and the joint rendering solution is selected based on the user's concerns about head motion latency and spatial audio rendering quality.

手机和耳机联合渲染：渲染算法分为手机和耳机两部分，手机渲染5.1/7.1/3DA等多声道/多对象部分，并进行混响效果渲染，充分发挥手机算力优势；耳机对立体声进行直达声渲染，充分发挥IMU数据传输链路短时延低的优势。Joint rendering of mobile phones and headphones: The rendering algorithm is divided into two parts: mobile phone and headphones. The mobile phone renders multi-channel/multi-object parts such as 5.1/7.1/3DA, and renders reverberation effects, giving full play to the computing power advantages of the mobile phone; the headphones render direct sound for stereo, giving full play to the advantages of the short latency of the IMU data transmission link.

头动位置预测：针对手机的头动延时大的问题，设计头动时延预测算法，预测算法在耳机上接收手机链路时延、耳机时延和当前IMU数值，生成预测IMU数据。耳机的实测IMU数据传递给双耳空间音频渲染算法第二部分，预测IMU数据传递给双耳空间音频渲染算法第一部分。该方案实现对头动位置的预测，可以降低1/2以上链路时延的影响。Head movement position prediction: To address the problem of long head movement delay in mobile phones, a head movement delay prediction algorithm is designed. The prediction algorithm receives the mobile phone link delay, headphone delay and current IMU value on the headset to generate predicted IMU data. The measured IMU data of the headset is passed to the second part of the binaural spatial audio rendering algorithm, and the predicted IMU data is passed to the first part of the binaural spatial audio rendering algorithm. This solution predicts the head movement position and can reduce the impact of link delay by more than 1/2.

图13为本申请空间音频渲染装置1300的一个示例性的结构示意图，如图13所示，本实施例的空间音频渲染装置1300可以应用于音频信号播放端。该空间音频渲染装置1300可以包括：获取模块1301、预测模块1302、发送模块1303、接收模块1304、渲染模块1305和播放模块1306。其中，FIG13 is a schematic diagram of an exemplary structure of a spatial audio rendering device 1300 of the present application. As shown in FIG13 , the spatial audio rendering device 1300 of the present embodiment can be applied to an audio signal playback end. The spatial audio rendering device 1300 may include: an acquisition module 1301, a prediction module 1302, a sending module 1303, a receiving module 1304, a rendering module 1305, and a playback module 1306.

获取模块1301，用于获取音频内容提供端的时延信息；预测模块1302，用于根据所述时延信息获取预测头动位置；发送模块1303，用于向所述音频内容提供端发送第一渲染信息，所述第一渲染信息包括所述预测头动位置。The acquisition module 1301 is used to obtain the delay information of the audio content provider; the prediction module 1302 is used to obtain the predicted head movement position based on the delay information; the sending module 1303 is used to send the first rendering information to the audio content provider, and the first rendering information includes the predicted head movement position.

在一种可能的实现方式中，所述预测模块1302，具体用于获取与所述时延信息对应的第一历史头动位置；通过传感器获取实测头动位置；根据所述第一历史头动位置和所述实测头动位置获取所述预测头动位置。In a possible implementation, the prediction module 1302 is specifically used to obtain a first historical head movement position corresponding to the time delay information; obtain the measured head movement position through a sensor; and obtain the predicted head movement position based on the first historical head movement position and the measured head movement position.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的时延；所述预测模块1302，具体用于从缓存中提取与所述时延对应的所述第一历史头动位置，所述缓存中预先存储了多组时延和历史头动位置的对应关系。In one possible implementation, the delay information includes the delay sent by the audio content provider; the prediction module 1302 is specifically used to extract the first historical head movement position corresponding to the delay from the cache, and the cache pre-stores the correspondence between multiple groups of delays and historical head movement positions.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的第二历史头动位置，所述第二历史头动位置为发送第二渲染信息时的实测头动位置，所述第二渲染信息早于所述第一渲染信息发送；所述预测模块1302，具体用于将所述第二历史头动位置作为所述第一历史头动位置。In one possible implementation, the delay information includes a second historical head movement position sent by the audio content provider, where the second historical head movement position is the actual head movement position when sending the second rendering information, and the second rendering information is sent earlier than the first rendering information; the prediction module 1302 is specifically used to use the second historical head movement position as the first historical head movement position.

在一种可能的实现方式中，所述预测模块1302，具体用于获取所述第一历史头动位置和所述实测头动位置的差值；获取头动变化率，所述头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据所述差值和所述头动变化率获取所述预测头动位置。In one possible implementation, the prediction module 1302 is specifically used to obtain the difference between the first historical head movement position and the measured head movement position; obtain the head movement change rate, which is obtained based on the N measured head movement positions obtained previously, N>1; and obtain the predicted head movement position based on the difference and the head movement change rate.

在一种可能的实现方式中，所述时延信息包括所述音频内容提供端发送的第二历史头动位置和第一不确定系数，所述第二历史头动位置为发送第二渲染信息时的实测头动位置，所述第一不确定系数来自于所述第二渲染信息，所述第二渲染信息早于所述第一渲染信息发送；所述预测模块1302，具体用于将所述第二历史头动位置作为所述第一历史头动位置。In one possible implementation, the delay information includes a second historical head movement position and a first uncertainty coefficient sent by the audio content provider, wherein the second historical head movement position is the actual head movement position when the second rendering information is sent, and the first uncertainty coefficient comes from the second rendering information, and the second rendering information is sent earlier than the first rendering information; the prediction module 1302 is specifically used to use the second historical head movement position as the first historical head movement position.

在一种可能的实现方式中，所述预测模块1302，具体用于获取头动变化率，所述头动变化率是根据之前得到的N个实测头动位置得到的，N＞1；根据所述头动变化率和所述第一不确定系数获取第二不确定系数；将所述第二不确定系数、所述第一历史头动位置和所述实测当前头动位置输入Kalman滤波器以得到所述预测头动位置。In one possible implementation, the prediction module 1302 is specifically used to obtain the head movement change rate, which is obtained based on N previously measured head movement positions, N>1; obtain a second uncertainty coefficient based on the head movement change rate and the first uncertainty coefficient; and input the second uncertainty coefficient, the first historical head movement position and the measured current head movement position into a Kalman filter to obtain the predicted head movement position.

在一种可能的实现方式中，接收模块1304，用于接收所述音频内容提供端发送的第一信息，所述第一信息包括音频格式、第一音频流以及所述时延信息；其中，所述音频格式指示音频源为立体声格式或者非立体声格式；当所述音频格式指示为所述非立体声格式时，所述第一音频流包括经所述音频内容提供端渲染的第一双耳信号；当所述音频格式指示为所述立体声格式时，所述第一音频流包括所述音频源；渲染模块1305，用于当所述音频格式指示为所述立体声格式时，通过传感器获取实测头动位置；根据所述实测头动位置对当前帧渲染得到的第二双耳信号，所述当前帧是所述音频源的其中一帧；播放模块1306，用于根据目标双耳信号播放音频，所述目标双耳信号包括所述第一双耳信号或所述第二双耳信号。In a possible implementation, a receiving module 1304 is used to receive first information sent by the audio content provider, the first information including an audio format, a first audio stream and the delay information; wherein the audio format indicates whether the audio source is in a stereo format or a non-stereo format; when the audio format indicates the non-stereo format, the first audio stream includes a first binaural signal rendered by the audio content provider; when the audio format indicates the stereo format, the first audio stream includes the audio source; a rendering module 1305 is used to obtain a measured head position through a sensor when the audio format indicates the stereo format; a second binaural signal is rendered for a current frame according to the measured head position, the current frame being one of the frames of the audio source; a playing module 1306 is used to play audio according to a target binaural signal, the target binaural signal including the first binaural signal or the second binaural signal.

在一种可能的实现方式中，所述渲染模块1305，具体用于根据所述实测头动位置对所述当前帧中的直达声部分进行带头动效果的渲染，并对所述当前帧中的早期反射声ER和后期混响声LR进行不带头动效果的渲染，以得到所述第二双耳信号。In one possible implementation, the rendering module 1305 is specifically used to render the direct sound part in the current frame with the head movement effect according to the measured head movement position, and to render the early reflected sound ER and the late reverberation sound LR in the current frame without the head movement effect, so as to obtain the second binaural signal.

本实施例的装置，可以用于执行图3或图4所示方法实施例中客户端执行的技术方案，其实现原理和技术效果类似，此处不再赘述。The device of this embodiment can be used to execute the technical solution executed by the client in the method embodiment shown in Figure 3 or Figure 4. Its implementation principle and technical effects are similar and will not be repeated here.

图14为本申请空间音频渲染装置1400的一个示例性的结构示意图，如图14所示，本实施例的空间音频渲染装置1400可以应用于音频内容提供端。该空间音频渲染装置1400可以包括：获取模块1401、渲染模块1402和发送模块1403。其中，FIG14 is a schematic diagram of an exemplary structure of a spatial audio rendering device 1400 of the present application. As shown in FIG14 , the spatial audio rendering device 1400 of the present embodiment can be applied to an audio content provider. The spatial audio rendering device 1400 may include: an acquisition module 1401, a rendering module 1402, and a sending module 1403.

获取模块1401，用于当采用第一渲染模式时，获取当前帧，所述当前帧是音频源的其中一帧；当所述音频源的音频格式为非立体声格式时，获取预测头动位置；渲染模块1402，用于根据所述预测头动位置对所述当前帧进行渲染以得到第一双耳信号；发送模块1403，用于向音频信号播放端发送第一信息，所述第一信息包括第一音频流、所述音频源的音频格式和时延信息，所述第一音频流包括所述第一双耳信号。The acquisition module 1401 is used to acquire the current frame when the first rendering mode is adopted, and the current frame is one of the frames of the audio source; when the audio format of the audio source is a non-stereo format, the predicted head movement position is acquired; the rendering module 1402 is used to render the current frame according to the predicted head movement position to obtain a first binaural signal; the sending module 1403 is used to send first information to the audio signal playback end, the first information includes a first audio stream, the audio format and delay information of the audio source, and the first audio stream includes the first binaural signal.

在一种可能的实现方式中，所述渲染模块1402，具体用于根据所述预测头动位置分别对所述当前帧中的直达声部分、早期反射声ER部分和后期混响声LR部分进行带头动效果的渲染，以得到所述第一双耳信号。In a possible implementation, the rendering module 1402 is specifically used to render the direct sound part, the early reflection sound ER part and the late reverberation sound LR part in the current frame with head movement effect according to the predicted head movement position, so as to obtain the first binaural signal.

在一种可能的实现方式中，所述获取模块1401，具体用于从所述音频信号播放端发送的渲染信息中得到所述预测头动位置。In a possible implementation, the acquisition module 1401 is specifically configured to obtain the predicted head movement position from rendering information sent by the audio signal playback end.

在一种可能的实现方式中，所述获取模块1401，还用于当采用第二渲染模式时，获取所述当前帧；所述渲染模块1402，还用于对所述当前帧进行不带头动效果的渲染以得到第二双耳信号；所述发送模块1403，还用于向所述音频信号播放端发送第二信息，所述第二信息包括第二音频流，所述第二音频流包括所述第二双耳信号。In a possible implementation, the acquisition module 1401 is further used to acquire the current frame when a second rendering mode is adopted; the rendering module 1402 is further used to render the current frame without a head motion effect to obtain a second binaural signal; the sending module 1403 is further used to send second information to the audio signal playback end, the second information includes a second audio stream, and the second audio stream includes the second binaural signal.

在实现过程中，上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。处理器可以是通用处理器、数字信号处理器(digital signalprocessor,DSP)、特定应用集成电路(application-specific integrated circuit，ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。本申请实施例公开的方法的步骤可以直接体现为硬件编码处理器执行完成，或者用编码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。In the implementation process, each step of the above method embodiment can be completed by the hardware integrated logic circuit or software instruction in the processor. The processor can be a general processor, a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (application-specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The general processor can be a microprocessor or the processor can also be any conventional processor, etc. The steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware coding processor to perform, or the hardware and software modules in the coding processor are combined to perform. The software module can be located in a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, and other mature storage media in the art. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

上述各实施例中提及的存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-onlymemory，ROM)、可编程只读存储器(programmable ROM，PROM)、可擦除可编程只读存储器(erasable PROM，EPROM)、电可擦除可编程只读存储器(electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(static RAM，SRAM)、动态随机存取存储器(dynamic RAM，DRAM)、同步动态随机存取存储器(synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(double data rateSDRAM，DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM，SLDRAM)和直接内存总线随机存取存储器(directrambus RAM，DR RAM)。应注意，本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。The memory mentioned in the above embodiments may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories. Among them, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), and direct RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium, including several instructions for a computer device (personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art who is familiar with the present technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. A spatial audio rendering method, comprising:

Obtaining the delay information of the audio content provider;

Acquire a predicted head movement position according to the time delay information;

Sending first rendering information to the audio content provider, where the first rendering information includes the predicted head movement position.

2. The method according to claim 1, characterized in that the step of obtaining the predicted head movement position according to the time delay information comprises:

Acquire a first historical head movement position corresponding to the time delay information;

The actual head motion position is obtained through the sensor;

The predicted head movement position is acquired according to the first historical head movement position and the measured head movement position.

3. The method according to claim 2, characterized in that the delay information includes the delay sent by the audio content provider;

The acquiring a first historical head movement position corresponding to the delay information includes:

The first historical head movement position corresponding to the time delay is extracted from a cache, wherein the cache pre-stores a plurality of sets of correspondences between time delays and historical head movement positions.

4. The method according to claim 2, characterized in that the delay information includes a second historical head movement position sent by the audio content provider, the second historical head movement position is a measured head movement position when sending the second rendering information, and the second rendering information is sent earlier than the first rendering information;

The second historical head movement position is used as the first historical head movement position.

5. The method according to any one of claims 2 to 4, characterized in that the step of obtaining the predicted head movement position according to the first historical head movement position and the measured head movement position comprises:

Obtaining a difference between the first historical head movement position and the actually measured head movement position;

Obtaining a head movement change rate, where the head movement change rate is obtained based on N previously obtained measured head movement positions, where N>1;

The predicted head movement position is obtained according to the difference and the head movement change rate.

6. The method according to claim 2, characterized in that the delay information includes a second historical head movement position and a first uncertainty coefficient sent by the audio content provider, the second historical head movement position is the measured head movement position when the second rendering information is sent, the first uncertainty coefficient comes from the second rendering information, and the second rendering information is sent earlier than the first rendering information;

7. The method according to claim 6, characterized in that the step of obtaining the predicted head movement position according to the first historical head movement position and the measured head movement position comprises:

Obtaining a second uncertainty coefficient according to the head movement change rate and the first uncertainty coefficient;

The second uncertainty coefficient, the first historical head motion position and the measured current head motion position are input into a Kalman filter to obtain the predicted head motion position.

8. The method according to any one of claims 1 to 7, further comprising:

Receive first information sent by the audio content provider, the first information including an audio format, a first audio stream, and the delay information; wherein the audio format indicates whether the audio source is in a stereo format or a non-stereo format; when the audio format indicates the non-stereo format, the first audio stream includes a first binaural signal rendered by the audio content provider; when the audio format indicates the stereo format, the first audio stream includes the audio source;

When the audio format indication is the stereo format, obtaining the measured head motion position through a sensor;

a second binaural signal obtained by rendering a current frame according to the measured head motion position, wherein the current frame is one of the frames of the audio source;

The audio is played according to a target binaural signal, where the target binaural signal includes the first binaural signal or the second binaural signal.

9. The method according to claim 8, characterized in that the second binaural signal obtained by rendering the current frame according to the measured head motion position comprises:

The direct sound part in the current frame is rendered with the head movement effect according to the measured head movement position, and the early reflection sound ER and the late reverberation sound LR in the current frame are rendered without the head movement effect to obtain the second binaural signal.

10. A spatial audio rendering method, comprising:

When the first rendering mode is adopted, a current frame is obtained, where the current frame is one of the frames of the audio source;

When the audio format of the audio source is a non-stereo format, obtaining a predicted head movement position;

Rendering the current frame according to the predicted head movement position to obtain a first binaural signal;

Sending first information to an audio signal playback end, the first information including a first audio stream, an audio format and delay information of the audio source, the first audio stream including the first binaural signal.

11. The method according to claim 10, wherein rendering the current frame according to the predicted head motion position to obtain the first binaural signal comprises:

The direct sound part, the early reflection sound ER part and the late reverberation sound LR part in the current frame are respectively rendered with the head movement effect according to the predicted head movement position to obtain the first binaural signal.

12. The method according to claim 10 or 11, characterized in that obtaining the predicted head movement position comprises:

The predicted head movement position is obtained from rendering information sent by the audio signal playing end.

The method according to claim 12 , wherein the delay information comprises a delay.

14. The method according to claim 12, characterized in that the rendering information further comprises a measured head motion position when the audio signal playback end sends the rendering information;

Correspondingly, the time delay information includes the measured head motion position.

15. The method according to claim 12, characterized in that the rendering information further comprises a measured head motion position and a first uncertainty coefficient when the audio signal playback end sends the rendering information;

Correspondingly, the time delay information includes the measured head motion position and the first uncertainty coefficient.

16 . The method according to claim 10 , wherein when the audio format of the audio source is a stereo format, the first audio stream includes the audio source.

17. The method according to any one of claims 10 to 16, further comprising:

When the second rendering mode is adopted, obtaining the current frame;

Rendering the current frame without head movement effect to obtain a second binaural signal;

Sending second information to the audio signal playback end, where the second information includes a second audio stream, and the second audio stream includes the second binaural signal.

18. A spatial audio rendering device, comprising:

An acquisition module, used to acquire the delay information of the audio content provider;

A prediction module, used for obtaining a predicted head movement position according to the time delay information;

A sending module is used to send first rendering information to the audio content provider, where the first rendering information includes the predicted head movement position.

19. The device according to claim 18 is characterized in that the prediction module is specifically used to obtain a first historical head movement position corresponding to the time delay information; obtain the measured head movement position through a sensor; and obtain the predicted head movement position based on the first historical head movement position and the measured head movement position.

20. The device according to claim 19, characterized in that the delay information includes the delay sent by the audio content provider;

The prediction module is specifically used to extract the first historical head movement position corresponding to the time delay from a cache, and the cache pre-stores a plurality of sets of correspondences between time delays and historical head movement positions.

21. The device according to claim 19, wherein the delay information comprises a second historical head movement position sent by the audio content provider, the second historical head movement position is a measured head movement position when sending the second rendering information, and the second rendering information is sent earlier than the first rendering information;

The prediction module is specifically configured to use the second historical head movement position as the first historical head movement position.

22. The device according to any one of claims 19-21 is characterized in that the prediction module is specifically used to obtain the difference between the first historical head movement position and the measured head movement position; obtain the head movement change rate, and the head movement change rate is obtained based on N measured head movement positions obtained previously, N>1; obtain the predicted head movement position based on the difference and the head movement change rate.

23. The device according to claim 19, wherein the delay information comprises a second historical head movement position and a first uncertainty coefficient sent by the audio content provider, the second historical head movement position is a measured head movement position when sending the second rendering information, the first uncertainty coefficient comes from the second rendering information, and the second rendering information is sent earlier than the first rendering information;

24. The device according to claim 23 is characterized in that the prediction module is specifically used to obtain the head movement change rate, which is obtained based on N previously measured head movement positions, N>1; obtain a second uncertainty coefficient based on the head movement change rate and the first uncertainty coefficient; input the second uncertainty coefficient, the first historical head movement position and the measured current head movement position into a Kalman filter to obtain the predicted head movement position.

25. The device according to any one of claims 18 to 24, further comprising:

A receiving module, configured to receive first information sent by the audio content provider, wherein the first information includes an audio format, a first audio stream, and the delay information; wherein the audio format indicates whether the audio source is in a stereo format or a non-stereo format; when the audio format indicates the non-stereo format, the first audio stream includes a first binaural signal rendered by the audio content provider; when the audio format indicates the stereo format, the first audio stream includes the audio source;

a rendering module, configured to, when the audio format indication is the stereo format, obtain a measured head motion position through a sensor; and render a second binaural signal obtained by performing rendering on a current frame according to the measured head motion position, wherein the current frame is one of the frames of the audio source;

The playing module is used to play audio according to a target binaural signal, where the target binaural signal includes the first binaural signal or the second binaural signal.

26. The device according to claim 25 is characterized in that the rendering module is specifically used to render the direct sound part in the current frame with the head movement effect according to the measured head movement position, and to render the early reflected sound ER and the late reverberation sound LR in the current frame without the head movement effect, so as to obtain the second binaural signal.

27. A spatial audio rendering device, comprising:

An acquisition module, configured to acquire a current frame when the first rendering mode is adopted, wherein the current frame is one of the frames of the audio source; and acquire a predicted head movement position when the audio format of the audio source is a non-stereo format;

A rendering module, configured to render the current frame according to the predicted head movement position to obtain a first binaural signal;

The sending module is used to send first information to the audio signal playing end, wherein the first information includes a first audio stream, an audio format and delay information of the audio source, and the first audio stream includes the first binaural signal.

28. The device according to claim 27 is characterized in that the rendering module is specifically used to render the direct sound part, the early reflection sound ER part and the late reverberation sound LR part in the current frame with head movement effect according to the predicted head movement position, so as to obtain the first binaural signal.

29. The device according to claim 27 or 28, characterized in that the acquisition module is specifically used to obtain the predicted head movement position from the rendering information sent by the audio signal playback end.

30. The apparatus according to claim 29, wherein the delay information comprises a delay.

31. The device according to claim 29, characterized in that the rendering information further includes the measured head movement position when the audio signal playback end sends the rendering information;

32. The device according to claim 29, characterized in that the rendering information further includes the measured head motion position and the first uncertainty coefficient when the audio signal playback end sends the rendering information;

33. The device according to any one of claims 27 to 32, characterized in that when the audio format of the audio source is a stereo format, the first audio stream includes the audio source.

34. The device according to any one of claims 27 to 33, characterized in that the acquisition module is further used to acquire the current frame when the second rendering mode is adopted;

The rendering module is further used to render the current frame without head motion effect to obtain a second binaural signal;

The sending module is further used to send second information to the audio signal playing end, where the second information includes a second audio stream, and the second audio stream includes the second binaural signal.

35. An audio signal playing device, comprising:

one or more processors;

A memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 9.

36. An audio content providing device, comprising:

one or more processors;

A memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 10 to 17.

37. A computer-readable storage medium, characterized in that it comprises a computer program, and when the computer program is executed on a computer, the computer is caused to execute the method according to any one of claims 1 to 17.

38. A computer program product, characterized in that the computer program product comprises computer program code, and when the computer program code is run on a computer, the computer is enabled to execute the method according to any one of claims 1 to 17.