CN118802874A

CN118802874A - Audio data transmission method, device, electronic device and readable storage medium

Info

Publication number: CN118802874A
Application number: CN202311676524.8A
Authority: CN
Inventors: 刘倍余; 胡苏�; 顾明; 饶明佺; 韩建
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-10-18

Abstract

The application provides an audio data transmission method, an audio data transmission device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: receiving a real-time transport protocol RTP data packet; the RTP data packet carries first audio data and first indication information, wherein the first indication information is used for indicating that the encoding mode of the first audio data is an audio-video encoding-decoding standard AVS encoding mode; and decoding the first audio data according to the first indication information. The application can realize real-time transmission of the audio data obtained by adopting the AVS coding mode.

Description

Audio data transmission method, device, electronic device and readable storage medium

技术领域Technical Field

本申请涉及通信技术领域，尤其涉及一种音频数据传输方法、装置、电子设备及可读存储介质。The present application relates to the field of communication technology, and in particular to an audio data transmission method, device, electronic device and readable storage medium.

背景技术Background Art

音视频实时传输的快速发展，为用户提供更多的便利条件。比如基于网页实时通信(Web Real-Time Communications，WebRTC)实现的直播、视频会议、远程控制等，为用户提供了便捷的实时音视频通讯场景。音视频编解码标准(Audio Video coding Standard，AVS)，包括系统、视频、音频、数字版权管理等四个主要技术标准和一致性测试等支撑标准。AVS的编码方式，编码效率高，实现复杂度低，具有更高的应用前景。但是，目前针对采用AVS编码方式获得的音频数据，如何实现实时传输还没有解决方案。The rapid development of real-time audio and video transmission provides users with more convenient conditions. For example, live broadcast, video conferencing, remote control, etc. based on Web Real-Time Communications (WebRTC) provide users with convenient real-time audio and video communication scenarios. The Audio Video Coding Standard (AVS) includes four main technical standards such as system, video, audio, and digital rights management, as well as supporting standards such as consistency testing. The AVS encoding method has high encoding efficiency, low implementation complexity, and greater application prospects. However, there is currently no solution for how to achieve real-time transmission of audio data obtained using the AVS encoding method.

发明内容Summary of the invention

本申请提供一种音频数据传输方法、装置、电子设备及可读存储介质，能够解决目前针对采用AVS编码方式获得的音频数据，如何实现实时传输还没有解决方案的问题。The present application provides an audio data transmission method, device, electronic device and readable storage medium, which can solve the problem that there is currently no solution for how to achieve real-time transmission of audio data obtained using the AVS encoding method.

本申请的实施例提供一种音频数据传输方法，应用于接收侧设备，所述方法包括：An embodiment of the present application provides an audio data transmission method, which is applied to a receiving-side device, and the method includes:

接收实时传输协议实时传输协议(Real-time Transport Protocol，RTP)数据包；其中，所述RTP数据包中承载第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为音视频编解码标准AVS编码方式；Receive a Real-time Transport Protocol (RTP) data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is an audio and video codec standard AVS encoding method;

根据所述第一指示信息，对所述第一音频数据进行解码处理。The first audio data is decoded according to the first indication information.

可选地，所述接收RTP数据包之前，还包括：Optionally, before receiving the RTP data packet, the method further includes:

通过与发送侧设备的媒体协商过程，确定编码配置信息；其中，所述编码配置信息包括编码名称信息，所述编码名称信息用于指示所述AVS编码方式。The encoding configuration information is determined through a media negotiation process with a sending-side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

可选地，所述编码配置信息还包括以下至少一项：Optionally, the encoding configuration information further includes at least one of the following:

编码器标识信息；Encoder identification information;

编码器配置参数；Encoder configuration parameters;

码率；Bit rate;

其中，所述编码器标识信息包括以下至少一项：The encoder identification information includes at least one of the following:

第一标识信息，用于指示对音频数据进行编码的编码类别；First identification information, used to indicate the encoding category for encoding the audio data;

第二标识信息，用于指示编码器中包含或者不包含神经网络模型；The second identification information is used to indicate whether the encoder includes a neural network model or not;

第三标识信息，用于指示编码器中神经网络模型的类别。The third identification information is used to indicate the category of the neural network model in the encoder.

可选地，所述RTP数据包中还承载以下至少一项：Optionally, the RTP data packet also carries at least one of the following:

序列号；Serial number;

时间戳；其中，在承载同一帧的第一音频数据的不同RTP数据包中，所述时间戳相同；Timestamp; wherein, in different RTP data packets carrying the first audio data of the same frame, the timestamp is the same;

标记位；其中，在承载每一帧的目标音频数据的RTP数据包中，所述标记位为第一值；和/或，在承载每一帧中除所述目标音频数据之外的第一音频数据的RTP数据包中，所述标记位为第二值；所述目标音频数据包括每一帧中第一个和/或最后一个第一音频数据。Mark bit; wherein, in the RTP data packet carrying the target audio data of each frame, the mark bit is a first value; and/or, in the RTP data packet carrying the first audio data other than the target audio data in each frame, the mark bit is a second value; the target audio data includes the first and/or last first audio data in each frame.

可选地，所述根据所述第一指示信息，对所述第一音频数据进行解码处理，包括：Optionally, the decoding the first audio data according to the first indication information includes:

在接收到多个RTP数据包中承载的时间戳相同的情况下，将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据；When the timestamps carried in the received multiple RTP data packets are the same, combining the first audio data carried in the multiple RTP data packets to obtain combined audio data;

根据所述第一指示信息，对组合后的音频数据进行解码处理。The combined audio data is decoded according to the first indication information.

在接收到的第一RTP数据包中承载的标记位为第一值的情况下，确定至少一个第二RTP数据包；其中，每个所述第二RTP数据包与所述第一RTP数据包种承载的时间戳均相同；In the case where the marker bit carried in the received first RTP data packet is a first value, determining at least one second RTP data packet; wherein each of the second RTP data packets has the same timestamp as that carried in the first RTP data packet;

将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据；其中，所述多个RTP数据包包括：所述第一RTP数据包和所述至少一个第二RTP数据包；Combining the first audio data carried in a plurality of RTP data packets to obtain combined audio data; wherein the plurality of RTP data packets include: the first RTP data packet and the at least one second RTP data packet;

可选地，所述在接收到的第一RTP数据包中承载的标记位为第一值的情况下，确定至少一个第二RTP数据包，包括：Optionally, when the marker bit carried in the received first RTP data packet is a first value, determining at least one second RTP data packet includes:

在接收到的第一RTP数据包中承载的标记位为第一值的情况下，确定目标RTP数据包；其中，所述目标RTP数据包中承载的标记位为第一值，且所述目标RTP数据包与所述第一RTP数据包中承载的时间戳不同；In the case where the marker bit carried in the received first RTP data packet is a first value, determining a target RTP data packet; wherein the marker bit carried in the target RTP data packet is the first value, and the timestamps carried in the target RTP data packet and the first RTP data packet are different;

将承载的序列号处于第一序列号与第二序列号之间，且承载的时间戳与所述第一RTP数据包中承载的时间戳相同的RTP数据包，确定为所述第二RTP数据包；其中，所述第一序列号是所述第一RTP数据包中承载的序列号，所述第二序列号是所述目标RTP数据包中承载的序列号。An RTP data packet that carries a sequence number between a first sequence number and a second sequence number and carries a timestamp that is the same as the timestamp carried in the first RTP data packet is determined as the second RTP data packet; wherein the first sequence number is the sequence number carried in the first RTP data packet, and the second sequence number is the sequence number carried in the target RTP data packet.

可选地，将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据，包括：Optionally, combining the first audio data carried in the multiple RTP data packets to obtain combined audio data includes:

在所述多个RTP数据包中承载的序列号连续的情况下，根据所述多个RTP数据包中承载的序列号递增或递减的顺序，依次将所述多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据。In the case where the sequence numbers carried in the multiple RTP data packets are continuous, the first audio data carried in the multiple RTP data packets are combined in sequence according to the increasing or decreasing order of the sequence numbers carried in the multiple RTP data packets to obtain combined audio data.

可选地，所述的音频数据传输方法还包括以下至少一项：Optionally, the audio data transmission method further includes at least one of the following:

在所述多个RTP数据包中承载的序列号不连续的情况下，丢弃所述多个RTP数据包，并重新接收所述RTP数据包；In the case where the sequence numbers carried in the multiple RTP data packets are discontinuous, discarding the multiple RTP data packets and receiving the RTP data packets again;

在接收到的RTP数据包中承载的标记位为第二值的情况下，继续接收所述RTP数据包，直到接收到的RTP数据包中承载的标记位为第一值或者在定时器超时的情况下丢弃已接收到的RTP数据包。When the marker bit carried in the received RTP data packet is the second value, continue to receive the RTP data packet until the marker bit carried in the received RTP data packet is the first value or the received RTP data packet is discarded when the timer times out.

可选地，所述接收实时传输协议RTP数据包，包括：Optionally, receiving a real-time transport protocol RTP data packet includes:

通过WebRTC媒体通道接收所述RTP数据包。The RTP data packet is received through the WebRTC media channel.

本申请实施例提供一种音频数据传输方法，应用于发送侧设备，所述方法包括：The present application provides an audio data transmission method, which is applied to a sending side device, and the method includes:

获得第一音频数据；其中，所述第一音频数据是基于AVS编码方式进行编码获得的；Obtaining first audio data; wherein the first audio data is obtained by encoding based on the AVS encoding method;

根据RTP对所述第一音频数据进行封装，得到RTP数据包；其中，所述RTP数据包承载所述第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为所述AVS编码方式；Encapsulate the first audio data according to RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

发送所述RTP数据包。The RTP data packet is sent.

可选地，所述获得第一音频数据之前，还包括：Optionally, before obtaining the first audio data, the method further includes:

通过与接收侧设备的媒体协商过程，确定编码配置信息；其中，所述编码配置信息包括编码名称信息，所述编码名称信息用于指示所述AVS编码方式。The encoding configuration information is determined through a media negotiation process with a receiving-side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

编码器标识信息；Encoder identification information;

编码器配置参数；Encoder configuration parameters;

码率；Bit rate;

可选地，所述获得第一音频数据，包括：Optionally, obtaining the first audio data includes:

获取初始音频数据；Get initial audio data;

根据所述编码配置信息，采用所述AVS编码方式对所述初始音频数据进行编码，得到编码后的音频数据；According to the encoding configuration information, the initial audio data is encoded using the AVS encoding method to obtain encoded audio data;

将所述编码后的音频数据拆分为多个所述第一音频数据。The encoded audio data is divided into a plurality of the first audio data.

可选地，所述根据RTP对所述第一音频数据进行封装，得到RTP数据包，包括：Optionally, encapsulating the first audio data according to RTP to obtain an RTP data packet includes:

根据RTP对每个所述第一音频数据分别进行封装，得到多个RTP数据包；其中，每个RTP数据包分别承载一个所述第一音频数据和所述第一指示信息。Each of the first audio data is encapsulated according to RTP to obtain multiple RTP data packets; wherein each RTP data packet carries one of the first audio data and the first indication information.

序列号；Serial number;

可选地，所述发送所述RTP数据包，包括：Optionally, sending the RTP data packet includes:

通过WebRTC媒体通道发送所述RTP数据包。The RTP data packet is sent through the WebRTC media channel.

本申请实施例提供一种音频数据传输装置，应用于接收侧设备，包括：The present application provides an audio data transmission device, which is applied to a receiving device, including:

接收模块，用于接收RTP数据包；其中，所述RTP数据包中承载第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为AVS编码方式；A receiving module, configured to receive an RTP data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

解码模块，用于根据所述第一指示信息，对所述第一音频数据进行解码处理。A decoding module is used to decode the first audio data according to the first indication information.

本申请实施例提供一种音频数据传输装置，应用于发送侧设备，包括：The present application provides an audio data transmission device, which is applied to a sending side device, including:

获得模块，用于获得第一音频数据；其中，所述第一音频数据是基于AVS编码方式进行编码获得的；An obtaining module, used to obtain first audio data; wherein the first audio data is obtained by encoding based on the AVS encoding method;

封装模块，用于根据RTP对所述第一音频数据进行封装，得到RTP数据包；其中，所述RTP数据包承载所述第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为所述AVS编码方式；An encapsulation module, used to encapsulate the first audio data according to RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

发送模块，用于发送所述RTP数据包。A sending module is used to send the RTP data packet.

本申请实施例提供一种电子设备，其特征在于，包括：处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序；其中，所述处理器执行所述计算机程序时实现如上所述接收侧设备的的音频数据传输方法的步骤；或者所述处理器执行所述计算机程序时实现如上所述发送侧设备的音频数据传输方法的步骤。An embodiment of the present application provides an electronic device, characterized in that it includes: a processor, a memory, and a computer program stored in the memory and executable on the processor; wherein, when the processor executes the computer program, the steps of the audio data transmission method of the receiving side device as described above are implemented; or when the processor executes the computer program, the steps of the audio data transmission method of the sending side device as described above are implemented.

本申请实施例提供一种可读存储介质，所述可读存储介质上存储有程序，所述程序被处理器执行时实现如上所述接收侧设备的的音频数据传输方法的步骤；或者所述程序被处理器执行时实现如上所述发送侧设备的音频数据传输方法的步骤。An embodiment of the present application provides a readable storage medium, on which a program is stored. When the program is executed by a processor, the steps of the audio data transmission method of the receiving side device as described above are implemented; or when the program is executed by a processor, the steps of the audio data transmission method of the sending side device as described above are implemented.

本申请实施例中，通过RTP数据包承载基于AVS编码方式获得的第一音频数据，并在RTP数据包中指示所述第一音频数据的编码方式为AVS编码方式，从而接收侧设备可以根据所述第一指示信息，对所述第一音频数据进行解码处理，以实现基于AVS编码方式获得的第一音频数据的实时传输，解决了目前针对采用AVS编码方式获得的音频数据，如何实现实时传输还没有解决方案的问题。In an embodiment of the present application, the first audio data obtained based on the AVS encoding method is carried by an RTP data packet, and the encoding method of the first audio data is indicated as the AVS encoding method in the RTP data packet, so that the receiving side device can decode the first audio data according to the first indication information to realize the real-time transmission of the first audio data obtained based on the AVS encoding method, which solves the problem that there is currently no solution to how to realize real-time transmission of audio data obtained using the AVS encoding method.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1表示本申请实施例的接收侧设备的音频传输方法的流程图；FIG1 is a flow chart showing an audio transmission method of a receiving device according to an embodiment of the present application;

图2表示本申请实施例的RTP数据包包头结构的示意图；FIG2 is a schematic diagram showing the RTP data packet header structure of an embodiment of the present application;

图3表示本申请实施例的发送侧设备的音频传输方法的流程图；FIG3 is a flow chart showing an audio transmission method of a transmitting device according to an embodiment of the present application;

图4表示本申请实施例的音频传输方法的整体流程图；FIG4 is an overall flow chart of the audio transmission method according to an embodiment of the present application;

图5表示本申请实施例的接收侧设备的音频传输装置的框图；FIG5 is a block diagram showing an audio transmission device of a receiving side device according to an embodiment of the present application;

图6表示本申请实施例的发送侧设备的音频传输装置的框图；FIG6 is a block diagram showing an audio transmission device of a transmitting side device according to an embodiment of the present application;

图7表示本申请实施例的电子设备的框图。FIG. 7 is a block diagram of an electronic device according to an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请要解决的技术问题、技术方案和优点更加清楚，下面将结合附图及具体实施例进行详细描述。在下面的描述中，提供诸如具体的配置和组件的特定细节仅仅是为了帮助全面理解本申请的实施例。因此，本领域技术人员应该清楚，可以对这里描述的实施例进行各种改变和修改而不脱离本申请的范围和精神。另外，为了清楚和简洁，省略了对已知功能和构造的描述。In order to make the technical problems, technical solutions and advantages to be solved by the application clearer, the following will be described in detail in conjunction with the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided only to help fully understand the embodiments of the application. Therefore, it should be clear to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the application. In addition, for clarity and brevity, the description of known functions and structures has been omitted.

应理解，说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此，在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外，这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。It should be understood that the references to "one embodiment" or "an embodiment" throughout the specification mean that the specific features, structures, or characteristics associated with the embodiment are included in at least one embodiment of the present application. Therefore, the references to "in one embodiment" or "in an embodiment" appearing throughout the specification do not necessarily refer to the same embodiment. In addition, these specific features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

在本申请的各种实施例中，应理解，下述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请实施例的实施过程构成任何限定。另外，本文中术语“系统”和“网络”在本文中常可互换使用。In various embodiments of the present application, it should be understood that the order of execution of the following processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application. In addition, the terms "system" and "network" are often used interchangeably in this article.

在本申请所提供的实施例中，应理解，“与A相应的B”表示B与A相关联，根据A可以确定B。但还应理解，根据A确定B并不意味着仅仅根据A确定B，还可以根据A和/或其它信息确定B。In the embodiments provided in the present application, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined according to A. However, it should also be understood that determining B according to A does not mean determining B only according to A, and B can also be determined according to A and/or other information.

如图1所示，本申请实施例提供一种音频数据传输方法，应用于接收侧设备。可选地，所述接收侧设备可以是支持音频数据接收或者支持音频数据接收并输出的设备，当然本申请实施例中不限于接收侧设备仅支持音频数据的接收或输出等，还可以支持视频数据的接收或输出，或者还可以支持音频数据和/或视频数据的发送等，本申请实施例不以此为限。As shown in Figure 1, an embodiment of the present application provides an audio data transmission method, which is applied to a receiving side device. Optionally, the receiving side device may be a device that supports audio data reception or supports audio data reception and output. Of course, the embodiment of the present application is not limited to the receiving side device only supporting the reception or output of audio data, etc., and may also support the reception or output of video data, or may also support the transmission of audio data and/or video data, etc. The embodiment of the present application is not limited to this.

具体的，所述方法包括以下步骤：Specifically, the method comprises the following steps:

步骤11：接收RTP数据包；其中，所述RTP数据包中承载第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为AVS编码方式。Step 11: Receive an RTP data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is AVS encoding method.

需要说明的是，AVS包括系统、视频、音频、数字版权管理等四个主要技术标准和一致性测试等支撑标准，本申请实施例涉及的AVS编码方式主要是指基于AVS的音频编码方式，包括但不限于AVS1.0的音频编码方式、AVS2.0的音频编码方式、AVS3.0的音频编码方式，或者其他版本的AVS音频编码方式等，本申请实施例不以此为限。It should be noted that AVS includes four main technical standards, namely system, video, audio, and digital rights management, as well as supporting standards such as consistency testing. The AVS encoding method involved in the embodiments of the present application mainly refers to the audio encoding method based on AVS, including but not limited to the audio encoding method of AVS1.0, the audio encoding method of AVS2.0, the audio encoding method of AVS3.0, or other versions of AVS audio encoding methods, etc. The embodiments of the present application are not limited to this.

可选地，所述第一指示信息可以承载在RTP数据包的包头中。Optionally, the first indication information may be carried in a header of an RTP data packet.

步骤12：根据所述第一指示信息，对所述第一音频数据进行解码处理。Step 12: Decode the first audio data according to the first indication information.

可选地，接收侧设备可以通过对RTP数据包进行解封装，并读取其中承载的第一指示信息和第一音频数据。这样在通过第一指示信息获知了该第一音频数据的编码方式是AVS编码方式的情况下，能够采用相应的解码方式对该第一音频数据进行解码，进而对解码后的音频数据进行输出。Optionally, the receiving device can decapsulate the RTP data packet and read the first indication information and the first audio data carried therein. In this way, when it is known through the first indication information that the encoding method of the first audio data is the AVS encoding method, the first audio data can be decoded using a corresponding decoding method, and then the decoded audio data can be output.

上述方案中，通过RTP数据包承载基于AVS编码方式获得的第一音频数据，并在RTP数据包中指示所述第一音频数据的编码方式为AVS编码方式，从而接收侧设备可以根据所述第一指示信息，对所述第一音频数据进行解码处理，以实现基于AVS编码方式获得的第一音频数据的实时传输，解决了目前针对采用AVS编码方式获得的音频数据，如何实现实时传输还没有解决方案的问题。In the above scheme, the first audio data obtained based on the AVS encoding method is carried by the RTP data packet, and the encoding method of the first audio data is indicated as the AVS encoding method in the RTP data packet, so that the receiving side device can decode the first audio data according to the first indication information to realize the real-time transmission of the first audio data obtained based on the AVS encoding method, which solves the problem that there is currently no solution to how to realize real-time transmission of audio data obtained using the AVS encoding method.

该实施例中，发送侧设备与接收侧设备之间可以基于媒体协商过程，协商确定在发送侧设备与接收侧设备之间传输的音频数据所采用的编码配置。比如：该媒体协商过程可以基于会话描述协议(Session Description Protocol，SDP)实现，SDP用于在媒体会话中传递媒体流信息，通话双方(发送侧设备与接收侧设备之间)可以通过SDP协商编解码等各项能力是否匹配，并允许会话描述的接收者参与会话。In this embodiment, the sending device and the receiving device can negotiate and determine the encoding configuration used for the audio data transmitted between the sending device and the receiving device based on the media negotiation process. For example, the media negotiation process can be implemented based on the Session Description Protocol (SDP), which is used to transmit media stream information in a media session. The two parties in a call (the sending device and the receiving device) can negotiate through SDP whether various capabilities such as encoding and decoding match, and allow the recipient of the session description to participate in the session.

举例来说，在基于SDP实现AVS的编解码协商时，可以采用以下方式承载编码名称信息：For example, when implementing AVS codec negotiation based on SDP, the following methods can be used to carry the codec name information:

在SDP媒体描述中的行描述("m＝")中可以描述媒体类型、传输类型、负载类型(Payload Type)等信息。其中，媒体名称(media name)字段用于指示媒体类型。比如medianame对应多用途互联网邮件扩展(Multipurpose Internet Mail Extensions，MIME)的媒体类型为音频(audio)，即"m＝audio"，表示从m行开始将要描述是音频数据相关信息。In the line description ("m=") of the SDP media description, information such as media type, transmission type, and payload type can be described. The media name field is used to indicate the media type. For example, if medianame corresponds to the media type of Multipurpose Internet Mail Extensions (MIME), which is audio, "m="audio" means that starting from line m, information related to audio data will be described.

在SDP媒体格式的描述中，载荷类型(a＝rtpmap)的值(value)对应RTP包头部的载荷类型(Payload Type，PT)，可以通过rtpmap字段进行定义，并通过跟随其后的格式参数(format parameters，fmtp)字段来定义属性信息。在描述音频数据相关信息时，可以使用"a＝rtpmap"指示AVS编码方式，例如：通过"a＝rtpmap"的编码名称(encoding name)字段指示MIME子名字为“AV3A-AATF”，即表示采用AVS编码方式。具体的，a＝rtpmap描述如下：In the description of the SDP media format, the value of the payload type (a=rtpmap) corresponds to the payload type (Payload Type, PT) of the RTP packet header, which can be defined by the rtpmap field, and the attribute information can be defined by the format parameters (fmtp) field that follows it. When describing audio data related information, "a=rtpmap" can be used to indicate the AVS encoding method. For example, the encoding name field of "a=rtpmap" indicates that the MIME subname is "AV3A-AATF", which means that the AVS encoding method is used. Specifically, a=rtpmap is described as follows:

a＝rtpmap:<payload type><encoding name>/<clock rate>[/<encodingparameters>]；a＝rtpmap:<payload type><encoding name>/<clock rate>[/<encodingparameters>];

其中，"a＝rtpmap:111AV3A-AATF/44100/5"表示encoding name字段指示编码方式为AV3A-AATF，即采用AVS编码方式。Among them, "a=rtpmap:111AV3A-AATF/44100/5" means that the encoding name field indicates that the encoding method is AV3A-AATF, that is, the AVS encoding method is adopted.

可选地，还可以使用"a＝rtpmap"的时钟速率(clock rate)字段指示音频的时钟频率，其单位可以是HZ。例如："a＝rtpmap:111AV3A-AATF/44100/5"表示clock rate字段指示采用AVS编码方式时，音频的时钟频率为44100。Optionally, the clock rate field of "a=rtpmap" may be used to indicate the clock frequency of the audio, and the unit may be HZ. For example, "a=rtpmap:111AV3A-AATF/44100/5" indicates that when the clock rate field indicates that the AVS encoding method is used, the clock frequency of the audio is 44100.

编码器标识信息；例如：该编码器标识信息用于指示采用AVS编码方式时，编码器的类型(比如：对音频数据进行编码的编码类别、编码器是否包含神经网络模型、包含神经网络模型时使用的神经网络模型的类别等)。Encoder identification information; for example: the encoder identification information is used to indicate the type of encoder when the AVS encoding method is adopted (for example: the encoding category for encoding audio data, whether the encoder includes a neural network model, the category of the neural network model used when the neural network model is included, etc.).

编码器配置参数；例如：该编码器配置参数用于指示用AVS编码方式时，编码器的配置参数(比如：音频通道数等)Encoder configuration parameters; for example, the encoder configuration parameters are used to indicate the configuration parameters of the encoder (such as the number of audio channels, etc.) when using the AVS encoding method.

码率：例如：该码率用于指示采用AVS编码方式时，音频流的码率。Bit rate: For example, this bit rate is used to indicate the bit rate of the audio stream when the AVS encoding method is used.

在SDP媒体格式的描述中，"a＝fmtp"可以描述编码器的相关信息，比如编码器标识信息(codec-nn-id)、编码器配置参数(config)、码率(bitrate)等。并且针对不同的音频格式可以有不同的参数配置。In the description of the SDP media format, "a=fmtp" can describe the relevant information of the encoder, such as encoder identification information (codec-nn-id), encoder configuration parameters (config), bitrate, etc. Different parameter configurations can be used for different audio formats.

可选地，所述编码器标识信息包括以下至少一项：Optionally, the encoder identification information includes at least one of the following:

第一标识信息，用于指示对音频数据进行编码的编码类别；例如：编码类别包括但不限于：通用高码率音频编码数据、无损音频编码数据、通用全码率音频编码数据等。The first identification information is used to indicate the encoding category for encoding the audio data; for example, the encoding category includes but is not limited to: general high-bitrate audio encoding data, lossless audio encoding data, general full-bitrate audio encoding data, etc.

第二标识信息，用于指示编码器中包含或者不包含神经网络模型，也即是第二标识信息用于指示编码器中是否包含神经网络模型，例如：可以通过1比特的值为0表示不包含神经网络模型，以及通过1比特的值为1表示包含神经网络模型，或者也可以通过1比特的值为1表示不包含神经网络模型，以及通过1比特的值为0表示包含神经网络模型等，本申请实施例不以此为限。The second identification information is used to indicate whether the encoder includes or does not include a neural network model, that is, the second identification information is used to indicate whether the encoder includes a neural network model. For example, a 1-bit value of 0 can be used to indicate that the neural network model is not included, and a 1-bit value of 1 can be used to indicate that the neural network model is included, or a 1-bit value of 1 can be used to indicate that the neural network model is not included, and a 1-bit value of 0 can be used to indicate that the neural network model is included, etc. The embodiments of the present application are not limited to this.

第三标识信息，用于指示编码器中神经网络模型的类别；例如：当第二标识信息指示编码器中包含神经网络模型时，可以使用第三标识信息指示神经网络模型的类别；或者当编码配置信息中包括第三标识信息但是不包括第二指示信息时，即表示编码器中包含神经网络模型并且指示该神经网络模型的类别(比如该神经网络模型的类别为低复杂度神经网络模型等)。The third identification information is used to indicate the category of the neural network model in the encoder; for example: when the second identification information indicates that the encoder contains a neural network model, the third identification information can be used to indicate the category of the neural network model; or when the encoding configuration information includes the third identification information but does not include the second indication information, it means that the encoder contains a neural network model and indicates the category of the neural network model (for example, the category of the neural network model is a low-complexity neural network model, etc.).

举例来说，在基于SDP实现AVS的编解码协商时，可以采用以下方式承载上述编码配置信息：For example, when implementing AVS codec negotiation based on SDP, the above encoding configuration information can be carried in the following manner:

codec-nn-id字段：可以采用16进制的两个字节描述编码器的类型。codec-nn-id field: two bytes in hexadecimal can be used to describe the type of encoder.

其中，第一个字节(MSB)通过音频编码标识(audio_codec_id)描述音频媒体资源的编码类别(即对音频数据进行编码的编码类别)。比如：audio_codec_id为0，表示媒体资源为通用高码率音频编码数据；audio_codec_id为1，表示媒体资源为无损音频编码数据；audio_codec_id为2，表示媒体资源为通用全码率音频编码数据；audio_codec_id的其余取值可以保留。The first byte (MSB) describes the encoding category of the audio media resource (i.e., the encoding category for encoding the audio data) through the audio coding identifier (audio_codec_id). For example: audio_codec_id is 0, indicating that the media resource is general high-bitrate audio coding data; audio_codec_id is 1, indicating that the media resource is lossless audio coding data; audio_codec_id is 2, indicating that the media resource is general full-bitrate audio coding data; the rest of the audio_codec_id values can be reserved.

其中，第二个字节可以用于描述编码器中的神经网络模型(nn_type)相关信息，其中包括编码器中是否包含神经网络模型，和/或，神经网络模型的类别等。例如：第二个字节取值为0，则表示编码器不包含神经网络模型；第二个字节取值为1时，表示编码器中包含神经网络模型，且该编码器使用低复杂度神经网络模型等，本申请实施例不以此为限。Among them, the second byte can be used to describe the information related to the neural network model (nn_type) in the encoder, including whether the encoder contains a neural network model, and/or the category of the neural network model, etc. For example: when the second byte value is 0, it means that the encoder does not contain a neural network model; when the second byte value is 1, it means that the encoder contains a neural network model, and the encoder uses a low-complexity neural network model, etc. The embodiments of the present application are not limited to this.

例如：使用audio_codec_id为0的通用高码率编码器，该编码器不包含神经网络模型，则codec-nn-id＝0x0000。使用audio_codec_id为1的无损音频编码器，该编码器不包含神经网络模型，则codec-nn-id＝0x0100。使用audio_codec_id为2的通用全码率编码器，并使用低复杂度神经网络模型(对应nn_type为1)，则codec-nn-id＝0x0201。For example: If a general high-rate encoder with audio_codec_id 0 is used and the encoder does not contain a neural network model, codec-nn-id = 0x0000. If a lossless audio encoder with audio_codec_id 1 is used and the encoder does not contain a neural network model, codec-nn-id = 0x0100. If a general full-rate encoder with audio_codec_id 2 is used and a low-complexity neural network model is used (corresponding to nn_type 1), codec-nn-id = 0x0201.

Config字段：可以使用16进制的字符串描述编码器的配置参数。Config field: A hexadecimal string can be used to describe the configuration parameters of the encoder.

以AVS第6部分(AVSP6)中的CA3特定框架(CA3SpecificBox)数据结构为例，Config字段对应AVSP6中的CA3SpecificBox数据结构。其中CA3SpecificBox数据结构如下：Taking the CA3SpecificBox data structure in AVS Part 6 (AVSP6) as an example, the Config field corresponds to the CA3SpecificBox data structure in AVSP6. The CA3SpecificBox data structure is as follows:

上述数据结构中：audio_codec_id为2时，对应AVS3音频全码率(GA)编码配置；audio_codec_id为0时，对应AVS3音频高码率(GH)编码配置；audio_codec_id为时，对应AVS3音频无损(LL)编码配置。In the above data structure: when audio_codec_id is 2, it corresponds to the AVS3 audio full bit rate (GA) encoding configuration; when audio_codec_id is 0, it corresponds to the AVS3 audio high bit rate (GH) encoding configuration; when audio_codec_id is , it corresponds to the AVS3 audio lossless (LL) encoding configuration.

bitrate字段：用于描述音频流的码率，表示总码率，单位为kb/s。以AVS3版本的第6部分(AVS3P6)中的AVS3音频全码率编码特定配置(Avs3AudioGASpecificConfig)数据结构为例，bitrate字段对应Avs3AudioGASpecificConfig中定义的总码率(total_bitrate)。其中，Avs3AudioGASpecificConfig数据结构如下：bitrate field: used to describe the bitrate of the audio stream, indicating the total bitrate in kb/s. Taking the AVS3 audio full bitrate encoding specific configuration (Avs3AudioGASpecificConfig) data structure in Part 6 (AVS3P6) of the AVS3 version as an example, the bitrate field corresponds to the total bitrate (total_bitrate) defined in Avs3AudioGASpecificConfig. Among them, the Avs3AudioGASpecificConfig data structure is as follows:

上述数据结构中：“……”表示其他配置参数，本申请实施例不做限定；“unsignedint(16)total_bitrate”用于描述采用AVS编码方式时，音频流的总码率。“unsigned int(2)resolution”用于描述分辨率。In the above data structure, "..." represents other configuration parameters, which are not limited in the present embodiment; "unsigned int (16) total_bitrate" is used to describe the total bit rate of the audio stream when the AVS encoding method is adopted. "unsigned int (2) resolution" is used to describe the resolution.

可选地，在基于SDP实现AVS的编解码协商时，还需要配置采用AVS编码方式进行编码的速率(rate)，对应RTP的时间戳尺度，可以与音频的采样率相同，如果在协商过程中没有指定，则可以采用默认值(比如默认值为90000)。Optionally, when implementing AVS codec negotiation based on SDP, it is also necessary to configure the encoding rate using the AVS encoding method, corresponding to the RTP timestamp scale, which can be the same as the audio sampling rate. If not specified during the negotiation process, the default value (for example, the default value is 90000) can be used.

可选地，所述对所述第一音频数据进行解码处理，包括：Optionally, the decoding the first audio data includes:

根据所述第一指示信息和所述编码配置信息，对所述第一音频数据进行解码处理。The first audio data is decoded according to the first indication information and the encoding configuration information.

该实施例中，在发送侧设备与接收侧设备基于媒体协商过程确定了AVS编码方式的编码配置信息的情况下，如果接收侧设备接收到的RTP数据包中通过第一指示信息指示了其承载的第一音频数据的编码方式为AVS编码方式，那么接收侧设备即可以基于该AVS编码方式的编码配置信息，对该第一音频数据进行解码，即实现基于AVS编码方式获得的音频数据的实时传输。In this embodiment, when the sending side device and the receiving side device determine the encoding configuration information of the AVS encoding method based on the media negotiation process, if the RTP data packet received by the receiving side device indicates through the first indication information that the encoding method of the first audio data it carries is the AVS encoding method, then the receiving side device can decode the first audio data based on the encoding configuration information of the AVS encoding method, that is, realize real-time transmission of the audio data obtained based on the AVS encoding method.

序列号；Serial number;

举例来说，在发送侧设备向接收侧设备发送多个RTP数据包时，这多个RTP数据包的序列号连续。比如：对于同一帧音频数据对应有多个第一音频数据(也即是将一帧音频数据划分为多个第一音频数据)的情况下，每个第一音频数据对应的RTP数据包中承载的序列号之间，按照音频数据本身的时间顺序连续排列，且每个第一音频数据对应的RTP数据包中承载的时间戳相同。相应的，针对不同帧音频数据的第一音频数据对应的RTP数据包中承载的序列号之间，也可以按照不同帧音频数据本身的时间顺序连续排列(比如针对同一个RTP会话中RTP数据包的序列号连续)，且不同帧音频数据的第一音频数据对应的RTP数据包中承载的时间戳不同。这样接收侧设备能够基于时间戳识别到属于同一帧的音频数据，以及基于序列号确定是否丢包等。并且，针对一帧音频数据划分得到的多个第一音频数据中的第一个和/或最后一个第一音频数据对应的RTP数据包中承载的标记位设置为第一值，可以保证接收侧设备能够准确获知属于同一帧的音频数据。For example, when the sending side device sends multiple RTP data packets to the receiving side device, the sequence numbers of these multiple RTP data packets are continuous. For example: when there are multiple first audio data corresponding to the same frame of audio data (that is, one frame of audio data is divided into multiple first audio data), the sequence numbers carried in the RTP data packets corresponding to each first audio data are arranged continuously according to the time sequence of the audio data itself, and the timestamps carried in the RTP data packets corresponding to each first audio data are the same. Correspondingly, the sequence numbers carried in the RTP data packets corresponding to the first audio data of different frames of audio data can also be arranged continuously according to the time sequence of the different frames of audio data themselves (for example, the sequence numbers of the RTP data packets in the same RTP session are continuous), and the timestamps carried in the RTP data packets corresponding to the first audio data of different frames of audio data are different. In this way, the receiving side device can identify the audio data belonging to the same frame based on the timestamp, and determine whether the packet is lost based on the sequence number. In addition, the marker bit carried in the RTP data packet corresponding to the first and/or last first audio data among the multiple first audio data obtained by dividing a frame of audio data is set to the first value, which can ensure that the receiving side device can accurately know the audio data belonging to the same frame.

如图2所示，给出了RTP包头(RTP Header)结构的示意图，具体描述如下：As shown in Figure 2, a schematic diagram of the RTP header structure is given, which is described in detail as follows:

V字段：用于指示RTP的版本，例如：V＝2用于指示RTP版本2。V field: used to indicate the version of RTP, for example: V=2 is used to indicate RTP version 2.

P字段：为1比特，用于指示是否填充。如果通过P字段设置为允许填充，则在RTP数据包的末尾填充一个或多个字节，这些填充的字节不是有效负载的一部分。P field: 1 bit, used to indicate whether to pad. If the P field is set to allow padding, one or more bytes are padded at the end of the RTP packet. These padded bytes are not part of the payload.

X字段：为1比特，用于设置扩展比特。如果通过X字段设置扩展比特，则固定头后面跟随一个扩展头(即在RTP包头后跟有一个扩展包头)。X field: 1 bit, used to set the extension bit. If the extension bit is set through the X field, an extension header follows the fixed header (that is, an extension header follows the RTP header).

CC字段：为4比特，用于指贡献源(CSRC)计数，即跟在固定头后面的CSRC识别符的数目。CC field: 4 bits, used to indicate the contributing source (CSRC) count, that is, the number of CSRC identifiers following the fixed header.

M字段：为1比特，用于在比特流中标记重要的事件，比如可以在该M字段承载所述标记位。M field: 1 bit, used to mark important events in the bit stream. For example, the marking bit can be carried in the M field.

PT字段：为7比特，用于指示RTP数据包所承载的负载格式，比如可以在该PT字段中承载所述第一指示信息。PT field: 7 bits, used to indicate the payload format carried by the RTP data packet. For example, the first indication information can be carried in the PT field.

序列号(sequence number)：为16比特，用于指示RTP数据包的序列。Sequence number: 16 bits, used to indicate the sequence of the RTP data packet.

时间戳(timestamp)：为32比特，用于指示RTP数据包的采样时间，比如针对同一帧的不同RTP数据包中，其时间戳相同。Timestamp: 32 bits, used to indicate the sampling time of the RTP data packet. For example, different RTP data packets of the same frame have the same timestamp.

同步信源(synchronization source，SSRC)标识符(identifier)字段：为32比特，用于识别同步源。该字段是一个随机数，在同一个RTP会话中只有一个同步标识。比如参加同一个视频会议的两个同步信源不能有相同的SSRC标识符。Synchronization source (SSRC) identifier field: 32 bits, used to identify the synchronization source. This field is a random number, and there is only one synchronization identifier in the same RTP session. For example, two synchronization sources participating in the same video conference cannot have the same SSRC identifier.

特约信源(contributing source，CSRC)标识符(identifier)字段：为一组列表，比如从0到15，共16项，每项32比特。用于指示在此RTP数据包所承载负载的所有贡献源。Contributing source (CSRC) identifier field: a list of 16 items, such as 0 to 15, each with 32 bits, used to indicate all contributing sources of the payload carried by this RTP packet.

该实施例中，由于发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，针对同一帧音频数据划分成了多个第一音频数据(这里第一音频数据也可以理解为是一帧音频数据中的一个音频片段，其可以包括音频相关信息和/或音频本身的数据)，并在每个第一音频数据对应的RTP数据包中承载的时间戳设置相同数值，因此对于接收侧设备来说，可以根据时间戳识别到属于同一帧的第一音频数据。这样接收侧设备可以针对同一帧的多个第一音频数据进行组帧后，再进行解码处理，从而可以解码得到一帧音频数据并输出。In this embodiment, when the sending side device performs RTP encapsulation for the audio data using the AVS encoding method, the same frame of audio data is divided into multiple first audio data (here, the first audio data can also be understood as an audio segment in a frame of audio data, which can include audio-related information and/or audio data itself), and the timestamp carried in the RTP data packet corresponding to each first audio data is set to the same value, so for the receiving side device, the first audio data belonging to the same frame can be identified according to the timestamp. In this way, the receiving side device can group the multiple first audio data of the same frame into frames, and then perform decoding processing, so that a frame of audio data can be decoded and output.

将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据；其中，所述多个RTP数据包包括：所述第一RTP数据包和所述至少一个第二RTP数据包(这里也即是将所述第一RTP数据包和所述至少一个第二RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据)；Combining the first audio data carried in a plurality of RTP data packets to obtain combined audio data; wherein the plurality of RTP data packets include: the first RTP data packet and the at least one second RTP data packet (that is, combining the first audio data carried in the first RTP data packet and the at least one second RTP data packet to obtain the combined audio data);

该实施例中，由于发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，针对同一帧音频数据划分成的多个第一音频数据中，第一个和/或最后一个第一音频数据对应的RTP数据包中承载的标记位为第一值，因此对于接收侧设备来说，可以基于该标记位判断是否完整接收到属于同一帧的多个第一音频数据。这样接收侧设备针对同一帧的多个第一音频数据进行组帧后，再进行解码处理，从而可以保证解码得到一帧完整的音频数据并输出。In this embodiment, when the sending side device performs RTP encapsulation for the audio data using the AVS encoding method, the marker bit carried in the RTP data packet corresponding to the first and/or last first audio data of the multiple first audio data divided into the same frame of audio data is the first value, so for the receiving side device, it can be determined based on the marker bit whether the multiple first audio data belonging to the same frame are completely received. In this way, the receiving side device performs framing for the multiple first audio data of the same frame and then performs decoding processing, so as to ensure that a complete frame of audio data is obtained through decoding and output.

例如：在发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，在同一帧音频数据划分成的多个第一音频数据中，针对最后一个第一音频数据对应的RTP数据包中承载的标记位为第一值，则接收侧设备在接收到承载的标记位为第一值的第一RTP数据包的情况下，即确定第一RTP数据包中承载的第一音频数据是其所在帧的最后一个第一音频数据(或者称为音频片段)。此时，接收侧设备可以按照该第一RTP数据包中承载的序列号递减的顺序向前追溯，直到追溯到承载的标记位为第一值的目标RTP数据包。当追溯到的目标RTP数据包承载的序列号连续且时间戳不同时，接收侧设备可以将追溯的这些RTP数据包(即承载的序列号处于第一序列号与第二序列号之间，且承载的时间戳与所述第一RTP数据包中承载的时间戳相同的RTP数据包)，确定为与第一RTP数据包中承载的第一音频数据属于同一帧的RTP数据包。For example: when the sending side device performs RTP encapsulation for audio data using the AVS encoding method, among the multiple first audio data divided into the same frame of audio data, the marker bit carried in the RTP data packet corresponding to the last first audio data is the first value, then when the receiving side device receives the first RTP data packet carrying the marker bit of the first value, it is determined that the first audio data carried in the first RTP data packet is the last first audio data (or audio fragment) of the frame in which it is located. At this time, the receiving side device can trace back in the descending order of the sequence number carried in the first RTP data packet until the target RTP data packet carrying the marker bit of the first value is traced back. When the sequence numbers carried by the traced target RTP data packet are continuous and the timestamps are different, the receiving side device can determine these traced RTP data packets (i.e., the RTP data packets carrying the sequence number between the first sequence number and the second sequence number, and the timestamp carried is the same as the timestamp carried in the first RTP data packet) as RTP data packets belonging to the same frame as the first audio data carried in the first RTP data packet.

可选地，所述音频数据传输方法还包括：Optionally, the audio data transmission method further includes:

该实施例中，在发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，在同一帧音频数据划分成的多个第一音频数据中，针对最后一个第一音频数据对应的RTP数据包中承载的标记位为第一值的情况下，如果接收侧设备接收到的RTP数据包中承载的标记位为第二值，则认为还没有接收到这一帧的最后一个第一音频数据，则继续接收所述RTP数据包，直到接收到的RTP数据包中承载的标记位为第一值，则认为接收到了这一帧的最后一个第一音频数据，从而可以进一步执行组帧处理。或者，如果在继续接收所述RTP数据包的过程中，在定时器超时时仍没有接收到承载的标记位为第一值的RTP数据包，则丢弃已接收到的RTP数据包，以保证传输性能避免宕机等问题。In this embodiment, when the sending side device performs RTP encapsulation for audio data using the AVS encoding method, among the multiple first audio data divided into the same frame of audio data, when the marker bit carried in the RTP data packet corresponding to the last first audio data is the first value, if the marker bit carried in the RTP data packet received by the receiving side device is the second value, it is considered that the last first audio data of this frame has not been received, and the RTP data packet is continued to be received until the marker bit carried in the received RTP data packet is the first value, then it is considered that the last first audio data of this frame has been received, so that the framing process can be further performed. Alternatively, if in the process of continuing to receive the RTP data packet, the RTP data packet carrying the marker bit of the first value is still not received when the timer times out, the received RTP data packet is discarded to ensure transmission performance and avoid problems such as downtime.

又例如：在发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，在同一帧音频数据划分成的多个第一音频数据中，针对第一个第一音频数据对应的RTP数据包中承载的标记位为第一值，则接收侧设备在接收到承载的标记位为第一值的第一RTP数据包的情况下，即确定第一RTP数据包中承载的第一音频数据是其所在帧的第一个第一音频数据(或者称为音频片段)。此时，接收侧设备可以按照该第一RTP数据包中承载的序列号递增的顺序继续执行收包操作，直到接收到承载的标记位为第一值的目标RTP数据包。当接收到的目标RTP数据包承载的序列号连续且时间戳不同时，接收侧设备可以将接收的这些RTP数据包(即承载的序列号处于第一序列号与第二序列号之间，且承载的时间戳与所述第一RTP数据包中承载的时间戳相同的RTP数据包)，确定为与第一RTP数据包中承载的第一音频数据属于同一帧的RTP数据包。For another example: when the sending side device performs RTP encapsulation for audio data using the AVS encoding method, among the multiple first audio data divided into the same frame of audio data, the marker bit carried in the RTP data packet corresponding to the first first audio data is the first value, then the receiving side device, when receiving the first RTP data packet carrying the marker bit of the first value, determines that the first audio data carried in the first RTP data packet is the first first audio data (or audio fragment) of the frame in which it is located. At this time, the receiving side device can continue to perform the packet receiving operation in the order of increasing sequence numbers carried in the first RTP data packet until the target RTP data packet carrying the marker bit of the first value is received. When the sequence numbers carried by the received target RTP data packets are continuous and the timestamps are different, the receiving side device can determine the received RTP data packets (i.e., the RTP data packets carrying the sequence numbers between the first sequence number and the second sequence number, and the RTP data packets carrying the same timestamp as the timestamp carried in the first RTP data packet) as RTP data packets belonging to the same frame as the first audio data carried in the first RTP data packet.

再例如：在发送侧设备针对采用AVS编码方式的音频数据进行RTP封装时，在同一帧音频数据划分成的多个第一音频数据中，针对第一个和最后一个第一音频数据对应的RTP数据包中承载的标记位为第一值，则接收侧设备在接收到承载的标记位为第一值的第一RTP数据包的情况下，可以根据与其序列号连续的RTP数据包中承载的标记位为第一值或第二值，确定第一RTP数据包中承载的第一音频数据是其所在帧的第一个或最后一个第一音频数据(或者称为音频片段)。比如在确定第一RTP数据包中承载的第一音频数据是其所在帧的第一个第一音频数据为例，此时接收侧设备可以按照该第一RTP数据包中承载的序列号递增的顺序继续执行收包操作，直到接收到承载的标记位为第一值的目标RTP数据包，如果接收到的目标RTP数据包承载的序列号连续且时间戳相同，则确定接收到这一帧的所有第一音频数据。这样接收侧设备可以将从第一RTP数据包开始接收的这些序列号连续的RTP数据包，确定为属于同一帧的RTP数据包。For another example: when the sending side device performs RTP encapsulation for audio data using the AVS encoding method, among the multiple first audio data divided into the same frame of audio data, the marker bit carried in the RTP data packet corresponding to the first and last first audio data is the first value, then when the receiving side device receives the first RTP data packet carrying the marker bit of the first value, it can determine that the first audio data carried in the first RTP data packet is the first or last first audio data (or audio fragment) of the frame in which it is located according to the marker bit carried in the RTP data packet with a continuous sequence number as the first value or the second value. For example, taking the determination that the first audio data carried in the first RTP data packet is the first first audio data of the frame in which it is located as an example, at this time, the receiving side device can continue to perform the packet receiving operation in the order of increasing the sequence number carried in the first RTP data packet until the target RTP data packet carrying the marker bit of the first value is received. If the sequence numbers carried by the received target RTP data packet are continuous and the timestamps are the same, it is determined that all the first audio data of this frame are received. In this way, the receiving side device can determine these RTP data packets with continuous sequence numbers received from the first RTP data packet as RTP data packets belonging to the same frame.

该实施例中，针对承载的时间戳相同的多个RTP数据包，可以确定其属于同一帧音频数据。如果多个RTP数据包中承载的序列号连续，则确定其承载的第一音频数据是在时间上连续的，从而可以将所述多个RTP数据包中的第一音频数据进行组帧，得到组帧后的音频数据。这样，接收侧设备针对同一帧的多个第一音频数据进行组帧后，再进行解码处理，从而可以保证准确地解码得到一帧完整的音频数据并输出。In this embodiment, for multiple RTP data packets carrying the same timestamp, it can be determined that they belong to the same frame of audio data. If the sequence numbers carried in the multiple RTP data packets are continuous, it is determined that the first audio data they carry is continuous in time, so that the first audio data in the multiple RTP data packets can be framed to obtain the framed audio data. In this way, the receiving side device frames the multiple first audio data of the same frame and then performs decoding processing, so that it can ensure that a complete frame of audio data is accurately decoded and output.

在所述多个RTP数据包中承载的序列号不连续的情况下，丢弃所述多个RTP数据包，并重新接收所述RTP数据包。In the case that the sequence numbers carried in the multiple RTP data packets are not continuous, the multiple RTP data packets are discarded and the RTP data packets are received again.

该实施例中，针对承载的时间戳相同的多个RTP数据包，如果确定所述多个RTP数据包中承载的序列号不连续，则认为这一帧的音频数据传输可能存在丢包情况，此时可以丢弃所述多个RTP数据包，并重新接收所述RTP数据包，以保证完整接收这一帧的音频数据。In this embodiment, for multiple RTP data packets carrying the same timestamp, if it is determined that the sequence numbers carried in the multiple RTP data packets are discontinuous, it is considered that there may be packet loss in the transmission of this frame of audio data. At this time, the multiple RTP data packets can be discarded and the RTP data packets can be received again to ensure that the audio data of this frame is fully received.

通过网页实时通信WebRTC媒体通道接收所述RTP数据包。The RTP data packet is received through a web real-time communication WebRTC media channel.

在WebRTC技术中，有音视频媒体通道和数据通道(dataChannel)两种通道可以传输音视频数据。对于采用AVS编码方式获得的音视频数据，如果基于现有的WebRTC技术实现传输时，针对AVS编码方式获得的音视频数据中包含有基于高度压缩数字视频编解码器标准(H.264)编码方式获得的视频数据时，通过WebRTC音视频媒体通道进行传输，而针对AVS编码方式获得的音频数据，需要通过WebRTC数据通道进行传输。In WebRTC technology, there are two channels, audio and video media channel and data channel (dataChannel), which can transmit audio and video data. For audio and video data obtained by AVS encoding, if the transmission is realized based on the existing WebRTC technology, when the audio and video data obtained by AVS encoding includes video data obtained by encoding based on the highly compressed digital video codec standard (H.264), it is transmitted through the WebRTC audio and video media channel, and the audio data obtained by AVS encoding needs to be transmitted through the WebRTC data channel.

通过WebRTC数据通道传输AVS编码方式获得的音频数据时，虽然可以实现AVS编码方式获得的音频数据的端到端传输，但是由于WebRTC数据通道只支持透传数据的功能，不支持音频抖动缓冲、音视频同步、前向纠错(Forward Error Correction，FEC)、回声消除、噪声抑制等音频处理能力，因此当采用AVS编码方式获得的音视频数据包括音频数据和采用H264编码方式获得的视频数据需要同时传输时，采用上述传输方式将可能导致音频播放出现卡顿、音视频不同步、噪声明显等问题。When audio data obtained by AVS encoding is transmitted through the WebRTC data channel, although end-to-end transmission of audio data obtained by AVS encoding can be achieved, since the WebRTC data channel only supports the function of transparent data transmission, it does not support audio processing capabilities such as audio jitter buffering, audio and video synchronization, forward error correction (Forward Error Correction, FEC), echo cancellation, and noise suppression. Therefore, when the audio and video data obtained by AVS encoding include audio data and video data obtained by H264 encoding need to be transmitted at the same time, the above transmission method may cause audio playback to appear stuck, audio and video synchronization, obvious noise, and other problems.

该实施例中，在发送侧设备针对采用AVS编码方式的音频数据划分成多个第一音频数据，并单独进行RTP封装后，该封装得到的RTP数据包可以在WebRTC媒体通道上进行传输，这样当采用AVS编码方式获得音视频数据需要同时传输时，可以实现在WebRTC媒体通道上同时传输音视频数据，以保证音视频播放同步，减少音频播放的卡顿、噪声等。In this embodiment, the sending side device divides the audio data encoded in the AVS mode into multiple first audio data and performs RTP encapsulation separately. The RTP data packet obtained by the encapsulation can be transmitted on the WebRTC media channel. In this way, when the audio and video data obtained in the AVS encoding mode need to be transmitted simultaneously, the audio and video data can be transmitted simultaneously on the WebRTC media channel to ensure the synchronization of audio and video playback and reduce the jamming and noise of audio playback.

如图3所示，本申请实施例提供一种音频数据传输方法，应用于发送侧设备，可选地，所述接收侧设备可以是支持音频数据接收或者支持音频数据接收并输出的设备，当然本申请实施例中不限于接收侧设备仅支持音频数据的接收或输出等，还可以支持视频数据的接收或输出，或者还可以支持音频数据和/或视频数据的发送等，本申请实施例不以此为限。As shown in Figure 3, an embodiment of the present application provides an audio data transmission method, which is applied to a sending side device. Optionally, the receiving side device can be a device that supports audio data reception or supports audio data reception and output. Of course, the embodiment of the present application is not limited to the receiving side device only supporting the reception or output of audio data, etc., but can also support the reception or output of video data, or can also support the sending of audio data and/or video data, etc. The embodiment of the present application is not limited to this.

步骤31：获得第一音频数据；其中，所述第一音频数据是基于AVS编码方式进行编码获得的；Step 31: Obtain first audio data; wherein the first audio data is obtained by encoding based on the AVS encoding method;

步骤32：根据RTP对所述第一音频数据进行封装，得到RTP数据包；其中，所述RTP数据包承载所述第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为所述AVS编码方式；Step 32: Encapsulate the first audio data according to RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

步骤33：发送所述RTP数据包。Step 33: Send the RTP data packet.

编码器标识信息；Encoder identification information;

编码器配置参数；Encoder configuration parameters;

码率；Bit rate;

获取初始音频数据；Get initial audio data;

可选地，所述根据实时传输协议RTP对所述第一音频数据进行封装，得到RTP数据包，包括：Optionally, encapsulating the first audio data according to a real-time transport protocol RTP to obtain an RTP data packet includes:

序列号；Serial number;

需要说明的是，本申请实施例中发送侧设备的音频传输方法与上述接收侧设备的音频传输方法是基于同一发明构思的，其实施例之间可以互相参见，这里不再赘述。It should be noted that the audio transmission method of the sending side device in the embodiment of the present application and the audio transmission method of the receiving side device mentioned above are based on the same inventive concept, and their embodiments can refer to each other and will not be repeated here.

以下针对发送侧设备的RTP数据包封装过程进行说明：The following is an explanation of the RTP data packet encapsulation process of the sending device:

对于发送侧设备来说，基于与接收侧设备之间的媒体协商过程，确定需要采用AVS编码方式进行编码时，在获取到初始音频数据(比如；可以是发送侧设备自身采集到的音频数据，或者也可以是音频采集设备采集到并透传给发送侧设备)后，则采用AVS编码方式对所述初始音频数据进行编码，得到编码后的音频数据。由于采用AVS编码方式进行编码时，编码后的音频数据的数据大小通常超过最大传输单元(Maximum Transmission Unit，MTU)的大小，因此在基于RTP对编码后的音频数据进行封装时，需要对该编码后的音频数据进行拆分。比如可以从该编码后的音频数据的起始位置开始，按照预设步长向结束位置依次拆分得到多个第一音频数据；或者，也可以从该编码后的音频数据的结束位置开始，按照预设步长向开始位置依次拆分得到多个第一音频数据等；其中，该预设步长可以按照封装后的RTP数据包大小不超过MTU的大小设置，本申请实施例不做具体限定。For the sending side device, based on the media negotiation process between the sending side device and the receiving side device, when it is determined that the AVS encoding method is needed for encoding, after obtaining the initial audio data (for example; it can be the audio data collected by the sending side device itself, or it can also be collected by the audio acquisition device and transparently transmitted to the sending side device), the AVS encoding method is used to encode the initial audio data to obtain the encoded audio data. Since the data size of the encoded audio data usually exceeds the size of the Maximum Transmission Unit (MTU) when the AVS encoding method is used for encoding, the encoded audio data needs to be split when the encoded audio data is encapsulated based on RTP. For example, starting from the starting position of the encoded audio data, it can be split in sequence according to the preset step length to the end position to obtain multiple first audio data; or, starting from the end position of the encoded audio data, it can be split in sequence according to the preset step length to the starting position to obtain multiple first audio data, etc.; wherein, the preset step length can be set according to the size of the RTP data packet after encapsulation not exceeding the size of the MTU, and the embodiment of the present application does not make specific restrictions.

这样，发送侧设备针对采用AVS编码方式获得的音频数据，基于RTP进行拆包封装处理，可以实现AVS编码方式获得的音频数据的实时传输，并可以实现通过WebRTC媒体通道传输AVS编码方式获得的音频数据。这样在在采用AVS编码方式获得音视频数据需要同时传输时，可以通过WebRTC媒体通道同时传输音视频数据，以保证音视频播放同步，减少音频播放的卡顿、噪声等。In this way, the sending side device performs unpacking and encapsulation processing based on RTP for the audio data obtained by AVS encoding, so as to realize the real-time transmission of the audio data obtained by AVS encoding, and to realize the transmission of the audio data obtained by AVS encoding through the WebRTC media channel. In this way, when the audio and video data obtained by AVS encoding need to be transmitted simultaneously, the audio and video data can be transmitted simultaneously through the WebRTC media channel to ensure the synchronization of audio and video playback and reduce the jamming and noise of audio playback.

以下结合发送侧设备和接收侧设备的交互过程，对本申请实施例的音频传输方法进行说明：The following describes the audio transmission method of the embodiment of the present application in combination with the interaction process between the sending side device and the receiving side device:

如图4所示，给出了一种音频传输方法的整体流程图，具体流程包括：As shown in FIG4 , an overall flow chart of an audio transmission method is provided, and the specific process includes:

步骤41：音频推流端(即发送侧设备)采集音频数据；Step 41: The audio streaming end (i.e., the sending side device) collects audio data;

步骤42：音频推流端编码音频数据；Step 42: The audio streaming end encodes the audio data;

步骤43：音频推流端判断编码后的音频数据是否采用AVS编码方式进行编码获得的(或者理解为判断编码后的音频数据的数据格式是否为基于AVS编码方式的音频编码格式，比如：通过是否携带“AV3A-AATF”标识判断，可以参见上述实施例，这里不再赘述)Step 43: the audio streaming end determines whether the encoded audio data is encoded using the AVS encoding method (or it can be understood as determining whether the data format of the encoded audio data is an audio encoding format based on the AVS encoding method, for example, by determining whether it carries the "AV3A-AATF" identifier, which can be seen in the above embodiment and will not be described in detail here)

如果是，则执行步骤44；如果不是，则可以基于WebRTC技术进行音频数据传输；If yes, execute step 44; if no, audio data transmission can be performed based on WebRTC technology;

步骤44：音频推流端将编码后的音频数据根据拆包规则进行拆包处理，拆成多个RTP数据包，并将这多个RTP数据包以序列号连续标注后存放至发送队列；具体的，根据拆包规则进行拆包处理包括：Step 44: the audio streaming end unpacks the encoded audio data according to the unpacking rule, splits it into multiple RTP data packets, and stores the multiple RTP data packets in a sending queue after being consecutively marked with sequence numbers. Specifically, unpacking according to the unpacking rule includes:

441、针对采用AVS编码方式获得的音频数据，开始执行拆包逻辑；441. For the audio data obtained by using the AVS encoding method, start executing the unpacking logic;

442、将编码后的音频数据，按照封装后的每个RTP数据包总大小小于或等于MTU大小进行拆包，比如以互联网协议(Internet Protocol，IP)层的MTU为1500字节为例，即封装后的每个RTP数据包总大小需要小于或等于1500字节(具体数值可根据路由规则按需调整)，本申请实施例不以此为限；442. Unpack the encoded audio data according to the total size of each RTP data packet after encapsulation being less than or equal to the MTU size. For example, the MTU of the Internet Protocol (IP) layer is 1500 bytes, that is, the total size of each RTP data packet after encapsulation needs to be less than or equal to 1500 bytes (the specific value can be adjusted as needed according to the routing rules), and the embodiments of the present application are not limited thereto;

443、拆包后的各个RTP数据包的序列号按序递增，且RTP数据包的时间戳一致；443. The sequence numbers of the unpacked RTP data packets increase in sequence, and the timestamps of the RTP data packets are consistent;

444、在最后一个RTP数据包的RTP Header中M字段置为1，表示为该音频帧的最后一个RTP数据包；444. The M field in the RTP Header of the last RTP data packet is set to 1, indicating that it is the last RTP data packet of the audio frame;

步骤45：音频推流端将拆包后的多个RTP数据包通过WebRTC媒体通道发送到网络中；或者如果没有进行拆包，则基于WebRTC技术进行音频数据传输；Step 45: the audio streaming end sends the unpacked multiple RTP data packets to the network through the WebRTC media channel; or if unpacking is not performed, audio data transmission is performed based on the WebRTC technology;

步骤46：音频接收端(即接收侧设备)通过WebRTC媒体通道接收RTP数据包；Step 46: The audio receiving end (i.e., the receiving device) receives the RTP data packet through the WebRTC media channel;

步骤47：音频接收端将接收到的RTP数据包根据RTP包序列号进行排序；Step 47: The audio receiving end sorts the received RTP data packets according to the RTP packet sequence numbers;

步骤48：音频接收端根据媒体类型判断RTP数据包是否需要组帧(比如：RTPHeader中的PT字段指示AV3A-AATF时，确定需要组帧)；Step 48: The audio receiving end determines whether the RTP data packet needs to be framed according to the media type (for example, when the PT field in the RTP Header indicates AV3A-AATF, it is determined that framing is required);

如果需要，则执行步骤49，如果不需要，则执行步骤410；If necessary, execute step 49, if not necessary, execute step 410;

步骤49：音频接收端根据步骤44中的拆包规则，执行相应的组帧操作，然后将组帧的音频帧传输到解码器；具体的，组帧操作包括：Step 49: The audio receiving end performs a corresponding framing operation according to the unpacking rule in step 44, and then transmits the framed audio frame to the decoder; specifically, the framing operation includes:

491、针对采用AVS编码方式获得的音频数据，开始执行组帧逻辑；491. For the audio data obtained by using the AVS encoding method, start executing the framing logic;

492、判断RTP数据包的时间戳是否一致；如果一致，则认为这些RTP数据包属于同一帧；如果不一致，则认为这些RTP数据包不属于同一帧；492. Determine whether the timestamps of the RTP data packets are consistent; if they are consistent, it is considered that these RTP data packets belong to the same frame; if they are inconsistent, it is considered that these RTP data packets do not belong to the same frame;

493、判断时间戳相同的RTP数据包中，是否存在RTP Header中M字段的数值为1的RTP数据包；如果不存在，则继续收包，直到超时丢弃；如果存在，则认为已接收到该音频帧的最后一个RTP数据包；493. Determine whether there is an RTP data packet with a value of 1 in the M field in the RTP Header among the RTP data packets with the same timestamp; if not, continue to receive packets until they are discarded due to timeout; if so, it is considered that the last RTP data packet of the audio frame has been received;

494、从RTP Header中M字段的数值为1的RTP数据包中承载的序列号开始向前追溯，如果向前追溯的RTP数据包与该RTP数据包的时间戳一致、且序列号完全连续，且可追溯到上一个音频帧的最后一个RTP数据包(时间戳不同，且M字段的数值为1)，则认为该音频帧的所有RTP数据包均已成功接收，则基于这些RTP数据包进行组帧，即完成组帧操作。否则，重新开始执行收包操作。494. Start tracing back from the sequence number carried in the RTP data packet whose M field value in the RTP Header is 1. If the RTP data packet traced back is consistent with the timestamp of the RTP data packet, and the sequence number is completely continuous, and can be traced back to the last RTP data packet of the previous audio frame (the timestamp is different, and the value of the M field is 1), then it is considered that all RTP data packets of the audio frame have been successfully received, and framing is performed based on these RTP data packets, that is, the framing operation is completed. Otherwise, restart the packet receiving operation.

步骤410：音频接收端解码组帧得到的音频数据；或者，如果不需要进行组帧操作，则直接对接收到的音频数据进行解码处理；Step 410: the audio receiving end decodes the audio data obtained by framing; or, if no framing operation is required, directly decodes the received audio data;

步骤411、音频接收端播放解码后的音频数据。Step 411: The audio receiving end plays the decoded audio data.

本申请实施例中，通过扩展WebRTC技术，能够实现支持AVS编码方式获得的音频编码的传输，还可以解决目前基于WebRTC的传输存在卡顿、音视频不同步、噪声大等问题，显著提升基于AVS编码方式获得的音频编码的传输质量。In the embodiments of the present application, by extending the WebRTC technology, it is possible to support the transmission of audio encoding obtained by the AVS encoding method, and it is also possible to solve the problems of current WebRTC-based transmission such as stuttering, audio and video asynchrony, and high noise, and significantly improve the transmission quality of audio encoding obtained based on the AVS encoding method.

如图5所示，本申请实施例提供一种音频数据传输装置500，应用于接收侧设备，包括：As shown in FIG. 5 , an embodiment of the present application provides an audio data transmission device 500, which is applied to a receiving-side device, including:

接收模块510，用于接收实时传输协议RTP数据包；其中，所述RTP数据包中承载第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为音视频编解码标准AVS编码方式；The receiving module 510 is used to receive a real-time transport protocol RTP data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the audio and video codec standard AVS encoding method;

解码模块520，用于根据所述第一指示信息，对所述第一音频数据进行解码处理。The decoding module 520 is used to decode the first audio data according to the first indication information.

可选地，所述音频数据传输装置500还包括：Optionally, the audio data transmission device 500 further includes:

确定模块，用于通过与发送侧设备的媒体协商过程，确定编码配置信息；其中，所述编码配置信息包括编码名称信息，所述编码名称信息用于指示所述AVS编码方式。The determination module is used to determine the encoding configuration information through a media negotiation process with a sending side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

编码器标识信息；Encoder identification information;

编码器配置参数；Encoder configuration parameters;

码率；Bit rate;

序列号；Serial number;

可选地，所述解码模块520包括：Optionally, the decoding module 520 includes:

第一组合单元，用于在接收到多个RTP数据包中承载的时间戳相同的情况下，将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据；A first combining unit is used to combine the first audio data carried in the multiple RTP data packets to obtain combined audio data when the timestamps carried in the multiple RTP data packets are the same;

第一解码单元，用于根据所述第一指示信息，对组合后的音频数据进行解码处理。The first decoding unit is used to decode the combined audio data according to the first indication information.

确定单元，用于在接收到的第一RTP数据包中承载的标记位为第一值的情况下，确定至少一个第二RTP数据包；其中，每个所述第二RTP数据包与所述第一RTP数据包种承载的时间戳均相同；A determination unit, configured to determine at least one second RTP data packet when the marker bit carried in the received first RTP data packet is a first value; wherein each of the second RTP data packet has the same timestamp as the first RTP data packet;

第二组合单元，用于将多个RTP数据包中承载的第一音频数据进行组合，得到组合后的音频数据；其中，所述多个RTP数据包包括：所述第一RTP数据包和所述至少一个第二RTP数据包；A second combining unit is used to combine the first audio data carried in a plurality of RTP data packets to obtain combined audio data; wherein the plurality of RTP data packets include: the first RTP data packet and the at least one second RTP data packet;

第二解码单元，用于根据所述第一指示信息，对组合后的音频数据进行解码处理。The second decoding unit is used to decode the combined audio data according to the first indication information.

可选地，所述确定单元还用于：Optionally, the determining unit is further configured to:

可选地，所述第一组合单元或所述第二组合单元还用于：Optionally, the first combination unit or the second combination unit is further used for:

可选地，可选地，所述音频数据传输装置500还包括以下至少一项：Optionally, the audio data transmission device 500 further includes at least one of the following:

第一处理模块，用于在所述多个RTP数据包中承载的序列号不连续的情况下，丢弃所述多个RTP数据包，并重新接收所述RTP数据包；A first processing module, configured to discard the multiple RTP data packets and re-receive the RTP data packets when the sequence numbers carried in the multiple RTP data packets are discontinuous;

第二处理模块，用于在接收到的RTP数据包中承载的标记位为第二值的情况下，继续接收所述RTP数据包，直到接收到的RTP数据包中承载的标记位为第一值或者在定时器超时的情况下丢弃已接收到的RTP数据包。The second processing module is used to continue receiving the RTP data packet when the mark bit carried in the received RTP data packet is the second value, until the mark bit carried in the received RTP data packet is the first value or the received RTP data packet is discarded when the timer times out.

可选地，所述接收模块510包括：Optionally, the receiving module 510 includes:

接收单元，用于通过网页实时通信WebRTC媒体通道接收所述RTP数据包。The receiving unit is used to receive the RTP data packet through the web real-time communication WebRTC media channel.

需要说明的是，本申请实施例中的装置能够实现上述接收侧设备的音频传输方法的各个实施例，且能达到相同的技术效果，为避免重复，这里不再赘述。It should be noted that the device in the embodiment of the present application can implement the various embodiments of the audio transmission method of the above-mentioned receiving-side device and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

如图6所示，本申请实施例提供一种音频数据传输装置600，应用于发送侧设备，包括：As shown in FIG6 , an embodiment of the present application provides an audio data transmission device 600, which is applied to a sending side device, including:

获得模块610，用于获得第一音频数据；其中，所述第一音频数据是基于音视频编解码标准AVS编码方式进行编码获得的；An acquisition module 610 is used to obtain first audio data; wherein the first audio data is obtained by encoding based on the audio and video codec standard AVS encoding method;

封装模块620，用于根据实时传输协议RTP对所述第一音频数据进行封装，得到RTP数据包；其中，所述RTP数据包承载所述第一音频数据以及第一指示信息，所述第一指示信息用于指示所述第一音频数据的编码方式为所述AVS编码方式；The encapsulation module 620 is used to encapsulate the first audio data according to the real-time transport protocol RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

发送模块630，用于发送所述RTP数据包。The sending module 630 is used to send the RTP data packet.

可选地，所述音频数据传输装置600还包括：Optionally, the audio data transmission device 600 further includes:

确定模块，用于通过与接收侧设备的媒体协商过程，确定编码配置信息；其中，所述编码配置信息包括编码名称信息，所述编码名称信息用于指示所述AVS编码方式。The determination module is used to determine the encoding configuration information through a media negotiation process with a receiving side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

编码器标识信息；Encoder identification information;

编码器配置参数；Encoder configuration parameters;

码率；Bit rate;

可选地，所述获得模块610包括：Optionally, the obtaining module 610 includes:

获取单元，用于获取初始音频数据；An acquisition unit, used for acquiring initial audio data;

编码单元，用于根据所述编码配置信息，采用所述AVS编码方式对所述初始音频数据进行编码，得到编码后的音频数据；An encoding unit, configured to encode the initial audio data in the AVS encoding mode according to the encoding configuration information to obtain encoded audio data;

拆分单元，用于将所述编码后的音频数据拆分为多个所述第一音频数据。A splitting unit is used to split the encoded audio data into multiple first audio data.

可选地，所述封装模块620包括：Optionally, the encapsulation module 620 includes:

封装单元，用于根据RTP对每个所述第一音频数据分别进行封装，得到多个RTP数据包；其中，每个RTP数据包分别承载一个所述第一音频数据和所述第一指示信息。An encapsulation unit is used to encapsulate each of the first audio data according to RTP to obtain multiple RTP data packets; wherein each RTP data packet carries one of the first audio data and the first indication information.

序列号；Serial number;

可选地，所述发送模块630包括：Optionally, the sending module 630 includes:

发送单元，用于通过网页实时通信WebRTC媒体通道发送所述RTP数据包。The sending unit is used to send the RTP data packet through the web real-time communication WebRTC media channel.

需要说明的是，本申请实施例中的装置能够实现上述发送侧设备的音频传输方法的各个实施例，且能达到相同的技术效果，为避免重复，这里不再赘述。It should be noted that the device in the embodiment of the present application can implement the various embodiments of the audio transmission method of the above-mentioned sending side device and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

如图7所示，本申请实施例还提供一种电子设备，包括：处理器71；以及通过总线接口72与所述处理器71相连接的存储器73，所述存储器73用于存储所述处理器71在执行操作时所使用的程序和数据，处理器71调用并执行所述存储器73中所存储的程序和数据。其中，收发机74与总线接口72连接，用于在处理器71的控制下接收和发送数据；可选地，所述电子设备作为接收侧设备时，处理器71用于读取存储器73中的程序实现上述接收侧设备的音频传输方法的步骤；或者，所述电子设备作为发送侧设备时，处理器71用于读取存储器73中的程序实现上述发送侧设备的音频传输方法的步骤，且能达到上述音频传输方法实施例相同的技术效果，为避免重复，这里不再赘述。As shown in FIG7 , the embodiment of the present application further provides an electronic device, comprising: a processor 71; and a memory 73 connected to the processor 71 through a bus interface 72, wherein the memory 73 is used to store programs and data used by the processor 71 when performing operations, and the processor 71 calls and executes the programs and data stored in the memory 73. Among them, a transceiver 74 is connected to the bus interface 72, and is used to receive and send data under the control of the processor 71; optionally, when the electronic device is used as a receiving side device, the processor 71 is used to read the program in the memory 73 to implement the steps of the audio transmission method of the above-mentioned receiving side device; or, when the electronic device is used as a sending side device, the processor 71 is used to read the program in the memory 73 to implement the steps of the audio transmission method of the above-mentioned sending side device, and can achieve the same technical effect as the above-mentioned audio transmission method embodiment, to avoid repetition, it will not be repeated here.

需要说明的是，在图7中，总线架构可以包括任意数量的互联的总线和桥，具体由处理器71代表的一个或多个处理器和存储器73代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起，这些都是本领域所公知的，因此，本文不再对其进行进一步描述。总线接口提供接口。收发机74可以是多个元件，即包括发送机和收发机，提供用于在传输介质上与各种其他装置通信的单元。针对不同的终端，用户接口75还可以是能够外接内接需要设备的接口，连接的设备包括但不限于小键盘、显示器、扬声器、麦克风、操纵杆等。处理器71负责管理总线架构和通常的处理，存储器73可以存储处理器71在执行操作时所使用的数据。It should be noted that in FIG. 7 , the bus architecture may include any number of interconnected buses and bridges, specifically one or more processors represented by processor 71 and various circuits of memory represented by memory 73 are linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art and are therefore not further described herein. The bus interface provides an interface. The transceiver 74 may be a plurality of components, namely, a transmitter and a transceiver, providing a unit for communicating with various other devices on a transmission medium. For different terminals, the user interface 75 may also be an interface capable of externally connecting or internally connecting required devices, and the connected devices include but are not limited to a keypad, a display, a speaker, a microphone, a joystick, and the like. The processor 71 is responsible for managing the bus architecture and general processing, and the memory 73 may store data used by the processor 71 when performing operations.

本领域技术人员可以理解，实现上述实施例的全部或者部分步骤可以通过硬件来完成，也可以通过计算机程序来指示相关的硬件来完成，所述计算机程序包括执行上述方法的部分或者全部步骤的指令；且该计算机程序可以存储于一可读存储介质中，存储介质可以是任何形式的存储介质。Those skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware, or may be accomplished by instructing relevant hardware through a computer program, wherein the computer program includes instructions for executing part or all of the steps of the above method; and the computer program may be stored in a readable storage medium, and the storage medium may be any form of storage medium.

另外，本申请具体实施例还提供一种可读存储介质，其上存储有程序，该程序被处理器执行时实现上述接收侧设备的音频传输方法的步骤，或者实现上述发送侧设备的音频传输方法的步骤，且能达到上述音频传输方法实施例相同的技术效果，为避免重复，这里不再赘述。In addition, the specific embodiment of the present application also provides a readable storage medium on which a program is stored. When the program is executed by the processor, it implements the steps of the audio transmission method of the above-mentioned receiving side device, or implements the steps of the audio transmission method of the above-mentioned sending side device, and can achieve the same technical effect as the above-mentioned audio transmission method embodiment. To avoid repetition, it will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露方法和装置，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理包括，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of hardware plus software functional units.

上述以软件功能单元的形式实现的集成的单元，可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中，包括若干指令用使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述收发方法的部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform some steps of the sending and receiving methods described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), disk or optical disk and other media that can store program codes.

以上所述的是本申请的优选实施方式，应当指出对于本技术领域的普通人员来说，在不脱离本申请所述的原理前提下还可以作出若干改进和润饰，这些改进和润饰也在本申请的保护范围内。The above is a preferred embodiment of the present application. It should be pointed out that for ordinary personnel in this technical field, several improvements and modifications can be made without departing from the principles described in the present application. These improvements and modifications are also within the scope of protection of the present application.

Claims

1. An audio data transmission method, characterized in that it is applied to a receiving-side device, the method comprising:

Receive a real-time transport protocol RTP data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is an audio and video codec standard AVS encoding method;

The first audio data is decoded according to the first indication information.

2. The audio data transmission method according to claim 1, characterized in that before receiving the real-time transport protocol RTP data packet, it also includes:

The encoding configuration information is determined through a media negotiation process with a sending-side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

3. The audio data transmission method according to claim 2, wherein the encoding configuration information further comprises at least one of the following:

Encoder identification information;

Encoder configuration parameters;

Bit rate;

The encoder identification information includes at least one of the following:

First identification information, used to indicate the encoding category for encoding the audio data;

The second identification information is used to indicate whether the encoder includes a neural network model or not;

The third identification information is used to indicate the category of the neural network model in the encoder.

4. The audio data transmission method according to claim 1, characterized in that the RTP data packet also carries at least one of the following:

Serial number;

Timestamp; wherein, in different RTP data packets carrying the first audio data of the same frame, the timestamp is the same;

Mark bit; wherein, in the RTP data packet carrying the target audio data of each frame, the mark bit is a first value; and/or, in the RTP data packet carrying the first audio data other than the target audio data in each frame, the mark bit is a second value; the target audio data includes the first and/or last first audio data in each frame.

5. The audio data transmission method according to claim 4, characterized in that the decoding process of the first audio data according to the first indication information comprises:

When the timestamps carried in the received multiple RTP data packets are the same, combining the first audio data carried in the multiple RTP data packets to obtain combined audio data;

The combined audio data is decoded according to the first indication information.

6. The audio data transmission method according to claim 4, wherein the decoding process of the first audio data according to the first indication information comprises:

In the case where the marker bit carried in the received first RTP data packet is a first value, determining at least one second RTP data packet; wherein each of the second RTP data packets has the same timestamp as that carried in the first RTP data packet;

Combining the first audio data carried in a plurality of RTP data packets to obtain combined audio data; wherein the plurality of RTP data packets include: the first RTP data packet and the at least one second RTP data packet;

7. The audio data transmission method according to claim 6, characterized in that the step of determining at least one second RTP data packet when the marker bit carried in the received first RTP data packet is a first value comprises:

In the case where the marker bit carried in the received first RTP data packet is a first value, determining a target RTP data packet; wherein the marker bit carried in the target RTP data packet is the first value, and the timestamps carried in the target RTP data packet and the first RTP data packet are different;

An RTP data packet that carries a sequence number between a first sequence number and a second sequence number and carries a timestamp that is the same as the timestamp carried in the first RTP data packet is determined as the second RTP data packet; wherein the first sequence number is the sequence number carried in the first RTP data packet, and the second sequence number is the sequence number carried in the target RTP data packet.

8. The audio data transmission method according to any one of claims 5 to 7, characterized in that combining the first audio data carried in a plurality of RTP data packets to obtain the combined audio data comprises:

In the case where the sequence numbers carried in the multiple RTP data packets are continuous, the first audio data carried in the multiple RTP data packets are combined in sequence according to the increasing or decreasing order of the sequence numbers carried in the multiple RTP data packets to obtain combined audio data.

9. The audio data transmission method according to any one of claims 5 to 7, characterized in that it also includes at least one of the following:

In the case where the sequence numbers carried in the multiple RTP data packets are discontinuous, discarding the multiple RTP data packets and receiving the RTP data packets again;

When the marker bit carried in the received RTP data packet is the second value, continue to receive the RTP data packet until the marker bit carried in the received RTP data packet is the first value or the received RTP data packet is discarded when the timer times out.

10. The audio data transmission method according to claim 1, wherein the receiving of a real-time transport protocol (RTP) data packet comprises:

The RTP data packet is received through a web real-time communication WebRTC media channel.

11. An audio data transmission method, characterized in that it is applied to a sending side device, the method comprising:

Obtain first audio data; wherein the first audio data is obtained by encoding based on the audio and video codec standard AVS encoding method;

Encapsulating the first audio data according to the real-time transport protocol RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

The RTP data packet is sent.

12. The audio data transmission method according to claim 11, characterized in that before obtaining the first audio data, it also includes:

The encoding configuration information is determined through a media negotiation process with a receiving-side device; wherein the encoding configuration information includes encoding name information, and the encoding name information is used to indicate the AVS encoding method.

13. The audio data transmission method according to claim 12, wherein the encoding configuration information further comprises at least one of the following:

Encoder identification information;

Encoder configuration parameters;

Bit rate;

The encoder identification information includes at least one of the following:

14. The audio data transmission method according to claim 12 or 13, wherein obtaining the first audio data comprises:

Get initial audio data;

According to the encoding configuration information, the initial audio data is encoded using the AVS encoding method to obtain encoded audio data;

The encoded audio data is divided into a plurality of the first audio data.

15. The audio data transmission method according to claim 14, characterized in that encapsulating the first audio data according to the real-time transport protocol RTP to obtain an RTP data packet comprises:

Each of the first audio data is encapsulated according to RTP to obtain multiple RTP data packets; wherein each RTP data packet carries one of the first audio data and the first indication information.

16. The audio data transmission method according to claim 15, characterized in that the RTP data packet also carries at least one of the following:

Serial number;

17. The audio data transmission method according to claim 11, wherein sending the RTP data packet comprises:

The RTP data packet is sent through a Web Real-Time Communication (WebRTC) media channel.

18. An audio data transmission device, characterized in that it is applied to a receiving-side device, comprising:

A receiving module, configured to receive a real-time transport protocol RTP data packet; wherein the RTP data packet carries first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is an audio and video codec standard AVS encoding method;

A decoding module is used to decode the first audio data according to the first indication information.

19. An audio data transmission device, characterized in that it is applied to a sending-side device, comprising:

An acquisition module is used to obtain first audio data; wherein the first audio data is obtained by encoding based on the audio and video codec standard AVS encoding method;

An encapsulation module, used to encapsulate the first audio data according to the real-time transport protocol RTP to obtain an RTP data packet; wherein the RTP data packet carries the first audio data and first indication information, and the first indication information is used to indicate that the encoding method of the first audio data is the AVS encoding method;

A sending module is used to send the RTP data packet.

20. An electronic device, characterized in that it comprises: a processor, a memory, and a computer program stored in the memory and executable on the processor; wherein, when the electronic device is a receiving-side device, the processor implements the steps of the audio data transmission method as described in any one of claims 1 to 10 when executing the computer program; or when the electronic device is a sending-side device, the processor implements the steps of the audio data transmission method as described in any one of claims 11 to 17 when executing the computer program.

21. A readable storage medium, characterized in that a program is stored on the readable storage medium, and when the program is executed by a processor, the steps of the audio data transmission method according to any one of claims 1 to 17 are implemented.