CN115050377B

CN115050377B - Audio transcoding method, device, audio transcoder, equipment and storage medium

Info

Publication number: CN115050377B
Application number: CN202111619099.XA
Authority: CN
Inventors: 黄庆博; 王蒙; 肖玮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-12-27
Publication date: 2024-09-27
Anticipated expiration: 2041-12-27
Also published as: CN115050377A

Abstract

The application discloses an audio transcoding method, an audio transcoding device, an audio transcoder, equipment and a storage medium, and belongs to the field of audio processing. According to the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, complete parameter extraction is not needed, and entropy decoding is adopted to obtain the audio characteristic parameters and the excitation signals. The re-quantization is also performed on the excitation signal and the audio characteristic parameters, and does not involve a correlation process of the time domain signal. And finally, carrying out entropy coding on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the operation amount of entropy decoding and entropy encoding is small, the operation amount can be greatly reduced without processing time signals, and the speed and efficiency of audio transcoding are improved on the whole on the premise of guaranteeing the tone quality.

Description

Audio transcoding method, device, audio transcoder, equipment and storage medium

The present application claims priority from chinese patent application No. 202110218868.9 entitled "audio transcoding method, apparatus, audio transcoder, device, and storage medium," filed on month 2, 2021, 26, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of audio processing, and in particular, to an audio transcoding method, apparatus, audio transcoder, device, and storage medium.

Background

With the development of network technology, more and more users conduct voice chat through social application programs.

In the related art, because the network bandwidths of different users are different, in the process of performing voice chat by the users, the social application program needs to transcode the transmitted audio, for example, if the network bandwidth of one user is lower, the audio needs to be transcoded, that is, the code rate of the audio is reduced, so as to ensure that the user can perform voice chat normally.

However, in the course of audio transcoding, the complexity of transcoding is high, resulting in slow and inefficient audio transcoding.

Disclosure of Invention

The embodiment of the application provides an audio transcoding method, an audio transcoding device, audio transcoders, equipment and a storage medium, which can improve the speed and efficiency of audio transcoding. The technical scheme is as follows:

In one aspect, there is provided an audio transcoding method, the method comprising:

entropy decoding is carried out on a first audio stream with a first code rate, so that audio characteristic parameters and excitation signals of the first audio stream are obtained, and the excitation signals are quantized voice signals;

Acquiring a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

Re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and a target transcoding rate;

And entropy coding the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

In one aspect, an audio transcoder is provided, the audio transcoder comprising: the device comprises an entropy decoding unit, a time domain decoding unit, a quantization unit and an entropy coding unit, wherein the entropy decoding unit is respectively connected with the time domain decoding unit and the quantization unit, the time domain decoding unit is connected with the quantization unit, and the quantization unit is connected with the entropy coding unit;

the entropy decoding unit is used for entropy decoding a first audio stream with a first code rate to obtain audio characteristic parameters and excitation signals of the first audio stream, wherein the excitation signals are quantized voice signals;

the time domain decoding unit is used for acquiring a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

the quantization unit is used for re-quantizing the excitation signal and the audio characteristic parameter based on the time domain audio signal and a target transcoding code rate;

The entropy coding unit is used for entropy coding the re-quantized audio characteristic parameter and the re-quantized excitation signal to obtain a second audio stream with a second code rate, and the second code rate is lower than the first code rate.

In a possible implementation manner, the quantization unit is configured to determine, in any one of the iterative processes, a first alternative quantization parameter based on the target transcoding rate; simulating a re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; and determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time domain audio signal and the first signal, the target transcoding rate and the code rate, the number of iterations of the analog audio stream meeting a second target condition.

In a possible implementation manner, the analog audio stream meeting the first target condition refers to at least one of the following:

the code rate of the analog audio stream is smaller than or equal to the target transcoding code rate;

the audio stream quality parameter of the analog audio stream is greater than or equal to a quality parameter threshold.

In a possible implementation manner, at least one of the time domain audio signal and the first signal, the target transcoding code rate and the code rate of the analog audio stream, and the iterated number of times meets a second target condition means that:

the similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold;

the difference between the target transcoding rate and the rate of the analog audio stream is less than or equal to a difference threshold;

the number of iterations is equal to an iteration number threshold.

In a possible embodiment, the quantization unit is configured to: simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter; and dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and then rounding to obtain the first signal and the first parameter.

In a possible implementation, the quantization unit is further configured to: and responding to that the analog audio stream does not meet the first target condition or the time domain audio signal, the first signal, the target transcoding code rate, the code rate of the analog audio stream and the iterated times do not meet the second target condition, and taking a second alternative quantization parameter determined based on the target transcoding code rate as an input of the next iteration process.

In a possible implementation, the entropy decoding unit is configured to: acquiring the occurrence probability of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In a possible implementation, the entropy encoding unit is configured to:

acquiring the re-quantized audio characteristic parameters and the occurrence probability of a plurality of coding units in the re-quantized excitation signal;

and encoding the plurality of encoding units based on the occurrence probability to obtain the second audio stream.

In a possible implementation manner, the audio transcoder further comprises a forward error correction module connected to the entropy encoding unit for forward error correction encoding of a subsequently received audio stream based on the second audio stream.

In one aspect, an audio transcoding apparatus is provided, the apparatus comprising:

the decoding module is used for entropy decoding a first audio stream with a first code rate to obtain audio characteristic parameters and excitation signals of the first audio stream, wherein the excitation signals are quantized voice signals;

The time domain audio signal acquisition module is used for acquiring a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

the quantization module is used for re-quantizing the excitation signal and the audio characteristic parameters based on the time domain audio signal and a target transcoding code rate;

And the encoding module is used for entropy encoding the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, and the second code rate is lower than the first code rate.

In a possible implementation manner, the quantization module is configured to obtain, through at least one iteration process, a first quantization parameter based on the target transcoding rate, where the first quantization parameter is used to adjust the first bitrate of the first audio stream to the target transcoding rate; re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and the first quantization parameter.

In a possible implementation manner, the quantization module is configured to determine, in any one of the iterative processes, a first alternative quantization parameter based on the target transcoding rate; simulating a re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; and determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time domain audio signal and the first signal, the target transcoding rate and the code rate, the number of iterations of the analog audio stream meeting a second target condition.

the number of iterations is equal to an iteration number threshold.

In a possible implementation manner, the quantization module is configured to simulate a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio feature parameter respectively, so as to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio feature parameter; and dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and then rounding to obtain the first signal and the first parameter.

In a possible implementation manner, the quantization module is further configured to, in response to the analog audio stream not meeting the first target condition, or the time-domain audio signal and the first signal, the target transcoding rate, and the code rate and the iterated number of times of the analog audio stream all do not meet the second target condition, take as input of a next iteration process a second alternative quantization parameter determined based on the target transcoding rate.

In a possible implementation manner, the decoding module is configured to obtain occurrence probabilities of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In a possible implementation manner, the encoding module is configured to obtain the re-quantized audio feature parameter and the occurrence probability of a plurality of encoding units in the re-quantized excitation signal; and encoding the plurality of encoding units based on the occurrence probability to obtain the second audio stream.

In a possible implementation manner, the apparatus further comprises a forward error correction module, configured to perform forward error correction encoding on a subsequently received audio stream based on the second audio stream.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one computer program stored therein, the computer program being loaded and executed by the one or more processors to implement the audio transcoding method.

In one aspect, a computer readable storage medium having at least one computer program stored therein is provided, the computer program being loaded and executed by a processor to implement the audio transcoding method.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising a program code stored in a computer readable storage medium, the program code being read from the computer readable storage medium by a processor of a computer device, the program code being executed by the processor, causing the computer device to perform the above-described audio transcoding method.

According to the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, complete parameter extraction is not needed, and entropy decoding is adopted to obtain the audio characteristic parameters and the excitation signals. The re-quantization is also performed on the excitation signal and the audio characteristic parameters, and does not involve a correlation process of the time domain signal. And finally, carrying out entropy coding on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the operation amount of entropy decoding and entropy encoding is small, the operation amount can be greatly reduced without processing time signals, and the speed and efficiency of audio transcoding are improved on the whole on the premise of guaranteeing the tone quality.

Drawings

For the sake of more clearly illustrating the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an encoder according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment of an audio transcoding method according to an embodiment of the present application;

FIG. 3 is a flowchart of an audio transcoding method according to an embodiment of the present application;

FIG. 4 is a flowchart of an audio transcoding method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a decoder according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio transcoder according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a method for forward error correction coding according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio transcoding device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.

The term "at least one" in the present application means one or more, and the meaning of "a plurality of" means two or more.

Cloud Technology (Cloud Technology) is based on the general terms of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied by Cloud computing business models, and can form a resource pool, and the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Cloud computing (Cloud computing) is a computing model that distributes computing tasks across a resource pool of large numbers of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

As a basic capability provider of cloud computing, a cloud computing resource pool (abbreviated as a cloud platform, generally referred to as IaaS (Infrastructure AS A SERVICE) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.

According to the logic function division, a PaaS (Platform AS A SERVICE, platform service) layer can be deployed on an IaaS (Infrastructure AS A SERVICE, infrastructure service) layer, and a SaaS (Software AS A SERVICE, service) layer can be deployed above the PaaS layer, or the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, web container, etc. SaaS is a wide variety of business software such as web portals, sms mass senders, etc. Generally, saaS and PaaS are upper layers relative to IaaS.

Cloud conferencing is an efficient, convenient, low-cost form of conferencing based on cloud computing technology. The user can rapidly and efficiently share voice, data files and videos with all groups and clients in the world synchronously by simply and easily operating through an internet interface, and the user is helped by a cloud conference service provider to operate through complex technologies such as data transmission, processing and the like in the conference.

At present, domestic cloud conference mainly focuses on service contents taking a Software as a main body (Software as a service) mode, including service forms such as telephone, network, video and the like, and video conference based on cloud computing is called as a cloud conference.

In the cloud conference era, the transmission, processing and storage of data are all processed by the computer resources of video conference factories, and users can carry out efficient remote conferences without purchasing expensive hardware and installing complicated software.

The cloud conference system supports the dynamic cluster deployment of multiple servers, provides multiple high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular for a plurality of users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrade of internal management level, and have been widely used in various fields of transportation, finance, operators, education, enterprises and the like. Undoubtedly, the video conference has stronger attraction in convenience, rapidness and usability after the cloud computing is applied, and the video conference application is required to be stimulated.

Entropy coding: the entropy coding is the coding without losing any information according to the entropy principle in the coding process, and the information entropy is the average information quantity of the information source.

Quantification: refers to a process of approximating a continuous value (or a large number of possible discrete values) of a signal to a finite number (or fewer) discrete values.

In-band forward error correction: in-band forward error correction, also called forward error correction code (Forward Error Correction, abbreviated FEC), is a method to increase the reliability of data communications. In a unidirectional communication channel, once an error is found, its receiver will not have the right to request a transmission. FEC is a method of transmitting redundant information using data, and when an error occurs in transmission, a receiver is allowed to reconstruct the data.

Audio Coding is classified into Multi-rate Coding (Multi-rate Coding) and Scalable Coding (Scalable Coding), wherein Scalable Coding streams have the following characteristics: the low-code rate code stream is a subset of the high-code rate code stream, and can only transmit the low-code rate core code stream when the network is congested, so that the multi-scale code stream is flexible and has no characteristic. However, in general, at the same code rate, the decoding result of the multi-scale coded code stream is better than that of the scalable coded code stream.

The OPUS is one of the most widely used audio encoders, and the OPUS encoder is a multi-scale encoder, and cannot generate a section of a cleavable code stream like a scalable encoder, and fig. 1 provides a schematic structural diagram of the OPUS encoder, and as can be seen from fig. 1, when the OPUS encoder is used to encode audio, the OPUS encoder needs to perform voice activity detection (VAD, voice Activity Detection) on the audio, pitch processing, noise shaping processing, LTP (Long-Term-terminal) Prediction, gain processing, LSF (LINE SPECTRAL Frequency), quantization, prediction, pre-filtering, noise shaping quantization, and interval encoding, and when the audio is required to be transcoded, the OPUS decoder needs to decode the encoded audio first, and then re-encode the decoded audio by the OPUS encoder to change the code rate of the audio, so that the encoding complexity is high due to the steps involved in encoding by using the OPUS encoder.

In the embodiment of the application, the computer device can be provided as a terminal or a server, and an implementation environment formed by the terminal and the server is described below.

Fig. 2 is a schematic diagram of an implementation environment of an audio transcoding method according to an embodiment of the present application, and referring to fig. 2, the implementation environment may include a terminal 210 and a server 240.

The terminal 210 is connected to the server 240 through a wireless network or a wired network. Alternatively, the terminal 210 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal 210 installs and runs a social class application.

Optionally, the server 240 is a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a distribution network (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. In some embodiments, the server 240 may be used as an execution body of the audio transcoding method provided by the embodiments of the present application, that is, the terminal 210 may collect audio signals, send the audio signals to the server 240, transcode the audio signals by the server 240, and send the transcoded audio to other terminals.

Alternatively, the terminal 210 refers broadly to one of a plurality of terminals, and embodiments of the present application are illustrated with respect to terminal 210 only.

Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as only one terminal, or tens or hundreds, or more, other terminals are also included in the implementation environment. The embodiment of the application does not limit the number of terminals and the equipment type.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein.

After the implementation environment of the embodiment of the present application is described below with reference to the implementation environment, in the following description, the terminal is the terminal 210 in the implementation environment, and the server is the server 240 in the implementation environment, which can be applied to various social applications, for example, an online conference application, or an instant messaging application, or a live broadcast application, which is not limited by the embodiment of the present application.

In online conference applications, there are often a plurality of terminals on which an online conference application is installed, and the user of each terminal is a participant in an online conference. The terminals are connected with the server through a network. In the online conference process, the server can transcode the voice signals uploaded by the terminals and then send the transcoded voice signals to the terminals, so that the terminals can play the voice signals, and the online conference is realized. Because the network environments where the terminals are located may be different, in the process of transcoding the voice signals by the server, the server can adopt the technical scheme provided by the embodiment of the application, convert the voice signals into different code rates according to the network bandwidths of different terminals, and send the voice signals with different code rates to different terminals, so that the different terminals can normally perform online conferences, namely, for terminals with larger network bandwidths, the server can transcode the voice signals with higher code rates, and the higher code rates mean higher voice quality, so that the larger bandwidth can be fully utilized, and the quality of the online conferences is improved. For terminals with smaller network bandwidth, the server can transcode the voice signals with a lower code rate, and the lower code rate means smaller bandwidth occupation, so that the voice signals can be sent to the terminals in real time, and normal online conference access of the terminals is ensured. In addition, since the network has network fluctuations, i.e. the network bandwidth in which the same terminal is located may be larger at one time and smaller at another time. The server can also adjust the transcoding code rate according to the fluctuation condition of the network so as to ensure the normal running of the online conference. In some embodiments, the online conference is also referred to as a cloud conference.

In the instant messaging application, the user can perform voice chat by installing the instant messaging application on the terminal. Taking two users for voice chat through instant messaging application as an example, the instant messaging application can obtain voice signals of the two users in the chat process through the terminals of the two users, the voice signals are sent to the server, the server sends the voice signals to the two terminals respectively, and the instant messaging application plays the voice signals through the terminals, so that the voice chat between the two users can be realized. And the network environment where the two voice chat parties are located is possibly different from the network environment where the two voice chat parties are located, namely, the network bandwidth of one party is larger, and the network bandwidth of the other party is smaller. Under the condition, the server can adopt the technical scheme provided by the embodiment of the application to transcode the voice signal, and the voice signal is transferred to the two terminals after being converted into the proper code rate, so that the two users can normally carry out voice chat.

In live broadcast application, a live broadcast end used by a live broadcast can acquire live broadcast voice signals of the live broadcast, the live broadcast voice signals are sent to a live broadcast server, the live broadcast server sends the live broadcast voice signals to audience ends used by different audiences, after receiving the live broadcast voice signals, the audience ends play the live broadcast voice signals, and the audiences can hear the voice of the live broadcast during live broadcast. Because different audience terminals may be in different network environments, the server can adopt the technical scheme provided by the embodiment of the application to transcode the live voice signals according to the network environments of the different audience terminals, namely, the live voice signals are converted into different code rates according to different network bandwidths of the audience terminals, and the voice signals with different code rates are sent to the different audience terminals, so that the different audience terminals can normally play live audio. That is, for the audience terminal with larger network bandwidth, the server can transcode the live broadcast voice signal with a higher code rate, and the higher code rate means higher voice quality, so that the larger bandwidth can be fully utilized, and the quality of live broadcast is improved. For the audience terminal with smaller network bandwidth, the server can transcode the live voice signal with a lower code rate, and the lower code rate means smaller bandwidth occupation, so that the live voice signal can be ensured to be sent to the audience terminal in real time, and the audience terminal is ensured to watch live broadcast normally. In addition, because of network fluctuations in the network, i.e., for the same viewer, the network bandwidth may be larger at one time and smaller at another time. The server can also adjust the transcoding code rate according to the fluctuation condition of the network bandwidth so as to ensure the normal running of live broadcast.

In addition to the above three application scenarios, the technical solution provided in the embodiment of the present application may be applied to other audio transmission scenarios, for example, in a broadcast television transmission scenario or in a satellite communication scenario, which is not limited in this embodiment of the present application.

Of course, the audio transcoding method provided by the embodiment of the application can be applied to a server as a cloud service, and also can be applied to a terminal, and the terminal carries out quick transcoding on the audio.

After the implementation environment and the application scenario of the embodiment of the present application are introduced, the following describes the technical solution provided by the embodiment of the present application, and in the following description process, taking a main body for executing an audio transcoding method as an example of a server, referring to fig. 3, the method includes:

301. The server performs entropy decoding on the first audio stream with the first code rate to obtain audio characteristic parameters and excitation signals of the first audio stream, wherein the excitation signals are quantized voice signals.

In some embodiments, the first audio stream is a high rate audio stream, and the audio characteristic parameters include a signal gain, an LSF (LINE SPECTRAL Frequency, linear spectrum) parameter, an LTP (Long-Term-filter) parameter, a treble delay, and the like. Quantization refers to a process of approximating a continuous value of a signal to a limited number (or less) of discrete values, wherein a speech signal is a continuous signal, and an excitation signal obtained after quantization is a discrete signal, which is convenient for a server to perform subsequent processing. In some embodiments, the high code rate refers to the code rate of the audio stream uploaded to the server by the terminal, and in other embodiments, the high code rate may be a code rate higher than a certain code rate threshold, for example, the code rate threshold is 1Mbps, and then a code rate higher than 1Mbps is also referred to as a high code rate. Of course, the definition of the high code rate may be different in different coding standards, and the embodiment of the present application is not limited thereto.

302. The server acquires a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

In some embodiments, the excitation signal is a discrete signal, and the server is capable of reverting the excitation signal to a time-domain audio signal for subsequent audio transcoding based on the audio characteristic parameters.

303. The server re-quantizes the excitation signal and the audio feature parameters based on the time-domain audio signal and the target transcoding rate.

In some embodiments, the re-quantization may also be referred to as noise-shaping quantization (Noise Shaping Quantization, NSQ), i.e. a compression process, and the server re-quantizes the excitation signal and the audio feature parameters, i.e. re-compresses the excitation signal and the audio feature parameters.

304. The server performs entropy coding on the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

After the audio characteristic parameters and the excitation signals are re-quantized, the audio characteristic parameters and the excitation signals are re-compressed, and entropy coding is carried out on the audio characteristic parameters and the excitation after re-quantization, so that a second audio stream with a lower code rate can be directly obtained.

According to the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, a complete parameter extraction process is not required to be executed, and entropy decoding is adopted to acquire the audio characteristic parameters and the excitation signals. The re-quantization is also performed on the excitation signal and the audio characteristic parameters, and does not involve a correlation process of the time domain signal. And finally, carrying out entropy coding on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the operation amount of entropy decoding and entropy encoding is small, the operation amount can be greatly reduced without processing time signals, and the speed and efficiency of audio transcoding are improved on the whole on the premise of guaranteeing the tone quality.

The steps 301 to 304 are a simple introduction of the embodiment of the present application, and the technical solution provided by the embodiment of the present application will be more clearly described below with reference to fig. 4, and the method includes:

401. The server performs entropy decoding on the first audio stream with the first code rate to obtain audio characteristic parameters and excitation signals of the first audio stream, wherein the excitation signals are quantized voice signals.

In one possible implementation, a server obtains probabilities of occurrence of a plurality of coding units in a first audio stream. The server decodes the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units. The server combines the plurality of decoding units to obtain the audio characteristic parameters and the excitation signal of the first audio stream.

The above-described embodiment is one possible implementation of entropy decoding, and in order to more clearly describe the above-described embodiment, an entropy encoding method corresponding to the above-described embodiment will be described first.

For example, in order to simplify the process, it is assumed that the audio characteristic parameter and excitation signal of the first audio stream are "MNOOP", each letter is a coding unit, where the occurrence probabilities of "M", "N", "O", and "P" in "MNOOP" are 0.2, 0.4, and 0.2, respectively, and the initial interval corresponding to "MNOOP" is [0, 100000]. The server divides the interval [0, 100000] into four subintervals according to the probability of occurrence of "M", "N", "O" and "P": m: [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: 80000, 100000], wherein the ratio between the lengths of each subinterval is the same as the ratio between the corresponding occurrence probabilities. Since in "MNOOP" the first letter is "M", the server selects the first subinterval M: [0, 20000] is used as a base section for the subsequent entropy encoding. The server sets the interval M according to the probability of occurrence of "M", "N", "O" and "P": [0, 20000] is divided into four subintervals: MM: [0, 4000], MN: [4000, 8000], MO: [8000, 16000] and MP: [16000, 20000]. Since in "MNOOP" the first two letters are "MN", the server selects the second subinterval MN: [4000, 8000] as a base section for the subsequent entropy encoding. The server sets the interval MN according to the probability of occurrence of "M", "N", "O" and "P": [4000, 8000] into four subintervals: MNM: [4000, 4800], MNN: [4800, 5600], MNO: [5600, 7200] and MNP: [7200, 8000]. Since the first three letters in "MNOOP" are "MNOs", the server will have the third subinterval MNO: [5600, 7200] as a base section for the subsequent entropy encoding. The server sets the interval MNO according to the probability of occurrence of "M", "N", "O" and "P": [5600, 7200] into four subintervals: MNOM: [5600, 5920], MNON: [5920, 6240], MNOO: [6240, 6880] and MNOP: [6880, 7200]. Since in "MNOOP" the first four letters are "MNOO", the server will have a third subinterval MNOO: [6240, 6880] the basic section of the subsequent entropy coding. the server divides the interval MNOO according to the probability of occurrence of "M", "N", "O" and "P": [6240, 6880] into four subintervals: MNOOM: [6240, 6368], MNOON: [6368, 6496], MNOOO: [6496, 6752] and MNOOP: [6752, 6880] whereby a section [6752, 6880] for entropy encoding "MNOOP" is obtained, the server can use any one of the values of the section [6752, 6880] to represent the encoding result of "MNOOP", For example, 6800 is denoted as "MNOOP", and in the above embodiment, 6800 is the first audio stream.

The above embodiment will be described based on the entropy encoding.

Taking 6800 as an example of the first audio stream, the server obtains probabilities of occurrence of a plurality of coding units in the first audio stream, that is, probabilities of occurrence of "M", "N", "O", and "P" are 0.2, 0.4, and 0.2, respectively. The server builds the same initial interval [0, 100000] as the entropy coding process, and divides the interval [0, 100000] into four subintervals according to the occurrence probabilities of "M", "N", "O" and "P": m: [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: 80000, 100000], since the first audio stream 6800 is in the first subinterval M: in [0, 20000], the server therefore uses this interval [0, 20000] as the base interval for the subsequent entropy decoding, M as the first decoding unit decoded. The server sets the interval M according to the probability of occurrence of "M", "N", "O" and "P": [0, 20000] is divided into four subintervals: MM: [0, 4000], MN: [4000, 8000], MO: [8000, 16000] and MP: [16000, 20000], since the first audio stream 6800 is in the second subinterval MN: in [4000, 8000], the server therefore takes this subinterval [4000, 8000] as the basic interval for the subsequent entropy decoding, N as the second decoding unit decoded. The server sets the interval MN according to the probability of occurrence of "M", "N", "O" and "P": [4000, 8000] into four subintervals: MNM: [4000, 4800], MNN: [4800, 5600], MNO: [5600, 7200] and MNP: 7200, 8000], since the first audio stream 6800 is in the third subinterval MNO: in [5600, 7200], this subinterval [5600, 7200] is thus taken as a base interval for the subsequent entropy decoding, and O is taken as a third decoding unit for decoding. The server sets the interval MNO according to the probability of occurrence of "M", "N", "O" and "P": [5600, 7200] into four subintervals: MNOM: [5600, 5920], MNON: [5920, 6240], MNOO: [6240, 6880] and MNOP: [6880, 7200], since the first audio stream 6800 is in the third subinterval MNOO: in [6240, 6880], the server therefore takes this subinterval [6240, 6880] as the base interval for the subsequent entropy decoding, O as the fourth decoding unit decoded. The server divides the interval MNOO according to the probability of occurrence of "M", "N", "O" and "P": [6240, 6880] into four subintervals: MNOOM: [6240, 6368], MNOON: [6368, 6496], MNOOO: [6496, 6752] and MNOOP: [6752, 6880] because the first audio stream 6800 is in the fourth subinterval MNOOP: in [6752, 6880], the server therefore takes P as the fifth decoded unit. The server combines the decoded five decoding units "M", "N", "O" and "P" to obtain "MNOOP", i.e. the audio feature parameters and excitation signals of the first audio stream.

In order to more clearly describe the technical solution provided by the embodiments of the present application, the following describes the foregoing embodiments on the basis of entropy decoding in the foregoing examples.

In a possible implementation manner, referring to fig. 5, the server inputs the first audio stream into the interval decoder 501, and performs entropy decoding on the first audio stream, where the entropy decoding process is referred to the above example and will not be described herein. After the first audio stream is entropy decoded by the section decoder 501, an entropy decoded audio stream is obtained. The server inputs the entropy-decoded audio stream to the parameter decoder 502, and outputs a flag bit pulse, a signal gain, and an audio feature parameter through the parameter decoder 502. The server inputs the flag bit pulse and the signal gain to the excitation signal generator 503 to obtain an excitation signal.

402. The server acquires a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

In one possible implementation manner, the server processes the excitation signal based on the audio characteristic parameter to obtain a time domain audio signal corresponding to the excitation signal.

For example, referring to fig. 5, the server inputs the audio feature parameters and excitation signals to a frame reconstruction module 504, and the frame reconstructed audio signal is output by the frame reconstruction module 504. The server inputs the audio signal after frame reconstruction into a sampling rate conversion filter 505, and resamples and encodes the audio signal through the sampling rate conversion filter 505 to obtain a time domain audio signal corresponding to the excitation signal. Alternatively, if the frame-reconstructed audio signal is a stereo audio signal, the server can input the frame-reconstructed audio signal to the stereo separation module 506 to separate the frame-reconstructed audio signal into mono audio signals before inputting the frame-reconstructed audio signal to the sample rate conversion filter. The server inputs the mono audio signal to the sample rate conversion filter 505 for resampling encoding to obtain a time domain audio signal corresponding to the excitation signal.

The following describes a method for frame reconstruction of an excitation signal by a frame reconstruction module:

In one possible implementation, the audio characteristic parameters include signal gain, LSF (LINE SPECTRAL Frequency, linear spectrum) coefficients, LTP (Long-Term-Prediction) coefficients, and treble delay, etc. The frame reconstruction module comprises an LTP synthesis filter and an LPC (LINEAR PREDICTIVE Coding) synthesis filter, wherein the server inputs high-pitch delay and LTP coefficients in the excitation signal and the audio characteristic parameters into the LTP synthesis filter, and the LTP synthesis filter carries out first frame reconstruction on the excitation signal to obtain a first filtered audio signal. The server inputs the first filtered audio signal, the LSF coefficient and the signal gain into an LPC synthesis filter, and the LPC synthesis filter carries out second frame reconstruction on the first filtered audio signal to obtain a second filtered audio signal. And the server fuses the first filtering audio signal and the second filtering audio signal to obtain an audio signal after frame reconstruction.

403. The server obtains a first quantization parameter through at least one iteration process based on the target transcoding rate, wherein the first quantization parameter is used for adjusting the first rate of the first audio stream to the target transcoding rate.

In one possible implementation, the server obtains the first quantization parameter via at least one iterative process, and in any iterative process, the server determines a first alternative quantization parameter based on the target transcoding rate. The server simulates a re-quantization process of the excitation signal and the audio feature parameters based on the first alternative quantization parameters to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameters. The server simulates the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. In response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding rate and the rate of the analog audio stream, the number of iterations meeting a second target condition, determining a first alternative quantization parameter as the first quantization parameter.

In the above embodiment, the processing procedure including four parts, that is, the server first determines an alternative quantization parameter, and re-quantizes the excitation signal and the audio feature parameter according to the alternative quantization parameter, so as to obtain the first signal and the first parameter. The server is capable of simulating the entropy encoding process of the first signal and the first parameter to obtain a simulated audio stream. The server judges the analog audio stream, determines whether the analog audio stream meets the requirement, and judges the requirement based on the first target condition and the second target condition. When the first target condition and the second target condition are satisfied at the same time, the server can end the iteration and output the first quantization parameter. The server can re-iterate when either of the first target condition and the second target condition is not satisfied.

In order to more clearly describe the above embodiment, the above embodiment will be described below in four parts.

The first part is a description of a manner in which the server determines the first alternative quantization parameter based on the target transcoding rate.

The target transcoding rate can be determined by the server according to practical situations, for example, according to network bandwidth, so that the target transcoding rate is matched with the network bandwidth.

In some embodiments, the first alternative quantization parameter represents a quantization step, and the larger the quantization step, the larger the compression ratio and the smaller the amount of quantized data. The smaller the quantization step size, the smaller the compression ratio, and the larger the amount of quantized data. In some embodiments, the target transcoding rate is lower than the first rate of the first audio stream, then during the audio transcoding process, i.e., a rate reduction process. In this process, the server can generate a first alternative quantization parameter based on the target transcoding rate, and after re-quantizing the excitation signal and the audio feature parameter with the first alternative quantization parameter, an audio stream with a lower bitrate can be obtained, where the bitrate of the audio stream is close to the target transcoding rate.

And a second part, which is used for simulating the re-quantization process of the excitation signal and the audio characteristic parameters based on the first alternative quantization parameters by the server, so as to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameters.

The simulation means that the server does not re-quantize the excitation signal and the audio feature parameters, but performs a re-quantization simulation based on the first alternative quantization parameters, so as to determine the first quantization parameters used in the actual quantization process.

In one possible implementation manner, the server simulates a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio characteristic parameter respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter. And the server divides the second signal and the second parameter with the first alternative quantization parameter respectively and then performs rounding to obtain the first signal and the first parameter.

Taking the example of re-quantizing the excitation signal by the server as an example, in the simulation process, the server performs discrete cosine transform on the excitation signal to obtain the second signal. And re-quantizing the second signal by the server by adopting a quantization step length corresponding to the first alternative quantization parameter, namely dividing the second signal by the quantization step length expressed by the first alternative parameter and rounding to obtain the first signal.

For example, if the excitation signal is a matrixThe server being capable of responding to the excitation signalDiscrete cosine transforming, i.e. the excitation signal is transformed by the following equation (1)And performing discrete cosine transform to obtain a second signal.

Wherein F (u) is the second signal, u is the generalized frequency variable, u=1, 2,3 … … N-1, F (i) is the excitation signal, N is the number of values in the excitation signal, i is the number of values in the excitation signal.

For convenience of explanation, the second signal is taken asThe quantization step size is 28 for example. In some embodiments, the server can re-quantize the second signal by the following equation (2) to obtain the first signal.

Q(m)＝round(m/S+0.5) (2)

Where Q () is a quantization function, m is a value in the second signal, round () is a rounded rounding function, and S is a quantization step size.

With a second signalFor example, the server can substitute 195 into equation (2), i.e., Q (195) =round (195/28+0.5) =round (7.464) =7, 7, i.e., the result of quantizing 195. The server applies formula (2) to the second signalAfter re-quantization, a first signal can be obtained

And the third part is used for simulating the entropy coding process of the first signal and the first parameter by the server and obtaining a mode of simulating the audio stream.

To simulate to the first signalDescribing an example of entropy encoding, the server is capable of transmitting the first signalDivided into four vectors, (7, -1, 0) ^T、(0,-1,0,0)^T、(0,0,0,0)^T and (0, 0) ^T. The server marks vector (7, -1, 0) ^T as A, vector (0, -1, 0) ^T as B, and vector (0, 0) ^T as C. First signalCan also be simplified As (ABCC). In the first signal (ABCC), the probabilities of occurrence of the coding units "a", "B" and "C" in (ABCC) are 0.25, 0.25 and 0.5, respectively, and the server generates an initial interval [0, 100000]. The server divides the initial interval [0, 100000] into three subintervals a according to the probability of occurrence of the coding units "a", "B" and "C": [0, 25000], B [25000, 50000], and C [50000, 100000]. Since the first letter is "a" in the first signal (ABCC), the server selects the first subinterval a: [0, 25000] as a base section of the subsequent entropy encoding. The server sets the interval a according to the probability of occurrence of the coding units "a", "B" and "C": [0, 25000] is divided into three subintervals AA: [0, 6250], AB [6250, 12500], and AC [12500, 100000]. Since the second letter is "B" in the first signal (ABCC), the server selects the second sub-interval AB 6250, 12500 as the basic interval for subsequent entropy encoding. The server divides the interval AB [6250, 12500] into three subintervals ABA according to the probability of occurrence of the coding units "a", "B" and "C": [6250, 7812.5], ABB [7812.5, 9375], and ABC [9375, 12500]. Since the third letter is "C" in the first signal (ABCC), the server selects the third subinterval ABC [9375, 12500] as the base interval for subsequent entropy encoding. The server divides the interval ABC [9375, 12500] into three subintervals ABCA according to the probability of occurrence of the coding units "a", "B" and "C": [9375, 10156.25], ABCB [10156.25, 10, 937.5] and ABCC [10, 937.5, 12500], whereby the interval in which the first signal (ABCC) is entropy coded is ABCC [10, 937.5, 12500], the server can use any value in the interval ABCC [10, 937.5, 12500] to represent the first signal (ABCC), for example 12000 to represent the first signal (ABCC).

If the entropy encoding process of the first signal and the first parameter is simulated to obtain a section [100, 130], the server can represent the simulated audio stream by using any value in the section [100, 130], for example, 120.

And a fourth section for explaining the first target condition and the second target condition.

In one possible implementation, the compliance of the analog audio stream with the first target condition refers to at least one of:

The code rate of the analog audio stream is less than or equal to the target transcoding code rate, and the audio stream quality parameter of the analog audio stream is greater than or equal to the quality parameter threshold. The audio stream quality parameters include signal-to-noise ratio, PESQ (Perceptual Evaluation of Speech Quality, perceived voice quality assessment), POLQA (Perceptual Objective Listening Quality Analysis, perceived objective voice quality assessment), and the like, and the quality parameter threshold is set according to practical situations, for example, according to the quality requirement of a voice call, that is, when the quality requirement of the voice call is higher, the quality parameter threshold can be set higher, and when the quality requirement of the voice call is lower, the quality parameter threshold can be set smaller.

In one possible implementation, the time domain audio signal and the first signal, the target transcoding rate and at least one of the rate of the analog audio stream, the number of iterated times, meet the second target condition means that:

The similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold. The difference between the target transcoding rate and the rate of the analog audio stream is less than or equal to a difference threshold. The number of iterations is equal to the iteration number threshold. That is, in the iteration process, the similarity between the time-domain audio signal and the first signal is used as a first factor affecting the iteration termination, the difference between the target transcoding rate and the rate of the analog audio stream is used as a second factor affecting the iteration termination, the number of iterations is used as a third factor affecting the iteration termination, and the server determines the time for ending the iteration through three phonemes. In some embodiments, if the iteration number threshold is 3, the current iteration number is 3, the similarity between the time-domain audio signal and the first signal obtained by iteration is smaller than the similarity threshold, and meanwhile, the difference between the target transcoding rate and the rate of the analog audio stream is larger than the difference threshold. By limiting the second target condition, the server can acquire the first quantization parameter with fewer iteration times, so that transcoding can be completed at a higher speed in a real-time voice call scene.

Under the limitation of the second target condition, the server does not perform a complete iterative process, which in some embodiments is a Noise Shaping Quantization (NSQ) loop iteration. The limitation of the second target condition may also be called a greedy algorithm, and the greedy algorithm is adopted to greatly improve the speed of audio transcoding, for the following reasons: one is that, since the first audio stream is the optimal quantization result with high code rate, the server can directly find other alternative quantization parameters near the quantization parameters of the first audio stream. And secondly, when the excitation signal and the time domain audio signal are compared, the iteration times can be greatly reduced according to the three factors. Of course, in more aggressive cases, for example, only iterating 1 time, the decoder may be deleted, and the audio transcoding may be directly performed, which is not limited by the embodiment of the present application.

In addition, in the iterative process, in response to the analog audio stream failing to meet the first target condition, or the time-domain audio signal failing to meet the second target condition, or the first signal, the target transcoding rate, the code rate of the analog audio stream, and the number of iterated times failing to meet the second target condition, the server takes the second alternative quantization parameter determined based on the target transcoding rate as an input for the next iterative process. That is, when the iteration number threshold is greater than 1, if neither the first target condition nor the second target condition is met, the server can redetermine the second alternative quantization parameter based on the target transcoding rate, and perform the next iteration process based on the second alternative quantization parameter.

404. The server re-quantizes the excitation signal and the audio feature parameters based on the time domain audio signal and the first quantization parameter.

In one possible implementation manner, the server performs discrete cosine transform on the excitation signal and the audio feature parameter respectively to obtain a third signal corresponding to the excitation signal and a third parameter corresponding to the audio feature parameter. And the server divides the third signal and the third parameter with the first quantization parameter respectively and then rounds the third signal and the third parameter to obtain a re-quantized excitation signal and a re-quantized audio characteristic parameter. This embodiment and the second part in the step 503 belong to the same inventive concept, and the implementation process is referred to the above description, and will not be repeated again.

405. The server performs entropy coding on the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

In one possible implementation, the server obtains the re-quantized audio feature parameters and the probability of occurrence of a plurality of coding units in the re-quantized excitation signal. The server encodes the plurality of encoding units based on the occurrence probability to obtain a second audio stream.

For example, for simplicity of the process, it is assumed that the re-quantized audio feature parameter and the re-quantized excitation signal are "DEFFG", each letter is a coding unit, where the probabilities of occurrence of "D", "E", "F", and "G" in "DEFFG" are 0.2, 0.4, and 0.2, respectively, and the initial interval corresponding to "DEFFG" is [0, 100000]. The server divides the interval [0, 100000] into four subintervals according to the probability of occurrence of "D", "E", "F" and "G": d: [0, 20000], E: [20000, 40000], F: [40000, 80000] and G: 80000, 100000], wherein the ratio between the lengths of each subinterval is the same as the ratio of the corresponding occurrence probabilities. Since in "DEFFG" the first letter is "D", the server selects the first subinterval D: [0, 20000] is used as a base section for the subsequent entropy encoding. The server sets interval D according to the probability of occurrence of "D", "E", "F" and "G": [0, 20000] is divided into four subintervals: DD: [0, 4000], DE: [4000, 8000], DF: [8000, 16000] and DG: [16000, 20000]. Since in "DEFFG" the first two letters are "DE", the server selects the second subinterval DE: [4000, 8000] as a base section for the subsequent entropy encoding. The server divides the interval DE according to the probability of occurrence of "D", "E", "F" and "G": [4000, 8000] into four subintervals: DED: [4000, 4800], DEE: [4800, 5600], DEF: [5600, 7200] and DEG: [7200, 8000]. Since the first three letters in "DEFFG" are "DEF", the server will have the third subinterval DEF: [5600, 7200] as a base section for the subsequent entropy encoding. The server sets the interval DEF according to the probability of occurrence of "D", "E", "F" and "G": [5600, 7200] into four subintervals: DEFD: [5600, 5920], DEFE: [5920, 6240], DEFF: [6240, 6880] and DEFG: [6880, 7200]. since in "DEFFG" the first four letters are "DEFF", the server will have a third subinterval DEFF: [6240, 6880] the basic section of the subsequent entropy coding. The server sets the interval DEFF according to the probability of occurrence of "D", "E", "F" and "G": [6240, 6880] into four subintervals: DEFFD: [6240, 6368], DEFFE: [6368, 6496], DEFFF: [6496, 6752] and DEFFG: [6752, 6880] whereby a section [6752, 6880] for entropy encoding "DEFFG" is obtained, the server can use any one of the values of the section [6752, 6880] to represent the encoding result of "DEFFG", for example, 6800 is denoted as "DEFFG", and in the above embodiment, 6800 is the second audio stream.

Optionally, after step 505, the audio transcoding method provided by the embodiment of the present application can also be combined with other audio processing methods to improve the quality of audio transcoding. For example, the audio transcoding method provided by the embodiment of the application can be combined with a Forward Error Correction (FEC) encoding method. During the transmission of an audio stream, errors and jitter may occur, which may lead to a degradation of the quality of the audio transmission, on the basis of which the audio may be encoded by means of forward error correction, the essence of which is that redundant information is added to the audio, so that the occurrence of errors is possible even if the error is corrected, the redundant information being information related to the first N frames of the current audio frame, where N is a positive integer.

In one possible implementation, the server performs forward error correction encoding on a subsequently received audio stream based on the second audio stream.

For example, assuming that a segment of an audio stream is an audio frame, the second audio stream is denoted as T-1 frame, and the audio stream received from the terminal is denoted as T frame, the server can encode the T-1 frame, that is, the second audio stream, as redundant information in forward error correction encoding of the T frame, when encoding the T frame, so as to obtain an encoded FEC code stream, where T is a positive integer. Because the code rate of the T-1 frame is reduced after the audio transcoding method provided by the embodiment of the application, the overall code rate of the encoded FEC code stream can also be reduced, so that the network antagonism, namely the performance of resisting network fluctuation, during the transmission of the audio stream is improved on the premise of ensuring the audio quality.

In other possible embodiments, referring to fig. 6, if the server is currently encoding the T frame, then for the T-1 frame and the T-2 frame, the server may adjust the code rates of the T-1 frame and the T-2 frame by using the audio transcoding method provided by the embodiment of the present application to reduce the code rates of the T-1 frame and the T-2 frame, and encode the adjusted T-1 frame, the adjusted T-2 frame, and the T by using the in-band forward error correction method to obtain an encoded FEC code stream, and since the code rates of the T-1 frame and the T-2 frame are reduced, the overall code rate of the encoded FEC code stream may also be reduced, thereby improving the network robustness when the audio stream is transmitted on the premise of ensuring the audio quality.

According to the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, a complete parameter extraction process is not required to be executed, and entropy decoding is adopted to acquire the audio characteristic parameters and the excitation signals, namely, a more aggressive greedy algorithm is adopted. The re-quantization is also performed on the excitation signal and the audio characteristic parameters, and does not involve a correlation process of the time domain signal. And finally, carrying out entropy coding on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. The complexity of the entropy decoding and the entropy encoding is almost negligible, so that the operation amount of the entropy decoding and the entropy encoding is small, the operation amount can be greatly reduced without processing the time domain signal, and the speed and the efficiency of the audio transcoding are improved on the whole on the premise of ensuring the audio quality.

In addition, an embodiment of the present application further provides an audio transcoder, where the audio transcoder has a structure shown in fig. 7, and the audio transcoder includes: the device comprises an entropy decoding unit 701, a time domain decoding unit 702, a quantization unit 703 and an entropy coding unit 704, wherein the entropy decoding unit 701 is connected with the time domain decoding unit 702 and the quantization unit 703 respectively, the time domain decoding unit 702 is connected with the quantization unit 703, and the quantization unit 703 is connected with the entropy coding unit 704. In some embodiments, the audio transcoder provided by the embodiments of the present application is also referred to as a downstream transcoder.

The entropy decoding unit 701 is configured to entropy decode a first audio stream with a first code rate to obtain an audio feature parameter and an excitation signal of the first audio stream, where the excitation signal is a quantized speech signal.

The time domain decoding unit 702 is configured to obtain a time domain audio signal corresponding to the excitation signal based on the audio feature parameter and the excitation signal.

A quantization unit 703 for re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and the target transcoding rate. In some embodiments, quantization unit 703 is also referred to as a fast noise shaping quantization unit.

The entropy encoding unit 704 is configured to perform entropy encoding on the re-quantized audio feature parameter and the re-quantized excitation signal, to obtain a second audio stream with a second bitrate, where the second bitrate is lower than the first bitrate.

In some embodiments, during the transcoding process, the entropy decoding unit 701 can send the audio feature parameters and the excitation signal to the time-domain decoding unit 702 and the quantization unit 703, respectively, and the time-domain decoding unit 702 can obtain the audio feature parameters and the excitation signal from the entropy decoding unit, and obtain the time-domain audio signal corresponding to the excitation signal based on the audio feature parameters and the excitation signal. The time-domain decoding unit 702 can transmit the time-domain audio signal to the quantization unit 703. The quantization unit 703 is capable of receiving the target transcoding rate, the audio feature parameters, the excitation signal, and the time-domain audio signal, and re-quantizing the excitation signal and the audio feature parameters. The quantization unit 703 can send the re-quantized audio feature parameter and the re-quantized excitation signal to the entropy encoding unit 704, and the re-quantized audio feature parameter and the re-quantized excitation signal are entropy encoded by the entropy encoding unit 704, thereby obtaining a second audio stream of a second bitrate.

In a possible implementation manner, the quantization unit is configured to obtain, through at least one iteration process, a first quantization parameter based on the target transcoding rate, where the first quantization parameter is used to adjust the first bitrate of the first audio stream to the target transcoding rate. The excitation signal and the audio feature parameters are re-quantized based on the time-domain audio signal and the first quantization parameter.

In a possible implementation, the quantization unit is configured to determine the first alternative quantization parameter based on the target transcoding rate during any iteration. And simulating a re-quantization process of the excitation signal and the audio characteristic parameters based on the first alternative quantization parameters to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameters. And simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. In response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding rate and the rate of the analog audio stream, the number of iterations meeting a second target condition, determining a first alternative quantization parameter as the first quantization parameter.

The code rate of the analog audio stream is less than or equal to the target transcoding code rate.

The audio stream quality parameter of the analog audio stream is greater than or equal to the quality parameter threshold.

the similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold.

The difference between the target transcoding rate and the rate of the analog audio stream is less than or equal to a difference threshold.

The number of iterations is equal to the iteration number threshold.

In a possible implementation, the quantization unit is configured to:

and simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter.

And dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and rounding to obtain a first signal and a first parameter.

In a possible implementation, the quantization unit is further configured to: and responding to the fact that the analog audio stream does not meet the first target condition or the time domain audio signal and the first signal, the target transcoding code rate and the iterated times of the analog audio stream do not meet the second target condition, and taking the second alternative quantization parameter determined based on the target transcoding code rate as the input of the next iteration process.

In one possible implementation, the entropy decoding unit is configured to: the probability of occurrence of a plurality of coding units in a first audio stream is obtained. The first audio stream is decoded based on the occurrence probability, and a plurality of decoding units corresponding to the plurality of encoding units are obtained. And combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In one possible implementation, the entropy encoding unit is configured to:

And acquiring the re-quantized audio characteristic parameters and the occurrence probability of a plurality of coding units in the re-quantized excitation signal.

And encoding the plurality of encoding units based on the occurrence probability to obtain a second audio stream.

In a possible embodiment, the audio transcoder further comprises a forward error correction unit, the forward error correction module being coupled to the entropy encoding unit for forward error correction encoding of a subsequently received audio stream based on the second audio stream.

It should be noted that: in the audio transcoder provided in the above embodiment, only the division of the above functional units is used for illustration, and in practical application, the above functional allocation may be performed by different functional units according to needs, that is, the internal structure of the audio transcoder is divided into different functional units, so as to perform all or part of the functions described above. In addition, the embodiments of the method for audio transcoding and the audio transcoder provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.

Fig. 8 is a schematic structural diagram of an audio transcoding device according to an embodiment of the present application, referring to fig. 8, the device includes: a decoding module 801, a time domain audio signal acquisition module 802, a quantization module 803, and an encoding module 804.

The decoding module 801 is configured to entropy decode a first audio stream at a first code rate to obtain an audio feature parameter and an excitation signal of the first audio stream, where the excitation signal is a quantized speech signal.

The time-domain audio signal obtaining module 802 is configured to obtain a time-domain audio signal corresponding to the excitation signal based on the audio feature parameter and the excitation signal.

A quantization module 803, configured to re-quantize the excitation signal and the audio feature parameter based on the time domain audio signal and the target transcoding rate.

The encoding module 804 is configured to entropy encode the re-quantized audio feature parameter and the re-quantized excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.

In one possible implementation, the quantization module is configured to obtain, through at least one iteration process, a first quantization parameter based on the target transcoding rate, where the first quantization parameter is used to adjust the first bitrate of the first audio stream to the target transcoding rate. The excitation signal and the audio feature parameters are re-quantized based on the time-domain audio signal and the first quantization parameter.

In one possible implementation, the quantization module is configured to determine, during any iteration, a first alternative quantization parameter based on the target transcoding rate. And simulating a re-quantization process of the excitation signal and the audio characteristic parameters based on the first alternative quantization parameters to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameters. And simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. In response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding rate and the rate of the analog audio stream, the number of iterations meeting a second target condition, determining a first alternative quantization parameter as the first quantization parameter.

The number of iterations is equal to the iteration number threshold.

In one possible implementation, the quantization module is configured to simulate a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio feature parameter, respectively, to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio feature parameter.

In one possible implementation manner, the quantization module is further configured to, in response to the analog audio stream not meeting the first target condition, or the time-domain audio signal and the first signal, the target transcoding rate, and the code rate and the iterated number of the analog audio stream not meeting the second target condition, take as input of a next iteration process a second alternative quantization parameter determined based on the target transcoding rate.

In a possible implementation manner, the decoding module is configured to obtain occurrence probabilities of a plurality of coding units in the first audio stream. The first audio stream is decoded based on the occurrence probability, and a plurality of decoding units corresponding to the plurality of encoding units are obtained. And combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In one possible embodiment, the coding module is configured to obtain the re-quantized audio feature parameter and the occurrence probability of the plurality of coding units in the re-quantized excitation signal. And encoding the plurality of encoding units based on the occurrence probability to obtain a second audio stream.

It should be noted that: in the audio transcoding device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the audio transcoding device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the audio transcoding device provided in the above embodiment and the method embodiment of audio transcoding belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

The embodiment of the application provides a computer device, which is used for executing the method, and can be realized as a terminal or a server, and the structure of the terminal is described below:

Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 900 may be: smart phones, tablet computers, notebook computers or desktop computers. Terminal 900 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 900 includes: one or more processors 901 and one or more memories 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 901 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 901 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 901 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one computer program for execution by processor 901 to implement the audio transcoding method provided by the method embodiments of the present application.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

The computer device may also be implemented as a server, and the following describes the structure of the server:

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may have a relatively large difference due to configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1001 and one or more memories 1002, where the one or more memories 1002 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1001 to implement the methods provided in the foregoing method embodiments. Of course, the server 1000 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, e.g. a memory comprising a computer program, executable by a processor to perform the audio transcoding method of the above-described embodiment is also provided. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises a program code stored in a computer readable storage medium, which program code is read from the computer readable storage medium by a processor of a computer device, which program code is executed by the processor, such that the computer device performs the above-mentioned audio transcoding method.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the present application.

Claims

1. A method of audio transcoding, the method comprising:

Based on a target transcoding code rate, acquiring a first quantization parameter through at least one iteration process, wherein the first quantization parameter is used for adjusting the first code rate of the first audio stream to the target transcoding code rate;

re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and the first quantization parameter;

2. The method of claim 1, wherein the obtaining the first quantization parameter through at least one iterative process based on the target transcoding rate comprises:

in any one of the iterative processes, determining a first alternative quantization parameter based on the target transcoding rate;

simulating a re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter;

Simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream;

And determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time domain audio signal and the first signal, the target transcoding rate and the code rate, the number of iterations of the analog audio stream meeting a second target condition.

3. The method of claim 2, wherein the compliance of the analog audio stream with the first target condition is at least one of:

4. The method of claim 2, wherein the at least one of the time domain audio signal and the first signal, the target transcoding rate and the rate of the analog audio stream, the number of iterations meeting a second target condition means:

the number of iterations is equal to an iteration number threshold.

5. The method of claim 2, wherein simulating the re-quantization of the excitation signal and the audio feature parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameter comprises:

Simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter;

And dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and then rounding to obtain the first signal and the first parameter.

6. The method according to claim 2, wherein the method further comprises:

And responding to that the analog audio stream does not meet the first target condition or the time domain audio signal, the first signal, the target transcoding code rate, the code rate of the analog audio stream and the iterated times do not meet the second target condition, and taking a second alternative quantization parameter determined based on the target transcoding code rate as an input of the next iteration process.

7. The method of claim 1, wherein entropy decoding the first audio stream at the first code rate to obtain the audio characteristic parameters and the excitation signal of the first audio stream comprises:

acquiring the occurrence probability of a plurality of coding units in the first audio stream;

decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units;

and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

8. The method of claim 1, wherein entropy encoding the re-quantized audio feature parameters and the re-quantized excitation signal to obtain a second audio stream at a second bitrate comprises:

9. The method of claim 1, wherein after entropy encoding the re-quantized audio feature parameters and the re-quantized excitation signal to obtain a second audio stream at a second bitrate, the method further comprises:

and performing forward error correction coding on the subsequently received audio stream based on the second audio stream.

10. An audio transcoder characterized in that, the audio transcoder includes: the device comprises an entropy decoding unit, a time domain decoding unit, a quantization unit and an entropy coding unit, wherein the entropy decoding unit is respectively connected with the time domain decoding unit and the quantization unit, the time domain decoding unit is connected with the quantization unit, and the quantization unit is connected with the entropy coding unit;

The quantization unit is configured to obtain a first quantization parameter through at least one iteration process based on a target transcoding code rate, where the first quantization parameter is used to adjust the first code rate of the first audio stream to the target transcoding code rate; re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and the first quantization parameter;

11. The audio transcoder according to claim 10, wherein said quantization unit is configured to determine a first alternative quantization parameter based on said target transcoding rate during any one of said iterations; simulating a re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; and determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time domain audio signal and the first signal, the target transcoding rate and the code rate, the number of iterations of the analog audio stream meeting a second target condition.

12. The audio transcoder of claim 11, wherein the compliance of the analog audio stream with the first target condition is at least one of:

13. The audio transcoder of claim 11, wherein the compliance of the time domain audio signal and at least one of the first signal, the target transcoding rate, and the rate of the analog audio stream, the number of iterations, with a second target condition means:

the number of iterations is equal to an iteration number threshold.

14. The audio transcoder of claim 11, wherein the quantization unit is configured to: simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter; and dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and then rounding to obtain the first signal and the first parameter.

15. The audio transcoder of claim 11, wherein the quantization unit is further configured to: and responding to that the analog audio stream does not meet the first target condition or the time domain audio signal, the first signal, the target transcoding code rate, the code rate of the analog audio stream and the iterated times do not meet the second target condition, and taking a second alternative quantization parameter determined based on the target transcoding code rate as an input of the next iteration process.

16. The audio transcoder of claim 10, wherein the entropy decoding unit is configured to: acquiring the occurrence probability of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

17. The audio transcoder of claim 10, wherein the entropy encoding unit is configured to:

18. The audio transcoder of claim 10, further comprising a forward error correction module coupled to the entropy encoding unit for forward error correction encoding a subsequently received audio stream based on the second audio stream.

19. An audio transcoding apparatus, said apparatus comprising:

The quantization module is used for acquiring a first quantization parameter through at least one iteration process based on a target transcoding code rate, wherein the first quantization parameter is used for adjusting the first code rate of the first audio stream to the target transcoding code rate; re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and the first quantization parameter;

20. The audio transcoding apparatus of claim 19, wherein the quantization module is configured to determine a first alternative quantization parameter based on the target transcoding rate during any one of the iterations; simulating a re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; and determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time domain audio signal and the first signal, the target transcoding rate and the code rate, the number of iterations of the analog audio stream meeting a second target condition.

21. The audio transcoding apparatus of claim 20, wherein the compliance of the analog audio stream with the first target condition is at least one of:

22. The audio transcoding apparatus of claim 20, wherein the compliance of at least one of the time-domain audio signal and the first signal, the target transcoding rate and the rate of the analog audio stream, the number of iterations, with a second target condition means:

the number of iterations is equal to an iteration number threshold.

23. The audio transcoding apparatus of claim 20, wherein the quantization module is configured to simulate the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio feature parameter, respectively, to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio feature parameter; and dividing the second signal and the second parameter by the first alternative quantization parameter respectively, and then rounding to obtain the first signal and the first parameter.

24. The audio transcoding apparatus of claim 20, wherein the quantization module is further configured to, in response to the analog audio stream not meeting the first target condition, or the time-domain audio signal and the first signal, the target transcoding rate, and the rate, the number of iterations, of the analog audio stream not meeting the second target condition, take as input to a next iteration process a second alternative quantization parameter determined based on the target transcoding rate.

25. The audio transcoding apparatus of claim 19, wherein the decoding module is configured to obtain probabilities of occurrence of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

26. The audio transcoding apparatus of claim 19, wherein the encoding module is configured to obtain the re-quantized audio feature parameters and the probability of occurrence of a plurality of encoding units in the re-quantized excitation signal; and encoding the plurality of encoding units based on the occurrence probability to obtain the second audio stream.

27. The audio transcoding apparatus of claim 19, further comprising a forward error correction module for forward error correction encoding a subsequently received audio stream based on the second audio stream.

28. A computer device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement the audio transcoding method of any of claims 1-9.

29. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the audio transcoding method of any of claims 1 to 9.

30. A computer program product, characterized in that the computer program product comprises a program code, which is stored in a computer readable storage medium, from which the program code is read by a processor of a computer device, which processor executes the program code, such that the computer device implements the audio transcoding method of any of claims 1 to 9.