CN114333861B

CN114333861B - Audio processing method, device, storage medium, equipment and product

Info

Publication number: CN114333861B
Application number: CN202111371005.1A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2025-07-11
Anticipated expiration: 2041-11-18
Also published as: CN114333861A

Abstract

The present application discloses an audio processing method, device, storage medium, equipment and product, which relates to the field of Internet technology. The present application can be applied to the fields of map vehicle networking, blockchain, artificial intelligence, etc. The method includes: receiving the spectrum characteristic parameters of the high-frequency signal and the low-frequency coded data of the low-frequency signal, wherein the high-frequency signal and the low-frequency signal belong to the target audio signal; decoding the low-frequency coded data to generate a decoded low-frequency signal; performing a prediction network matching process based on the spectrum characteristic parameters to obtain an audio prediction network matched with the spectrum characteristic parameters; performing an audio prediction process based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal; and generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal. The present application can reduce the transmission bandwidth of audio data and ensure the audio playback effect.

Description

Audio processing method, device, storage medium, equipment and product

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an audio processing method, an audio processing device, a storage medium, equipment and a product.

Background

The audio processing is generally mainly audio encoding and decoding, and the audio encoding and decoding processing process is mainly that sound signals are collected by a collecting end, the collecting end encodes and compresses the audio signals of the collected sound signals and then sends the compressed audio signals to a receiving end, and the receiving end decodes and plays the sound.

At present, an acquisition end in the related art can transmit an audio signal to a receiving end after reducing the code rate in a certain way so as to reduce the transmission bandwidth, but in the current way, the code rate is reduced to a limited extent, so that the reduction effect of the transmission bandwidth is poor, and the receiving end is often uncontrollable to generate an audio output signal in order to reduce the code rate, so that the audio playing effect is poor.

Disclosure of Invention

The embodiment of the application provides an audio processing scheme which can effectively reduce the transmission bandwidth of audio data and ensure the audio playing effect.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

According to one embodiment of the application, the audio processing method comprises the steps of receiving frequency spectrum characteristic parameters of a high-frequency signal and low-frequency coding data of a low-frequency signal, decoding the low-frequency coding data to generate a decoded low-frequency signal, performing prediction network matching processing on the basis of the frequency spectrum characteristic parameters to obtain an audio prediction network matched with the frequency spectrum characteristic parameters, performing audio prediction processing on the basis of the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and generating an audio output signal corresponding to the target audio signal according to the high-frequency signal and the decoded low-frequency signal.

According to one embodiment of the application, an audio processing device comprises a receiving module, a decoding module, a matching module and an output module, wherein the receiving module is used for receiving spectral characteristic parameters of a high-frequency signal and low-frequency coding data of a low-frequency signal, the high-frequency signal and the low-frequency signal belong to a target audio signal, the decoding module is used for decoding the low-frequency coding data to generate a decoded low-frequency signal, the matching module is used for carrying out prediction network matching processing based on the spectral characteristic parameters to obtain an audio prediction network matched with the spectral characteristic parameters, the prediction module is used for carrying out audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and the output module is used for generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal.

In some embodiments of the present application, the spectral feature parameters include a spectral envelope type, the matching module includes an information obtaining unit configured to obtain network information of at least one preset audio prediction network, where each network information corresponds to a preset spectral envelope type, a network matching unit configured to determine network information corresponding to a preset spectral envelope type matched by the spectral envelope type, to obtain target network information, and a network determining unit configured to determine a preset audio prediction network corresponding to the target network information as the audio prediction network matched by the spectral feature parameters.

In some embodiments of the application, the prediction module comprises an extraction processing unit, an information prediction unit and a signal generation unit, wherein the extraction processing unit is used for carrying out frequency spectrum characteristic extraction processing on the decoded low-frequency signal to obtain low-frequency spectrum information, the information prediction unit is used for carrying out audio prediction processing on the basis of the low-frequency spectrum information by adopting the audio prediction network to obtain predicted frequency spectrum information, and the signal generation unit is used for generating a predicted high-frequency signal corresponding to the high-frequency signal on the basis of the predicted frequency spectrum information.

In some embodiments of the present application, the extraction processing unit is configured to perform modified discrete cosine transform processing on the decoded low-frequency signal to obtain the low-frequency spectrum information, and the signal generating unit is configured to perform modified discrete cosine inverse transform processing on the predicted spectrum information to generate a predicted high-frequency signal corresponding to the high-frequency signal.

In some embodiments of the present application, the output module is configured to perform quadrature mirror image synthesis filtering processing on the predicted high frequency signal and the decoded low frequency signal to generate the audio output signal.

According to one embodiment of the application, the audio processing method comprises the steps of carrying out decomposition processing on a target audio signal to generate a high-frequency signal and a low-frequency signal, carrying out feature extraction processing on the high-frequency signal to obtain a frequency spectrum feature parameter corresponding to the high-frequency signal, carrying out audio encoding processing on the low-frequency signal to generate low-frequency encoded data corresponding to the low-frequency signal, and sending the frequency spectrum feature parameter and the low-frequency encoded data to a receiving end so that the receiving end can determine an audio prediction network matched with the frequency spectrum feature parameter and generate an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

According to one embodiment of the application, an audio processing device comprises a decomposition module, an extraction module, an encoding module and a transmission module, wherein the decomposition module is used for carrying out decomposition processing on a target audio signal to generate a high-frequency signal and a low-frequency signal, the extraction module is used for carrying out feature extraction processing on the high-frequency signal to obtain a frequency spectrum feature parameter corresponding to the high-frequency signal, the encoding module is used for carrying out audio encoding processing on the low-frequency signal to generate low-frequency encoded data corresponding to the low-frequency signal, and the transmission module is used for transmitting the frequency spectrum feature parameter and the low-frequency encoded data to a receiving end so that the receiving end can determine an audio prediction network matched with the frequency spectrum feature parameter and generate an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

In some embodiments of the application, the extraction module comprises a frequency domain conversion unit, a power spectrum value calculation unit and a frequency spectrum characteristic parameter acquisition unit, wherein the frequency domain conversion unit is used for carrying out frequency domain conversion processing on the high-frequency signal to obtain a frequency domain signal, the power spectrum value calculation unit is used for calculating the power spectrum value of each frequency point in the frequency domain signal, and the frequency spectrum characteristic parameter acquisition unit is used for carrying out characteristic extraction processing based on the power spectrum value of each frequency point to obtain a frequency spectrum characteristic parameter describing the frequency spectrum distribution characteristic of the high-frequency signal.

In some embodiments of the present application, the spectrum characteristic parameter obtaining unit includes an element calculating subunit, configured to calculate an average value of power spectrum values of the frequency points, determine a maximum power spectrum value of the power spectrum values of the frequency points, a difference processing subunit, configured to perform a difference processing on the maximum power spectrum value and the average value to obtain a first difference value, and a spectrum characteristic parameter determining subunit, configured to determine a spectrum characteristic parameter corresponding to the high frequency signal according to the first difference value.

In some embodiments of the present application, the spectral feature parameter includes a spectral envelope type, and the spectral feature parameter determining subunit is configured to determine that the spectral envelope type corresponding to the high-frequency signal is a first type if the first difference value is smaller than a first predetermined threshold value and the maximum power spectrum value is smaller than a second predetermined threshold value, and determine that the spectral envelope type corresponding to the high-frequency signal is a second type if the first difference value is smaller than the first predetermined threshold value and the maximum power spectrum value is larger than the second predetermined threshold value.

In some embodiments of the present application, the spectral feature parameter includes a spectral envelope type, the spectral feature parameter determining subunit is configured to normalize a power spectrum value of each frequency point to obtain a normalized value corresponding to each frequency point if the first difference value is greater than a first predetermined threshold value, obtain at least one preset target value, each target value corresponds to a preset spectral envelope type, calculate a mean square error value between the normalized value corresponding to each frequency point and each target value, and determine a preset spectral envelope type corresponding to a target value corresponding to the smallest mean square error value as the spectral envelope type of the high-frequency signal.

In some embodiments of the present application, the spectrum characteristic parameter determining subunit is configured to perform a difference processing on the power spectrum value of each frequency point and the average value to obtain a second difference value corresponding to each frequency point, calculate a square value of the second difference value corresponding to each frequency point, calculate an average value of the square values to obtain a normalized score, and divide the second difference value corresponding to each frequency point by the normalized score to obtain a normalized value corresponding to each frequency point.

In some embodiments of the present application, the decomposition module is configured to perform quadrature mirror image decomposition filtering processing on the target audio signal to generate the high frequency signal and the low frequency signal.

According to another embodiment of the application, a computer readable storage medium has stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the method according to the embodiment of the application.

According to another embodiment of the application, an electronic device comprises a memory storing a computer program and a processor reading the computer program stored in the memory to perform the method according to the embodiment of the application.

According to another embodiment of the application, a computer program product or computer program includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations described in the embodiments of the present application.

In the embodiment of the application, the frequency spectrum characteristic parameters of the high-frequency signals and the low-frequency coded data of the low-frequency signals are received, the high-frequency signals and the low-frequency signals are generated by decomposing target audio signals, the low-frequency coded data are decoded to generate decoded low-frequency signals, the prediction network matching processing is carried out on the basis of the frequency spectrum characteristic parameters to obtain an audio prediction network matched with the frequency spectrum characteristic parameters, the audio prediction processing is carried out on the basis of the audio prediction network and the decoded low-frequency signals to generate predicted high-frequency signals corresponding to the high-frequency signals, and the audio output signals corresponding to the target audio signals are generated according to the high-frequency signals and the decoded low-frequency signals.

In this way, for the target audio signal, the spectrum distribution characteristics of the high-frequency signal in the target audio signal can be described through the spectrum characteristic parameters with very few data sizes, only the spectrum characteristic parameters and the low-frequency coded data of the low-frequency signal are required to be transmitted when the data are received, the transmission bandwidth is effectively reduced, meanwhile, the matched audio prediction network is selected based on the spectrum characteristic parameters to restore the high-frequency signal, the high-frequency signal is generated, and the general spectrum distribution characteristics can be described through the very few data sizes, so that the errors of the predicted high-frequency signal and the original high-frequency signal are controllable, the generation of the audio output signal is controllable, and furthermore, the overall coding rate in the audio processing process is effectively reduced, the capability of restoring the high-frequency signal is strong, the transmission bandwidth of the audio data is effectively reduced, and the audio playing effect is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of a system to which embodiments of the application may be applied.

Fig. 2 shows a flow chart of an audio processing method according to an embodiment of the application.

Fig. 3 shows a schematic diagram of spectral envelope types according to an embodiment of the application.

Fig. 4 shows a flow chart of an audio processing method according to another embodiment of the application.

Fig. 5 shows a flow chart of an audio processing procedure in a scenario.

Fig. 6 shows another flow chart of an audio processing procedure in one scenario.

Fig. 7 shows a block diagram of an audio processing device according to an embodiment of the application.

Fig. 8 shows a block diagram of an audio processing device according to another embodiment of the application.

Fig. 9 shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In the description that follows, specific embodiments of the application will be described with reference to steps and symbols performed by one or more computers, unless otherwise indicated. Thus, these steps and operations will be referred to in several instances as being performed by a computer, which as referred to herein performs operations that include processing units by the computer that represent electronic signals that represent data in a structured form. This operation transforms the data or maintains it in place in the memory system of the computer, which may be reconfigured or otherwise altered in ways well known to those skilled in the art. The data structure maintained by the data is the physical location of the memory, which has specific characteristics defined by the data format. However, the principles of the present application are described in the foregoing text and are not meant to be limiting, and those of skill in the art will appreciate that various of the steps and operations described below may also be implemented in hardware.

Fig. 1 shows a schematic diagram of a system 100 in which embodiments of the application may be applied. As shown in fig. 1, the system 100 may include a terminal 101, a terminal 102, and a server 103.

Terminals 101 and 102 may be any device, and terminals 102 include, but are not limited to, cell phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, VR/AR devices, smart watches, computers, and the like.

The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. In one implementation of the present example, the server 103 is a cloud server.

In some examples, the terminals 101, 102, and 103 may be nodes in a blockchain network, which may promote security of audio processing.

In one embodiment of the present example, the terminal 101 may receive a spectral feature parameter of a high-frequency signal and low-frequency encoded data of a low-frequency signal, where the high-frequency signal and the low-frequency signal belong to a target audio signal, perform decoding processing on the low-frequency encoded data to generate a decoded low-frequency signal, perform prediction network matching processing based on the spectral feature parameter to obtain an audio prediction network with the spectral feature parameter matching, perform audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and generate an audio output signal corresponding to the target audio signal according to the high-frequency signal and the decoded low-frequency signal.

In one example, the spectral characteristic parameter of the high-frequency signal and the low-frequency encoded data of the low-frequency signal may be directly sent by the terminal 102 as the acquisition end to the terminal 101 as the receiving end, in one example, the spectral characteristic parameter of the high-frequency signal and the low-frequency encoded data of the low-frequency signal may be sent by the terminal 102 to the terminal 101 through the server 103, and in one example, the spectral characteristic parameter of the high-frequency signal and the low-frequency encoded data of the low-frequency signal may be sent by the a processing unit as the acquisition end in the terminal 101 to the B processing unit as the receiving end in the terminal 101.

In one embodiment of the present example, the terminal 102 may perform decomposition processing on a target audio signal to generate a high-frequency signal and a low-frequency signal, perform feature extraction processing on the high-frequency signal to obtain a spectral feature parameter corresponding to the high-frequency signal, perform audio encoding processing on the low-frequency signal to generate low-frequency encoded data corresponding to the low-frequency signal, and send the spectral feature parameter and the low-frequency encoded data to a receiving end, so that the receiving end determines an audio prediction network with the matched spectral feature parameter, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

In one example, the receiving end may be the terminal 101, the terminal 102 may directly send the spectrum characteristic parameter and the low-frequency encoded data to the terminal 101 or the terminal 102 may send the spectrum characteristic parameter and the low-frequency encoded data to the terminal 101 through the server 103, in one example, the receiving end may be one C processing unit in the middle terminal 102, and the other D processing unit in the terminal 102 as the collecting end may send the spectrum characteristic parameter and the low-frequency encoded data to the C processing unit.

Fig. 2 schematically shows a flow chart of an audio processing method according to an embodiment of the application. The main execution body of the audio processing method may be any receiving end, and the receiving end may decode the received audio related data to generate an audio output signal, where the audio output signal is used to play sound, and the receiving end may be, for example, the terminal 101 or the terminal 102 shown in fig. 1.

As shown in fig. 2, the audio processing method may include steps S210 to S250.

The method comprises the steps of S210, S220, S230, S240, S250 and S250, wherein the S210 receives spectral characteristic parameters of a high-frequency signal and low-frequency coding data of a low-frequency signal, the high-frequency signal and the low-frequency signal belong to a target audio signal, the decoding process is carried out on the low-frequency coding data to generate a decoded low-frequency signal, the prediction network matching process is carried out on the basis of the spectral characteristic parameters to obtain an audio prediction network with the matched spectral characteristic parameters, the audio prediction process is carried out on the basis of the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and the audio output signal corresponding to the target audio signal is generated according to the high-frequency signal and the decoded low-frequency signal.

The specific procedure of each step performed when audio processing is performed in the embodiment shown in fig. 2 is described below.

In step S210, a spectral feature parameter of a high-frequency signal and low-frequency encoded data of a low-frequency signal generated by decomposing a target audio signal are received.

The target audio signal may be a digital sound signal generated by the acquisition end through analog-to-digital conversion of the acquired sound signal, the target audio signal may be decomposed into a high-frequency signal and a low-frequency signal, the high-frequency signal may be a part of the target audio signal above a predetermined frequency, and the low-frequency signal may be a part of the target audio signal below the predetermined frequency.

The acquisition end can extract the spectrum distribution characteristics of the high-frequency signal and generate spectrum characteristic parameters describing the spectrum distribution characteristics according to the extracted spectrum distribution characteristics. The spectral feature parameter may be a number or an identifier, etc., and the data size of the spectral feature parameter may be controlled very little, for example, in one example, a spectral distribution feature is described by "1", and only 3 bits (bits) of "1" are needed to describe a spectral distribution feature. And the low frequency signal may be encoded using a conventional speech encoder (which may be CELP, SILK, AAC or the like) to generate low frequency encoded data.

After the acquisition end generates the frequency spectrum characteristic parameters of the high-frequency signals and the low-frequency coding data of the low-frequency signals, the frequency spectrum characteristic parameters of the high-frequency signals and the low-frequency coding data of the low-frequency signals can be sent to the receiving end. The acquisition end can form a coded code stream together with the frequency spectrum characteristic parameters and the low-frequency coded data to be transmitted to the receiving end when transmitting, and the code stream of the coded code stream can be extremely small.

In step S220, the low-frequency encoded data is subjected to decoding processing to generate a decoded low-frequency signal.

The receiving end can decode the low-frequency encoded data by a traditional voice decoder to generate a decoded low-frequency signal, and the decoded low-frequency signal is the decoded low-frequency signal.

In step S230, a prediction network matching process is performed based on the spectral feature parameters, so as to obtain an audio prediction network with the matched spectral feature parameters.

The audio prediction network, i.e. the prediction network for predicting the high frequency signal, may be a deep learning network. The prediction network matching process is performed based on the spectral feature parameters, that is, an audio prediction network that determines that the spectral feature parameters match, and it is understood that each spectral feature parameter may correspond to an audio prediction network.

The audio prediction network is used for predicting a high-frequency signal corresponding to the high-frequency signal according to low-frequency coding data corresponding to the low-frequency signal, so that mapping of audio from low frequency to high frequency is realized, diversity exists in mapping of audio from low frequency to high frequency, the spectral characteristic parameters of the high-frequency signal are added to match the corresponding audio prediction network, so that the diversity of samples can be limited, namely, a plurality of preset audio prediction networks are trained, each network corresponds to own training samples (the training samples can comprise input samples, namely, signal characteristic information (such as spectral information) of the low-frequency signal, and the expected output, namely, signal characteristic information (such as spectral information) of the high-frequency signal) and the audio prediction network matched with the spectral characteristic parameters can be determined from the plurality of preset audio prediction networks.

Each network corresponds to its own training sample and trains independently, for example, the a preset audio prediction network may train based on the training sample corresponding to the spectral feature parameter 1, and the B preset audio prediction network may train based on the training sample corresponding to the spectral feature parameter 2. At this time, if the received spectral feature parameter is 1, the audio prediction network to which the received spectral feature parameter is matched is a preset audio prediction network.

In one embodiment, the spectrum characteristic parameters comprise spectrum envelope types, and the step S230 of performing prediction network matching processing based on the spectrum characteristic parameters to obtain an audio prediction network matched with the spectrum characteristic parameters comprises the steps of obtaining network information of at least one preset audio prediction network, wherein each network information corresponds to one preset spectrum envelope type, determining network information corresponding to the preset spectrum envelope types matched with the spectrum envelope types to obtain target network information, and determining the preset audio prediction network corresponding to the target network information as the audio prediction network matched with the spectrum characteristic parameters.

The spectral feature parameter is information describing a spectral distribution feature, in this embodiment, the spectral feature parameter is a spectral envelope type, and the spectral distribution feature is a spectral envelope feature, where the spectral envelope type is information describing a spectral envelope feature, and the spectral envelope feature may represent a spectral value variation trend in a signal spectrum, that is, each spectral value variation trend corresponds to one spectral envelope type.

The network information may be an identification of a preset audio prediction network, i.e. a pre-trained audio prediction network. Each network information corresponds to a preset spectrum envelope type, that is, each preset audio prediction network corresponds to a preset spectrum envelope type.

In this embodiment, each network corresponds to its own training sample and trains independently, for example, a preset audio prediction network may train based on the training sample corresponding to the spectral envelope type 1, and B preset audio prediction network may train based on the training sample corresponding to the spectral envelope type 2. At this time, if the received spectrum envelope type is 1, the audio prediction network with the received spectrum envelope type matching is a preset audio prediction network.

A preset spectrum envelope type is determined, for example, the received spectrum envelope type is 1, and the preset spectrum envelope type is 1. If the network information corresponding to the matched preset spectrum envelope type 1 is X, the target network information is X. Furthermore, the audio prediction network with matched frequency spectrum characteristic parameters is a preset audio prediction network indicated by X.

The accuracy of the prediction high-frequency signals of the audio prediction network can be effectively improved in a mode of matching the prediction network by using the spectrum envelope type.

In one embodiment, referring to fig. 3, the preset spectrum envelope type includes 8 types, wherein the "0" type is a low-energy flat type, the "1" type is a high-energy flat type, the "2" type is an energy convex type, the "3" type is an energy concave type, the "4" type is an energy gradually rising type, the "5" energy gradually decreasing type, the "6" type is a step type with high energy before and low energy after, and the "7" type is a step type with high energy before and low energy after. The applicant found that in this type of division, the prediction accuracy of the audio prediction network can be further improved by performing prediction network matching in the form of a spectral envelope type.

It will be appreciated that in other embodiments, the spectral feature parameter may include other feature information, such as information describing features in the spectrum, such as a proportion of a power spectrum value exceeding a predetermined threshold, etc., and further step S230 of performing a prediction network matching process based on the spectral feature parameter to obtain an audio prediction network with the spectral feature parameter matching, may include obtaining network information of at least one preset audio prediction network, each of the network information corresponding to a preset spectrum envelope type, determining network information corresponding to the preset spectrum envelope type with which the other feature information matches, obtaining target network information, and determining a preset audio prediction network corresponding to the target network information as the audio prediction network with which the spectral feature parameter matches.

In step S240, audio prediction processing is performed based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal.

When the audio prediction network based on the spectrum characteristic parameter matching predicts, the error of the predicted high-frequency signal and the original high-frequency signal is controllable because the general spectrum distribution characteristic can be described through the extremely small data size, and compared with the mode of adopting a unified prediction network, the matched audio prediction network can accurately predict the high-frequency signal, can avoid the midrange property of the predicted high-frequency signal, and improves the accuracy of the predicted high-frequency signal.

Wherein signal characteristic information (e.g., spectral information) of the decoded low frequency signal may be input to an audio prediction network, the audio prediction network may predict signal characteristic information (e.g., spectral information) of the output high frequency signal, and the output signal characteristic may be used to restore the high frequency signal to generate a predicted audio signal.

In one embodiment, step S240 performs audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, where the step includes performing spectral feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectral information, performing audio prediction processing based on the low-frequency spectral information using the audio prediction network to obtain predicted spectral information, and generating a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted spectral information.

The low-frequency spectrum information is input into an audio prediction network, and the audio prediction network performs audio prediction processing, so that predicted spectrum information is output, and the predicted high-frequency signal corresponding to the high-frequency signal can be obtained by performing audio restoration based on the predicted spectrum information.

In one embodiment, the method for extracting the frequency spectrum characteristics of the decoded low-frequency signal to obtain low-frequency spectrum information comprises the steps of performing modified discrete cosine transform on the decoded low-frequency signal to obtain the low-frequency spectrum information, and generating a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted frequency spectrum information comprises the steps of performing modified discrete cosine inverse transform on the predicted frequency spectrum information to generate the predicted high-frequency signal corresponding to the high-frequency signal.

In this embodiment, the decoded low frequency signal may be subjected to a modified discrete cosine transform process by a modified discrete cosine transformer (MDCT, modified Discrete Cosine Transform) to obtain the low frequency spectrum information. Then, the predicted spectral information predicted for the audio prediction network may be subjected to an inverse modified discrete cosine transform process by an inverse modified discrete cosine transformer (IMDCT, inverse Modified Discrete Cosine Transform) to generate a predicted high frequency signal. In this predictive manner, the applicant has found that the accuracy of predicting the audio signal can be further improved.

In step S250, an audio output signal corresponding to the target audio signal is generated according to the predicted high frequency signal and the decoded low frequency signal.

The predicted high-frequency signal is a predicted high-frequency signal, the decoded low-frequency signal is an original low-frequency signal, and the predicted high-frequency signal and the decoded low-frequency signal are synthesized to generate an audio output signal corresponding to the original target audio signal.

In one embodiment, step S250 is configured to generate an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal, and includes performing quadrature mirror image synthesis filtering on the predicted high-frequency signal and the decoded low-frequency signal to generate the audio output signal. Wherein, the quadrature mirror image synthesis filter processing can be performed on the predicted high frequency signal and the decoded low frequency signal through a quadrature mirror image filter (QMF, quandrature Mirror Filter) to generate a full-band audio output signal corresponding to the target audio signal.

It will be appreciated that in other embodiments, the audio output signal may be generated by subjecting the high frequency signal to a synthesis filter process with the decoded low frequency signal by other existing synthesis filters.

Fig. 4 schematically shows a flow chart of an audio processing method according to an embodiment of the application. The main body of execution of the audio processing method may be any terminal, for example, the terminal 101 or the terminal 102 shown in fig. 1.

As shown in fig. 4, the audio processing method may include steps S310 to S340.

Step S310, decomposing the target audio signal to generate a high-frequency signal and a low-frequency signal; the method comprises the steps of S320, S330, S340, transmitting the frequency spectrum characteristic parameters and the low-frequency coding data to a receiving end so that the receiving end determines an audio prediction network matched with the frequency spectrum characteristic parameters and generates an audio output signal based on the audio prediction network and the decoded low-frequency signals obtained by decoding the low-frequency coding data.

In this way, for the target audio signal, the spectrum distribution characteristics of the high-frequency signal in the target audio signal can be described through the spectrum characteristic parameters with very few data sizes, only the spectrum characteristic parameters and the low-frequency coding data of the low-frequency signal are required to be transmitted during transmission, the transmission bandwidth is effectively reduced, meanwhile, the matched audio prediction network is selected based on the spectrum characteristic parameters to restore the high-frequency signal, the high-frequency signal is generated, and the general spectrum distribution characteristics can be described through the very few data sizes, so that the errors of the predicted high-frequency signal and the original high-frequency signal are controllable, the generation of the audio output signal is controllable, and furthermore, the overall coding rate in the audio processing process is effectively reduced, the capability of restoring the high-frequency signal is strong, the transmission bandwidth of the audio data is effectively reduced, and the audio playing effect is ensured.

The specific procedure of each step performed when audio processing is performed in the embodiment shown in fig. 3 is described below.

In step S310, the target audio signal is decomposed to generate a high-frequency signal and a low-frequency signal.

Wherein the target audio signal may be decomposed in a corresponding manner to generate a high frequency signal and a low frequency signal by a band-PASS FILTER (BPF) bank, a quadrature mirror filter (QMF, quandrature Mirror Filter) bank, or the like.

In one embodiment, step S310 of decomposing the target audio signal to generate a high frequency signal and a low frequency signal includes performing quadrature mirror decomposition filtering on the target audio signal to generate the high frequency signal and the low frequency signal. Wherein, the target audio signal may be subjected to quadrature mirror image decomposition filtering processing by a quadrature mirror image filter (QMF, quandrature Mirror Filter) bank to generate a high frequency signal and a low frequency signal.

In step S320, feature extraction processing is performed on the high-frequency signal to obtain a spectral feature parameter corresponding to the high-frequency signal.

The acquisition end can extract the spectrum distribution characteristics of the high-frequency signal and generate spectrum characteristic parameters describing the spectrum distribution characteristics according to the extracted spectrum distribution characteristics. The spectral distribution characteristics may include, among other things, spectral envelope characteristics, proportions of power spectrum values in the spectrum exceeding a predetermined threshold, and the like. In one embodiment of the present example, the spectral distribution feature is a spectral envelope feature, and the spectral feature parameter is a spectral envelope type.

The spectral feature parameter may be a number or an identifier, etc., and the data size of the spectral feature parameter may be controlled very little, for example, in one example, a spectral distribution feature is described by "1", and only 3 bits (bits) of "1" are needed to describe a spectral distribution feature.

In one embodiment, step S320 includes performing feature extraction processing on the high-frequency signal to obtain a spectral feature parameter corresponding to the high-frequency signal, where the feature extraction processing includes performing frequency domain conversion processing on the high-frequency signal to obtain a frequency domain signal, calculating power spectrum values of frequency points in the frequency domain signal, and performing feature extraction processing based on the power spectrum values of the frequency points to obtain a spectral feature parameter describing a spectral distribution feature of the high-frequency signal.

The high-frequency signal is a time-domain signal, a frequency-domain signal of a frequency domain can be obtained through frequency domain conversion processing (such as Fourier transform processing), and a power spectrum value of each frequency point can be extracted based on a frequency spectrum (i.e. a spectrogram) of the frequency domain signal. The characteristic extraction based on the power spectrum value of each frequency point can accurately analyze the spectrum distribution characteristic of the high-frequency signal, and further generate spectrum characteristic parameters describing the spectrum distribution characteristic. In one example, the power spectrum value of each frequency point may be a logarithmic value of the power spectrum of each frequency point.

In one embodiment, the feature extraction processing is performed based on the power spectrum values of the frequency points to obtain a spectrum feature parameter describing the spectrum distribution feature of the high-frequency signal, and the feature extraction processing comprises the steps of calculating an average value of the power spectrum values of the frequency points, determining a maximum power spectrum value in the power spectrum values of the frequency points, performing a difference processing on the maximum power spectrum value and the average value to obtain a first difference value, and determining the spectrum feature parameter corresponding to the high-frequency signal according to the first difference value.

For example, the power spectrum value of each frequency point is x (i), i e [ N1, N2], i is the number of the frequency point, the average xavg of the power spectrum values of each frequency point is the average of all x (i), the maximum power spectrum value xmax of each frequency point is the maximum one of all x (i), and the maximum power spectrum value xmax is subtracted by the average xavg to obtain a first difference. In this way, the spectral distribution characteristics, in particular the spectral envelope characteristics, can be accurately reflected on the basis of the magnitude of the first difference.

In some other modes, when feature extraction processing is performed based on the power spectrum value of each frequency point to obtain a spectrum feature parameter describing the spectrum distribution feature of the high-frequency signal, the corresponding spectrum feature parameter can be determined according to the proportion by calculating the proportion of the power spectrum value exceeding a predetermined threshold in the spectrum.

In one embodiment, the spectrum characteristic parameter includes a spectrum envelope type, and the determining the spectrum characteristic parameter corresponding to the high-frequency signal according to the first difference value includes determining that the spectrum envelope type corresponding to the high-frequency signal is a first type if the first difference value is smaller than a first predetermined threshold value and the maximum power spectrum value is smaller than a second predetermined threshold value, and determining that the spectrum envelope type corresponding to the high-frequency signal is a second type if the first difference value is smaller than the first predetermined threshold value and the maximum power spectrum value is larger than the second predetermined threshold value. The first type and the second type describe corresponding spectral envelope features, respectively.

In one implementation of this embodiment, referring to fig. 3, the predetermined spectrum envelope type includes a "0" type being a low-energy tile type and a "1" type being a high-energy tile type. If the first difference is smaller than the first predetermined threshold value C1 and the maximum power spectrum value is smaller than the second predetermined threshold value C2, it may be determined that the spectrum envelope type corresponding to the high-frequency signal is a low-energy tiling type of the first type: "0". If the first difference value is smaller than a first preset threshold value C1 and the maximum power spectrum value is larger than a second preset threshold value C2, determining that the spectrum envelope type corresponding to the high-frequency signal is of a second type which is a high-energy tiling type of a type 1.

In one embodiment, the spectrum characteristic parameter comprises a spectrum envelope type, the spectrum characteristic parameter corresponding to the high-frequency signal is determined according to the first difference value, the method comprises the steps of normalizing the power spectrum value of each frequency point to obtain a normalization value corresponding to each frequency point if the first difference value is larger than a first preset threshold value, obtaining at least one preset target value, each target value corresponds to one preset spectrum envelope type, calculating a mean square error value of the normalization value corresponding to each frequency point and each target value, and determining the minimum target value corresponding to the mean square error value as the spectrum envelope type of the high-frequency signal.

In one implementation manner of this embodiment, referring to fig. 3, the preset spectrum envelope type includes an "2" type being an energy convex type, an "3" type being an energy concave type, an "4" type being an energy gradually rising type, an "5" type being an energy gradually decreasing type, an "6" type being a step type with high energy at front and low energy at rear, and an "7" type being a step type with high energy at front and low energy at rear.

Each target value corresponds to a preset spectrum envelope type, and the target value is a preset value. In one example, the target value may be z (i), i e [ N1, N2], i is the number of the frequency bin, z (i) is less than 1, taking type "2" as an example, if N2-N1+1 is equal to 9, the target value z (i) corresponding to type "2" may be set to be 000111000.

And calculating a mean square error value of a normalization value corresponding to each frequency point and each target value, and accurately determining a preset spectrum envelope type with the nearest spectrum distribution characteristic of the high-frequency signal based on the minimum mean square error value. For example, if the mean square error value of the target value corresponding to the type "2" and the normalized value corresponding to each frequency point is smaller than the types "3" to "7", the type "2" energy convex type can be determined as the spectrum envelope type of the high-frequency signal.

In one embodiment, the normalizing the power spectrum value of each frequency point to obtain a normalized value corresponding to each frequency point includes performing a difference processing on the power spectrum value of each frequency point and the average value to obtain a second difference value corresponding to each frequency point, calculating a square value of the second difference value corresponding to each frequency point, calculating an average value of the square values to obtain a normalized score, and dividing the second difference value corresponding to each frequency point by the normalized score to obtain a normalized value corresponding to each frequency point.

In this embodiment, the formula can be specifically adoptedAndNormalization processing is performed to obtain a normalized value y ⁽i⁾ corresponding to each frequency point i, wherein N2-n1+1 is the total number of frequency points, N2 to N1 are the frequency point sequence number range, xavg is the average value, x (i) is the power spectrum value of each frequency point i, x (i) -xavg is the second difference value corresponding to each frequency point, and std is the normalized score (average value of square values).

In step S330, the low-frequency signal is subjected to audio encoding processing, and low-frequency encoded data corresponding to the low-frequency signal is generated.

The low frequency signal may be encoded using a conventional speech encoder (which may be CELP, SILK, AAC or the like) to generate low frequency encoded data.

In step S340, the spectral feature parameter and the low-frequency encoded data are transmitted to the receiving end, so that the receiving end determines an audio prediction network with the spectral feature parameter matched, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

When the frequency spectrum characteristic parameters and the low-frequency coded data are sent to the receiving end, the frequency spectrum characteristic parameters and the low-frequency coded data can be combined together to form a coded code stream which can be extremely small to be sent to the receiving end. The receiving end may determine an audio prediction network with matched spectral feature parameters based on the steps in the embodiment shown in fig. 2, and generate an audio output signal based on the audio prediction network and a decoded low-frequency signal obtained by decoding the low-frequency encoded data.

The method described in the above embodiments will be described in further detail below with reference to an application scenario example. The meaning of the related terms in this scenario is the same as in the foregoing embodiments, and specific reference may be made to the description in the foregoing embodiments. The flow of audio processing in this application scenario, in which the foregoing embodiments of the present application are applied to audio processing, may be shown in fig. 5 and 6.

First, referring to fig. 5, the encoding process in the audio processing process is performed at the acquisition end, and the process may include steps S410 to S450.

In step S410, a target audio signal is input, specifically, the acquisition end may acquire a target audio signal generated by analog-to-digital conversion of a sound signal.

In step S420, QMF decomposition, specifically, subjecting a target audio signal to decomposition processing to generate a high-frequency signal and a low-frequency signal, and subjecting the target audio signal to decomposition processing to generate a high-frequency signal and a low-frequency signal, includes subjecting the target audio signal to quadrature mirror decomposition filtering processing to generate the high-frequency signal and the low-frequency signal. Wherein, the target audio signal may be subjected to quadrature mirror image decomposition filtering processing by a quadrature mirror image filter (QMF, quandrature Mirror Filter) bank to generate a high frequency signal and a low frequency signal.

In step S430, spectral feature frame extraction is performed, specifically, on the high-frequency signal to obtain spectral feature parameters corresponding to the high-frequency signal.

The method comprises the steps of carrying out feature extraction processing on the high-frequency signals to obtain frequency spectrum feature parameters corresponding to the high-frequency signals, carrying out frequency domain conversion processing on the high-frequency signals to obtain frequency domain signals, calculating power spectrum values of all frequency points in the frequency domain signals, and carrying out feature extraction processing on the basis of the power spectrum values of all the frequency points to obtain the frequency spectrum feature parameters describing the frequency spectrum distribution features of the high-frequency signals.

The method comprises the steps of calculating an average value of power spectrum values of all frequency points, determining a maximum power spectrum value in the power spectrum values of all frequency points, carrying out difference processing on the maximum power spectrum value and the average value to obtain a first difference value, and determining the spectrum characteristic parameter corresponding to the high-frequency signal according to the first difference value.

The spectrum characteristic parameters comprise spectrum envelope types, and specifically referring to fig. 3, the preset spectrum envelope types comprise a low-energy tiling type of a 0 type, a high-energy tiling type of a1 type, an energy convex type of a2 type, an energy concave type of a3 type, an energy gradually rising type of a 4 type, an energy gradually falling type of a 5 type, a step type of high energy before and low energy after the 6 type, and a step type of high energy before and low energy after the 7 type.

The "0" type is a low-energy tiling type, and the "1" type is a high-energy tiling type. The determining the spectrum characteristic parameters corresponding to the high-frequency signals according to the first difference value comprises determining that the spectrum envelope type corresponding to the high-frequency signals is of a first type if the first difference value is smaller than a first preset threshold value and the maximum power spectrum value is smaller than a second preset threshold value, and determining that the spectrum envelope type corresponding to the high-frequency signals is of a second type if the first difference value is smaller than the first preset threshold value and the maximum power spectrum value is larger than the second preset threshold value. If the first difference is smaller than the first predetermined threshold C1 and the maximum power spectrum value is smaller than the second predetermined threshold C2, it may be determined that the spectrum envelope type corresponding to the high-frequency signal is a low-energy tiling type of the first type: "0". If the first difference value is smaller than a first preset threshold value C1 and the maximum power spectrum value is larger than a second preset threshold value C2, determining that the spectrum envelope type corresponding to the high-frequency signal is of a second type which is a high-energy tiling type of a type 1.

For the "2" to "7" forms. The method comprises the steps of determining spectral characteristic parameters corresponding to high-frequency signals according to the first difference value, normalizing power spectrum values of all frequency points to obtain normalized values corresponding to all the frequency points if the first difference value is larger than a first preset threshold value, obtaining at least one preset target value, wherein each target value corresponds to one preset spectral envelope type, calculating a mean square error value of the normalized value corresponding to all the frequency points and each target value, and determining the preset spectral envelope type corresponding to the target value corresponding to the smallest mean square error value as the spectral envelope type of the high-frequency signals.

Each target value corresponds to a preset spectrum envelope type, and the target value is a preset value. The target value may be z (i), i e [ N1, N2], i is the number of the frequency bin, z (i) is smaller than 1, taking type "2" as an example, if N2-N1+1 is equal to 9, the target value z (i) corresponding to type "2" may be set to 000111000. And calculating a mean square error value of a normalization value corresponding to each frequency point and each target value, and accurately determining a preset spectrum envelope type with the nearest spectrum distribution characteristic of the high-frequency signal based on the minimum mean square error value. For example, if the mean square error value of the target value corresponding to the type "2" and the normalized value corresponding to each frequency point is smaller than the types "3" to "7", the type "2" energy convex type can be determined as the spectrum envelope type of the high-frequency signal.

The normalization processing is carried out on the power spectrum value of each frequency point to obtain a normalization value corresponding to each frequency point, and the normalization processing comprises the steps of carrying out difference processing on the power spectrum value of each frequency point and the average value to obtain a second difference value corresponding to each frequency point, calculating the square value of the second difference value corresponding to each frequency point, calculating the average value of the square values to obtain a normalization score, and dividing the second difference value corresponding to each frequency point by the normalization score to obtain a normalization value corresponding to each frequency point. In this embodiment, the formula can be specifically adoptedAndNormalization processing is performed to obtain a normalized value y ⁽i⁾ corresponding to each frequency point i, wherein N2-n1+1 is the total number of frequency points, N2 to N1 are the frequency point sequence number range, xavg is the average value, x (i) is the power spectrum value of each frequency point i, x (i) -xavg is the second difference value corresponding to each frequency point, and std is the normalized score (average value of square values).

In step S440, the low frequency speech is encoded, specifically, the low frequency signal is subjected to audio encoding processing, and low frequency encoded data corresponding to the low frequency signal is generated. The low frequency signal may be encoded using a conventional speech encoder (which may be CELP, SILK, AAC or the like) to generate low frequency encoded data.

In step S450, the data is output, and the spectral characteristic parameter and the low-frequency encoded data are transmitted to the receiving end. The spectrum characteristic parameters and the low-frequency coded data can be packaged together to form a coded code stream to be sent to a receiving end.

Further, referring to fig. 6, the decoding process in the audio processing process is performed at the acquisition end, and the process may include steps S510 to S550.

In step S510, a code stream is input, specifically, a code stream sent by the acquisition end is received, where the code stream includes a spectral feature parameter of a high-frequency signal and low-frequency encoded data of a low-frequency signal. Namely, the frequency spectrum characteristic parameters of the high-frequency signal and the low-frequency coded data of the low-frequency signal are received, and the high-frequency signal and the low-frequency signal are generated by decomposing the target audio signal.

In step S520, the code stream is parsed, specifically, the received code stream is parsed to obtain the spectral characteristic parameters of the high frequency signal and the low frequency encoded data of the low frequency signal in the code stream.

In step S530, the low frequency speech is decoded, specifically, the low frequency encoded data is decoded to generate a decoded low frequency signal. The receiving end can decode the low-frequency encoded data by a traditional voice decoder to generate a decoded low-frequency signal, and the decoded low-frequency signal is the decoded low-frequency signal.

In step S540, network matching, specifically, performing prediction network matching processing based on the spectrum feature parameters, to obtain an audio prediction network with the spectrum feature parameters matched.

The frequency spectrum characteristic parameters comprise frequency spectrum envelope types, the audio prediction network matched with the frequency spectrum characteristic parameters is obtained by performing prediction network matching processing on the basis of the frequency spectrum characteristic parameters, the audio prediction network matched with the frequency spectrum characteristic parameters comprises network information of at least one preset audio prediction network, each network information corresponds to one preset frequency spectrum envelope type, network information corresponding to the preset frequency spectrum envelope types matched with the frequency spectrum envelope types is determined to obtain target network information, and the preset audio prediction network corresponding to the target network information is determined to be the audio prediction network matched with the frequency spectrum characteristic parameters.

In step S550, a prediction process, specifically, an audio prediction process is performed based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal.

The step S550 of performing audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal comprises the step S551 of performing spectral feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectral information, the step S552 of performing audio prediction processing based on the low-frequency spectral information by adopting the audio prediction network to obtain predicted spectral information, and the step S553 of generating a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted spectral information.

The method comprises the steps of performing frequency spectrum feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectrum information, wherein the step of performing improved discrete cosine transform processing on the decoded low-frequency signal to obtain the low-frequency spectrum information, and the step of generating a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted frequency spectrum information comprises the step of performing improved discrete cosine inverse transform processing on the predicted frequency spectrum information to generate the predicted high-frequency signal corresponding to the high-frequency signal.

The decoded low frequency signal may be subjected to a modified discrete cosine transform process by a modified discrete cosine transformer (MDCT, modified Discrete Cosine Transform) to obtain low frequency spectrum information. Then, the predicted spectral information predicted for the audio prediction network may be subjected to an inverse modified discrete cosine transform process by an inverse modified discrete cosine transformer (IMDCT, inverse Modified Discrete Cosine Transform) to generate a predicted high frequency signal.

In step S560, QMF synthesis, specifically, generating an audio output signal corresponding to the target audio signal from the high frequency signal and the decoded low frequency signal. Generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal comprises performing quadrature mirror image synthesis filtering processing on the predicted high-frequency signal and the decoded low-frequency signal to generate the audio output signal. Wherein, the quadrature mirror image synthesis filter processing can be performed on the predicted high frequency signal and the decoded low frequency signal through a quadrature mirror image filter (QMF, quandrature Mirror Filter) to generate a full-band audio output signal corresponding to the target audio signal.

The method can at least realize that the acquisition end can describe the spectrum distribution characteristics of the high-frequency signals in the target audio signals through the spectrum characteristic parameters with little data size, only the spectrum characteristic parameters and the low-frequency coded data of the low-frequency signals need to be transmitted during transmission, the transmission bandwidth is effectively reduced, meanwhile, the matched audio prediction network is selected based on the spectrum characteristic parameters to restore the high-frequency signals, the high-frequency signals are generated, and the general spectrum distribution characteristics can be described through the little data size, so that the errors of the predicted high-frequency signals and the original high-frequency signals are controllable, the generation of the audio output signals is controllable, the overall coding rate is effectively reduced in the audio processing process, the capability of restoring the high-frequency signals is strong, the transmission bandwidth of the audio data is effectively reduced, and the audio playing effect is ensured.

In order to facilitate better implementation of the audio processing method provided by the embodiment of the application, the embodiment of the application also provides an audio processing device based on the audio processing method. Where the meaning of the terms is the same as in the above-described audio processing method, specific implementation details may be referred to in the description of the method embodiments. Fig. 7 shows a block diagram of an audio processing device according to an embodiment of the application. Fig. 8 shows a block diagram of an audio processing device according to another embodiment of the application.

As shown in fig. 7, the audio processing apparatus 600 may include a receiving module 610, a decoding module 620, a matching module 630, a predicting module 640, and an output module 650, where the audio processing apparatus 600 may be applied to a device corresponding to a receiving end of audio.

The receiving module 610 may be configured to receive spectral feature parameters of a high-frequency signal and low-frequency encoded data of a low-frequency signal, where the high-frequency signal and the low-frequency signal belong to a target audio signal, the decoding module 620 may be configured to perform decoding processing on the low-frequency encoded data to generate a decoded low-frequency signal, the matching module 630 may be configured to perform prediction network matching processing based on the spectral feature parameters to obtain an audio prediction network with the matched spectral feature parameters, the prediction module 640 may be configured to perform audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and the output module 650 may be configured to generate an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal.

In some embodiments of the present application, the spectral feature parameters include a spectral envelope type, the matching module 630 includes an information obtaining unit configured to obtain network information of at least one preset audio prediction network, where each network information corresponds to a preset spectral envelope type, a network matching unit configured to determine network information corresponding to a preset spectral envelope type matched by the spectral envelope type, to obtain target network information, and a network determining unit configured to determine a preset audio prediction network corresponding to the target network information as the audio prediction network matched by the spectral feature parameters.

In some embodiments of the present application, the prediction module 640 includes an extraction processing unit configured to perform spectral feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectrum information, an information prediction unit configured to perform audio prediction processing based on the low-frequency spectrum information by using the audio prediction network to obtain predicted spectrum information, and a signal generation unit configured to generate a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted spectrum information.

In some embodiments of the present application, the output module 650 is configured to perform quadrature mirror synthesis filtering on the predicted high frequency signal and the decoded low frequency signal to generate the audio output signal.

In this way, based on the audio processing apparatus 600, for the target audio signal, the spectrum distribution characteristics of the high-frequency signal therein may be described by the spectrum characteristic parameters of the minimum data size, only the spectrum characteristic parameters and the low-frequency encoded data of the low-frequency signal need to be transmitted when the data is received, the transmission bandwidth is effectively reduced, and at the same time, the high-frequency signal is restored by selecting the matched audio prediction network based on the spectrum characteristic parameters, so as to generate the high-frequency signal.

As shown in fig. 8, the audio processing apparatus 700 may include a decomposition module 710, an extraction module 720, an encoding module 730, and a delivery module 740, where the audio processing apparatus 700 may be applied to a device corresponding to an audio capturing end.

The decomposition module 710 may be configured to decompose a target audio signal to generate a high-frequency signal and a low-frequency signal, the extraction module 720 may be configured to perform feature extraction processing on the high-frequency signal to obtain a spectral feature parameter corresponding to the high-frequency signal, the encoding module 730 may be configured to perform audio encoding processing on the low-frequency signal to generate low-frequency encoded data corresponding to the low-frequency signal, and the transmission module 740 may be configured to send the spectral feature parameter and the low-frequency encoded data to a receiving end, so that the receiving end determines an audio prediction network that matches the spectral feature parameter, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

In some embodiments of the present application, the extraction module 720 includes a frequency domain conversion unit configured to perform a frequency domain conversion process on the high-frequency signal to obtain a frequency domain signal, a power spectrum value calculation unit configured to calculate a power spectrum value of each frequency point in the frequency domain signal, and a spectrum feature parameter acquisition unit configured to perform a feature extraction process based on the power spectrum value of each frequency point to obtain a spectrum feature parameter describing a spectrum distribution feature of the high-frequency signal.

In some embodiments of the present application, the decomposing module 710 is configured to perform quadrature mirror image decomposition filtering processing on the target audio signal to generate the high frequency signal and the low frequency signal.

In this way, based on the audio processing apparatus 700, for the target audio signal, the spectrum distribution characteristics of the high-frequency signal therein can be described by the spectrum characteristic parameters with very few data sizes, only the spectrum characteristic parameters and the low-frequency encoded data of the low-frequency signal need to be transmitted when the data are transmitted, the transmission bandwidth is effectively reduced, meanwhile, the matched audio prediction network is selected based on the spectrum characteristic parameters to restore the high-frequency signal, and generate the high-frequency signal.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

In addition, the embodiment of the present application further provides an electronic device, which may be a terminal or a server, as shown in fig. 9, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, specifically:

The electronic device may include one or more processing cores 'processors 801, one or more computer-readable storage media's memory 802, power supply 803, and input unit 804, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 9 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 801 is a control center of the electronic device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 802, and calling data stored in the memory 802, thereby controlling the electronic device as a whole. Optionally, the processor 801 may include one or more processing cores, and preferably the processor 801 may integrate an application processor that primarily processes operating systems, user pages, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801.

The memory 802 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by executing the software programs and modules stored in the memory 802. The memory 802 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the computer device, etc. In addition, memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 801 with access to the memory 802.

The electronic device further comprises a power supply 803 for powering the various components, preferably the power supply 803 can be logically coupled to the processor 801 via a power management system such that functions such as managing charging, discharging, and power consumption are performed by the power management system. The power supply 803 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 804, which input unit 804 may be used for receiving input digital or character information and for generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 801 in the electronic device loads executable files corresponding to the processes of one or more computer programs into the memory 802 according to the following instructions, and the processor 801 executes the computer programs stored in the memory 802, so as to implement the functions in the foregoing embodiments of the present application.

The processor 801 may perform, for example, receiving spectral feature parameters of a high-frequency signal and low-frequency encoded data of a low-frequency signal, where the high-frequency signal and the low-frequency signal belong to a target audio signal, performing decoding processing on the low-frequency encoded data to generate a decoded low-frequency signal, performing prediction network matching processing based on the spectral feature parameters to obtain an audio prediction network with the matched spectral feature parameters, performing audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal, and generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal.

The processor 801 may perform the steps of performing a decomposition process on a target audio signal to generate a high-frequency signal and a low-frequency signal, performing a feature extraction process on the high-frequency signal to obtain a spectral feature parameter corresponding to the high-frequency signal, performing an audio encoding process on the low-frequency signal to generate low-frequency encoded data corresponding to the low-frequency signal, and transmitting the spectral feature parameter and the low-frequency encoded data to a receiving end, so that the receiving end determines an audio prediction network with the spectral feature parameter matched, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency encoded data.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of the various methods of the above embodiments may be performed by a computer program, or by computer program control related hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application also provide a computer readable storage medium having stored therein a computer program that can be loaded by a processor to perform the steps of any of the methods provided by the embodiments of the present application.

The computer readable storage medium may include, among others, read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disks, and the like.

Since the computer program stored in the computer readable storage medium may execute the steps of any one of the methods provided in the embodiments of the present application, the beneficial effects that can be achieved by the methods provided in the embodiments of the present application may be achieved, which are detailed in the previous embodiments and are not described herein.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the application described above.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It will be understood that the application is not limited to the embodiments which have been described above and shown in the drawings, but that various modifications and changes can be made without departing from the scope thereof.

Claims

1. An audio processing method, comprising:

Receiving spectrum characteristic parameters of a high-frequency signal and low-frequency coding data of a low-frequency signal, wherein the high-frequency signal and the low-frequency signal belong to a target audio signal, and the spectrum characteristic parameters are information describing spectrum distribution characteristics of the high-frequency signal and include a spectrum envelope type;

Decoding the low-frequency encoded data to generate a decoded low-frequency signal;

Performing prediction network matching processing based on the spectral feature parameters to obtain an audio prediction network matched with the spectral feature parameters, wherein the audio prediction network matched with the spectral feature parameters is a deep learning network trained based on training samples corresponding to the spectral feature parameters;

Performing audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal;

generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal;

The step of performing prediction network matching processing based on the spectrum feature parameters to obtain an audio prediction network matching the spectrum feature parameters includes:

Acquire network information of at least one preset audio prediction network, each of the network information corresponds to a preset spectrum envelope type;

Determine network information corresponding to a preset spectrum envelope type that matches the spectrum envelope type, and obtain target network information;

The preset audio prediction network corresponding to the target network information is determined as the audio prediction network that matches the frequency spectrum feature parameters.

2. The method according to claim 1, characterized in that the audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal comprises:

Performing spectrum feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectrum information;

Using the audio prediction network, performing audio prediction processing based on the low-frequency spectrum information to obtain predicted spectrum information;

A predicted high-frequency signal corresponding to the high-frequency signal is generated based on the predicted spectrum information.

3. The method according to claim 2, characterized in that the step of performing spectrum feature extraction processing on the decoded low-frequency signal to obtain low-frequency spectrum information comprises:

Performing improved discrete cosine transform processing on the decoded low-frequency signal to obtain the low-frequency spectrum information;

The generating a predicted high-frequency signal corresponding to the high-frequency signal based on the predicted spectrum information includes:

The predicted frequency spectrum information is processed by an improved inverse discrete cosine transform to generate a predicted high-frequency signal corresponding to the high-frequency signal.

4. The method according to any one of claims 1 to 3, characterized in that generating an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal comprises:

The predicted high-frequency signal and the decoded low-frequency signal are subjected to orthogonal mirror synthesis filtering to generate the audio output signal.

5. An audio processing method, comprising:

Decompose the target audio signal to generate high-frequency signal and low-frequency signal;

Performing feature extraction processing on the high-frequency signal to obtain a spectrum feature parameter corresponding to the high-frequency signal, wherein the spectrum feature parameter is information describing a spectrum distribution feature of the high-frequency signal and includes a spectrum envelope type;

Performing audio coding processing on the low-frequency signal to generate low-frequency coding data corresponding to the low-frequency signal;

The spectral feature parameters and the low-frequency coded data are sent to a receiving end, so that the receiving end determines an audio prediction network that matches the spectral feature parameters, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency coded data, wherein the audio prediction network that matches the spectral feature parameters is a deep learning network trained based on training samples corresponding to the spectral feature parameters;

Wherein, the determining of the audio prediction network matching the spectrum feature parameters comprises:

6. The method according to claim 5, characterized in that the step of performing feature extraction processing on the high-frequency signal to obtain a frequency spectrum feature parameter corresponding to the high-frequency signal comprises:

Performing frequency domain conversion processing on the high frequency signal to obtain a frequency domain signal;

Calculate the power spectrum value of each frequency point in the frequency domain signal;

A feature extraction process is performed based on the power spectrum value of each frequency point to obtain a spectrum feature parameter describing the spectrum distribution feature of the high-frequency signal.

7. The method according to claim 6, characterized in that the feature extraction process based on the power spectrum value of each frequency point to obtain spectrum feature parameters describing the spectrum distribution characteristics of the high-frequency signal comprises:

Calculating the average value of the power spectrum values of each of the frequency points, and determining the maximum power spectrum value among the power spectrum values of each of the frequency points;

Performing a difference processing on the maximum power spectrum value and the average value to obtain a first difference value;

A frequency spectrum characteristic parameter corresponding to the high-frequency signal is determined according to the first difference.

8. The method according to claim 7, wherein determining the frequency spectrum characteristic parameter corresponding to the high-frequency signal according to the first difference comprises:

If the first difference is less than a first predetermined threshold value and the maximum power spectrum value is less than a second predetermined threshold value, determining that the spectrum envelope type corresponding to the high-frequency signal is a first type;

If the first difference is smaller than a first predetermined threshold and the maximum power spectrum value is larger than a second predetermined threshold, it is determined that the spectrum envelope type corresponding to the high-frequency signal is a second type.

9. The method according to claim 7, wherein determining the frequency spectrum characteristic parameter corresponding to the high-frequency signal according to the first difference comprises:

If the first difference is greater than a first predetermined threshold, normalizing the power spectrum value of each frequency point to obtain a normalized value corresponding to each frequency point;

Acquire at least one preset target value, each of the target values corresponding to a preset spectrum envelope type;

Calculate the mean square error between the normalized value corresponding to each frequency point and each target value;

The preset spectrum envelope type corresponding to the target value corresponding to the minimum mean square error value is determined as the spectrum envelope type of the high-frequency signal.

10. The method according to claim 9, characterized in that the normalizing the power spectrum value of each frequency point to obtain the normalized value corresponding to each frequency point comprises:

Performing difference processing on the power spectrum value of each frequency point and the average value respectively to obtain a second difference value corresponding to each frequency point;

Calculating the square value of the second difference corresponding to each of the frequency points, and calculating the average value of the square values to obtain a normalized score;

The second difference values corresponding to the frequency points are divided by the normalized scores to obtain normalized values corresponding to the frequency points.

11. The method according to any one of claims 5 to 10, characterized in that the step of decomposing the target audio signal to generate a high-frequency signal and a low-frequency signal comprises:

The target audio signal is subjected to orthogonal mirror decomposition filtering processing to generate the high-frequency signal and the low-frequency signal.

12. An audio processing device, comprising:

A receiving module, used for receiving spectrum characteristic parameters of a high-frequency signal and low-frequency coding data of a low-frequency signal, wherein the high-frequency signal and the low-frequency signal belong to a target audio signal, and the spectrum characteristic parameters are information describing the spectrum distribution characteristics of the high-frequency signal and include a spectrum envelope type;

A decoding module, used for decoding the low-frequency encoded data to generate a decoded low-frequency signal;

A matching module, configured to perform a prediction network matching process based on the spectral feature parameters to obtain an audio prediction network matched with the spectral feature parameters, wherein the audio prediction network matched with the spectral feature parameters is a deep learning network trained based on training samples corresponding to the spectral feature parameters;

A prediction module, configured to perform audio prediction processing based on the audio prediction network and the decoded low-frequency signal to generate a predicted high-frequency signal corresponding to the high-frequency signal;

An output module, configured to generate an audio output signal corresponding to the target audio signal according to the predicted high-frequency signal and the decoded low-frequency signal;

13. An audio processing device, comprising:

A decomposition module is used to decompose the target audio signal to generate a high-frequency signal and a low-frequency signal;

An extraction module, used for performing feature extraction processing on the high-frequency signal to obtain a spectrum feature parameter corresponding to the high-frequency signal, wherein the spectrum feature parameter is information describing the spectrum distribution characteristics of the high-frequency signal and includes a spectrum envelope type;

An encoding module, used for performing audio encoding processing on the low-frequency signal to generate low-frequency encoding data corresponding to the low-frequency signal;

A delivery module, configured to send the spectral feature parameters and the low-frequency coded data to a receiving end, so that the receiving end determines an audio prediction network that matches the spectral feature parameters, and generates an audio output signal based on a decoded low-frequency signal obtained by decoding the audio prediction network and the low-frequency coded data, wherein the audio prediction network that matches the spectral feature parameters is a deep learning network trained based on training samples corresponding to the spectral feature parameters;

14. A computer-readable storage medium, characterized in that a computer program is stored thereon, and when the computer program is executed by a processor of a computer, the computer is caused to execute the method according to any one of claims 1 to 4 and 5 to 11.

15. An electronic device, comprising: a memory storing a computer program; and a processor reading the computer program stored in the memory to execute the method according to any one of claims 1 to 4 and 5 to 11.

16. A computer program product, characterized in that the computer program product comprises a computer program, and when the computer program is executed by a processor, the method of any one of claims 1 to 4 and 5 to 11 is implemented.