CN114627882A

CN114627882A - Audio processing method, electronic device and computer readable storage medium

Info

Publication number: CN114627882A
Application number: CN202210379551.8A
Authority: CN
Inventors: 张斌; 林慧镔
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-06-14
Anticipated expiration: 2042-04-12
Also published as: CN114627882B

Abstract

The embodiments of the present application disclose an audio processing method, an electronic device, and a computer-readable storage medium, wherein the method includes: in response to an audio enhancement enabling instruction for target audio, acquiring a first audio to be played from the target audio; the The target audio is the audio in the playing state; the first audio is input into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model; the mode of the second audio is processed according to the high frequency band mode of the first audio and The phase of the second audio is modified according to the phase of the low frequency band of the first audio, so as to process the second audio into a third audio; the third audio is super high-quality audio; and the playback target audio is replaced with the playback of the third audio. With the embodiments of the present application, the audio being played can be spectrum-extended in real time, so as to improve the audio playback effect.

Description

Audio processing method, electronic device, and computer-readable storage medium

技术领域technical field

本申请涉及音频处理技术领域，尤其涉及一种音频处理方法、电子设备及计算机可读存储介质。The present application relates to the technical field of audio processing, and in particular, to an audio processing method, an electronic device, and a computer-readable storage medium.

背景技术Background technique

随着终端设备的发展，各式各样的应用层出不穷。在使用终端设备对歌曲进行播放时，对于一些年代久远的老歌，或者由录制效果较差的设备所录制的歌曲，由于这些歌曲在音频信号分量不够丰富，因此，无法向用户提供极致的听觉感受，从而影响用户体验。With the development of terminal equipment, various applications emerge one after another. When using a terminal device to play songs, for some old songs or songs recorded by devices with poor recording effect, because these songs are not rich enough in audio signal components, they cannot provide users with the ultimate hearing. feelings, thereby affecting the user experience.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种音频处理方法、电子设备及计算机可读存储介质，能够实时地对正在播放的音频进行频谱扩展，以提高音频的播放效果。Embodiments of the present application provide an audio processing method, an electronic device, and a computer-readable storage medium, which can perform spectrum expansion on the audio being played in real time, so as to improve the audio playback effect.

第一方面，本申请实施例公开了一种音频处理方法，该方法应用于终端设备中，该终端设备包括音频增强模型，该方法包括：In a first aspect, an embodiment of the present application discloses an audio processing method, which is applied to a terminal device, where the terminal device includes an audio enhancement model, and the method includes:

响应针对目标音频的音频增强开启指令，从目标音频中获取待播放的第一音频；该目标音频为处于播放状态的音频；In response to the audio enhancement opening instruction for the target audio, the first audio to be played is obtained from the target audio; the target audio is the audio in the playing state;

将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频；Input the first audio into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model;

根据第一音频的高频带模对第二音频的模进行处理以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为第三音频；该第三音频为超高品质音频；The mode of the second audio is processed according to the high-band mode of the first audio and the phase of the second audio is modified according to the low-band phase of the first audio, so as to process the second audio into a third audio; the third audio for ultra-high quality audio;

将播放目标音频替换为播放第三音频。Replace playing the target audio with playing the third audio.

第二方面，本申请实施例公开了一种音频处理装置，该装置包括：In a second aspect, an embodiment of the present application discloses an audio processing device, the device comprising:

获取模块，用于响应针对目标音频的音频增强开启指令，从目标音频中获取待播放的第一音频；该目标音频为处于播放状态的音频；The acquisition module is used to obtain the first audio to be played from the target audio in response to the audio enhancement opening instruction for the target audio; the target audio is the audio in the playing state;

处理模块，用于将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频；a processing module for inputting the first audio into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model;

该处理模块，还用于根据第一音频的高频带模对第二音频的模进行处理以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为第三音频；该第三音频为超高品质音频；The processing module is further configured to process the mode of the second audio frequency according to the high frequency band mode of the first audio frequency and modify the phase of the second audio frequency according to the low frequency band phase of the first audio frequency, so as to process the second audio frequency as the first audio frequency. Three audios; the third audio is super high quality audio;

播放模块，用于将播放目标音频替换为播放第三音频。The playback module is used to replace the playback target audio with the playback of the third audio.

第三方面，本发明实施例提供了一种电子设备，该电子设备包括存储器和处理器，存储器存储有计算机程序，计算机程序被处理器执行时，使得处理器执行前述第一方面提供的音频处理方法。In a third aspect, an embodiment of the present invention provides an electronic device, the electronic device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to perform the audio processing provided in the foregoing first aspect method.

第四方面，本发明实施例提供了一种计算机存储介质，其特征在于，该计算机存储介质中存储有计算机程序指令，该计算机程序指令被处理器执行时，用于执行前述第一方面提供的音频处理方法。In a fourth aspect, an embodiment of the present invention provides a computer storage medium, characterized in that the computer storage medium stores computer program instructions, and when the computer program instructions are executed by a processor, is used to execute the above-mentioned first aspect. audio processing method.

第五方面，本发明实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品包括计算机程序，计算机程序存储在计算机存储介质中；计算机设备的处理器从计算机存储介质中读取该计算机指令时，该处理器执行前述第一方面提供的音频处理方法。In a fifth aspect, an embodiment of the present invention provides a computer program product or a computer program, the computer program product includes a computer program, and the computer program is stored in a computer storage medium; the processor of the computer device reads the computer from the computer storage medium. When the instruction is executed, the processor executes the audio processing method provided by the foregoing first aspect.

在本发明实施例中，终端设备通过接收并响应针对处于播放状态的目标音频输入的音频增强开启指令，可以根据音频增强开启指令，从目标音频中获取待播放的第一音频，以将第一音频输入预先训练完成的音频增强模型，得到该音频增强模型输出的第二音频，并根据第一音频的高频带模对第二音频的模进行处理，以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为超高品质的第三音频，并将播放目标音频替换为播放该第三音频，从而使得终端设备可以在音频的实时播放状态下，对目标音频实时地进行音频增强处理，进而得到超高品质的第三音频，以提升用户的听觉体验。In this embodiment of the present invention, by receiving and responding to an audio enhancement enable instruction input for the target audio in the playing state, the terminal device can acquire the first audio to be played from the target audio according to the audio enhancement enable instruction, so as to convert the first audio The audio is input to the pre-trained audio enhancement model, the second audio output by the audio enhancement model is obtained, and the mode of the second audio is processed according to the high frequency band mode of the first audio frequency, and the phase pair of the low frequency band according to the first audio frequency is processed. The phase of the second audio is corrected, so that the second audio is processed into a super high-quality third audio, and the playback target audio is replaced with the third audio, so that the terminal device can play the audio in real time. The target audio is subjected to audio enhancement processing in real time, thereby obtaining an ultra-high-quality third audio, so as to improve the user's listening experience.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1A是本申请实施例提供的一种时域频谱扩展的示意图；1A is a schematic diagram of a time-domain spectrum spreading provided by an embodiment of the present application;

图1B是本申请实施例提供的一种频域频谱扩展的示意图；1B is a schematic diagram of a frequency-domain spectrum spreading provided by an embodiment of the present application;

图1C是本申请实施例提供的一种系统架构的示意图；1C is a schematic diagram of a system architecture provided by an embodiment of the present application;

图2是本申请实施例提供的一种音频处理方法的流程示意图；2 is a schematic flowchart of an audio processing method provided by an embodiment of the present application;

图3是本申请实施例提供的一种目标音频的播放界面的示意图；3 is a schematic diagram of a playback interface of a target audio provided by an embodiment of the present application;

图4是本申请实施例提供的另一种目标音频的播放界面的示意图；4 is a schematic diagram of a playback interface of another target audio provided by an embodiment of the present application;

图5是本申请实施例提供的又一种目标音频的播放界面示意图；5 is a schematic diagram of a playback interface of another target audio provided by an embodiment of the present application;

图6A是本申请实施例提供的一种目标音频对应的音频设置界面的示意图；6A is a schematic diagram of an audio setting interface corresponding to a target audio provided by an embodiment of the present application;

图6B是本申请实施例提供的一种第一音频的示意图；6B is a schematic diagram of a first audio frequency provided by an embodiment of the present application;

图6C是本申请实施例提供的另一种第一音频的示意图；6C is a schematic diagram of another first audio frequency provided by an embodiment of the present application;

图6D是本申请实施例提供的又一种第一音频的示意图；6D is a schematic diagram of still another first audio frequency provided by an embodiment of the present application;

图6E是本申请实施例提供的还一种第一音频的示意图；6E is a schematic diagram of still another first audio frequency provided by an embodiment of the present application;

图7A是本申请实施例提供的一种高频模的后处理流程示意图；7A is a schematic diagram of a post-processing process flow of a high-frequency mode provided by an embodiment of the present application;

图7B是本申请实施例提供的一种改进后的Griffinlim算法的流程示意图；7B is a schematic flowchart of an improved Griffinlim algorithm provided by an embodiment of the present application;

图8A是本申请实施例提供的一种显示第一音频频谱图和第三音频频谱图的示意图；8A is a schematic diagram of displaying a first audio spectrogram and a third audio spectrogram provided by an embodiment of the present application;

图8B是本申请实施例提供的一种显示音频片段1的前后对比频谱的示意图；FIG. 8B is a schematic diagram showing the before and after comparison spectrum of the audio segment 1 provided by an embodiment of the present application;

图8C是本申请实施例提供的另一种显示音频片段1的前后对比频谱的示意图；FIG. 8C is another schematic diagram showing the before-and-after comparison spectrum of the audio segment 1 provided by an embodiment of the present application;

图9是本申请实施例提供的另一种音频增强模型生成方法的流程示意图；9 is a schematic flowchart of another method for generating an audio enhancement model provided by an embodiment of the present application;

图10是本申请实施例提供的一种编码-解码架构的内部结构示意图；10 is a schematic diagram of the internal structure of an encoding-decoding architecture provided by an embodiment of the present application;

图11是本申请实施例提供的一种通过生成式对抗网络进行训练的流程示意图；FIG. 11 is a schematic flowchart of training through a generative adversarial network provided by an embodiment of the present application;

图12是本申请实施例提供的一种低频带和高频带频点的示意图；12 is a schematic diagram of a low-frequency band and a high-frequency band frequency point provided by an embodiment of the present application;

图13是本申请实施例提供的一种通过生成对抗网络模型进行预测的流程示意图；FIG. 13 is a schematic flowchart of a prediction by a generative adversarial network model provided by an embodiment of the present application;

图14是本申请实施例提供的一种音频处理装置的结构示意图；14 is a schematic structural diagram of an audio processing apparatus provided by an embodiment of the present application;

图15是本申请实施例提供的一种电子设备的结构示意图。FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

为了便于理解，首先介绍本申请涉及的术语。For ease of understanding, terms involved in this application are first introduced.

1、超高品质(super quality,SQ)音质1. Super quality (super quality, SQ) sound quality

SQ音质也可以称为无损音质。其中，若将音源为光盘(compact disc,CD) 品质及CD品质以上的音频记为音频A，则音频A可以为SQ音质。或者，若将音频A经过无损编码器得到的音频记为音频B，则音频B可以为SQ音质。或者，若音频A或音频B经过较高品质的有损编码器进行编码后仍能以无损格式存储，则编码后的音频可以记为音频C，音频C可以为SQ音质。SQ sound quality can also be called lossless sound quality. Wherein, if the audio source is a compact disc (compact disc, CD) quality and audio with CD quality or higher as audio A, then audio A can be of SQ quality. Alternatively, if the audio obtained by audio A through a lossless encoder is denoted as audio B, then audio B can be of SQ quality. Alternatively, if audio A or audio B can still be stored in a lossless format after being encoded by a higher-quality lossy encoder, the encoded audio may be recorded as audio C, and audio C may be of SQ quality.

2、高品质(high quality,HQ)音质2. High quality (high quality, HQ) sound quality

HQ音质，指的是未经过低码率上转的标准的mp3格式(码率为320kbps) 的音频。评判是否为HQ音质的主要技术指标是：(1)频谱最高切线达到20K 以上；(2)若20K频谱附近有衰减，则高于20K的频谱占比需大于25％；(3) 若18K附近有衰减，则高于18K频谱的占比需大于45％。其中，标准mp3格式 (320kbps)的频谱可以到达20K。HQ quality refers to audio in standard mp3 format (bit rate 320kbps) that has not been up-converted with a low bit rate. The main technical indicators for judging whether it is HQ sound quality are: (1) the highest tangent of the spectrum reaches more than 20K; (2) if there is attenuation near the 20K spectrum, the proportion of the spectrum higher than 20K must be greater than 25%; (3) if the spectrum is near 18K If there is attenuation, the proportion of spectrum higher than 18K should be greater than 45%. Among them, the spectrum of standard mp3 format (320kbps) can reach 20K.

3、低品质(low quality,LQ)音质3. Low quality (LQ) sound quality

若将以较低采样率(如28kHz以下)进行采样得到的音频，记为音频D；则音频D可以为LQ音质。或者，若将经过较低码率的编码器(如LAME 64bits version 3.99.5编码器在80kbps及以下的码率进行编码)进行编码后且频谱高度未达到14kHz的音频，记为音频E，则音频E可以为LQ音质。If the audio obtained by sampling at a lower sampling rate (eg, below 28 kHz) is recorded as audio D; then audio D can be of LQ quality. Or, if the audio whose spectral height does not reach 14 kHz after being encoded by a lower bit rate encoder (such as LAME 64bits version 3.99.5 encoder at a bit rate of 80kbps and below) is recorded as audio E, then Audio E can be LQ sound quality.

4、非超高品质音质4. Non-super high-quality sound quality

非超高品质音质也可以称为非无损音质。非超高品质音质可以是HQ音质或LQ音质。评判是否为非无损音质的主要技术指标可以概括为：(1)文件格式为有损编码器压缩；(2)文件格式为无损编码器压缩，或者未压缩，但是可以找到经过有损压缩后重新二次保存为无损格式的，即有明显的有损压缩过的痕迹；其中，上述痕迹主要可以通过频谱高度和密度描述为：频谱达到20k以上(不含20k)，且满足一定的比例，如10％以上有有效能量。Non-Ultra High Quality sound quality can also be referred to as non-lossless sound quality. The non-ultra-high quality sound quality can be HQ sound quality or LQ sound quality. The main technical indicators for judging whether it is non-lossless sound quality can be summarized as: (1) the file format is compressed by a lossy encoder; (2) the file format is compressed by a lossless encoder, or uncompressed, but it can be found that after lossy compression If the secondary storage is in a lossless format, that is, there are obvious traces of lossy compression; among them, the above traces can be mainly described by the spectrum height and density: the spectrum reaches more than 20k (excluding 20k), and meets a certain ratio, such as More than 10% have effective energy.

5、音乐频谱扩展5. Music spectrum spread

音乐频谱扩展(Music Bandwidth Extension)技术也可以称为音乐超分辨率(Music Super Resolution)技术。从时域上看，如图1A所示，可以使用深度神经网络(DeepNeural Networks,DNN)技术进行时域插值，以引入高频细节；从频域上看，如图1B所示，可以使用DNN技术进行修复，将丢失的高频成分重建修复出来。The music spectrum extension (Music Bandwidth Extension) technology may also be referred to as a music super-resolution (Music Super Resolution) technology. From the time domain, as shown in Figure 1A, deep neural network (DNN) technology can be used for time domain interpolation to introduce high-frequency details; from the frequency domain, as shown in Figure 1B, DNN can be used Repair technology to restore the lost high-frequency components.

下面将结合附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

为了能够增强音乐的律动和体感反馈，提高用户体验，本申请实施例提供了一种音频处理方法。为了更好地理解本申请实施例提供的音频处理方法，下面先对该音频处理方法应用的系统架构进行介绍。In order to enhance the rhythm and somatosensory feedback of music and improve user experience, an embodiment of the present application provides an audio processing method. In order to better understand the audio processing method provided by the embodiments of the present application, a system architecture to which the audio processing method is applied is first introduced below.

请参见图1C，是本申请实施例提供的一种系统架构的示意图，可以通过该系统架构执行本申请所提出的音频处理方法。该系统架构包括终端设备101和服务器102。终端设备101和服务器102之间通过有线或无线的方式进行通信连接。需要说明的是，上述终端设备101可以设置音频增强模型，以通过该音频增强模型对正在播放的音频进行处理，从而得到超高品质的音频。Please refer to FIG. 1C , which is a schematic diagram of a system architecture provided by an embodiment of the present application, and the audio processing method proposed by the present application can be executed through the system architecture. The system architecture includes a terminal device 101 and a server 102 . The terminal device 101 and the server 102 are connected through wired or wireless communication. It should be noted that, the above-mentioned terminal device 101 may set an audio enhancement model, so as to process the audio being played through the audio enhancement model, so as to obtain ultra-high-quality audio.

其中，终端设备101是一种具有无线通信功能的设备，可以是智能手机、平板电脑、智能可穿戴设备、个人电脑等等设备，在该设备中可以运行应用客户端，如音频播放软件。在本申请的一些实施例中，终端设备还可以是具有收发功能的装置，例如芯片系统。其中，芯片系统可以包括芯片，还可以包括其它分立器件，本申请实施例对此并不限定。The terminal device 101 is a device with wireless communication function, which can be a smart phone, a tablet computer, a smart wearable device, a personal computer, etc., in which an application client, such as audio playback software, can be run. In some embodiments of the present application, the terminal device may also be a device with a transceiving function, such as a chip system. Wherein, the chip system may include a chip, and may also include other discrete devices, which is not limited in this embodiment of the present application.

终端设备101可以从服务器102中获取数据，如视频、音频、文字等。其中，服务器102可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。服务器102中的数据库可以是服务器102的本地数据库，也可以是服务器102能够访问的云端数据库，本申请对此不作限制。需要说明的是，上述系统架构以包括一个终端设备101 和一个服务器102为例进行说明，终端设备的数量和服务器的数量不构成对本申请的限定。The terminal device 101 may acquire data, such as video, audio, text, and the like, from the server 102 . The server 102 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The database in the server 102 may be a local database of the server 102, or may be a cloud database accessible by the server 102, which is not limited in this application. It should be noted that, the above system architecture is described by taking the example of including one terminal device 101 and one server 102 , and the number of terminal devices and the number of servers does not constitute a limitation on this application.

下面对本申请实施例提供的音频处理方法进一步进行详细描述：The audio processing method provided by the embodiment of the present application is further described in detail below:

请参见图2，图2是本申请实施例提供的一种音频处理方法的流程示意图。该音频处理方法可以由终端设备(如上述图1C所示的终端设备101)执行，该方法也可以由服务器执行，或者由服务器和终端共同执行(如服务器获取终端所播放的目标音频，对目标音频中的第一音频进行音质增强后得到第二音频，并根据第一音频和第二音频确定超高品质的第三音频，以将第三音频返回终端进行播放)。为便于理解，图2实施例以该方法由终端设备执行为例进行说明。其中，该音频处理方法至少可以包括以下步骤S201～S204：Please refer to FIG. 2, which is a schematic flowchart of an audio processing method provided by an embodiment of the present application. The audio processing method can be executed by a terminal device (such as the terminal device 101 shown in FIG. 1C above), the method can also be executed by a server, or executed jointly by the server and the terminal (for example, the server obtains the target audio played by the terminal, and the target After the first audio in the audio is enhanced in sound quality, a second audio is obtained, and a third audio with super high quality is determined according to the first audio and the second audio, so that the third audio is returned to the terminal for playback). For ease of understanding, the embodiment in FIG. 2 is described by taking the method being executed by a terminal device as an example. Wherein, the audio processing method may include at least the following steps S201-S204:

S201、响应针对目标音频的音频增强开启指令，从目标音频获取待播放的第一音频。S201. Acquire a first audio to be played from the target audio in response to an audio enhancement enabling instruction for the target audio.

其中，目标音频为处于播放状态的音频。该目标音频可以为非超高品质音频，即非无损音质音频，例如为HQ音质或LQ音质的音频。目标音频可以是歌曲、录音等，本申请对此不作限制。上述目标音频处于播放状态可以理解为终端设备正在播放目标音频，例如终端设备的音乐应用或播放模块(如：收音机或录音机)等正在播放目标音频，本申请以终端设备上的音乐应用正在播放目标音频为例进行说明。Wherein, the target audio is the audio in the playing state. The target audio may be non-ultra-high-quality audio, that is, non-lossless audio, such as HQ-quality or LQ-quality audio. The target audio can be songs, recordings, etc., which is not limited in this application. The above-mentioned target audio is in the playing state can be understood as the terminal device is playing the target audio, for example, a music application or a playback module (such as a radio or a recorder) of the terminal device is playing the target audio, and the present application uses the music application on the terminal device is playing the target audio. Audio is used as an example to illustrate.

上述音频增强开启指令，可以用于针对目标音频开启音频处理流程，即针对目标音频执行步骤S201至S204。需要说明的是，用户可以在上述音乐应用中针对目标音频输入音频增强开启指令，相应的，终端设备接收该增强开启指令。The above audio enhancement enable instruction can be used to enable the audio processing flow for the target audio, that is, to perform steps S201 to S204 for the target audio. It should be noted that the user may input an audio enhancement enabling instruction for the target audio in the above music application, and correspondingly, the terminal device receives the enhancement enabling instruction.

终端设备可于如下两种界面中的一种界面，接收到针对目标音频输入的音频增强开启指令。具体的：The terminal device may receive an audio enhancement enable instruction for the target audio input on one of the following two interfaces. specific:

界面一：目标音频的播放界面。即终端设备可以于目标音频的播放界面，接收针对目标音频输入的音频增强开启指令。Interface 1: The playback interface of the target audio. That is, the terminal device may receive an audio enhancement enable instruction for input of the target audio on the playback interface of the target audio.

其中，目标音频的播放界面可以为终端设备上的音乐应用播放目标音频时的界面；如图3所示，图3示出了目标音频的播放界面。在图3所示的播放界面中，目标音频的下方可以列出不同的音质以及各个音质对应的文件大小，如：标准品质2.0M、HQ品质7.1M、SQ品质18.5M等；在各个音质选项的下方可以提供一个音频增强开启按钮，用于开启音频增强处理流程。可选的，用户可以针对上述三种不同的音质进行选择，以使得终端设备可以根据用户所选择的音质播放目标音频。可选的，用户也可以设置自动选择音质，以使得终端设备可以根据目前的网络状况等因素，自动选择当前适用的音质。如图3所示，在开启音频增强按钮之前，目标音频可以HQ音质进行播放。可选的，音质选项和音频增强开启按钮位于目标音频的下方仅用于举例，不构成对本申请的限定。Wherein, the playing interface of the target audio may be an interface when the music application on the terminal device plays the target audio; as shown in FIG. 3 , FIG. 3 shows the playing interface of the target audio. In the playback interface shown in Figure 3, different sound quality and the corresponding file size of each sound quality can be listed below the target audio, such as: standard quality 2.0M, HQ quality 7.1M, SQ quality 18.5M, etc.; in each sound quality option An audio enhancement on button can be provided below the , which is used to start the audio enhancement processing process. Optionally, the user can select the above three different sound quality, so that the terminal device can play the target audio according to the sound quality selected by the user. Optionally, the user can also set to automatically select the sound quality, so that the terminal device can automatically select the currently applicable sound quality according to the current network conditions and other factors. As shown in Figure 3, before the audio enhancement button is turned on, the target audio can be played in HQ sound quality. Optionally, the sound quality option and the audio enhancement enable button are located below the target audio only for example, and do not constitute a limitation to the present application.

需要说明的是，在如图3所示的目标音频的播放界面中，用户可以通过点击针对目标音频的音频增强开启按钮，以开启对目标音频的音频增强处理流程，即执行步骤S201至S204。It should be noted that, in the target audio playback interface as shown in FIG. 3 , the user can click the audio enhancement activation button for the target audio to activate the audio enhancement processing flow for the target audio, that is, perform steps S201 to S204.

界面二：目标音频对应的音频设置界面。即终端设备可以于目标音频对应的音频设置界面，接收针对目标音频输入的音频增强开启指令。Interface 2: The audio setting interface corresponding to the target audio. That is, the terminal device may receive an audio enhancement enabling instruction for the target audio input on the audio setting interface corresponding to the target audio.

需要说明的是，在目标音频的播放界面不存在音频增强开启按钮的情况下，终端设备可以通过在目标音频的播放界面中进行点击操作，以跳转至目标音频对应的音频设置界面。如图4所示，图4中的目标音频的播放界面中不存在音频增强开启按钮，则终端设备可以通过接收到针对音频增强按钮的点击操作后，跳转至该目标音频对应的音频设置界面；或者，如图5所示，图5中的目标音频的播放界面不存在音频增强开启按钮，则终端设备可以通过接收到针对目标音频输入的音频设置操作后(如通过点击右键，从菜单中选取音频设置按钮)，跳转至该目标音频对应的音频设置界面，本申请对此不作限制。It should be noted that, in the case where there is no audio enhancement enable button on the playback interface of the target audio, the terminal device can jump to the audio setting interface corresponding to the target audio by performing a click operation on the playback interface of the target audio. As shown in FIG. 4, there is no audio enhancement enable button in the playback interface of the target audio in FIG. 4, then the terminal device can jump to the audio setting interface corresponding to the target audio after receiving the click operation for the audio enhancement button Or, as shown in Figure 5, the playback interface of the target audio in Figure 5 does not have an audio enhancement open button, then the terminal device can be through receiving the audio setting operation for the target audio input (such as by clicking the right button, from the menu Select the audio setting button) to jump to the audio setting interface corresponding to the target audio, which is not limited in this application.

如图6A所示，图6A示出了目标音频对应的音频设置界面。在图6A所示的目标音频对应的音频设置界面中，用户可以通过点击音频增强开启按钮，以开启针对目标音频的音频增强流程。As shown in FIG. 6A , FIG. 6A shows an audio setting interface corresponding to the target audio. In the audio setting interface corresponding to the target audio shown in FIG. 6A , the user can click the audio enhancement enable button to start the audio enhancement process for the target audio.

终端设备在对目标音频进行播放时，可以接收到用户针对目标音频输入的音频增强开启指令。也就是说，终端设备在播放目标音频时，可以接收到用户针对该目标音频输入的音频增强开启指令，以将该目标音频进行频谱扩展，从而提高音频品质，进而提升用户体验。When playing the target audio, the terminal device may receive an audio enhancement start instruction input by the user for the target audio. That is, when playing the target audio, the terminal device may receive an audio enhancement enable instruction input by the user for the target audio, so as to perform spectrum expansion on the target audio, thereby improving the audio quality and further improving the user experience.

其中，为了与进行音质增强后的音频进行区分，可以将进行音质增强前的音频称为第一音频。该第一音频可以为终端设备从目标音频中获取的音频，如图6B所示，该第一音频可以为从目标音频中获取到的一段时长较短的音频(按照预设时长获取第一音频)；可选的，如图6C所示，该第一音频也可以为从目标音频中获取到的一段时长较长的音频(从接收到音频增强开启指令的时间点到目标音频的结束时间点获取第一音频)，本申请对此不作限制。Wherein, in order to distinguish it from the audio after the sound quality enhancement, the audio before the sound quality enhancement may be called the first audio. The first audio may be the audio obtained by the terminal device from the target audio. As shown in FIG. 6B , the first audio may be a short duration of the audio obtained from the target audio (the first audio is obtained according to the preset duration). ); Optionally, as shown in Figure 6C, this first audio frequency can also be a longer audio frequency (from the time point when the audio enhancement on instruction is received to the end time point of the target audio frequency) obtained from the target audio frequency Obtain the first audio), which is not limited in this application.

需要说明的是，在接收到音频增强开启指令时，终端设备可以将接收到该音频增强开启指令的时间点作为起始时间点，以从该起始时间点开始获取预设时长的第一音频。如上述图6B或图6C所示，若终端设备接收到音频增强开启指令时，目标音频已经播放了30秒的音频数据，则终端设备可以将第30秒作为起始时间点，以从第30秒开始获取上述第一音频。It should be noted that when receiving the audio enhancement turn-on instruction, the terminal device may take the time point at which the audio enhancement turn on instruction is received as the start time point, so as to obtain the first audio frequency of the preset duration from the start time point. . As shown in FIG. 6B or FIG. 6C above, if the target audio has already played 30 seconds of audio data when the terminal device receives the audio enhancement enable instruction, the terminal device can use the 30th second as the starting time point to start from the 30th second. Seconds start to acquire the above-mentioned first audio.

在一种可选的实施方式中，获取第一音频的开始时间点为接收到音频增强开启指令的时间点，截止时间点为目标音频的结束时间点。此种情况下，如果用于获取第一音频的预设时长短于目标音频的剩余播放时长，那么可以从目标音频中获取到多个第一音频。另外，由于不同音频的时长可以不同，因此，终端设备从不同音频中获取的第一音频的个数也可以不同。例如：假设音频1的总时长为3分钟，若第一音频的时长为5秒，则终端设备最多可以从该音频1 中获取到36个第一音频；假设音频2的总时长为1分钟，若第一音频的时长为5秒，则终端设备最多可以从该音频2中获取到12个第一音频。In an optional implementation manner, the start time point of acquiring the first audio is the time point when the audio enhancement on instruction is received, and the end time point is the end time point of the target audio. In this case, if the preset duration for acquiring the first audio is shorter than the remaining playing duration of the target audio, multiple first audios may be acquired from the target audio. In addition, since the durations of different audios may be different, the number of first audios acquired by the terminal device from different audios may also be different. For example: Assuming that the total duration of audio 1 is 3 minutes, if the duration of the first audio is 5 seconds, the terminal device can obtain up to 36 first audios from the audio 1; assuming that the total duration of audio 2 is 1 minute, If the duration of the first audio is 5 seconds, the terminal device can acquire at most 12 first audios from the audio 2 .

可选的，针对不同的音频，终端设备获取的第一音频的时长也可以不同。例如：假设音频3的总时长为3分钟，针对该音频3，若终端设备获取的第一音频的时长为5秒，则最多可以从该音频3中获取到36个第一音频；假设音频4 的总时长为3分钟，针对该音频4，若终端设备获取的第一音频的时长为10秒，则最多可以从该音频4中获取到18个第一音频。本申请对终端设备所获取的第一音频的个数不作限制。Optionally, for different audios, the duration of the first audio acquired by the terminal device may also be different. For example: assuming that the total duration of audio 3 is 3 minutes, for audio 3, if the duration of the first audio acquired by the terminal device is 5 seconds, then a maximum of 36 first audios can be acquired from audio 3; assuming audio 4 The total duration is 3 minutes. For this audio 4, if the duration of the first audio acquired by the terminal device is 10 seconds, a maximum of 18 first audios can be acquired from the audio 4. This application does not limit the number of first audios acquired by the terminal device.

在一种实现方式中，上述第一音频中可以包括一个或多个音频片段。如图 6D所示，该第一音频可以包括一个音频片段；可选的，如图6E所示，该第一音频也可以包括多个音频片段，本申请对此不作限制。In an implementation manner, the above-mentioned first audio may include one or more audio segments. As shown in FIG. 6D, the first audio may include one audio segment; optionally, as shown in FIG. 6E, the first audio may also include multiple audio segments, which is not limited in this application.

需要说明的是，在第一音频包括一个音频片段的情况下，终端设备所获取到的第一音频的数量与音频片段的数量可以相同。如图6D所示，目标音频的总时长为3分钟，第一音频(如称作第一音频1)包括的一个音频片段，若该音频片段的时长为5秒，则在目标音频已经播放了30秒的音频数据时，终端设备可以从该目标音频获取到30个音频片段(即30个第一音频1)。It should be noted that, in the case where the first audio includes one audio segment, the number of the first audio acquired by the terminal device may be the same as the number of audio segments. As shown in FIG. 6D, the total duration of the target audio is 3 minutes, and the first audio (for example, called first audio 1) includes an audio clip, if the duration of the audio clip is 5 seconds, the target audio has been played When the audio data is 30 seconds, the terminal device can obtain 30 audio segments (ie, 30 first audio 1s) from the target audio.

可选的，在第一音频包括多个音频片段的情况下，终端设备所获取到的第一音频的数量与音频片段的数量可以不同。如图6E所示，目标音频的总时长为 3分钟，第一音频(如称作第一音频2)包括从目标音频第30秒开始之后所有的音频数据，该第一音频2包括多个音频片段，若该音频片段的时长为5秒，则在目标音频已经播放了30秒的音频数据时，终端设备可以从该目标音频获取到1个第一音频2，该第一音频2中包括30个音频片段。Optionally, when the first audio includes multiple audio segments, the number of the first audio acquired by the terminal device may be different from the number of audio segments. As shown in FIG. 6E , the total duration of the target audio is 3 minutes, the first audio (for example, called first audio 2 ) includes all audio data after the 30th second of the target audio, and the first audio 2 includes multiple audios segment, if the duration of the audio segment is 5 seconds, when the target audio has played 30 seconds of audio data, the terminal device can obtain a first audio 2 from the target audio, and the first audio 2 includes 30 audio clips.

需要说明的是，一个音频片段中可以包括终端设备进行音频增强处理所需要的最小音频数据量，以使得终端设备获取到该音频片段后可以进行音频增强处理。例如，若终端设备可以进行音频增强处理的最小音频数据量为5毫秒时长的音频，则一个音频片段中可以至少包括5毫秒时长的音频。其中，音频增强处理的最小音频数据量为5毫秒时长的音频仅用于举例，不构成对本申请的限定。It should be noted that an audio segment may include the minimum audio data amount required by the terminal device to perform audio enhancement processing, so that the terminal device can perform audio enhancement processing after acquiring the audio segment. For example, if the minimum amount of audio data that can be performed by the terminal device for audio enhancement processing is audio with a duration of 5 milliseconds, one audio segment may include at least audio with a duration of 5 milliseconds. Wherein, the audio with a minimum audio data volume of 5 milliseconds in audio enhancement processing is only used as an example, and does not constitute a limitation to the present application.

其中，上述音频片段也可称作单位时长的音频数据，即第一音频可以包括一个或多个单位时长的音频数据。可选的，上述一个音频片段还可称为一个批大小的音频数据。本申请实施例中均以音频片段为例进行说明，不构成对本申请的限定。可选的，上述音频片段也可以帧为单位，本申请对此不作限制。The above audio segment may also be referred to as audio data of unit duration, that is, the first audio may include one or more audio data of unit duration. Optionally, the above-mentioned one audio segment may also be referred to as one batch-sized audio data. In the embodiments of the present application, audio clips are used as examples for description, which does not constitute a limitation on the present application. Optionally, the above-mentioned audio clip may also be used as a unit of frame, which is not limited in this application.

终端设备在接收到音频增强开启指令后，可以根据该音频增强开启指令，从目标音频中获取到第一音频中包括的一个或多个音频片段，以将正在播放的时长较长的目标音频，分割为一个或多个时长较短的音频片段，从而在获取到一个音频片段后，便可以快速地开始音频增强处理，进而可以更快速地得到处理后的音频片段。After receiving the audio enhancement turn-on instruction, the terminal device may obtain one or more audio clips included in the first audio from the target audio according to the audio enhancement turn-on instruction, so that the target audio with a longer duration being played is changed to It is divided into one or more audio clips with a short duration, so that after an audio clip is acquired, the audio enhancement processing can be started quickly, and the processed audio clip can be obtained more quickly.

S202、将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频。S202. Input the first audio into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model.

其中，上述第二音频可以为音频增强模型输出的音频。该第二音频可以为将第一音频经过音频增强处理后的音频，该第二音频可以比第一音频包含更多的信号分量。需要说明的是，该第二音频可以与第一音频相对应，也就是说，若第一音频包括一个音频片段，则第二音频也包括一个音频片段；若第一音频包括多个音频片段，则第二音频也包括多个音频片段。Wherein, the above-mentioned second audio may be the audio output by the audio enhancement model. The second audio may be audio obtained by subjecting the first audio to audio enhancement processing, and the second audio may contain more signal components than the first audio. It should be noted that the second audio may correspond to the first audio, that is, if the first audio includes an audio segment, the second audio also includes an audio segment; if the first audio includes multiple audio segments, Then the second audio also includes a plurality of audio segments.

在一种实现方式中，上述音频增强模型为通过生成式对抗网络(GenerativeAdversarial Networks，GAN)对音频样本进行训练得到的模型，该音频样本为超高品质音频。如何生成音频增强模型将在图9所示的实施例进行详细描述，在此不再赘述。In an implementation manner, the above audio enhancement model is a model obtained by training audio samples through a generative adversarial network (Generative Adversarial Networks, GAN), and the audio samples are ultra-high-quality audio. How to generate an audio enhancement model will be described in detail in the embodiment shown in FIG. 9 , which will not be repeated here.

S203、根据第一音频的高频带模对第二音频的模进行处理以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为第三音频；该第三音频为超高品质音频。S203, processing the mode of the second audio according to the high frequency band mode of the first audio and modifying the phase of the second audio according to the low frequency band phase of the first audio, so as to process the second audio into a third audio; Tri-audio is super high quality audio.

其中，上述第三音频可以为基于上述第一音频和上述第二音频进行高频模后处理和相位修正后的音频。也就是说，将音频增强模型输出的第二音频进行模和相位的优化之后，可以得到该第三音频。Wherein, the above-mentioned third audio frequency may be an audio frequency obtained by performing high-frequency modulo post-processing and phase correction based on the above-mentioned first audio frequency and the above-mentioned second audio frequency. That is to say, after the modulo and phase optimization of the second audio output from the audio enhancement model is performed, the third audio can be obtained.

在一种实现方式中，基于第一音频的高频带模和第二音频的高频带模，对第二音频的模进行高频模后处理，得到第二音频的全频带模；根据第一音频的低频带相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位；基于全频带模和全频带相位对第二音频进行处理，以确定第三音频。In an implementation manner, based on the high frequency band mode of the first audio frequency and the high frequency band mode of the second audio frequency, the mode of the second audio frequency is subjected to high frequency mode post-processing to obtain the full frequency band mode of the second audio frequency; The phase of the second audio frequency is phase-corrected to obtain the full-band phase of the second audio frequency; the second audio frequency is processed based on the full-band mode and the full-band phase to determine the third audio frequency.

其中，高频模后处理流程可参见图7A所示的示意图。具体的，终端设备可以将第一音频的高频带模与第二音频的高频带模进行比较，以确定出高频频点能量更大的高频带模，从而采用能量更大的高频带模作为高频模后处理后的高频带模。例如：若第一音频的高频带模的能量高于第二音频的高频带模的能量，则采用第一音频的高频带模作为高频模后处理后的高频带模。通过对高频模的后处理流程，可以在原始高频频点能量更大的情况下，尽可能地保留原始高频频点，从而做到不损害原始音频的特征。可选的，上述高频模的后处理流程与针对图像的模的后处理流程相似，在此不再赘述。For the post-processing flow of the high-frequency mode, reference may be made to the schematic diagram shown in FIG. 7A . Specifically, the terminal device can compare the high frequency band mode of the first audio frequency with the high frequency band mode of the second audio frequency to determine the high frequency band mode with more high frequency frequency point energy, so as to use the high frequency band mode with more energy. The band mode is post-processed as the high frequency band mode of the high frequency mode. For example, if the energy of the high frequency band mode of the first audio frequency is higher than the energy of the high frequency band mode of the second audio frequency, the high frequency band mode of the first audio frequency is used as the post-processing high frequency band mode of the high frequency mode. Through the post-processing process of the high-frequency mode, the original high-frequency frequency points can be preserved as much as possible when the energy of the original high-frequency frequency points is larger, so as not to damage the characteristics of the original audio. Optionally, the above-mentioned post-processing flow of the high-frequency mode is similar to the post-processing flow of the image mode, and details are not repeated here.

在一种实现方式中，将第一音频的低频带相位进行镜像处理，得到镜像相位；利用语音信号重建算法对镜像相位进行运算，得到运算后的相位；根据运算后的相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位。In an implementation manner, the low-band phase of the first audio frequency is subjected to image processing to obtain the image phase; the image phase is calculated by using a speech signal reconstruction algorithm to obtain the calculated phase; and the second audio frequency is calculated according to the calculated phase. Phase correction is performed on the phase of the second audio frequency to obtain the full-band phase of the second audio frequency.

其中，上述语音信号重建算法可以为改进后的Griffinlim算法，该改进后的Griffinlim算法的具体流程可以如图7B所示。具体的，本申请实施例中，可以通过将第一音频的低频带相位进行镜像处理，以将该镜像处理后的相位作为改进后的Griffinlim算法中的初始相位，从而使得运算后的相位更准确。The above speech signal reconstruction algorithm may be an improved Griffinlim algorithm, and a specific process of the improved Griffinlim algorithm may be shown in FIG. 7B . Specifically, in this embodiment of the present application, the phase of the low frequency band of the first audio frequency can be mirrored, so that the phase after mirroring can be used as the initial phase in the improved Griffinlim algorithm, so that the calculated phase is more accurate .

需要说明的是，针对第二音频的相位修正处理的迭代次数可以人为设置，虽然迭代的次数越多可以使得结果更准确，但是为了减少终端设备的计算量，一般可以人为设置为迭代1次或2次即可，本申请对此不作限制。可选的，改进后的Griffinlim算法对应的其他步骤，可参见Griffinlim算法的相关步骤，本申请在此不再赘述。It should be noted that the number of iterations of the phase correction processing for the second audio frequency can be set manually. Although the number of iterations is more, the result can be more accurate, but in order to reduce the calculation amount of the terminal device, it can generally be set to 1 iteration or 1. 2 times is sufficient, which is not limited in this application. Optionally, for other steps corresponding to the improved Griffinlim algorithm, reference may be made to the relevant steps of the Griffinlim algorithm, which will not be repeated in this application.

终端设备可以根据第二音频、经过高频模后处理后的全频带模，以及经过改进Griffinlim算法进行相位修正后的全频带相位，确定出第三音频。具体的，终端设备可以将上述第二音频、全频带模和全频带相位基于欧拉公式组合，并经过短时间傅里叶逆转换(InverseShort-time Fourier Transform，ISTFT)恢复得到时域信号，本申请对此不作限制。通过将第二音频的进行模和相位的优化处理，可以使得处理后的第三音频中的细节更准确，从而提升用户的听感体验。The terminal device can determine the third audio frequency according to the second audio frequency, the full-band mode after post-processing of the high-frequency mode, and the full-band phase after phase correction by the improved Griffinlim algorithm. Specifically, the terminal device can combine the above-mentioned second audio frequency, full-band mode and full-band phase based on Euler's formula, and recover the time-domain signal through Inverse Short-time Fourier Transform (ISTFT). There are no restrictions on the application. By optimizing the modulo and phase of the second audio, details in the processed third audio can be made more accurate, thereby improving the user's listening experience.

需要说明的是，终端设备得到第三音频之后，可以缓存第三音频，以供终端设备获取并播放该第三音频。可选的，终端设备可以将第三音频缓存至该终端设备的内存，也可以将第三音频缓存至其他存储空间，本申请对此不作限制。It should be noted that, after obtaining the third audio, the terminal device may buffer the third audio for the terminal device to obtain and play the third audio. Optionally, the terminal device may cache the third audio in the memory of the terminal device, or may cache the third audio in another storage space, which is not limited in this application.

在一种实现方式中，终端设备可以于音频设置界面，显示第一音频和第三音频的频谱。In an implementation manner, the terminal device may display the frequency spectrum of the first audio frequency and the third audio frequency on the audio setting interface.

需要说明的是，为了简化描述，下文均以“第三音频为进行音频增强处理后的超高品质音频”为例进行说明，该“音频增强处理”可以表示“音频增强处理以及模和相位优化处理”，不构成对本申请的限定。It should be noted that, in order to simplify the description, “the third audio is super high-quality audio after audio enhancement processing” is used as an example for description, and the “audio enhancement processing” can mean “audio enhancement processing and modulus and phase optimization.” processing", which does not constitute a limitation on this application.

由前述内容可知，第一音频可以为音频增强处理前的音频，第三音频可以为音频增强处理后的超高品质的音频。如图8A所示，在音频设置界面开启增强处理按钮之后，终端设备还可以在该音频设置界面将音频增强之前的音频(即上述第一音频)的频谱图和音频增强后的音频(即上述第三音频)的频谱图显示出来，以用于向用户展示开启音频增强的前后对比效果。其中，图8A中第一音频和第三音频频谱图的放置位置仅用于举例，不构成对本申请的限定。As can be seen from the foregoing content, the first audio may be audio before audio enhancement processing, and the third audio may be ultra-high-quality audio after audio enhancement processing. As shown in FIG. 8A , after the enhancement processing button is turned on in the audio setting interface, the terminal device can also set the spectrogram of the audio before audio enhancement (that is, the above-mentioned first audio) and the audio after the audio enhancement (that is, the above-mentioned audio) on the audio setting interface. The spectrogram of the third audio) is displayed to show the user the before and after contrast effect of turning on the audio enhancement. Wherein, the placement positions of the first audio frequency and the third audio frequency spectrogram in FIG. 8A are only used for example, and do not constitute a limitation to the present application.

可选的，上述音频增强的前后对比效果图可以是示例性的，例如，开启音频增强前的频谱高度可以为12K左右的频谱，开启音频增强后的频谱高度可以为22K左右的频谱。可以理解的是，不同音频所对应的图例可以相同，也可以不同，以向用户示出大致的效果，并不代表实际的第一音频和第三音频的频谱。Optionally, the above-mentioned before-and-after comparison effect diagram of audio enhancement may be exemplary, for example, the spectrum height before audio enhancement is enabled may be a spectrum of about 12K, and the spectrum height after audio enhancement is enabled may be a spectrum of about 22K. It can be understood that the legends corresponding to different audios may be the same or different, so as to show a general effect to the user, and do not represent the actual frequency spectra of the first audio and the third audio.

可选的，上述音频增强的前后对比效果图可以是音频增强处理过程中实际生成的图例，例如，开启音频增强前的频谱图可以为正在播放的第一音频的频谱，开启音频增强后的频谱图可以为经过音频增强处理的第三音频的频谱。可以理解的是，不同音频所对应的前后对比效果图可以不同，以向用户示出实际处理过程中的第一音频和第三音频的频谱。Optionally, the above-mentioned before-and-after comparison effect diagram of audio enhancement may be a legend actually generated during audio enhancement processing. For example, the spectrogram before audio enhancement is turned on may be the frequency spectrum of the first audio being played, and the frequency spectrum after audio enhancement is turned on. The graph may be the frequency spectrum of the third audio that has undergone audio enhancement processing. It can be understood that the before-and-after comparison effect graphs corresponding to different audios may be different, so as to show the frequency spectrum of the first audio and the third audio in the actual processing process to the user.

在一种实现方式中，终端设备可以针对音频增强模型当前正在处理的目标音频片段，显示目标音频片段的第一频谱和第二频谱；该第一频谱为目标音频片段输入音频增强模型前的频谱，第二频谱为目标片段输入音频增强模型后的频谱，目标音频片段为多个音频片段中正在处理的音频片段。In an implementation manner, the terminal device may display the first frequency spectrum and the second frequency spectrum of the target audio segment for the target audio segment currently being processed by the audio enhancement model; the first frequency spectrum is the frequency spectrum of the target audio segment before the audio enhancement model is input , the second frequency spectrum is the frequency spectrum of the target segment after the audio enhancement model is input, and the target audio segment is the audio segment being processed among the multiple audio segments.

其中，上述第一频谱可以为目标音频片段进行音频增强处理之前(即上述输入音频增强模型前)的频谱，上述第二频谱可以为目标音频片段进行音频增强处理之后(即上述输入音频增强模型后)的频谱，本申请对此不作限制。Wherein, the above-mentioned first frequency spectrum may be the frequency spectrum of the target audio clip before the audio enhancement processing (that is, before the above-mentioned input audio enhancement model), and the above-mentioned second frequency spectrum may be the target audio clip after the audio enhancement processing (i.e. after the above-mentioned input audio enhancement model). ), which is not limited in this application.

需要说明的是，在第一音频包括多个音频片段(即多个音频片段)时，上述前后对比效果图可以用于示出正在处理的音频片段的频谱。例如：如图8B所示，假设第一音频包括音频片段1、音频片段2和音频片段3，且终端设备正在对音频片段1(即上述目标音频片段)进行音频增强处理，则前后对比效果图可以示出处理音频片段1时的前后对比频谱。可选的，若终端设备开始处理音频片段2(即上述目标音频片段)，则前后对比效果图可以示出处理音频片段2时的前后对比频谱，本申请对此不作限制。It should be noted that, when the first audio includes multiple audio segments (ie, multiple audio segments), the above-mentioned before-and-after comparison effect diagram may be used to show the frequency spectrum of the audio segment being processed. For example, as shown in FIG. 8B , assuming that the first audio includes audio segment 1, audio segment 2 and audio segment 3, and the terminal device is performing audio enhancement processing on audio segment 1 (that is, the above-mentioned target audio segment), then compare the effect diagrams before and after A before and after spectrum when audio segment 1 is processed can be shown. Optionally, if the terminal device starts to process audio segment 2 (ie, the above-mentioned target audio segment), the before-and-after comparison effect diagram may show the before-and-after comparison spectrum when audio segment 2 is processed, which is not limited in this application.

可选的，假设第一音频包括音频片段1、音频片段2和音频片段3，前后对比效果图可以先示出音频片段1的前后对比频谱，再示出音频片段2的前后对比频谱，最后示出音频片段3的前后对比频谱，以将第一音频所包括的所有音频片段的前后对比频谱均显示出来。如图8C所示，该图示例性地先示出了音频片段1的前后对比频谱。Optionally, assuming that the first audio includes audio clip 1, audio clip 2, and audio clip 3, the before-and-after comparison effect diagram may first show the before-and-after comparison spectrum of audio clip 1, and then show the before-and-after comparison spectrum of audio clip 2, and finally show the before and after comparison spectrum of audio clip 2. The before-and-after comparison spectrums of the audio segment 3 are output, so as to display the before-and-after comparison spectrums of all audio segments included in the first audio. As shown in FIG. 8C , the figure exemplarily shows the before-and-after comparison spectrum of the audio segment 1 .

S204、将播放目标音频替换为播放第三音频。S204. Replace the playback target audio with playback of the third audio.

终端设备获取到第三音频后，可以为用户播放该第三音频。可选的，终端设备在获取到第一音频的情况下，可暂停对第一音频的播放，并在获取到第三音频时，及时地缓存并播放第三音频，以提升用户听觉体验。其中，从暂停第一音频到播放第三音频的间隔时间较短，即将第一音频输入音频增强模型进行处理的时间较短，音频增强模型可以快速处理第一音频所包括的一个或多个音频片段，以减少用户等待时间。After acquiring the third audio, the terminal device can play the third audio for the user. Optionally, the terminal device may suspend the playback of the first audio when acquiring the first audio, and buffer and play the third audio in time when acquiring the third audio, so as to improve the user's listening experience. Wherein, the interval from pausing the first audio to playing the third audio is shorter, that is, the time for inputting the first audio into the audio enhancement model for processing is shorter, and the audio enhancement model can quickly process one or more audios included in the first audio Fragments to reduce user wait time.

在一种实现方式中，终端设备在第三音频的播放完毕时，删除缓存的第三音频。In an implementation manner, the terminal device deletes the buffered third audio when the third audio is played.

由前述内容可知，该第三音频可以包括一个或多个音频片段，终端设备在播放第三音频时，可以依次播放音频片段，且当播放完一个音频片段，则删除该音频片段的缓存。示例性的，假设第三音频包括多个音频片段，如音频片段4、音频片段5和音频片段6，且音频片段4播放完毕，则终端设备可以删除缓存的音频片段4。可选的，终端设备也可以在第二音频播放完毕后，即音频片段4、音频片段5和音频片段6均播放完毕后，删除缓存的第二音频，本申请对此不作限制。通过删除播放完毕的音频片段，可以清空缓存，进而释放存储空间，以存储终端设备后续进行音频增强处理的音频片段。As can be seen from the foregoing content, the third audio may include one or more audio clips. When playing the third audio, the terminal device may play the audio clips in sequence, and delete the buffer of the audio clip after playing one audio clip. Exemplarily, assuming that the third audio includes multiple audio clips, such as audio clip 4, audio clip 5, and audio clip 6, and audio clip 4 is played, the terminal device may delete the buffered audio clip 4. Optionally, the terminal device may also delete the cached second audio after the second audio is played, that is, after the audio segment 4, the audio segment 5 and the audio segment 6 are all played, which is not limited in this application. By deleting the audio clips that have been played, the cache can be emptied, thereby freeing up storage space to store audio clips that the terminal device performs subsequent audio enhancement processing.

在一种实现方式中，终端设备在接收到针对目标音频输入的播放进度条拖动指令或音频切换指令时，删除缓存的第三音频。In an implementation manner, the terminal device deletes the buffered third audio when receiving a playback progress bar dragging instruction or an audio switching instruction input for the target audio.

其中，播放进度条拖动指令可以用于拖动音频的播放进度，如将音频拖动至1分30秒；音频切换指令可以用于切换正在播放的音频，如将正在播放的音频1切换为音频3。Among them, the playback progress bar drag command can be used to drag the playback progress of the audio, such as dragging the audio to 1 minute and 30 seconds; the audio switching command can be used to switch the audio being played, such as switching the audio being played 1 to Audio 3.

由于终端设备是实时地对当前正在播放的音频进行音频增强处理，因此，在终端设备接收到针对目标音频输入的播放进度条拖动指令或音频切换指令时，该终端设备需要进行音频增强处理的音频发生了改变；也就是说，先前已经处理并缓存的音频将不再被播放，所以终端设备可以删除缓存的音频，以清空缓存，进而使用该缓存存储执行了播放进度条拖动指令或音频切换指令后，并经过音频增强处理后的音频。Since the terminal device performs audio enhancement processing on the currently playing audio in real time, when the terminal device receives the playback progress bar dragging instruction or audio switching instruction input for the target audio, the terminal device needs to perform audio enhancement processing. The audio has changed; that is, the audio that has been previously processed and cached will no longer be played, so the terminal device can delete the cached audio to clear the cache, and then use the cache to store the playback progress bar drag command or audio After switching the command, and after audio enhancement processing.

在一种实现方式中，终端设备在接收到针对目标音频输入的音频增强关闭指令时，停止将目标音频中的音频输入音频增强模型。In an implementation manner, the terminal device stops inputting the audio in the target audio into the audio enhancement model when receiving the audio enhancement closing instruction for the target audio input.

其中，音频增强关闭指令可以用于关闭音频增强处理流程。终端设备在接收到针对目标音频输入的音频增强关闭指令时，可以停止针对目标音频的音频增强处理，即停止将目标音频中的音频(如第一音频)输入音频增强模型，进而结束上述音频增强处理进程。The audio enhancement close instruction may be used to close the audio enhancement processing flow. When the terminal device receives the audio enhancement closing instruction for the target audio input, it can stop the audio enhancement processing for the target audio, that is, stop inputting the audio in the target audio (such as the first audio) into the audio enhancement model, and then end the above-mentioned audio enhancement. process.

需要说明的是，在针对目标音频进行的音频增强处理效果不够理想，或者在针对目标音频的音频增强处理失败的情况下，终端设备可以接收到上述音乐增强关闭指令，以减少终端设备的能耗。可以理解的是，终端设备在接收到音频增强关闭指令时，目标音频将以原始的音质进行播放。It should be noted that, in the case where the audio enhancement processing effect for the target audio is not ideal, or in the case that the audio enhancement processing for the target audio fails, the terminal device can receive the above-mentioned music enhancement close instruction to reduce the energy consumption of the terminal device. . It can be understood that, when the terminal device receives the instruction to turn off the audio enhancement, the target audio will be played in the original sound quality.

还需要说明的是，上述目标音频也可以为超高品质的音频，即无损音质音频，例如为SQ音质的音频。终端设备通过将超高品质的目标音频进行音频增强处理，可以补齐该目标音频在高频的缺失，从而使得音频增强后的音频声音更宏亮，细节更丰富。It should also be noted that the above-mentioned target audio may also be ultra-high-quality audio, that is, lossless-quality audio, such as SQ-quality audio. By performing audio enhancement processing on the ultra-high-quality target audio, the terminal device can make up for the lack of high frequency of the target audio, so that the audio after audio enhancement has a louder sound and richer details.

本申请实施例中，终端设备通过接收针对处于播放状态的音频(如目标音频)输入的音频增强开启指令，可以根据音频增强开启指令，从目标音频中获取第一音频，以将第一音频输入音频增强模型，得到音频增强模型输出的音频 (如上述第二音频)，并根据第一音频和第二音频确定出进行模和相位优化后的第三音频，以缓存并播放该第三音频，从而使得终端设备可以在音频的实时播放状态下，对该音频进行音频增强处理，进而得到超高品质的音频，以提升用户的听觉体验。In the embodiment of the present application, the terminal device may obtain the first audio from the target audio according to the audio enhancement enabling instruction by receiving the audio enhancement enabling instruction input for the audio in the playing state (such as the target audio), so as to input the first audio The audio enhancement model obtains the audio output by the audio enhancement model (such as the above-mentioned second audio), and determines the third audio after modulo and phase optimization according to the first audio and the second audio, to buffer and play the third audio, Therefore, the terminal device can perform audio enhancement processing on the audio in the real-time playback state of the audio, thereby obtaining ultra-high-quality audio, so as to improve the user's listening experience.

请参见图9，图9是本申请实施例提供的一种音频增强模型生成方法的流程示意图。该音频增强模型生成方法可以由终端设备(该终端设备可以为执行图2 所示实施例的终端设备，也可以为其他终端设备)或服务器执行。假设图2所示实施例由终端设备1执行，音频增强模型生成方法由终端设备2或服务器执行，终端设备2或服务器生成的音频增强模型可部署在终端设备1中。为便于理解和区分，图9所示实施例以该方法由终端设备2执行为例进行说明。其中，该音频增强模型生成方法至少可以包括以下步骤S901～S902：Please refer to FIG. 9. FIG. 9 is a schematic flowchart of a method for generating an audio enhancement model provided by an embodiment of the present application. The method for generating an audio enhancement model may be executed by a terminal device (the terminal device may be the terminal device that executes the embodiment shown in FIG. 2, or may be other terminal devices) or a server. Assuming that the embodiment shown in FIG. 2 is executed by the terminal device 1 , the audio enhancement model generation method is executed by the terminal device 2 or the server, and the audio enhancement model generated by the terminal device 2 or the server can be deployed in the terminal device 1 . For ease of understanding and differentiation, the embodiment shown in FIG. 9 is described by taking the method being executed by the terminal device 2 as an example. The method for generating an audio enhancement model may include at least the following steps S901-S902:

S901、生成音频增强模型的初始架构。S901. Generate an initial architecture of an audio enhancement model.

音频增强模型的初始架构中可以包括音频增强模型所需要采用的一个或多个算法。由于音频增强模型可以从低频带通过模型学习得到高频带的映射关系，因此，可以采用编码-解码(encoder-decoder)网络模型作为音频增强模型的初始架构。The initial architecture of the audio enhancement model may include one or more algorithms required by the audio enhancement model. Since the audio enhancement model can learn the mapping relationship of the high frequency band from the low frequency band, an encoder-decoder network model can be used as the initial architecture of the audio enhancement model.

可选的，图10为encoder-decoder架构的内部结构示意图。由图10所示，该encoder-decoder架构中可以包括深度可分离卷积(Depthwise Separable Convolution，DWconv2D)、分裂超分模块(SplitSRBlock)和子像素卷积(sub-pixel consvolution，SubPixel2D)等轻量化算法，下文将对几种轻量化算法的具体计算过程进行简要的说明。Optionally, FIG. 10 is a schematic diagram of the internal structure of the encoder-decoder architecture. As shown in Figure 10, the encoder-decoder architecture can include lightweight algorithms such as Depthwise Separable Convolution (DWconv2D), SplitSRBlock and Sub-pixel convolution (SubPixel2D). , the specific calculation process of several lightweight algorithms will be briefly described below.

1、DWconv2D1. DWconv2D

其中，DWconv2D为深度(Depthwise，DW)卷积与逐点(Pointwise，PW)卷积的合称。深度卷积不同于常规卷积操作，深度卷积的一个卷积核负责一个通道，一个通道只被一个卷积核卷积；而常规卷积的每个卷积核是同时操作输入的每个通道。以对图像进行处理为例，对于一张5×5像素、三通道彩色输入图像(即该输入图像的大小为5×5×3)，深度卷积可以在二维平面内进行运算，其卷积核的数量与上一层的通道数相同(通道和卷积核一一对应)；因此，一个三通道的图像经过运算后可以生成3个特征图像。Among them, DWconv2D is a combination of depthwise (Depthwise, DW) convolution and pointwise (Pointwise, PW) convolution. Depthwise convolution is different from conventional convolution operations. One convolution kernel of depthwise convolution is responsible for one channel, and one channel is only convolved by one convolution kernel; while each convolution kernel of conventional convolution operates on each input simultaneously. aisle. Taking image processing as an example, for a 5 × 5 pixel, three-channel color input image (that is, the size of the input image is 5 × 5 × 3), the depth convolution can be operated in a two-dimensional plane, and its volume The number of product kernels is the same as the number of channels in the previous layer (one-to-one correspondence between channels and convolution kernels); therefore, a three-channel image can generate three feature images after operations.

逐点卷积的运算与常规卷积运算非常相似，不同之处在于，逐点卷积的卷积核的尺寸为1×1×M，M为上一层的深度。因此，通过逐点卷积运算可以将上一步的图像在深度方向上进行加权组合，以生成新的特征图像。在逐点卷积中，特征图像的数量与过滤器的数量相同。The operation of point-by-point convolution is very similar to the regular convolution operation, the difference is that the size of the convolution kernel of point-by-point convolution is 1×1×M, and M is the depth of the previous layer. Therefore, through the point-by-point convolution operation, the images of the previous step can be weighted in the depth direction to generate a new feature image. In pointwise convolution, the number of feature images is the same as the number of filters.

2、SplitSRBlock2. SplitSRBlock

在DWconv2D基础上进一步优化，提出了一种新的端到端手机端超分辨率系统SplitSR，分裂卷积按一定比率(该比率可调)沿深度通道分割输入特征，以降低计算量和内存损耗，从而加速推理过程。具体的，可以先将输入特征沿着深度通道分隔，一部分参与DWconv2D计算，一部分不参与任何计算(即特征保留)；再将两个部分的特征按照深度通道进行合并拼接。通过该算法可以减小计算量，还可以保留一部分特征到下一层中，从而使得解码层也能得到更多的初级特征。Further optimized on the basis of DWconv2D, a new end-to-end mobile phone-side super-resolution system, SplitSR, is proposed. The split convolution divides the input features along the depth channel according to a certain ratio (the ratio is adjustable) to reduce the amount of computation and memory consumption. , thereby speeding up the inference process. Specifically, the input features can be separated along the depth channel first, some of them participate in the DWconv2D calculation, and some do not participate in any calculation (ie, feature retention); then the features of the two parts can be combined and spliced according to the depth channel. Through this algorithm, the amount of computation can be reduced, and some features can also be retained in the next layer, so that the decoding layer can also obtain more primary features.

3、SubPixel2D3. SubPixel2D

SubPixel2D是结合了上采样(upsample)和卷积(consvolution)操作的一种算法，该算法可以作用于低分率特征，以使得该低分辨率特征通过该算法得到高分辨率特征。通过该算法可以降低使用逆卷积作为上采样手段时可能会带入过多认为因素的风险。SubPixel2D可以将每个像素的每个通道重新排列成一个r*r的区域，以对应高分辨率图像中的一个r*r大小的子块，即大小为1*H*W 的特征图像可以被重新排列成1*rH*rW大小的高分辨率图像。若是四维向量重组特征大小，则可以由[B,H,W，r*r*C]重新排列为[B,rH,rW,C]。该算法虽然被称作sub-pixel convolution，但实际上并没有卷积操作。SubPixel2D is an algorithm that combines upsampling (upsample) and convolution (consvolution) operations. The algorithm can act on low-resolution features, so that the low-resolution features can obtain high-resolution features through the algorithm. This algorithm reduces the risk of introducing too many factors when using deconvolution as an upsampling method. SubPixel2D can rearrange each channel of each pixel into an r*r region to correspond to a sub-block of size r*r in the high-resolution image, that is, a feature image of size 1*H*W can be Rearranged into high-resolution images of size 1*rH*rW. If the four-dimensional vector reorganizes the feature size, it can be rearranged from [B, H, W, r*r*C] to [B, rH, rW, C]. Although this algorithm is called sub-pixel convolution, it does not actually have a convolution operation.

本申请实施例中，通过采用精简的encoder-decoder网络模型来生成音频增强模型的初始架构，可以减少网络层数以及输入帧数；其中，该encoder-decoder 网络模型使用特殊的编码(encoder)网络结构单元，以减小计算量和模型大小；并使用特殊的解码(decoder)模块进行上采样，以降低内存使用单元，使得终端设备1在运行音频增强模型时，既能实时运行又占用极低中央处理器(Central Processing Unit，CPU)资源和内存消耗。In the embodiment of the present application, by using a simplified encoder-decoder network model to generate the initial architecture of the audio enhancement model, the number of network layers and the number of input frames can be reduced; wherein, the encoder-decoder network model uses a special encoding (encoder) network Structural unit to reduce the amount of calculation and model size; and use a special decoder module for upsampling to reduce the memory usage unit, so that when the terminal device 1 runs the audio enhancement model, it can run in real time and occupy extremely low Central Processing Unit (CPU) resource and memory consumption.

S902、通过生成式对抗网络GAN，对音频样本进行训练，得到音频增强模型。其中，上述音频样本可以为超高品质音频。S902 , training audio samples through a generative confrontation network GAN to obtain an audio enhancement model. The above audio samples may be ultra-high-quality audio.

需要说明的是，上述GAN训练可以包括生成器(Generator)和判别器(Discriminator)，通过GAN训练可以使得生成器生成的预测音频无法辨别真假。其中，生成器可以使用上述步骤中提到的轻量级encoder-decoder架构，判别器可以使用二分类网络模型结构(如VGG-like)。It should be noted that the above-mentioned GAN training can include a generator (Generator) and a discriminator (Discriminator), and through GAN training, the predicted audio generated by the generator can not be distinguished from true and false. Among them, the generator can use the lightweight encoder-decoder architecture mentioned in the above steps, and the discriminator can use the two-class network model structure (such as VGG-like).

如图11所示，图11示出了通过GAN训练的流程示意图。具体的，终端设备2可以从该音频样本中获取第四音频，该第四音频为音频样本的低频带音频；并将上述第四音频输入上述encoder-decoder架构，得到第五音频；该第五音频可以为通过GAN训练得到的音频。As shown in Figure 11, Figure 11 shows a schematic diagram of the flow of training through GAN. Specifically, the terminal device 2 can obtain a fourth audio from the audio sample, and the fourth audio is a low-band audio of the audio sample; and input the fourth audio into the encoder-decoder architecture to obtain a fifth audio; the fifth audio The audio can be the audio obtained by GAN training.

需要说明的是，在获取到音频样本时，终端设备2可以提取音频样本的短时傅里叶变换(Short-time Fourier Transform，STFT)特征，并对该STFT特征取模，再取对数，以得到该音频样本的模对数；其中，fft_length＝2048，hop_length＝256。该音频样本的模对数大小可以表示为[T,1024]，其中，T为可以为一个特征序列(即上述第一音频)的长度，该T值的大小与音频样本的时长有关。可选的，若将模对数转换为固定帧长为32帧的大小，则可以将该音频样本的模对数大小可以表示为[X,32,1024]，其中，X＝T/32。It should be noted that when acquiring the audio sample, the terminal device 2 can extract the Short-time Fourier Transform (STFT) feature of the audio sample, take the modulo of the STFT feature, and then take the logarithm, to get the logarithm of the audio sample; where fft_length=2048, hop_length=256. The logarithmic size of the audio sample can be expressed as [T, 1024], where T can be the length of a feature sequence (ie, the above-mentioned first audio), and the size of the T value is related to the duration of the audio sample. Optionally, if the modulo logarithm is converted into a fixed frame length of 32 frames, the modulo logarithmic size of the audio sample can be expressed as [X, 32, 1024], where X=T/32.

由于在encoder-decoder架构中使用轻量级DWconv2D作为卷积运算单元，因此可以构建一个四维的输入输出，比如将第二维默认扩展为1，则上述音频样本的模对数大小可以表示为[X,1,32,464]，即在训练过程中模型的输入可以为 [X,1,32,464]，各个数值可以分别表述为[批大小,固定值1,帧长,低频带点数]。其中，低频带点数464可以对应于10K的频谱高度，其计算方式可以为： 464＝2048*10K(频谱高度)/44.1K(采样率)。可以理解的是，在训练过程中模型的输出可以为[X,1,32,580]，各个数值可以分别表述为[批大小,固定值1,帧长, 高频带点数]。其中，高频带频点数580的计算方式可以为：580＝(1024-464)+20；其中，1024可以对应于44.1K采样率全频带22.05K频谱高度，20可以对应于重叠区(overlap)频点数，该数值不固定、可调整，以用于预防计算过程中的突变情况。可参见图12，图12示出了低频带和高频带频点的示意图。Since the lightweight DWconv2D is used as the convolution operation unit in the encoder-decoder architecture, a four-dimensional input and output can be constructed. For example, if the second dimension is extended to 1 by default, the logarithmic size of the above audio sample can be expressed as [ X, 1, 32, 464], that is, the input to the model during the training process can be [X, 1, 32, 464], and each value can be expressed as [batch size, fixed value 1, frame length, low-band points]. Wherein, the number of low-band points 464 may correspond to a spectrum height of 10K, and the calculation method may be: 464=2048*10K (spectrum height)/44.1K (sampling rate). It can be understood that the output of the model during the training process can be [X, 1, 32, 580], and each value can be expressed as [batch size, fixed value 1, frame length, high frequency band points]. Among them, the calculation method of the number of high-frequency frequency points 580 may be: 580=(1024-464)+20; wherein, 1024 may correspond to the full frequency band of 44.1K sampling rate and 22.05K spectral height, and 20 may correspond to the overlap area (overlap) The number of frequency points. This value is not fixed and can be adjusted to prevent sudden changes in the calculation process. Referring to FIG. 12 , FIG. 12 shows a schematic diagram of the frequency points of the low frequency band and the high frequency band.

可选的，GAN中Generator和Discriminator的损失函数分别可以为：Optionally, the loss functions of Generator and Discriminator in GAN can be:

其中，

可以表示判别器的损失函数；E可以表示交叉熵损失函数；x可以表示音频样本，D可以表示判别器，D(x)可以表示将音频样本置于判别器进行判别；z可以表示音频增强模型的输入，该输入为音频样本的低频带音频；G可以表示生成器，G(z)可以表示将输入音频增强模型的低频带音频进行预测生成，该生成结果为全频带音频；D(G(z))可以表示将生成的全频带音频至于判别器进行判别；

可以表示生成器的损失函数；可以先训练D(判别器)，再训练G (生成器)，两者相互对抗，直至收敛。需要说明的是，除了使用上述损失函数外，还可以额外引入两个损失函数以加强低频带特征到高频带特征的预测，如下所示，总的L_G损失函数可以为：in,

It can represent the loss function of the discriminator; E can represent the cross entropy loss function; x can represent the audio sample, D can represent the discriminator, D(x) can represent the audio sample placed in the discriminator for discrimination; z can represent the audio enhancement model The input is the low-band audio of the audio sample; G can represent the generator, and G(z) can represent the prediction and generation of the low-band audio of the input audio enhancement model, and the generation result is the full-band audio; D(G( z)) can indicate that the generated full-band audio is judged by the discriminator;

The loss function of the generator can be represented; D (discriminator) can be trained first, then G (generator) can be trained, and the two will fight against each other until convergence. It should be noted that, in addition to the above loss functions, two additional loss functions can be introduced to enhance the prediction of low-band features to high-band features. As shown below, the total _LG loss function can be:

其中，L_G可以表示总的损失函数，L_LSD可以表示最小显著差(Least SignificantDifference,LSD)的损失函数，也可称为L2 loss；L_l1pixcel可以表示L1 loss，也可称为L1范数损失；上述λ₁和λ₂可以表示权重超参数，一般可以分别设置为10 和0.1即可；L可以表示时间(即频谱图中的横轴)，K可以表示频点(即频谱图中的纵轴)，X^HR可以表示高频带，X^HR(l,k)可以表示高频带中具体频点的位置，X^SR可以表示生成的高频带，X^SR(l,k)可以表示生成的高频带中具体频点的位置。可以看出，L_LSD损失函数可以通过把目标值(高频带)与模型输出值(生成的高频带)之差的平方再开根号得到误差；L_l1pixcel损失函数可以通过把目标值(高频带)与模型输出值(生成的高频带)之差进行绝对值运算得到误差。Among them, L _G can represent the total loss function, L _LSD can represent the Least Significant Difference (LSD) loss function, which can also be called L2 loss; L _l1pixcel can represent L1 loss, which can also be called L1 norm loss ; the above-mentioned λ ₁ and λ ₂ can represent the weight hyperparameters, which can generally be set to 10 and 0.1 respectively; L can represent time (that is, the horizontal axis in the spectrogram), and K can represent the frequency point (that is, the vertical axis in the spectrogram). axis), X ^HR can represent the high frequency band, X ^HR (l, k) can represent the location of a specific frequency point in the high frequency band, X ^SR can represent the generated high frequency band, and X ^SR (l, k) can represent the generated high frequency band. The location of the specific frequency point in the high frequency band. It can be seen that the L _LSD loss function can obtain the error by taking the square of the difference between the target value (high frequency band) and the model output value (the generated high frequency band) and then taking the square root; the L _l1pixcel loss function can be obtained by taking the target value ( The difference between the high frequency band) and the model output value (the generated high frequency band) is calculated by the absolute value calculation to obtain the error.

本申请实施例中，通过采用GAN训练模型，并不直接使用一般的生成模型 (如自动编码器(auto encoder))训练方式，可以更加充分地训练模型，从而从低频带学习得到更多高频带的细节，使得高频带生成得到更多高频细节，进而更加逼真的还原真实高频特征。In the embodiment of the present application, by using the GAN training model instead of directly using the general generative model (such as an auto encoder) training method, the model can be trained more fully, so that more high frequencies can be learned from the low frequency band The details of the high-frequency bands can be generated to obtain more high-frequency details, thereby restoring the real high-frequency features more realistically.

可选的，终端设备2通过训练好的音频增强模型进行预测生成时，可以对音频增强后的音频的模和相位进行后处理。具体的，如图13所示，输入44.1K 采样率音频(如上述音频样本)，然后计算STFT特征，得到对应模和相位，对模取对数得到对数模，将对数模截取到频谱高度10K(如上述第四音频)，输入训练好的生成器和高频模后处理模块，可以得到一个全频带模，将生成的全频带模和通过低频带相位镜像得到全频带相位，通过欧拉公式，并使用ISTFT，可以得到44.1K采样率时域波形，进一步送入改进Griffinlim算法以修正相位，即可得到经过模和相位后处理的时域波形(即图13中预测后的音频)。Optionally, when the terminal device 2 performs prediction generation by using the trained audio enhancement model, it may perform post-processing on the modulus and phase of the audio-enhanced audio. Specifically, as shown in Figure 13, input 44.1K sampling rate audio (such as the above audio samples), then calculate the STFT feature to obtain the corresponding modulus and phase, take the logarithm of the modulus to obtain the logarithmic modulus, and intercept the logarithmic modulus to the spectrum The height is 10K (such as the fourth audio frequency above), input the trained generator and the high-frequency modulo post-processing module, and a full-band modulo can be obtained. The generated full-band modulo and the low-band phase mirror are used to obtain the full-band phase, and the Euler formula is used to obtain the full-band phase. , and using ISTFT, the 44.1K sampling rate time-domain waveform can be obtained, which is further sent to the improved Griffinlim algorithm to correct the phase, and the time-domain waveform after modulo and phase post-processing (ie, the predicted audio in Figure 13) can be obtained.

需要说明的是，在通过训练好的音频增强模型进行预测生成时，所输入的音频样本可以为超高品质的音频(即图13中采样率为44.1K的音频)，将该音频样本进行音频增强处理(包括模和相位的优化处理)之后，所得到的音频仍然可以为预测后的超高品质音频(即采样率为44.1K的音频)。该预测后的音频可以补齐音频样本在频域上的高频缺失，可以使得音频样本在时域上波动更快，进而使得预测后的音频声音更宏亮，细节更丰富。It should be noted that when the prediction generation is performed by the trained audio enhancement model, the input audio sample can be ultra-high-quality audio (that is, the audio with a sampling rate of 44.1K in Figure 13), and the audio sample is processed into audio After enhancement processing (including modulo and phase optimization processing), the resulting audio can still be predicted ultra-high-quality audio (ie, audio with a sampling rate of 44.1K). The predicted audio can make up for the lack of high frequencies of the audio samples in the frequency domain, which can make the audio samples fluctuate faster in the time domain, thereby making the predicted audio sound louder and more detailed.

可选的，在通过音频增强模型进行预测生成时，所采用的高频模后处理流程以及相位修正处理流程，可以参见前述图2对应实施例中S203对应的高频模后处理和相位修正的详细描述，本申请在此不再赘述。Optionally, when performing prediction generation through the audio enhancement model, for the high-frequency modulo post-processing flow and the phase correction processing flow, please refer to the detailed description of the high-frequency modulo post-processing and phase correction corresponding to S203 in the corresponding embodiment of FIG. 2 . The application will not be repeated here.

本申请实施例中，通过在生成音频增强模型的初始架构时，部署轻量化算法，可以减少计算量和模型大小；并通过GAN对超高品质的音频样本进行训练，可以得到音频增强模型，从而从低频带学习到更多高频带的细节；再对音频增强模型预测生成的音频进行模和相位的优化处理，可以使得预测结果更准确，进而提高音频的播放效果。In the embodiment of the present application, by deploying a lightweight algorithm when generating the initial architecture of the audio enhancement model, the amount of calculation and the size of the model can be reduced; and the audio enhancement model can be obtained by training ultra-high-quality audio samples through GAN, thereby Learn from the low frequency band to learn more details of the high frequency band; then optimize the modulo and phase of the audio predicted by the audio enhancement model, which can make the prediction result more accurate, thereby improving the audio playback effect.

基于上述的音频处理方法，本发明实施例提供了一种音频处理装置。请参见图14，是本发明实施例提供的一种音频处理装置的结构示意图，该音频处理装置1400可以运行如下单元：Based on the above audio processing method, an embodiment of the present invention provides an audio processing apparatus. Please refer to FIG. 14 , which is a schematic structural diagram of an audio processing apparatus provided by an embodiment of the present invention. The audio processing apparatus 1400 may run the following units:

获取单元1401，用于响应针对目标音频的音频增强开启指令，从目标音频中获取待播放的第一音频；该目标音频为处于播放状态的音频；Obtaining unit 1401 is used to respond to the audio enhancement opening instruction for the target audio, and obtain the first audio to be played from the target audio; the target audio is the audio in the playing state;

处理单元1402，用于将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频；processing unit 1402, for inputting the first audio into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model;

该处理单元1402，还用于根据第一音频的高频带模对第二音频的模进行处理以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为第三音频；该第三音频为超高品质音频；The processing unit 1402 is further configured to process the modulo of the second audio frequency according to the high frequency band mode of the first audio frequency and modify the phase of the second audio frequency according to the low frequency band phase of the first audio frequency, so as to process the second audio frequency as tertiary audio; the tertiary audio is super high quality audio;

播放单元1403，用于将播放目标音频替换为播放第三音频。The playing unit 1403 is configured to replace the playing target audio with playing the third audio.

在一实施例中，上述音频处理装置中还包括确定单元1404。上述处理单元 1402，还用于基于第一音频的高频带模和第二音频的高频带模，对第二音频的模进行高频模后处理，得到第二音频的全频带模；上述处理单元1402，还用于根据第一音频的低频带相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位；该确定单元1404，用于基于全频带模和全频带相位对所述第二音频进行处理，以确定第三音频。In one embodiment, the above audio processing apparatus further includes a determining unit 1404 . The above-mentioned processing unit 1402 is further configured to perform high-frequency post-processing on the second audio frequency mode based on the high frequency band mode of the first audio frequency and the high frequency frequency band mode of the second audio frequency to obtain the full frequency band mode of the second audio frequency; the above processing unit 1402, also used for performing phase correction on the phase of the second audio frequency according to the low frequency band phase of the first audio frequency to obtain the full frequency band phase of the second audio frequency; the determining unit 1404 is used for all frequency bands based on the full frequency band mode and the full frequency band phase. The second audio is processed to determine the third audio.

在一实施例中，上述处理单元1402，还用于将第一音频的低频带相位进行镜像处理，得到镜像相位；上述处理单元1402，还用于利用语音信号重建算法对镜像相位进行运算，得到运算后的相位；上述处理单元1402，还用于根据运算后的相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位。In one embodiment, the above-mentioned processing unit 1402 is further configured to perform mirror image processing on the low-band phase of the first audio frequency to obtain the mirror image phase; the above-mentioned processing unit 1402 is also used to perform an operation on the mirror image phase by using a speech signal reconstruction algorithm to obtain: Calculated phase; the above-mentioned processing unit 1402 is further configured to perform phase correction on the phase of the second audio frequency according to the calculated phase to obtain the full-band phase of the second audio frequency.

在一实施例中，上述处理单元1402，还用于在接收到针对目标音频输入的拖动指令或切换指令时，删除缓存的第三音频。In one embodiment, the above-mentioned processing unit 1402 is further configured to delete the buffered third audio when receiving a drag instruction or a switching instruction input for the target audio.

在一实施例中，上述处理单元1402，还用于在第三音频的播放完毕时，删除缓存的第三音频。In one embodiment, the above-mentioned processing unit 1402 is further configured to delete the buffered third audio when the third audio is played.

在一实施例中，上述处理单元1402，还用于在接收到针对目标音频输入的音频增强关闭指令时，停止将目标音频中的音频输入音频增强模型。In one embodiment, the above-mentioned processing unit 1402 is further configured to stop inputting the audio in the target audio into the audio enhancement model when receiving the audio enhancement closing instruction for the target audio input.

在一实施例中，上述音频处理装置中还包括通信单元1405。该通信单元1405，用于于目标音频的播放界面，接收针对目标音频输入的音频增强开启指令。In one embodiment, the above audio processing apparatus further includes a communication unit 1405 . The communication unit 1405 is configured to receive, on the target audio playback interface, an audio enhancement enabling instruction for the target audio input.

在一实施例中，上述通信单元1405，还用于于目标音频对应的音频设置界面，接收针对目标音频输入的音频增强开启指令。In one embodiment, the above-mentioned communication unit 1405 is further configured to receive, on the audio setting interface corresponding to the target audio, an audio enhancement enabling instruction for input of the target audio.

在一实施例中，上述音频处理装置中还包括显示单元1406。该显示单元1406，用于于音频设置界面，显示第一音频和第三音频的频谱。In one embodiment, the above audio processing apparatus further includes a display unit 1406 . The display unit 1406 is used for displaying the frequency spectrum of the first audio and the third audio in the audio setting interface.

在一实施例中，上述显示单元1406，还用于针对音频增强模型当前正在处理的目标音频片段，显示目标音频片段的第一频谱和第二频谱；该第一频谱为目标音频片段输入音频增强模型前的频谱，第二频谱为目标片段输入音频增强模型后的频谱，目标音频片段为多个音频片段中正在处理的音频片段。In one embodiment, the above-mentioned display unit 1406 is further configured to display the first frequency spectrum and the second frequency spectrum of the target audio segment for the target audio segment currently being processed by the audio enhancement model; the first frequency spectrum is the input audio enhancement for the target audio segment. The frequency spectrum before the model, the second frequency spectrum is the frequency spectrum after the target segment is input to the audio enhancement model, and the target audio segment is the audio segment being processed among the multiple audio segments.

在一实施例中，上述音频增强模型为通过生成式对抗网络对音频样本进行训练得到的模型，该音频样本为超高品质音频。In one embodiment, the above-mentioned audio enhancement model is a model obtained by training audio samples through a generative adversarial network, and the audio samples are ultra-high-quality audio.

根据本发明的一个实施例，图2所示的音频处理方法所涉及各个步骤可以是由图14所示的音频处理装置中的各个单元来执行的。例如，图2的步骤S201 可由图14所示的音频处理装置1400中的获取单元1401来执行，步骤S202可由图14所示的音频处理装置1400中的处理单元1402来执行，步骤S204可由图14所示的音频处理装置1400中的播放单元1403来执行等等。According to an embodiment of the present invention, each step involved in the audio processing method shown in FIG. 2 may be performed by each unit in the audio processing apparatus shown in FIG. 14 . For example, step S201 in FIG. 2 can be performed by the acquiring unit 1401 in the audio processing apparatus 1400 shown in FIG. 14 , step S202 can be performed by the processing unit 1402 in the audio processing apparatus 1400 shown in FIG. 14 , and step S204 can be performed by FIG. 14 The playback unit 1403 in the shown audio processing apparatus 1400 performs and so on.

根据本发明的另一个实施例，图14所示的音频处理装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成，或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成，这可以实现同样的操作，而不影响本发明的实施例的技术效果的实现。上述单元是基于逻辑功能划分的，在实际应用中，一个单元的功能也可以由多个单元来实现，或者多个单元的功能由一个单元实现。在本发明的其它实施例中，基于音频处理装置也可以包括其它单元，在实际应用中，这些功能也可以由其它单元协助实现，并且可以由多个单元协作实现。According to another embodiment of the present invention, each unit in the audio processing apparatus shown in FIG. 14 may be respectively or all combined into one or several other units to form, or some of the unit(s) may be disassembled. It is divided into a plurality of units with smaller functions, which can realize the same operation without affecting the realization of the technical effects of the embodiments of the present invention. The above-mentioned units are divided based on logical functions. In practical applications, the function of one unit may also be implemented by multiple units, or the functions of multiple units may be implemented by one unit. In other embodiments of the present invention, the audio-based processing apparatus may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of multiple units.

根据本发明的另一个实施例，可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机的通用计算设备上运行能够执行如图2所示的相应方法所涉及的各步骤的计算机程序(包括程序代码)，来构造如图14中所示的音频处理装置，以及来实现本发明实施例音频处理方法。所述计算机程序可以记载于例如计算机存储介质上，并通过计算机存储介质装载于上述计算设备中，并在其中运行。According to another embodiment of the present invention, a general-purpose computing device, such as a computer, may be implemented using processing elements such as a central processing unit (CPU), random access storage medium (RAM), read-only storage medium (ROM), etc., and storage elements. The computer program (including program code) capable of executing the steps involved in the corresponding method as shown in FIG. 2 is executed to construct the audio processing apparatus as shown in FIG. 14 , and to implement the audio processing method of the embodiment of the present invention. The computer program can be recorded on, for example, a computer storage medium, loaded into the above-mentioned computing device through the computer storage medium, and executed therein.

综上所述，终端设备通过接收针对处于播放状态的音频(如目标音频)输入的音频增强开启指令，可以响应音频增强开启指令，从目标音频中获取待播放的第一音频，以将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的音频(如第二音频)，并根据第一音频和第二音频，确定超高品质的第三音频，以播放该第三音频，从而使得终端设备可以在音频的实时播放状态下，对该音频进行音频增强处理，进而得到超高品质的音频，以提升用户的听觉体验。To sum up, the terminal device can respond to the audio enhancement enabling instruction by receiving the audio enhancement enabling instruction input for the audio in the playing state (such as the target audio), and obtain the first audio to be played from the target audio, so as to convert the first audio The audio is input to the pre-trained audio enhancement model, the audio output by the audio enhancement model (such as the second audio) is obtained, and according to the first audio and the second audio, an ultra-high-quality third audio is determined to play the third audio, Therefore, the terminal device can perform audio enhancement processing on the audio in the real-time playback state of the audio, thereby obtaining ultra-high-quality audio, so as to improve the user's listening experience.

基于上述的音频处理方法以及音频处理装置的实施例，本发明实施例还提供了一种电子设备，此处所述电子设备可以对应于前述的终端设备。请参见图 15，是本发明实施例提供的一种电子设备的结构示意图，该电子设备1500至少可包括：处理器1501、输入接口1502、输出接口1503以及计算机存储介质1504 可通过总线或其他方式连接。Based on the above embodiments of the audio processing method and the audio processing apparatus, an embodiment of the present invention further provides an electronic device, where the electronic device may correspond to the aforementioned terminal device. Please refer to FIG. 15, which is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. The electronic device 1500 may at least include: a processor 1501, an input interface 1502, an output interface 1503, and a computer storage medium 1504. connect.

计算机存储介质1504可以存储在电子设备1500的存储器1505中，所述计算机存储介质1501用于存储计算机程序，所述计算机程序包括程序指令，所述处理器1501用于执行所述计算机存储介质1504存储的程序指令。处理器1501 (或称CPU(Central ProcessingUnit，中央处理器))是电子设备的计算核心以及控制核心，其适于实现一条或多条指令，具体适于加载并执行：The computer storage medium 1504 may be stored in the memory 1505 of the electronic device 1500, the computer storage medium 1501 for storing a computer program including program instructions, and the processor 1501 for executing the computer storage medium 1504 storage program instructions. The processor 1501 (or called CPU (Central Processing Unit, central processing unit)) is the computing core and the control core of the electronic device, which is suitable for implementing one or more instructions, and is specifically suitable for loading and executing:

响应针对目标音频的音频增强开启指令，从目标音频中获取待播放的第一音频；该目标音频为处于播放状态的音频；将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频；根据第一音频的高频带模对第二音频的模进行处理以及根据第一音频的低频带相位对第二音频的相位进行修正，以将第二音频处理为第三音频；该第三音频为超高品质音频；将播放目标音频替换为播放第三音频。In response to the audio enhancement opening instruction for the target audio, the first audio to be played is obtained from the target audio; the target audio is the audio in a playing state; the first audio is input into the pre-trained audio enhancement model, and the output of the audio enhancement model is obtained the second audio frequency; process the mode of the second audio frequency according to the high frequency band mode of the first audio frequency and modify the phase of the second audio frequency according to the low frequency band phase of the first audio frequency, so as to process the second audio frequency into the third audio frequency ; The third audio is super high quality audio; replace the playback target audio with the playback third audio.

在一实施例中，上述处理器1501，还用于基于第一音频的高频带模和第二音频的高频带模，对第二音频的模进行高频模后处理，得到第二音频的全频带模；上述处理器1501，还用于根据第一音频的低频带相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位；上述处理器1501，还用于全频带模和全频带相位对所述第二音频进行处理，以确定第三音频。In one embodiment, the above-mentioned processor 1501 is further configured to perform post-processing on the high frequency mode of the second audio frequency based on the high frequency band mode of the first audio frequency and the high frequency band mode of the second audio frequency, so as to obtain the full range of the second audio frequency. frequency band mode; the above-mentioned processor 1501 is also used to perform phase correction on the phase of the second audio frequency according to the low frequency band phase of the first audio frequency to obtain the full frequency band phase of the second audio frequency; the above processor 1501 is also used for the full frequency band mode and the full-band phase to process the second tones to determine the third tones.

在一实施例中，上述处理器1501，还用于将第一音频的低频带相位进行镜像处理，得到镜像相位；上述处理器1501，还用于利用语音信号重建算法对镜像相位进行运算，得到运算后的相位；上述处理器1501，还用于根据运算后的相位，对第二音频的相位进行相位修正，得到第二音频的全频带相位。In one embodiment, the above-mentioned processor 1501 is further configured to perform mirror image processing on the low frequency band phase of the first audio frequency to obtain the mirror image phase; the above-mentioned processor 1501 is also used to perform an operation on the mirror image phase by using a speech signal reconstruction algorithm to obtain Calculated phase; the processor 1501 is further configured to perform phase correction on the phase of the second audio frequency according to the calculated phase to obtain the full-band phase of the second audio frequency.

在一实施例中，处理器1501，在接收到针对目标音频输入的拖动指令或切换指令时，删除缓存的第三音频。In one embodiment, the processor 1501 deletes the buffered third audio when receiving a drag instruction or a switching instruction input for the target audio.

在一实施例中，处理器1501，在第三音频的播放完毕时，删除缓存的第三音频。In one embodiment, the processor 1501 deletes the buffered third audio when the third audio is played.

在一实施例中，处理器1501，在接收到针对目标音频输入的音频增强关闭指令时，停止将目标音频中的音频输入音频增强模型。In one embodiment, the processor 1501, when receiving an audio enhancement closing instruction for the target audio input, stops inputting the audio in the target audio into the audio enhancement model.

在一实施例中，处理器1501，于目标音频的播放界面，接收针对目标音频输入的音频增强开启指令。In one embodiment, the processor 1501 receives an audio enhancement enable instruction for the target audio input on the target audio playback interface.

在一实施例中，处理器1501，于目标音频对应的音频设置界面，接收针对目标音频输入的音频增强开启指令。In one embodiment, the processor 1501 receives, on the audio setting interface corresponding to the target audio, an audio enhancement enable instruction for the input of the target audio.

在一实施例中，处理器1501，于音频设置界面，显示第一音频和第三音频的频谱。In one embodiment, the processor 1501 displays the frequency spectrum of the first audio and the third audio on the audio setting interface.

在一实施例中，处理器1501，针对音频增强模型当前正在处理的目标音频片段，显示目标音频片段的第一频谱和第二频谱；该第一频谱为目标音频片段输入音频增强模型前的频谱，第二频谱为目标片段输入音频增强模型后的频谱，目标音频片段为多个音频片段中正在处理的音频片段。In one embodiment, the processor 1501 displays a first frequency spectrum and a second frequency spectrum of the target audio segment for the target audio segment currently being processed by the audio enhancement model; the first frequency spectrum is the frequency spectrum of the target audio segment before the audio enhancement model is input. , the second frequency spectrum is the frequency spectrum of the target segment after the audio enhancement model is input, and the target audio segment is the audio segment being processed among the multiple audio segments.

在一实施例中，上述音频增强模型为对生成式对抗网络模型进行相位修正得到的模型，该生成式对抗网络模型为通过生成式对抗网络对音频样本进行训练得到的模型，该音频样本为超高品质音频。In one embodiment, the above-mentioned audio enhancement model is a model obtained by performing phase correction on a generative adversarial network model, and the generative adversarial network model is a model obtained by training an audio sample through a generative adversarial network. High quality audio.

综上所述，电子设备接收针对目标音频输入的音频增强开启指令，该目标音频为处于播放状态的音频；并响应音频增强开启指令，从目标音频中获取待播放的第一音频；以将第一音频输入预先训练完成的音频增强模型，得到音频增强模型输出的第二音频，并根据第一音频和第二音频，确定超高品质音频的第三音频；从而播放第三音频。应当理解的，电子设备在通过将第一音频输入音频增强模型能够得到超高品质的第三音频，进而提高用户体验。To sum up, the electronic device receives an audio enhancement turn-on instruction input for a target audio, and the target audio is the audio in the playing state; and in response to the audio enhancement turn-on command, obtains the first audio to be played from the target audio; An audio is input to the pre-trained audio enhancement model, the second audio output by the audio enhancement model is obtained, and the third audio of the ultra-high-quality audio is determined according to the first audio and the second audio; thus the third audio is played. It should be understood that, by inputting the first audio into the audio enhancement model, the electronic device can obtain the third audio of super high quality, thereby improving the user experience.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等，具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中，而前述的存储介质可包括：U盘、移动硬盘、磁碟、光盘、只读存储器 (英文：Read-Only Memory，缩写：ROM)或者随机存取存储器(英文：Random AccessMemory，缩写：RAM)等各种可以存储程序代码的介质。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments. The technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution. The computer software product is stored in a storage medium, including a number of instructions for So that a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device) executes all or part of the steps of the foregoing methods in the various embodiments of the present application. Wherein, the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (English: Read-Only Memory, abbreviation: ROM) or random access memory (English: Random Access Memory, abbreviation: RAM) ) and other media that can store program codes.

本领域普通技术人员可以意识到，结合本申请中所公开的实施例描述的各示例的单元及步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用，使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and steps of each example described in conjunction with the embodiments disclosed in this application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程设备。计算机指令可以存储在计算机存储介质中，或者通过计算机存储介质进行传输。计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如，同轴电缆、光纤、数字用户线(DSL))或无线(例如，红外、无线、微波等) 方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如，固态硬盘(Solid State Disk，SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present invention are generated in whole or in part. A computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. Computer instructions may be stored in, or transmitted over, computer storage media. Computer instructions may be sent from one website site, computer, server, or data center to another website site by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) , computer, server or data center. A computer storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. Useful media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art who is familiar with the technical field disclosed in the present invention can easily think of changes or substitutions. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. An audio processing method, wherein the method is applied to a terminal device, and the method comprises:

In response to the audio enhancement opening instruction for the target audio, obtain the first audio to be played from the target audio; the target audio is the audio in the playing state;

Inputting the first audio into the pre-trained audio enhancement model to obtain the second audio output by the audio enhancement model;

The modulo of the second audio is processed according to the high-band modulo of the first audio and the phase of the second audio is modified according to the low-band phase of the first audio to convert the second audio Processed as third audio; the third audio is ultra-high-quality audio;

Playing the target audio is replaced with playing the third audio.

2. The method according to claim 1, wherein the determining a third audio according to the first audio and the second audio comprises:

Based on the high frequency band mode of the first audio frequency and the high frequency band mode of the second audio frequency, perform high frequency mode post-processing on the mode of the second audio frequency to obtain the full frequency band mode of the second audio frequency;

performing phase correction on the phase of the second audio frequency according to the low frequency band phase of the first audio frequency to obtain the full frequency band phase of the second audio frequency;

The second audio is processed based on the full-band mode and the full-band phase to determine the third audio.

3 . The method according to claim 2 , wherein, according to the low frequency band phase of the first audio frequency, phase correction is performed on the phase of the second audio frequency to obtain the full frequency band phase of the second audio frequency. 4 . ,include:

Performing mirror image processing on the low frequency band phase of the first audio frequency to obtain a mirror image phase;

Utilize the speech signal reconstruction algorithm to operate the mirror phase to obtain the phase after operation;

Phase correction is performed on the phase of the second audio frequency according to the calculated phase to obtain the full-band phase of the second audio frequency.

4 . The method according to claim 2 , wherein the audio enhancement model is a model obtained by training audio samples through a generative confrontation network, and the audio samples are ultra-high-quality audio. 5 .

5. The method according to any one of claims 1-4, wherein the method further comprises:

The buffered third audio is deleted when a drag instruction or a switching instruction input for the target audio is received or when the third audio is finished playing.

6. The method according to any one of claims 1-4, wherein the receiving an audio enhancement enabling instruction for target audio input comprises:

On the playback interface of the target audio or the audio setting interface corresponding to the target audio, an audio enhancement enabling instruction input for the target audio is received.

7. The method according to claim 6, wherein the method further comprises:

On the audio setting interface, the frequency spectrum of the first audio and the third audio is displayed.

8. The method according to claim 7, wherein the first audio comprises a plurality of audio segments, the second audio comprises audio segments respectively enhanced by the audio segments, and the audio enhancement model processing each of the audio clips in turn to obtain audio clips after audio enhancement;

The displaying the frequency spectrum of the first audio frequency and the third audio frequency includes:

For the target audio segment currently being processed by the audio enhancement model, display the first frequency spectrum and the second frequency spectrum of the target audio segment; wherein the first frequency spectrum is the frequency spectrum of the target audio segment, and the second frequency spectrum is the frequency spectrum of the enhanced audio segment corresponding to the target audio segment.

9. An electronic device, comprising a processor and a memory, wherein the memory is used to store a computer program, the computer program includes program instructions, the processor is configured to invoke the program instructions, A method as claimed in any one of claims 1-8 is performed.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to execute The method of any one of claims 1-8.