CN109524016B

CN109524016B - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN109524016B
Application number: CN201811243993.XA
Authority: CN
Inventors: 彭学杰; 刘佳泽; 王宇飞
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2022-06-28
Anticipated expiration: 2038-10-16
Also published as: CN109524016A

Abstract

The invention discloses an audio processing method and device, electronic equipment and a storage medium, and belongs to the technical field of audio processing. The method comprises the following steps: filtering the audio frequency through a plurality of filters to obtain a low-frequency audio frequency, a medium-frequency audio frequency and a first high-frequency audio frequency, wherein the frequency of the medium-frequency audio frequency is within the range of human voice frequency, the maximum frequency of the low-frequency audio frequency is less than or equal to the minimum frequency of the medium-frequency audio frequency, and the minimum frequency of the first high-frequency audio frequency is greater than or equal to the maximum frequency of the medium-frequency audio frequency; respectively delaying the low-frequency audio and the first high-frequency audio for a preset time; and carrying out audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain the target audio. The invention can enhance the human sound definition in the audio, ensure that the original balance degree of the audio is not damaged, and do not change the original signal energy of the audio.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

However, at present, people often cannot listen to the audio in a quiet environment, most of the audio is listened to when people exercise the body or ride vehicles, and in this case, noise or accompaniment in the audio will weaken or even cover the vocal signal of a singer in the audio, so that a user cannot hear the singing content of the singer, and user experience is affected. In order to make the voice in the audio more clear, the audio needs to be processed accordingly.

Currently, an equalizer is generally used to enhance human voice in audio. In the implementation process, the fourier transform may be performed on the audio to be processed to obtain a first spectrum signal of the audio in a frequency domain, then the equalizer is used to increase the energy of the medium-high frequency signal in the frequency spectrum according to the first spectrum signal to obtain a second spectrum signal, and then the inverse fourier transform is performed on the second spectrum signal to obtain the target audio with enhanced human voice. Because the frequency of the human voice is generally in the middle-high frequency range, the brightness of the human voice can be increased by using the equalizer to increase the energy of the middle-high frequency signal, so that the human voice is clearer in audio.

However, the equalizer is used to enhance the human voice, which will destroy the original equalization degree of the audio, so that the signal energy of the audio becomes large, clipping distortion may occur in severe cases, and the enhancement of the human voice region transition in the audio will also cause the audio signal to be too strong, thereby causing hearing fatigue of the user, even damaging the hearing of the user.

Disclosure of Invention

The embodiment of the invention provides an audio processing method, which can solve the problems that the original balance degree of audio is damaged and the energy of an audio signal is changed when an equalizer is used for enhancing human voice. The technical scheme is as follows:

in a first aspect, an audio processing method is provided, the method including:

filtering the audio frequency through a plurality of filters to obtain a low-frequency audio frequency, a medium-frequency audio frequency and a first high-frequency audio frequency, wherein the frequency of the medium-frequency audio frequency is within a human voice frequency range, the maximum frequency of the low-frequency audio frequency is less than or equal to the minimum frequency of the medium-frequency audio frequency, and the minimum frequency of the first high-frequency audio frequency is greater than or equal to the maximum frequency of the medium-frequency audio frequency;

delaying the low-frequency audio and the first high-frequency audio for a preset time length respectively;

and carrying out audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain a target audio.

Optionally, the filtering, by using multiple filters, the audio to obtain a low-frequency audio, a middle-frequency audio, and a first high-frequency audio includes:

respectively filtering the audio through a first low-pass filter and a first high-pass filter to obtain a low-frequency audio and a second high-frequency audio, wherein the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency;

and respectively filtering the second high-frequency audio through a second low-pass filter and a second high-pass filter to obtain the intermediate-frequency audio and the first high-frequency audio, wherein the cut-off frequency of the second low-pass filter and the initial frequency of the second high-pass filter are second frequencies, and the second frequencies are greater than the first frequencies and less than or equal to the maximum human voice frequency.

Optionally, the delaying the low-frequency audio by a preset time includes:

constructing a first delayer and a second delayer;

inputting a target number of preset values into the first delayer, inputting the low-frequency audio into the first delayer, and outputting the delayed low-frequency audio through the first delayer, wherein the ratio of the target number to the audio sampling frequency rate of the audio is the preset duration;

Inputting the target number of preset values into the second delayer, inputting the first high-frequency audio into the second delayer, and outputting the first high-frequency audio after delay processing through the second delayer.

Optionally, before the filtering the audio by using a plurality of filters, the method further includes:

acquiring the audio time length and the audio sampling frequency of the audio;

taking a product of the audio sampling frequency and the audio duration as a length of each of the plurality of filters.

Optionally, the plurality of filters are a combination of one or more of FIR (infinite Impulse Response) filters, FFT (Fast Fourier Transform) filters, and MDCT (Modified Discrete Cosine Transform) filters.

In a second aspect, there is provided an audio processing apparatus, the apparatus comprising:

the filtering processing module is used for filtering audio through a plurality of filters to obtain a low-frequency audio, a medium-frequency audio and a first high-frequency audio, wherein the frequency of the medium-frequency audio is within a human voice frequency range, the maximum frequency of the low-frequency audio is less than or equal to the minimum frequency of the medium-frequency audio, and the minimum frequency of the first high-frequency audio is greater than or equal to the maximum frequency of the medium-frequency audio;

The delay processing module is used for respectively delaying the low-frequency audio and the first high-frequency audio for preset time;

and the synthesis processing module is used for carrying out audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain a target audio.

Optionally, the filtering processing module includes:

the first filtering processing unit is used for respectively filtering the audio through a first low-pass filter and a first high-pass filter to obtain a low-frequency audio and a second high-frequency audio, wherein the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency;

and the second filtering processing unit is used for respectively filtering the second high-frequency audio through a second low-pass filter and a second high-pass filter to obtain the intermediate-frequency audio and the first high-frequency audio, the cut-off frequency of the second low-pass filter and the initial frequency of the second high-pass filter are second frequencies, and the second frequencies are greater than the first frequencies and less than or equal to the maximum human voice frequency.

Optionally, the delay processing module includes:

a construction unit for constructing a first delayer and a second delayer;

a first delay processing unit, configured to input a target number of preset values into the first delayer, input the low-frequency audio into the first delayer, and output the delayed low-frequency audio through the first delayer, where a ratio between the target number and an audio sampling frequency rate of the audio is the preset duration;

and the second delay processing unit is used for inputting the target number of preset values into the second delayer, inputting the first high-frequency audio into the second delayer, and outputting the first high-frequency audio after delay processing through the second delayer.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring the audio time length and the audio sampling frequency of the audio;

a setting module for taking a product of the audio sampling frequency and the audio duration as a length of each of the plurality of filters.

Optionally, the plurality of filters are a combination of one or more of FIR filters, FFT filters and MDCT filters.

In a third aspect, an electronic device is provided that includes a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of any one of the methods of the first aspect described above.

In a fourth aspect, a computer-readable storage medium is provided, on which instructions are stored, which when executed by a processor implement the audio processing method of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the audio processing method of the first aspect described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the audio is filtered by a plurality of filters to obtain the low-frequency audio, the intermediate-frequency audio and the first high-frequency audio, so that the intermediate-frequency audio in which the human voice is positioned in the audio is separated from the low-frequency audio and the first high-frequency audio in which the non-human voice is positioned. And then, delaying the low-frequency audio and the first high-frequency audio for a preset time length respectively, and then carrying out audio synthesis processing on the medium-frequency audio, the low-frequency audio subjected to delay processing and the first high-frequency audio, so as to obtain a target audio with clear human voice. The human voice audio and the non-human voice audio in the audio are separated, and then the non-human voice audio is synthesized with the human voice audio after being delayed, so that the human voice signal can firstly reach human ears, and the rest non-human voice signals can be delayed to reach, therefore, the auditory attention of a person can be concentrated in a human voice area, and the human voice sensed by the human ears is clearer.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of audio processing according to another exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of an audio processing device according to an exemplary embodiment;

fig. 4 is a block diagram of an electronic device 400 provided in accordance with an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Before describing the audio processing method provided by the embodiment of the present invention in detail, the application scenario and the implementation environment related to the embodiment of the present invention are briefly described.

First, a brief introduction is made to an application scenario related to the embodiment of the present invention.

The audio processing method provided by the embodiment of the invention is applied to improving the definition of the human voice in the audio, so that the human voice in the audio heard by human ears is clearer. For example, the method is applied to improving the definition of the voice of a speaker in recorded audio so that a listener can clearly hear the voice of the speaker therein, or the method is applied to improving the definition of the voice of a singer in song audio so that the listener can clearly hear scenes such as the voice of the singer therein.

The related art method for enhancing human voice using an equalizer has the following disadvantages:

1. the original balance of the audio is damaged;

2. if the positive gain of the equalizer is used, the signal energy of the audio becomes large, and clipping distortion may occur in severe cases;

3. the gain of the transition to the human acoustic region can cause the high frequency signal to be too strong, which leads to hearing fatigue and even hearing impairment.

In the embodiment of the invention, a novel audio processing method is improved based on a psychological theory in order to not destroy the original balance degree of the audio. That is, the human voice area and the non-human voice area in the audio frequency are separated, then the non-human voice signal is delayed, so that the human voice signal firstly reaches the human ear, and other signals are delayed to arrive, therefore, the auditory attention of the human is concentrated in the human voice area, the human voice perceived by the human ear is clearer and more pleasant, and the purpose of improving the human voice definition is achieved. The embodiment of the invention does not use an equalizer at all, does not gain the signal intensity of the human voice area, namely keeps the original balance degree of the audio and does not change the original energy.

In addition, the audio processing method provided by the embodiment of the invention can be applied to an audio processing device, the audio processing device can be an electronic device such as a terminal and a server, and the terminal can comprise a mobile phone, a tablet computer or a computer. Further, the embodiment of the present invention may also implement the audio processing method through audio processing software, for example, the terminal may install the audio processing software and process the audio according to the method provided by the embodiment of the present invention by running the audio processing software.

Fig. 1 is a flow diagram illustrating an audio processing method according to an exemplary embodiment, which may include the following steps:

step 101: the audio is filtered through a plurality of filters to obtain a low-frequency audio, a medium-frequency audio and a first high-frequency audio, wherein the frequency of the medium-frequency audio is within a human voice frequency range, the maximum frequency of the low-frequency audio is less than or equal to the minimum frequency of the medium-frequency audio, and the minimum frequency of the first high-frequency audio is greater than or equal to the maximum frequency of the medium-frequency audio.

Step 102: and respectively delaying the low-frequency audio and the first high-frequency audio for a preset time.

Step 103: and carrying out audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain the target audio.

In the embodiment of the invention, the audio is filtered through a plurality of filters to obtain the low-frequency audio, the medium-frequency audio and the first high-frequency audio, so that the medium-frequency audio in which human voice is positioned in the audio is separated from the low-frequency audio and the first high-frequency audio in which non-human voice is positioned. And then, respectively delaying the low-frequency audio and the first high-frequency audio for a preset time length, and then carrying out audio synthesis processing on the medium-frequency audio, the delayed low-frequency audio and the delayed first high-frequency audio to obtain a target audio with clear human voice. The human voice audio and the non-human voice audio in the audio are separated, and then the non-human voice audio is synthesized with the human voice audio after being delayed, so that the human voice signal can firstly reach human ears, and the rest non-human voice signals can be delayed to reach, therefore, the auditory attention of a person can be concentrated in a human voice area, and the human voice sensed by the human ears is clearer.

Optionally, the filtering the audio through multiple filters to obtain a low-frequency audio, a middle-frequency audio, and a first high-frequency audio includes:

Respectively filtering the audio frequency through a first low-pass filter and a first high-pass filter to obtain the low-frequency audio frequency and a second high-frequency audio frequency, wherein the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency;

Optionally, the delaying the low-frequency audio by a preset time period includes:

constructing a first delayer and a second delayer;

Optionally, before the filtering the audio by using multiple filters, the method further includes:

acquiring the audio time length and the audio sampling frequency of the audio;

the product of the audio sampling frequency and the audio duration is taken as the length of each of the plurality of filters.

All the above optional technical solutions can be combined arbitrarily to form an optional embodiment of the present invention, which is not described in detail herein.

Fig. 2 is a flowchart illustrating an audio processing method according to another exemplary embodiment, which is described as an example of the audio processing method applied to a computer device in the present embodiment, and the audio processing method may include the following implementation steps:

step 201: the audio is filtered through a first low-pass filter and a first high-pass filter respectively to obtain a low-frequency audio and a second high-frequency audio, the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency.

According to the audio processing method provided by the embodiment of the invention, the audio to be processed can be filtered through a plurality of filters to obtain the low-frequency audio, the intermediate-frequency audio and the first high-frequency audio. The frequency of the intermediate frequency audio is in the human voice frequency range, namely the intermediate frequency audio is equivalent to the human voice audio in the audio. The maximum frequency of the low-frequency audio is less than or equal to the minimum frequency of the medium-frequency audio, that is, the low-frequency audio corresponds to the non-human voice audio of which the frequency is less than the human voice audio. The minimum frequency of the first high-frequency audio is greater than or equal to the maximum frequency of the medium-frequency audio, that is, the first high-frequency audio is equivalent to the non-human voice audio of which the frequency is greater than the human voice audio. The human voice frequency and the non-human voice frequency in the audio frequency can be separated through a plurality of filters. Specifically, the operation of performing filter processing on the audio to be processed by a plurality of filters can be realized by this step 201 and the following step 202.

The audio is to-be-processed audio, which may specifically be a recording or a song, and the embodiment of the present invention does not limit this. For example, the audio may be any song stored in the terminal, or any song in the audio playing software, etc. The maximum frequency of the low-frequency audio is greater than or equal to the minimum human voice frequency, and the frequency range of the second high-frequency audio includes the human voice frequency range and a frequency range above the maximum human voice frequency.

In the embodiment of the present invention, the audio is filtered by the first low-pass filter and the first high-pass filter, so that the low-frequency audio with a frequency lower than the human voice frequency can be separated from the audio, and the audio is separated into two audio ranges, that is, the low-frequency audio with a frequency lower than the human voice frequency and the second high-frequency audio with a frequency higher than the low-frequency audio.

It should be noted that the human voice frequency is generally between 150Hz and 3600Hz, and may be understood as an unvoiced sound, and in this interval, the human voice frequency to which human hearing is sensitive, which is selected according to a loudness curve such as hearing, is generally between 500Hz and 3600Hz, so in the embodiment of the present invention, the frequency between 500Hz and 3600Hz may be used as the human voice frequency.

Alternatively, the first frequency may be 500 Hz. Inputting the audio frequency into a first low-pass filter, and obtaining low-frequency audio frequency with the frequency of 0-500 Hz through filtering processing of the first low-pass filter; and inputting the audio frequency into a first high-pass filter, and obtaining a second high-frequency audio frequency with the frequency of more than 500Hz through the filtering treatment of the first high-pass filter. That is, when the first frequency is 500Hz, the low-frequency audio is an audio with a frequency less than 500Hz, that is, the frequency range of the low-frequency audio is 0 to 500 Hz; the second high frequency audio is audio with frequency greater than 500Hz, i.e. the frequency range of the second high frequency audio is above 500 Hz.

Further, the plurality of filters may be constructed before the audio is filtered through the plurality of filters. For example, two low-pass filters and two high-pass filters, i.e., a first low-pass filter, a first high-pass filter, a second low-pass filter, and a second high-pass filter, are constructed. It should be noted that any one of the filters may adopt an IIR filter, or may also adopt a non-IIR filter, for example, one or a combination of more of an FIR filter, an FFT filter and an MDCT filter, which is not limited in this embodiment of the present invention.

In one possible implementation, when the first low-pass filter and the first high-pass filter use a non-IIR filter, the first low-pass filter and the first high-pass filter may be obtained by setting the length, the start frequency, and the cutoff frequency of the filters. For example, the implementation process may include: and acquiring the audio time length and the audio sampling frequency of the audio, and taking the product of the audio sampling frequency and the audio time length as the length of each filter in the first low-pass filter and the first high-pass filter.

For example, assuming that the audio sampling frequency of the audio is 44100Hz and the audio duration is 1 second, it may be determined that the product of the audio duration and the audio sampling frequency is 44100 and the length of each of the first low pass filter and the first high pass filter is set to 44100. Where the length of the filter may in turn be referred to as the order of the target filter, the longer the length, the better the performance of the filter.

In addition, it is necessary to set a cut-off frequency of the first low-pass filter and a start frequency of the first high-pass filter, for example, the cut-off frequency of the first low-pass filter and the start frequency of the first high-pass filter are both 500Hz, so that when the audio is filtered by the first low-pass filter and the first high-pass filter, the low-frequency audio and the second high-frequency audio in the audio can be separated.

It should be noted that, the implementation process for constructing the first low-pass filter and the first high-pass filter is only an example, in another embodiment, the lengths of the first low-pass filter and the first high-pass filter may also be set by a user according to implementation requirements in a customized manner, or by a default setting of a computer device, which is not limited in this embodiment of the present invention.

In addition, when the first low-pass filter and the first high-pass filter are IIR filters, such as BIQUAD filter, butterworth filter, elliptic filter, etc., there is no concept of length of the filter, and the IIR filter has only the concept of order, generally 1 order and 2 order BIQUAD filter. Typically a multiple IIR filter is cascaded into a 3 rd order, 4 th order, 5 th order multiple order filter. The lower the order the worse the filter performance, the higher the order the better. The order of the IIR filter/cascade is typically commonly used to be 1, 2, 4 and 6.

Although the present invention has no limitation on the selection of the filter, in general, if the audio is subjected to the block processing (i.e., the audio is divided into small blocks, for example, into audio blocks of 100 milliseconds, and then each audio block is sequentially processed according to the above process), the FIR filter may be considered to be selected as the first low-pass filter and the first high-pass filter. Conversely, if the audio is not processed in blocks, or the length of each audio block after the block processing is too long (for example, longer than 0.5 second), an IIR filter may be used as the first low-pass filter and the first high-pass filter, and further, a 4-order IIR filter (for example, 4 1-order IIR cascades or 2 BIQUAD cascades) may be used.

Step 202: and respectively filtering the second high-frequency audio through a second low-pass filter and a second high-pass filter to obtain a medium-frequency audio and a first high-frequency audio, wherein the cut-off frequency of the second low-pass filter and the initial frequency of the second high-pass filter are second frequencies, and the second frequencies are greater than the first frequencies and less than or equal to the maximum human voice frequency.

Since the second high-frequency audio includes both the human voice audio and the high-frequency audio with a frequency greater than that of the human voice audio, after the second high-frequency audio is obtained, the second high-frequency audio can be further subjected to filtering processing by the two filters, so that the human voice frequency and the non-human voice frequency in the second high-frequency audio can be further separated.

Alternatively, the second frequency may be 3600 Hz. Inputting the second high-frequency audio into a second low-pass filter, and filtering by the second low-pass filter to obtain a medium-frequency audio with the frequency of 500-3600 Hz; and inputting the second high-frequency audio into a second high-pass filter, and obtaining the first high-frequency audio with the frequency of 3600Hz or above through the filtering treatment of the second high-pass filter. That is, when the first frequency is 500Hz and the second frequency is 3600Hz, the intermediate frequency audio is audio with a frequency in a range of 500Hz to 3600Hz, that is, the frequency range of the intermediate frequency audio is 500Hz to 3600 Hz; the first high-frequency audio is audio with frequency greater than 3600Hz, namely the frequency range of the first high-frequency audio is above 3600 Hz.

Further, before the second high-frequency audio is filtered by the second low-pass filter and the second high-pass filter, the second low-pass filter and the second high-pass filter may be constructed. Moreover, since all operations in the embodiment of the present invention are performed in the time domain, the input and output lengths of the audio samples in all steps are completely the same, and therefore, the lengths of the intermediate frequency audio and the first high frequency audio in this step are also completely the same, that is, the length is the number of PCM samples included in the audio.

In one possible implementation, when the second low-pass filter and the second high-pass filter are non-IIR filters, the second low-pass filter and the second high-pass filter may be obtained by setting the length, the start frequency, and the cutoff frequency of the filters. For example, the implementation process may include: and acquiring the audio time length and the audio sampling frequency of the audio, and taking the product of the audio sampling frequency and the audio time length as the length of each filter in the second low-pass filter and the second high-pass filter.

For example, assuming that the audio sampling frequency of the audio is 44100Hz and the audio duration is 1 second, it may be determined that the product of the audio duration and the audio sampling frequency is 44100 and the length of each of the second low pass filter and the second high pass filter is set to 44100. The length of the filter can be called the order of the target filter, and the longer the length is, the better the performance of the filter is.

In addition, it is also necessary to set the cut-off frequency of the second low-pass filter and the start frequency of the second high-pass filter, for example, the cut-off frequency of the second low-pass filter and the start frequency of the second high-pass filter are both 3600Hz, so that when the second high-frequency audio is filtered by using the second low-pass filter and the second high-pass filter, the middle-frequency audio and the first high-frequency audio in the high-frequency audio can be separated.

It should be noted that, the implementation process of constructing the second low-pass filter and the second high-pass filter is only an example, in another embodiment, the lengths of the second low-pass filter and the second high-pass filter may also be set by a user according to implementation requirements in a customized manner, or by a default setting of a computer device, which is not limited in this embodiment of the present invention.

In addition, when the second low-pass filter and the second high-pass filter are IIR filters, such as BIQUAD filter, butterworth filter, elliptic filter, etc., there is no concept of length of the filter, and IIR filters have only the concept of order, generally 1 st order and BIQUAD filter 2 nd order. Typically a multiple IIR filter is cascaded into a 3 rd order, 4 th order, 5 th order multiple order filter. The lower the order the worse the filter performance, the higher the order the better. The order of the IIR filter/cascade is typically commonly used to be of order 1, 2, 4 and 6.

Although the present invention has no limitation on the selection of the filter, in general, if the audio is processed by blocking (i.e. the audio is divided into small blocks, for example, into audio blocks of 100 milliseconds, and then each audio block is processed according to the above process in turn), FIR filters may be considered as the second low-pass filter and the second high-pass filter. Conversely, if the audio is not subjected to the blocking processing, or the length of each audio block after the blocking processing is too long (for example, more than 0.5 second), an IIR filter may be selected as the second low-pass filter and the second high-pass filter, and further, an IIR filter of 4 orders (for example, 4 IIR cascades of 1 order or 2 BIQUAD cascades) may be used.

Step 203: and respectively delaying the low-frequency audio and the first high-frequency audio for a preset time.

The preset duration may be set by a user according to actual needs in a self-defined manner, or may be set by default by the computer device, which is not limited in the embodiment of the present invention. For example, the preset time period may be within an interval of 0 to 50 milliseconds, and the smaller the preset time period is, the weaker the effect of enhancing the human voice definition is, the more suitable the human voice definition is for use in a quiet room, and the larger the preset time period is, the stronger the effect of enhancing the human voice definition is, the more suitable the human voice definition is for use in a noisy environment. For example, the preset time period may be set to 25 msec.

In one possible implementation manner, a specific implementation that respectively delays the low-frequency audio and the first high-frequency audio by a preset time may include: constructing a first delayer and a second delayer; inputting a target number of preset values into a first delayer, inputting low-frequency audio into the first delayer, and outputting the delayed low-frequency audio through the first delayer; and inputting a target number of preset values into a second delayer, inputting the first high-frequency audio into the second delayer, and outputting the first high-frequency audio after delay processing through the second delayer. Wherein, the ratio of the target number to the audio sampling frequency rate of the audio is the preset time length.

The preset value may be set by a user in a user-defined manner according to actual requirements, or may be set by default by the computer device, which is not limited in the embodiment of the present invention. For example, the preset value may be "0", or may be "1".

The target number is generally determined by an audio sampling rate and a predetermined duration, and further, the target number is a product of the audio sampling frequency and a predetermined duration, for example, when the audio sampling rate is 44100Hz and the predetermined duration is 25 ms, the target number may be determined to be 1103.

In some embodiments, the First delay and the second delay may select a FIFO (First in First out) buffer. Assuming that the first delayer selects the FIFO1 and the second delayer selects the FIFO2, and the target number is 1103, and the preset value is 0, at this time, 1103 0 s are respectively input into the FIFO1 and the FIFO2, and since the ratio between 1103 and 44100 is 0.025, which is 25 msec, when inputting the low frequency audio into the FIFO1, it is equivalent to delaying the low frequency audio by 25 msec, and similarly, when inputting the first high frequency audio into the FIFO2, it is equivalent to delaying the first high frequency audio by 25 msec. The electronics can control the FIFO1 to output audio of the same audio length as the low frequency audio resulting in delayed low frequency audio, which can be denoted as LP OUT, for example. Similarly, the FIFO2 may be controlled to output audio of the same audio length as the first high frequency audio, resulting in a delayed first high frequency audio, which may be denoted as HP _ OUT, for example.

It should be noted that, since all operations in the embodiment of the present invention are performed in the time domain, the input and output lengths of the audio samples in all steps are completely the same, and therefore, the lengths of the low-frequency audio, the first high-frequency audio, and the low-frequency audio and the first high-frequency audio after the delay processing in this step are also the same.

Step 204: and performing audio synthesis processing on the intermediate-frequency audio, the delayed low-frequency audio and the first high-frequency audio to obtain a target audio.

In the audio synthesis process, the inverse process of step 201 described above can be used. For example, when an IIR filter or an FIR filter is used in step 201, the audio synthesis process is to linearly add the audios, i.e., the middle-frequency audio, and the delayed low-frequency audio and the first high-frequency audio. For another example, when the MDCT filter is selected in step 201, the audio synthesis process here is inverse MDCT, that is, IMDCT (inverse modified discrete cosine transform).

For example, assuming that the audio synthesis process is performed by a linear addition method, the intermediate frequency audio is denoted as VOCAL, the delayed low frequency audio is denoted as LP _ OUT, and the first high frequency audio is denoted as HP _ OUT, the final output target audio OUT is VOCAL + LP _ OUT + HP _ OUT.

It should be noted that, if the audio is stereo audio or multi-channel audio, each channel of audio may be processed according to the audio processing method provided by the embodiment of the present invention. That is, the audio in step 201 may also be each channel audio in the target audio.

Fig. 3 is a schematic diagram illustrating the structure of an audio processing apparatus according to an exemplary embodiment, which may be implemented by software, hardware, or a combination of both. The audio processing apparatus may include:

the filtering processing module 310 is configured to perform filtering processing on an audio through a plurality of filters to obtain a low-frequency audio, a medium-frequency audio, and a first high-frequency audio, where a frequency of the medium-frequency audio is within a human voice frequency range, a maximum frequency of the low-frequency audio is less than or equal to a minimum frequency of the medium-frequency audio, and a minimum frequency of the first high-frequency audio is greater than or equal to a maximum frequency of the medium-frequency audio;

a delay processing module 320, configured to delay the low-frequency audio and the first high-frequency audio by a preset time length, respectively;

and a synthesis processing module 330, configured to perform audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after the delay processing, and the first high-frequency audio to obtain a target audio.

Optionally, the filtering processing module 310 includes:

the first filtering processing unit is used for respectively filtering the audio frequency through a first low-pass filter and a first high-pass filter to obtain a low-frequency audio frequency and a second high-frequency audio frequency, the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency;

Optionally, the delay processing module 320 includes:

a construction unit for constructing a first delayer and a second delayer;

a first delay processing unit, configured to input a target number of preset values into the first delay, input the low-frequency audio into the first delay, and output the delayed low-frequency audio through the first delay, where a ratio between the target number and an audio sampling frequency rate of the audio is the preset duration;

Optionally, the apparatus further comprises:

It should be noted that: in the audio processing apparatus provided in the foregoing embodiment, when the audio processing method is implemented, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio processing apparatus and the audio processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 4 is a block diagram of an electronic device 400 provided in accordance with an example embodiment. The electronic device 400 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Electronic device 400 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.

In general, the electronic device 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the audio processing method provided by the method embodiments herein.

In some embodiments, the electronic device 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by buses or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, touch screen display 405, camera 406, audio circuitry 407, positioning components 408, and power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be one, providing the front panel of the electronic device 400; in other embodiments, the display screen 405 may be at least two, respectively disposed on different surfaces of the electronic device 400 or in a folded design; in still other embodiments, the display screen 405 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the electronic device 400. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate the current geographic Location of the electronic device 400 for navigation or LBS (Location Based Service). The Positioning component 408 may be a Positioning component based on the GPS (Global Positioning System) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union.

The power supply 409 is used to supply power to the various components in the electronic device 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When power source 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic apparatus 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the touch display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the electronic device 400, and the gyro sensor 412 may acquire a 3D motion of the user on the electronic device 400 in cooperation with the acceleration sensor 411. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.

The pressure sensors 413 may be disposed on a side bezel of the electronic device 400 and/or on a lower layer of the touch display screen 405. When the pressure sensor 413 is arranged on the side frame of the electronic device 400, a holding signal of the user to the electronic device 400 can be detected, and the processor 401 performs left-right hand identification or shortcut operation according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of the user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 401 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 414 may be disposed on the front, back, or side of the electronic device 400. When a physical button or vendor Logo is provided on the electronic device 400, the fingerprint sensor 414 may be integrated with the physical button or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch screen display 405 according to the ambient light intensity collected by the optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 405 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera head assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also known as a distance sensor, is typically disposed on the front panel of the electronic device 400. The proximity sensor 416 is used to capture the distance between the user and the front of the electronic device 400. In one embodiment, the processor 401 controls the touch display screen 405 to switch from the bright screen state to the dark screen state when the proximity sensor 416 detects that the distance between the user and the front surface of the electronic device 400 gradually decreases; when the proximity sensor 416 detects that the distance between the user and the front of the electronic device 400 is gradually increased, the processor 401 controls the touch display screen 405 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the electronic device 400, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a computer device, enable the computer device to perform the audio processing method provided in the embodiment shown in fig. 1 or fig. 2.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the audio processing method provided in the embodiment shown in fig. 1 or fig. 2.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of audio processing, the method comprising:

filtering audio through a plurality of filters to obtain a low-frequency audio, a medium-frequency audio and a first high-frequency audio, wherein the frequency of the medium-frequency audio is within a human voice frequency range, the medium-frequency audio is a human voice audio in the audio, the maximum frequency of the low-frequency audio is less than or equal to the minimum frequency of the medium-frequency audio, the low-frequency audio is a non-human voice audio of which the frequency in the audio is less than the human voice audio, the minimum frequency of the first high-frequency audio is greater than or equal to the maximum frequency of the medium-frequency audio, and the first high-frequency audio is a non-human voice audio of which the frequency in the audio is greater than the human voice audio;

performing audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain a target audio;

the delaying the low-frequency audio and the first high-frequency audio by a preset time period respectively includes:

constructing a first delayer and a second delayer; inputting a target number of preset values into the first delayer, inputting the low-frequency audio into the first delayer, and outputting the delayed low-frequency audio through the first delayer, wherein the ratio of the target number to the audio sampling frequency of the audio is the preset duration;

2. The method of claim 1, wherein the filtering the audio through a plurality of filters to obtain the low frequency audio, the mid frequency audio, and the first high frequency audio comprises:

respectively filtering the audio frequency through a first low-pass filter and a first high-pass filter to obtain a low-frequency audio frequency and a second high-frequency audio frequency, wherein the cut-off frequency of the first low-pass filter and the initial frequency of the first high-pass filter are first frequencies, and the first frequencies are greater than or equal to the minimum human voice frequency;

3. The method of any of claims 1-2, wherein prior to filtering the audio through the plurality of filters, further comprising:

acquiring the audio time length and the audio sampling frequency of the audio;

4. A method as claimed in claim 3, wherein the plurality of filters are a combination of one or more of finite long single-bit impulse response, FIR, filters, discrete fourier transform, FFT, filters and modified discrete cosine transform, MDCT, filters.

5. An audio processing apparatus, characterized in that the apparatus comprises:

the filtering processing module is used for filtering audio through a plurality of filters to obtain a low-frequency audio, a medium-frequency audio and a first high-frequency audio, wherein the frequency of the medium-frequency audio is within a human voice frequency range, the medium-frequency audio is a human voice audio in the audio, the maximum frequency of the low-frequency audio is less than or equal to the minimum frequency of the medium-frequency audio, the low-frequency audio is a non-human voice audio of which the medium frequency of the audio is less than the human voice audio, the minimum frequency of the first high-frequency audio is greater than or equal to the maximum frequency of the medium-frequency audio, and the first high-frequency audio is a non-human voice audio of which the medium frequency of the audio is greater than the human voice audio;

the synthesis processing module is used for carrying out audio synthesis processing on the intermediate-frequency audio, the low-frequency audio after delay processing and the first high-frequency audio to obtain a target audio;

the delay processing module comprises:

a construction unit for constructing a first delayer and a second delayer;

a first delay processing unit, configured to input a target number of preset values into the first delay, input the low-frequency audio into the first delay, and output the delayed low-frequency audio through the first delay, where a ratio between the target number and an audio sampling frequency of the audio is the preset duration;

6. The apparatus of claim 5, wherein the filter processing module comprises:

7. The apparatus of any of claims 5-6, further comprising:

a setting module to take a product of the audio sampling frequency and the audio duration as a length of each of the plurality of filters.

8. The apparatus of claim 7, wherein the plurality of filters are a combination of one or more of a finite long single-bit impulse response (FIR) filter, a discrete Fourier transform (FFT) filter, and a Modified Discrete Cosine Transform (MDCT) filter.

9. An electronic device, wherein the electronic device comprises a processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to perform the steps of the method of any one of claims 1-4.

10. A computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the method of any one of claims 1-4.