CN115101084A - Model training method, audio processing method, device, sound box, equipment and medium - Google Patents
Model training method, audio processing method, device, sound box, equipment and medium Download PDFInfo
- Publication number
- CN115101084A CN115101084A CN202210723242.8A CN202210723242A CN115101084A CN 115101084 A CN115101084 A CN 115101084A CN 202210723242 A CN202210723242 A CN 202210723242A CN 115101084 A CN115101084 A CN 115101084A
- Authority
- CN
- China
- Prior art keywords
- amplitude spectrum
- audio
- processing model
- audio signal
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/04—Circuits for transducers, loudspeakers or microphones for correcting frequency response
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
技术领域technical field
本公开涉及音频处理领域,更具体地说,涉及一种模型训练方法、音频处理方法、装置、音箱、设备及介质。The present disclosure relates to the field of audio processing, and more particularly, to a model training method, an audio processing method, an apparatus, a sound box, a device and a medium.
背景技术Background technique
随着神经网络的推广,神经网络越来越多的被运用到音频领域,例如音频去噪、音频去混响、语音分离等等,相比较于传统算法,神经网络往往能获得更好的效果。在音频降噪领域,目前通常只考虑对16kHz频带的音频进行降噪,虽然存在极少数考虑全频带降噪的系统,但其16kHz以上频带音频的降噪效果并不理想,主要表现为降噪后音频的音质下降,或者是针对音频的高频区域,存在由于频率分辨率模糊而导致的高频区域的降噪性能较差的情况。With the promotion of neural networks, more and more neural networks are used in the audio field, such as audio denoising, audio de-reverberation, speech separation, etc. Compared with traditional algorithms, neural networks can often achieve better results. . In the field of audio noise reduction, currently only the 16kHz frequency band is usually considered for noise reduction. Although there are very few systems that consider full-band noise reduction, the noise reduction effect of the frequency band above 16kHz is not ideal, mainly for noise reduction. The sound quality of the rear audio is degraded, or for the high-frequency region of the audio, there is a situation where the noise reduction performance in the high-frequency region is poor due to blurred frequency resolution.
发明内容SUMMARY OF THE INVENTION
本公开提供一种模型训练方法、音频处理方法、装置、音箱、设备及介质,以至少解决上述相关技术中的问题。The present disclosure provides a model training method, an audio processing method, an apparatus, a sound box, a device and a medium, so as to at least solve the above-mentioned problems in the related art.
根据本公开实施例的第一方面,提供一种音频处理模型的训练方法,包括:获取第一训练样本,其中,所述第一训练样本包括第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱,所述第一带噪音频信号通过将所述第一原始音频信号和噪声信号进行混合而得到;基于所述第一带噪音频信号的幅度谱,得到针对第一频带范围的第一带噪幅度谱和针对第二频带范围的第二带噪幅度谱;将所述第一带噪幅度谱输入所述音频处理模型中的第一处理模型,得到针对所述第一频带范围的第一估计降噪幅度谱,其中,所述第一处理模型被预先训练好;将所述第一估计降噪幅度谱和所述第二带噪幅度谱输入所述音频处理模型中的第二处理模型,得到针对所述第二频带范围的第二估计降噪幅度谱;基于所述第二估计降噪幅度谱和第二原始幅度谱,获取第一损失,所述第二原始幅度谱为所述第一原始音频信号的幅度谱中的针对所述第二频带范围的幅度谱;通过根据所述第一损失调整所述第二处理模型的模型参数,对所述音频处理模型进行训练。According to a first aspect of the embodiments of the present disclosure, there is provided a training method for an audio processing model, including: acquiring a first training sample, wherein the first training sample includes an amplitude spectrum of a first frequency signal with noise and a first original The amplitude spectrum of the audio signal, the first frequency signal with noise is obtained by mixing the first original audio signal and the noise signal; based on the amplitude spectrum of the first frequency signal with noise, the first frequency band range is obtained The first noisy amplitude spectrum and the second noisy amplitude spectrum for the second frequency band range; input the first noisy amplitude spectrum into the first processing model in the audio processing model, and obtain the first frequency band for the first processing model. The first estimated noise reduction amplitude spectrum of the range, wherein the first processing model is pre-trained; the first estimated noise reduction amplitude spectrum and the second noisy amplitude spectrum are input into the audio processing model. a second processing model to obtain a second estimated noise reduction magnitude spectrum for the second frequency band; based on the second estimated noise reduction magnitude spectrum and a second original magnitude spectrum, obtain a first loss, the second original magnitude The spectrum is the amplitude spectrum for the second frequency band in the amplitude spectrum of the first original audio signal; by adjusting the model parameters of the second processing model according to the first loss, the audio processing model is performed train.
可选地,所述第一频带范围和所述第二频带范围是从所述第一带噪音频信号的幅度谱的全频带范围划分得到的。Optionally, the first frequency band range and the second frequency band range are obtained by dividing the full frequency band range of the amplitude spectrum of the first noisy frequency signal.
可选地,所述第一频带范围为0-16kHz,所述第二频带范围为16-48kHz。Optionally, the first frequency band range is 0-16 kHz, and the second frequency band range is 16-48 kHz.
可选地,所述第一处理模型是通过下述方式被预先训练的:获取第二训练样本,其中,所述第二训练样本包括第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱,所述第二带噪音频信号通过将所述第二原始音频信号和噪声信号进行混合而得到,所述第二原始音频信号的频带范围为所述第一频带范围;将所述第二带噪音频信号的幅度谱输入所述第一处理模型,得到针对所述第一频带范围的第三估计降噪幅度谱;基于所述第三估计降噪幅度谱和所述第二原始音频信号的幅度谱,获取第二损失;通过根据所述第二损失调整所述第一处理模型的模型参数,对所述第一处理模型进行训练。Optionally, the first processing model is pre-trained by acquiring a second training sample, wherein the second training sample includes the amplitude spectrum of the second noisy frequency signal and the second original audio signal The amplitude spectrum of the second frequency band with noise is obtained by mixing the second original audio signal and the noise signal, and the frequency band range of the second original audio signal is the first frequency band range; The amplitude spectrum of the second frequency signal with noise is input into the first processing model to obtain a third estimated noise reduction amplitude spectrum for the first frequency band; based on the third estimated noise reduction amplitude spectrum and the second original The amplitude spectrum of the audio signal is obtained, and the second loss is obtained; the first processing model is trained by adjusting the model parameters of the first processing model according to the second loss.
可选地,所述第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱通过以下方式得到:分别对所述第二带噪音频信号和所述第二原始音频信号进行短时傅里叶变换,得到时频域的所述第二带噪音频信号和所述第二原始音频信号;基于时频域的所述第二带噪音频信号和所述第二原始音频信号提取幅度谱,得到所述第二带噪音频信号的幅度谱和所述第二原始音频信号的幅度谱。Optionally, the amplitude spectrum of the second frequency signal with noise and the amplitude spectrum of the second original audio signal are obtained by performing a short-term analysis on the second frequency signal with noise and the second original audio signal respectively. Fourier transform to obtain the second frequency signal with noise and the second original audio signal in the time-frequency domain; extract the amplitude based on the second frequency signal with noise and the second original audio signal in the time-frequency domain spectrum, to obtain the amplitude spectrum of the second frequency signal with noise and the amplitude spectrum of the second original audio signal.
可选地,所述第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱通过以下方式得到:分别对所述第一带噪音频信号和所述第一原始音频信号进行短时傅里叶变换,得到时频域的所述第一带噪音频信号和所述第一原始音频信号;基于时频域的所述第一带噪音频信号和所述第一原始音频信号提取幅度谱,得到所述第一带噪音频信号的幅度谱和所述第一原始音频信号的幅度谱。Optionally, the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal are obtained by performing a short-term analysis on the first frequency signal with noise and the first original audio signal respectively. Fourier transform to obtain the first frequency signal with noise and the first original audio signal in the time-frequency domain; extract the amplitude based on the first frequency signal with noise and the first original audio signal in the time-frequency domain spectrum to obtain the amplitude spectrum of the first noise-banded frequency signal and the amplitude spectrum of the first original audio signal.
根据本公开实施例的第二方面,提供一种音频处理方法,包括:获取待处理的音频信号;利用音频处理模型对所述待处理的音频信号进行音频处理,得到处理后的音频信号,其中,所述音频处理模型是基于前述第一方面任一所述的音频处理模型的训练方法训练得到。According to a second aspect of the embodiments of the present disclosure, an audio processing method is provided, including: acquiring an audio signal to be processed; and performing audio processing on the audio signal to be processed by using an audio processing model to obtain a processed audio signal, wherein , the audio processing model is obtained by training based on any one of the audio processing model training methods described in the first aspect.
可选地,所述利用音频处理模型对所述待处理的音频信号进行音频处理,得到处理后的音频信号,包括:基于所述待处理的音频信号的幅度谱,得到针对第一频带范围的第一幅度谱和针对第二频带范围的第二幅度谱;将所述第一幅度谱输入所述音频处理模型中的第一处理模型,得到针对所述第一频带范围的第一估计幅度谱;将所述第二幅度谱和所述第一估计幅度谱输入所述音频处理模型中的第二处理模型,得到针对所述第二频带范围的第二估计幅度谱;将所述第一估计幅度谱和所述第二估计幅度谱结合,得到估计幅度谱;基于所述估计幅度谱,得到处理后的音频信号。Optionally, performing audio processing on the to-be-processed audio signal by using an audio processing model to obtain a processed audio signal includes: obtaining, based on an amplitude spectrum of the to-be-processed audio signal, an a first magnitude spectrum and a second magnitude spectrum for a second frequency band range; inputting the first magnitude spectrum into a first processing model in the audio processing model to obtain a first estimated magnitude spectrum for the first frequency band range ; Input the second amplitude spectrum and the first estimated amplitude spectrum into the second processing model in the audio processing model to obtain a second estimated amplitude spectrum for the second frequency band range; The magnitude spectrum is combined with the second estimated magnitude spectrum to obtain an estimated magnitude spectrum; based on the estimated magnitude spectrum, a processed audio signal is obtained.
可选地,所述第一频带范围和所述第二频带范围是从待处理的音频信号的幅度谱的全频带范围划分得到的。Optionally, the first frequency band range and the second frequency band range are obtained by dividing the full frequency band range of the amplitude spectrum of the audio signal to be processed.
可选地,所述第一频带范围为0-16kHz,所述第二频带范围为16-48kHz。Optionally, the first frequency band range is 0-16 kHz, and the second frequency band range is 16-48 kHz.
可选地,所述待处理的音频信号的幅度谱通过以下方式得到:对所述音频信号进行短时傅里叶变换,得到时频域的所述音频信号;基于时频域的所述音频信号提取幅度谱,得到所述音频信号的幅度谱。Optionally, the amplitude spectrum of the audio signal to be processed is obtained by: performing short-time Fourier transform on the audio signal to obtain the audio signal in the time-frequency domain; based on the audio signal in the time-frequency domain The amplitude spectrum of the signal is extracted to obtain the amplitude spectrum of the audio signal.
可选地,所述基于所述估计幅度谱,得到处理后的音频信号,包括:将所述估计幅度谱和与所述估计幅度谱对应的相位相乘;对相乘结果进行反短时傅里叶变换,得到时域的处理后的音频信号。Optionally, obtaining the processed audio signal based on the estimated amplitude spectrum includes: multiplying the estimated amplitude spectrum by a phase corresponding to the estimated amplitude spectrum; performing an inverse short-time-future on the multiplication result. Lie transform to obtain the processed audio signal in the time domain.
根据本公开实施例的第三方面,提供一种音频处理模型的训练装置,包括:第一训练样本获取单元,被配置为:获取第一训练样本,其中,所述第一训练样本包括第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱,所述第一带噪音频信号通过将所述第一原始音频信号和噪声信号进行混合而得到;带噪幅度谱获取单元,被配置为:基于所述第一带噪音频信号的幅度谱,得到针对第一频带范围的第一带噪幅度谱和针对第二频带范围的第二带噪幅度谱;第一估计降噪幅度谱确定单元,被配置为:将所述第一带噪幅度谱输入所述音频处理模型中的第一处理模型,得到针对所述第一频带范围的第一估计降噪幅度谱,其中,所述第一处理模型被预先训练好;第二估计降噪幅度谱确定单元,被配置为:将所述第一估计降噪幅度谱和所述第二带噪幅度谱输入所述音频处理模型中的第二处理模型,得到针对所述第二频带范围的第二估计降噪幅度谱;第一损失获取单元,被配置为:基于所述第二估计降噪幅度谱和第二原始幅度谱,获取第一损失,所述第二原始幅度谱为所述第一原始音频信号的幅度谱中的针对所述第二频带范围的幅度谱;模型参数调整单元,被配置为:通过根据所述第一损失调整所述第二处理模型的模型参数,对所述音频处理模型进行训练。According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for training an audio processing model, comprising: a first training sample obtaining unit configured to: obtain a first training sample, wherein the first training sample includes a first training sample The amplitude spectrum of the noise-banded frequency signal and the amplitude spectrum of the first original audio signal, the first noise-banded frequency signal is obtained by mixing the first original audio signal and the noise signal; the noise-banded amplitude spectrum acquisition unit is It is configured to: based on the amplitude spectrum of the first frequency signal with noise, obtain a first amplitude spectrum with noise for the first frequency band range and a second amplitude spectrum with noise for the second frequency band range; first estimate the noise reduction amplitude spectrum a determining unit, configured to: input the first noisy amplitude spectrum into a first processing model in the audio processing model to obtain a first estimated noise reduction amplitude spectrum for the first frequency band, wherein the The first processing model is pre-trained; the second estimated noise reduction amplitude spectrum determining unit is configured to: input the first estimated noise reduction amplitude spectrum and the second noisy amplitude spectrum into the audio processing model; The second processing model obtains a second estimated noise reduction amplitude spectrum for the second frequency band; the first loss acquisition unit is configured to: based on the second estimated noise reduction amplitude spectrum and the second original amplitude spectrum, obtain a first loss, the second original amplitude spectrum is an amplitude spectrum for the second frequency band in the amplitude spectrum of the first original audio signal; the model parameter adjustment unit is configured to: The loss adjusts the model parameters of the second processing model to train the audio processing model.
可选地,所述第一频带范围和所述第二频带范围是从所述第一带噪音频信号的幅度谱的全频带范围划分得到的。Optionally, the first frequency band range and the second frequency band range are obtained by dividing the full frequency band range of the amplitude spectrum of the first noisy frequency signal.
可选地,所述第一频带范围为0-16kHz,所述第二频带范围为16-48kHz。Optionally, the first frequency band range is 0-16 kHz, and the second frequency band range is 16-48 kHz.
可选地,所述第一处理模型是通过下述方式被预先训练的:获取第二训练样本,其中,所述第二训练样本包括第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱,所述第二带噪音频信号通过将所述第二原始音频信号和噪声信号进行混合而得到,所述第二原始音频信号的频带范围为所述第一频带范围;将所述第二带噪音频信号的幅度谱输入所述第一处理模型,得到针对所述第一频带范围的第三估计降噪幅度谱;基于所述第三估计降噪幅度谱和所述第二原始音频信号的幅度谱,获取第二损失;通过根据所述第二损失调整所述第一处理模型的模型参数,对所述第一处理模型进行训练。Optionally, the first processing model is pre-trained by acquiring a second training sample, wherein the second training sample includes the amplitude spectrum of the second noisy frequency signal and the second original audio signal The amplitude spectrum of the second frequency band with noise is obtained by mixing the second original audio signal and the noise signal, and the frequency band range of the second original audio signal is the first frequency band range; The amplitude spectrum of the second frequency signal with noise is input into the first processing model to obtain a third estimated noise reduction amplitude spectrum for the first frequency band; based on the third estimated noise reduction amplitude spectrum and the second original The amplitude spectrum of the audio signal is obtained, and the second loss is obtained; the first processing model is trained by adjusting the model parameters of the first processing model according to the second loss.
可选地,所述第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱通过以下方式得到:分别对所述第二带噪音频信号和所述第二原始音频信号进行短时傅里叶变换,得到时频域的所述第二带噪音频信号和所述第二原始音频信号;基于时频域的所述第二带噪音频信号和所述第二原始音频信号提取幅度谱,得到所述第二带噪音频信号的幅度谱和所述第二原始音频信号的幅度谱。Optionally, the amplitude spectrum of the second frequency signal with noise and the amplitude spectrum of the second original audio signal are obtained by performing a short-term analysis on the second frequency signal with noise and the second original audio signal respectively. Fourier transform to obtain the second frequency signal with noise and the second original audio signal in the time-frequency domain; extract the amplitude based on the second frequency signal with noise and the second original audio signal in the time-frequency domain spectrum, to obtain the amplitude spectrum of the second frequency signal with noise and the amplitude spectrum of the second original audio signal.
可选地,所述第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱通过以下方式得到:分别对所述第一带噪音频信号和所述第一原始音频信号进行短时傅里叶变换,得到时频域的所述第一带噪音频信号和所述第一原始音频信号;基于时频域的所述第一带噪音频信号和所述第一原始音频信号提取幅度谱,得到所述第一带噪音频信号的幅度谱和所述第一原始音频信号的幅度谱。Optionally, the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal are obtained by performing a short-term analysis on the first frequency signal with noise and the first original audio signal respectively. Fourier transform to obtain the first frequency signal with noise and the first original audio signal in the time-frequency domain; extract the amplitude based on the first frequency signal with noise and the first original audio signal in the time-frequency domain spectrum to obtain the amplitude spectrum of the first noise-banded frequency signal and the amplitude spectrum of the first original audio signal.
根据本公开实施例的第四方面,提供一种音频处理装置,包括:音频信号获取单元,被配置为:获取待处理的音频信号;音频信号处理单元,被配置为:利用音频处理模型对所述待处理的音频信号进行音频处理,得到处理后的音频信号,其中,所述音频处理模型是基于前述第一方面任一所述的训练方法训练得到。According to a fourth aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus, comprising: an audio signal acquisition unit, configured to: acquire an audio signal to be processed; and an audio signal processing unit, configured to: use an audio processing model to The audio signal to be processed is subjected to audio processing to obtain a processed audio signal, wherein the audio processing model is obtained by training based on any one of the training methods described in the first aspect.
可选地,所述音频信号处理单元可被配置为基于所述待处理的音频信号的幅度谱,得到针对第一频带范围的第一幅度谱和针对第二频带范围的第二幅度谱;将所述第一幅度谱输入音频处理模型中的第一处理模型,得到针对所述第一频带范围的第一估计幅度谱;将所述第二幅度谱和所述第一估计幅度谱输入所述音频处理模型中的第二处理模型,得到针对所述第二频带范围的第二估计幅度谱;将所述第一估计幅度谱和所述第二估计幅度谱结合,得到估计幅度谱;基于所述估计幅度谱,得到处理后的音频信号。Optionally, the audio signal processing unit may be configured to obtain a first amplitude spectrum for the first frequency band range and a second amplitude spectrum for the second frequency band range based on the amplitude spectrum of the audio signal to be processed; The first amplitude spectrum is input into the first processing model in the audio processing model to obtain a first estimated amplitude spectrum for the first frequency band; the second amplitude spectrum and the first estimated amplitude spectrum are input into the a second processing model in the audio processing model to obtain a second estimated amplitude spectrum for the second frequency band range; combining the first estimated amplitude spectrum and the second estimated amplitude spectrum to obtain an estimated amplitude spectrum; based on the The estimated amplitude spectrum is described to obtain the processed audio signal.
可选地,所述第一频带范围和所述第二频带范围是从待处理的音频信号的幅度谱的全频带范围划分得到的。Optionally, the first frequency band range and the second frequency band range are obtained by dividing the full frequency band range of the amplitude spectrum of the audio signal to be processed.
可选地,所述第一频带范围为0-16kHz,所述第二频带范围为16-48kHz。Optionally, the first frequency band range is 0-16 kHz, and the second frequency band range is 16-48 kHz.
可选地,所述音频处理装置还包括音频信号幅度谱获取单元,所述音频信号幅度谱获取单元可被配置为对所述音频信号进行短时傅里叶变换,得到时频域的所述音频信号;基于时频域的所述音频信号提取幅度谱,得到所述音频信号的幅度谱。Optionally, the audio processing device further includes an audio signal amplitude spectrum acquisition unit, and the audio signal amplitude spectrum acquisition unit can be configured to perform short-time Fourier transform on the audio signal to obtain the time-frequency domain. Audio signal; extracting the amplitude spectrum based on the audio signal in the time-frequency domain, to obtain the amplitude spectrum of the audio signal.
可选地,所述音频信号处理单元可被配置为:将所述估计幅度谱和与所述估计幅度谱对应的相位相乘;对相乘结果进行反短时傅里叶变换,得到时域的处理后的音频信号。Optionally, the audio signal processing unit may be configured to: multiply the estimated amplitude spectrum by a phase corresponding to the estimated amplitude spectrum; perform an inverse short-time Fourier transform on the multiplication result to obtain a time domain the processed audio signal.
根据本公开实施例的第五方面,提供一种智能音箱,包括根据本公开的第四方面的音频处理装置。According to a fifth aspect of the embodiments of the present disclosure, there is provided a smart speaker, including the audio processing apparatus according to the fourth aspect of the present disclosure.
根据本公开实施例的第六方面,提供一种电子设备,包括:至少一个处理器;至少一个存储计算机可执行指令的存储器,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行根据本公开第一方面的音频处理模型的训练方法或执行根据本公开第二方面的音频处理方法。According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the at least one processor when the at least one processor is caused to perform the training method of the audio processing model according to the first aspect of the present disclosure or the audio processing method according to the second aspect of the present disclosure.
根据本公开实施例的第七方面,提供一种存储指令的计算机可读存储介质,当所述指令被至少一个处理器运行时,促使所述至少一个处理器执行根据本公开第一方面的音频处理模型的训练方法或执行根据本公开第二方面的音频处理方法。According to a seventh aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to execute the audio according to the first aspect of the present disclosure A training method of a processing model or performing an audio processing method according to the second aspect of the present disclosure.
根据本公开实施例的第八方面,提供一种计算机程序产品,该计算机程序产品中的指令可由计算机设备的处理器执行以完成根据本公开第一方面的音频处理模型的训练方法或执行根据本公开第二方面的音频处理方法。According to an eighth aspect of the embodiments of the present disclosure, there is provided a computer program product, the instructions in the computer program product can be executed by a processor of a computer device to complete the training method of the audio processing model according to the first aspect of the present disclosure or execute the method according to the present disclosure. The audio processing method of the second aspect is disclosed.
本公开的实施例提供的技术方案至少带来以下有益效果:The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:
根据本公开的音频处理模型的训练方法、装置、设备及介质,在训练样本偏少的情况下,通过预先训练好的第一处理模型的输出结果指导第二处理模型的训练过程,可使训练完成后的第二处理模型具有良好的降噪效果,从而增强整个音频处理模型的鲁棒性,改善音频处理模型的降噪效果。According to the training method, device, device and medium of the audio processing model of the present disclosure, when the training samples are too few, the training process of the second processing model is guided by the output result of the pre-trained first processing model, so that the training process can be improved. The completed second processing model has a good noise reduction effect, thereby enhancing the robustness of the entire audio processing model and improving the noise reduction effect of the audio processing model.
此外,根据本公开的音频处理方法、装置、设备、音箱及介质,利用根据本公开的音频处理模型的训练方法训练得到的音频处理模型进行音频处理,可避免由于模型复杂度高而造成的处理后音频的音质下降的问题,并可改善处理后音频的降噪效果。In addition, according to the audio processing method, device, device, speaker and medium of the present disclosure, the audio processing model trained by the audio processing model training method according to the present disclosure is used to perform audio processing, which can avoid processing due to high model complexity. The quality of the post-audio is degraded, and the noise reduction effect of the post-processing audio can be improved.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.
图1是示出根据本公开的示例性实施例的音频处理模型的训练和应用的整体系统示意图。FIG. 1 is an overall system diagram illustrating the training and application of an audio processing model according to an exemplary embodiment of the present disclosure.
图2是示出根据本公开的示例性实施例的音频处理模型的训练方法的流程图。FIG. 2 is a flowchart illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure.
图3是示出根据本公开的示例性实施例的音频处理方法的流程图。FIG. 3 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure.
图4是示出根据本公开的示例性实施例的利用音频处理模型进行音频处理的流程图。4 is a flowchart illustrating audio processing using an audio processing model according to an exemplary embodiment of the present disclosure.
图5是示出根据本公开的示例性实施例的音频处理模型的训练装置500的框图。FIG. 5 is a block diagram illustrating a
图6是示出根据本公开的示例性实施例的音频处理装置600的框图。FIG. 6 is a block diagram illustrating an
图7是示出根据本公开的示例性实施例的音频信号处理单元620的框图。FIG. 7 is a block diagram illustrating the audio
图8是根据本公开的示例性实施例的智能音箱800的框图。FIG. 8 is a block diagram of a
图9是示出根据本公开的示例性实施例的电子设备900的框图。FIG. 9 is a block diagram illustrating an
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.
通常情况下,利用神经网络解决音频信号降噪问题时,往往在时频域上做处理的效果会更好,时频域是指将原始时域波形信号短时傅里叶变换(STFT:Short-Time FourierTransform)到时频域,经过一系列处理,再将时频域信号短时反傅里叶变换(ISTFT:Inverse Short-Time Fourier Transform)到时域,得到处理后波形的方式。利用神经网络在时频域处理音频降噪问题时,输入网络的特征往往可以分为傅立叶变换的时频谱、梅尔谱、子带谱等,其中,梅尔谱和子带谱等都是基于傅立叶变换的时频谱变换而来,因此时频谱具有最完整的时频信息,将时频谱作为神经网络的输入特征,可在最大程度上保留音频的音质并达到良好的降噪效果。Under normal circumstances, when using neural networks to solve the noise reduction problem of audio signals, the effect of processing in the time-frequency domain is often better. The time-frequency domain refers to the short-time Fourier transform of the original time-domain waveform signal (STFT: Short -Time FourierTransform) to the time-frequency domain, after a series of processing, and then the time-frequency domain signal short-time inverse Fourier Transform (ISTFT: Inverse Short-Time Fourier Transform) to the time domain, to obtain the way of the processed waveform. When using neural networks to deal with audio noise reduction in the time-frequency domain, the characteristics of the input network can often be divided into Fourier transform time spectrum, Mel spectrum, subband spectrum, etc. Among them, Mel spectrum and subband spectrum are based on Fourier transform The time spectrum is converted from the transformed time spectrum, so the time spectrum has the most complete time-frequency information. Using the time spectrum as the input feature of the neural network can preserve the sound quality of the audio to the greatest extent and achieve a good noise reduction effect.
对全频带降噪系统而言,当将完整的时频谱作为训练特征时,会影响网络收敛效果,导致最终得到的音频音质降低,并且复杂度也偏高。而将利用心理声学动机特征的梅尔谱或bfcc(基于巴克刻度的倒谱)作为训练特征,虽然可以有效降低整体复杂度,但会导致对高频区域的频率分辨率模糊,从而导致高频区域的降噪性能不佳。具体来讲,针对全频带音频的降噪,相关技术中主要存在两种技术方案,其一,通过利用梅尔谱或bfcc等训练的48kHz全降噪模型进行全频带的音频降噪,但由于模型对高频区域的频率分辨率不佳,导致该方案在16-48kHz的高频区域上的降噪效果较差;其二,通过利用傅里叶变换幅度谱训练的48kHz全降噪模型进行全频带的音频降噪,但该方案在16kHz以下的频率区域上的降噪后的音质较差,PESQ(Perceptual evaluation of speech quality,主观语音质量评估)分数低。For the full-band noise reduction system, when the complete time spectrum is used as a training feature, it will affect the convergence effect of the network, resulting in lower audio quality and higher complexity. However, using the mel spectrum or bfcc (Barker scale-based cepstrum) of psychoacoustic motivation features as training features can effectively reduce the overall complexity, but it will lead to blurred frequency resolution in the high-frequency region, resulting in high-frequency The noise reduction performance of the area is poor. Specifically, for the noise reduction of full-band audio, there are mainly two technical solutions in the related art. One is to perform full-band audio noise reduction by using a 48kHz full noise reduction model trained by mel spectrum or bfcc. The model has poor frequency resolution in the high-frequency region, resulting in poor noise reduction effect of the scheme in the high-frequency region of 16-48kHz; Full-band audio noise reduction, but the sound quality after noise reduction in the frequency region below 16kHz is poor, and the PESQ (Perceptual evaluation of speech quality, subjective speech quality evaluation) score is low.
为了兼顾音频在16kHz以下频率区域上的降噪后的音质,并改善在16-48kHz高频区域上的降噪效果,本公开提出了一种音频处理模型的训练方法、装置、设备及介质,音频处理方法、装置、音箱、设备及介质,具体地说,在对音频处理模型进行训练时,在高频带音频样本(例如,48kHz音频信号)相对于低频带音频样本(例如,16kHz音频信号)偏少或高频带音频样本(例如,48kHz音频信号)与低频带音频样本(例如,16kHz音频信号)的数据类型等不匹配的情况下,通过预先训练好的音频处理模型(例如,用于处理频带为16kHz以下的音频)的输出结果指导另一个处理模型(例如,用于处理频带为16kHz-48kHz的音频)的训练过程,可使训练完成后的另一个处理模型具有良好的降噪效果,从而增强整个音频处理模型的鲁棒性。另外,在利用训练好的音频处理模型进行音频处理时,将待处理的音频信号的幅度谱划分为低频带范围(例如,16kHz以下)的幅度谱和高频带范围(例如,16-48kHz)的幅度谱,分别将两个幅度谱输入到两个音频处理模型中进行处理,并以低频带范围的幅度谱的音频处理结果指导高低频带范围的幅度谱的音频处理,从而可避免由于模型复杂度高而造成的降噪后音质下降的问题,并且改善高频区域音频的降噪效果。下面,将参照图1至图9具体描述根据本公开的示例性实施例的模型训练方法、音频处理方法、装置、音箱、设备及介质。In order to take into account the sound quality of audio after noise reduction in the frequency region below 16 kHz and improve the noise reduction effect in the high frequency region of 16-48 kHz, the present disclosure proposes a training method, device, equipment and medium for an audio processing model, Audio processing method, apparatus, speaker, device and medium, specifically, when training an audio processing model, in high frequency band audio samples (for example, 48kHz audio signal) relative to low frequency band audio samples (for example, 16kHz audio signal) ) is too small or the data types of high-band audio samples (for example, 48kHz audio signals) and low-band audio samples (for example, 16kHz audio signals) do not match, through a pre-trained audio processing model (for example, using The output result of processing audio with frequency band below 16 kHz guides the training process of another processing model (for example, for processing audio with frequency band of 16 kHz-48 kHz), so that another processing model after training has good noise reduction. effect, thereby enhancing the robustness of the entire audio processing model. In addition, when using the trained audio processing model for audio processing, the amplitude spectrum of the audio signal to be processed is divided into the amplitude spectrum of the low frequency band range (eg, below 16 kHz) and the high frequency band range (eg, 16-48 kHz) The two amplitude spectra are input into two audio processing models for processing, and the audio processing results of the amplitude spectrum in the low frequency band are used to guide the audio processing of the amplitude spectrum in the high and low frequency bands, so as to avoid the complexity of the model. It can reduce the problem of sound quality degradation after noise reduction caused by high degree of noise, and improve the noise reduction effect of audio in high frequency area. Hereinafter, a model training method, an audio processing method, an apparatus, a sound box, an apparatus and a medium according to exemplary embodiments of the present disclosure will be described in detail with reference to FIGS. 1 to 9 .
图1是示出根据本公开的示例性实施例的音频处理模型的训练和应用的整体系统示意图。FIG. 1 is an overall system diagram illustrating the training and application of an audio processing model according to an exemplary embodiment of the present disclosure.
参照图1,音频处理模型100包括第一处理模型110和第二处理模型120,其中,第一处理模型110用于对低频带范围(例如,16kHz以下)的音频进行降噪处理,第二处理模型120用于对高频带范围(例如,16-48kHz)的音频进行降噪处理。音频处理模型在应用前需要进行训练。根据本公开的示例性实施例,首先将第一处理模型110训练好,再结合第一处理模型110对第二处理模型120进行训练。具体来讲,在训练第一处理模型110时,可对训练样本进行训练前处理。例如,分别对低频带范围(例如,16kHz以下)的原始音频信号和混合该原始音频信号后的带噪音频信号进行时频转换(例如,短时傅里叶变换),从时域转换至时频域,得到每帧音频信号(包括原始音频信号和带噪音频信号)的幅度信息和相位信息。可将带噪音频信号的幅度谱作为第一处理模型110训练的特征(feature),将原始音频信号的幅度谱作为第一处理模型110训练的目标(label),将feature和label送入第一处理模型110进行训练,得到训练好的第一处理模型110。在训练第二处理模型120时,也对训练样本进行训练前处理,例如,分别对全频带范围(例如,48kHz)的原始音频信号和混合该原始音频信号后的带噪音频信号进行时频转换(例如,短时傅里叶变换),从时域转换至时频域,得到每帧音频信号(包括原始音频信号和带噪音频信号)的幅度信息和相位信息。可将带噪音频信号中高频带范围(例如,16-48kHz)的幅度谱作为第二处理模型120训练的特征,将原始音频信号中高频带范围(例如,16-48kHz)的幅度谱作为第二处理模型120训练的目标,在训练时,将带噪音频信号中低频带范围(例如,16kHz以下)的幅度谱输入训练好的第一处理模型120,得到降噪后的结果,再将降噪结果、带噪音频信号中高频带范围的幅度谱和原始音频信号中高频带范围的幅度谱送入第二处理模型120进行训练,得到训练好的第二处理模型120。Referring to FIG. 1 , the
在获得训练好的音频处理模型后,可对待处理的音频信号进行短时傅里叶变换,并基于短时傅里叶变换的结果提取待处理音频信号的幅度谱,按照全频带范围将得到的幅度谱划分为低频带范围(例如,16kHz以下)的幅度谱和高频带范围(例如,16-48kHz)的幅度谱,并分别将两个幅度谱输入到第一处理模型110和第二处理模型120中进行降噪处理,得到分别针对低频带范围和高频带范围的估计幅度谱。之后,将两个估计幅度谱结合之后进行反短时傅里叶变换,得到降噪后的音频信号。After obtaining the trained audio processing model, short-time Fourier transform can be performed on the audio signal to be processed, and the amplitude spectrum of the audio signal to be processed can be extracted based on the result of the short-time Fourier transform. The magnitude spectrum is divided into a low frequency range (eg, below 16 kHz) and a high frequency range (eg, 16-48 kHz), and the two magnitude spectra are input to the
根据上述方案,通过将时频域上的音频信号的幅度谱按照频带范围划分为低频带范围和高频带范围的两个幅度谱,并通过音频处理模型的第一处理模型和第二处理模型分别对这两个幅度谱进行处理,可在保留将音频的时频谱作为神经网络的输入特征的同时,降低由于将音频的时频谱作为神经网络的输入特征而引起的高整体复杂度,从而避免降噪后音频音质下降的情况,并改善音频在高频区域上的降噪效果。According to the above solution, the amplitude spectrum of the audio signal in the time-frequency domain is divided into two amplitude spectra of the low frequency band range and the high frequency band range according to the frequency band range, and the first processing model and the second processing model of the audio processing model are used. Processing these two amplitude spectra separately can reduce the high overall complexity caused by using the time spectrum of audio as the input feature of the neural network while retaining the time spectrum of the audio as the input feature of the neural network, thereby avoiding The audio quality is degraded after noise reduction, and the noise reduction effect of the audio in the high frequency region is improved.
另外,在训练第二处理模型的过程中,以低频带范围的幅度谱的音频处理结果指导高低频带范围的幅度谱的音频处理,可在高频带音频样本偏少的情况下,也获得良好的降噪效果,从而增强整个音频处理模型的鲁棒性。In addition, in the process of training the second processing model, the audio processing results of the amplitude spectrum in the low frequency band are used to guide the audio processing of the amplitude spectrum in the high and low frequency bands, so that good results can be obtained even when there are few high frequency audio samples. The noise reduction effect, thereby enhancing the robustness of the entire audio processing model.
图2是示出根据本公开的示例性实施例的音频处理模型的训练方法的流程图。这里,音频处理模型可以是神经网络(例如,深度神经网络(DNN),卷积神经网络(CNN),循环神经网络(RNN)等)模型,其在训练时输入可以是带噪音频信号(训练样本)的幅度谱,其在应用时输入可以是待处理音频信号的幅度谱。FIG. 2 is a flowchart illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure. Here, the audio processing model may be a neural network (eg, Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc.) model, whose input during training may be a noisy frequency signal (training sample), which, when applied, the input may be the amplitude spectrum of the audio signal to be processed.
参照图2,在步骤201,可获取第一训练样本,其中,第一训练样本包括第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱,该第一带噪音频信号通过将该第一原始音频信号和噪声信号进行混合而得到。这里,第一原始音频信号可以是纯净音频信号或者混响音频信号(例如,可通过将纯净音频信号与房间冲激响应进行卷积而产生)等,对此不作限制。噪声信号可以是通过从网上下载、实际录制等方式获取到的噪声信号。具体地说,可按照一定的信噪比在时域将第一原始音频信号与噪声信号相加来产生第一带噪音频信号。另外,第一原始音频信号和第一带噪音频信号可以是全频带音频信号(例如,频带为48kHz)。Referring to FIG. 2, in
根据本公开的示例性实施例,第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱可通过以下方式得到:首先,分别对第一带噪音频信号和第一原始音频信号进行短时傅里叶变换,得到时频域的第一带噪音频信号和第一原始音频信号,从而可以获取每帧第一带噪音频信号和第一原始音频信号的幅度信息和相位信息。然后,基于时频域的第一带噪音频信号和第一原始音频信号提取幅度谱,得到第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱。According to an exemplary embodiment of the present disclosure, the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal may be obtained in the following manner: first, the first frequency signal with noise and the first original audio signal are respectively Short-time Fourier transform is performed to obtain the first frequency signal with noise and the first original audio signal in the time-frequency domain, so that the amplitude information and phase information of the first frequency signal with noise and the first original audio signal in each frame can be obtained. Then, the amplitude spectrum is extracted based on the first frequency signal with noise and the first original audio signal in the time-frequency domain to obtain the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal.
例如,针对48kHz全频带音频信号,若长度为T的第一原始音频信号x和第一带噪音频信号y在时域上分别为x(t)和y(t),其中t代表时间,0<t≤T,则经过短时傅里叶变换之后,x(t)和y(t)在时频域可表示为如下的公式(1)和公式(2):For example, for a 48kHz full-band audio signal, if the first original audio signal x and the first noisy audio signal y of length T are respectively x(t) and y(t) in the time domain, where t represents time and 0 <t≤T, then after short-time Fourier transform, x(t) and y(t) can be expressed as the following formulas (1) and (2) in the time-frequency domain:
X48k(n,k)=STFT(x48k(t)) (1)X 48k (n,k)=STFT(x 48k (t)) (1)
Y48k(n,k)=STFT(y48k(t)) (2)Y 48k (n,k)=STFT(y 48k (t)) (2)
其中,X48k(n,k)表示第一原始音频信号的时频域信号,Y48k(n,k)表示第一带噪音频信号的时频域信号,x48k(t)表示第一原始音频信号的时域信号,y48k(t)表示第一带噪音频信号的时域信号,n为帧序列,0<n≤N,N为总帧数;k为中心频率序列0<k≤K48;K48为总频带数。Wherein, X 48k (n, k) represents the time-frequency domain signal of the first original audio signal, Y 48k (n, k) represents the time-frequency domain signal of the first noise-banded frequency signal, and x 48k (t) represents the first original audio signal The time domain signal of the audio signal, y 48k (t) represents the time domain signal of the first frequency signal with noise, n is the frame sequence, 0<n≤N, N is the total number of frames; k is the center frequency sequence 0<k≤ K 48 ; K 48 is the total number of frequency bands.
可分别从第一原始音频信号的时频域信号X48k(n,k)和第一带噪音频信号的时频域信号Y48k(n,k)提取幅度谱,如下公式(3)和(4)所示。The amplitude spectrum can be extracted from the time-frequency domain signal X 48k (n,k) of the first original audio signal and the time-frequency domain signal Y 48k (n,k) of the first band noise frequency signal, respectively, as shown in the following formulas (3) and ( 4) shown.
MagX48k(n,k)=abs(X48k(n,k)) (3)MagX 48k (n,k)=abs(X 48k (n,k)) (3)
MagY48k(n,k)=abs(Y48k(n,k)) (4)MagY 48k (n,k)=abs(Y 48k (n,k)) (4)
其中,MagX48k(n,k)表示第一原始音频信号的幅度谱;MagY48k(n,k)表示第一带噪音频信号的幅度谱。Wherein, MagX 48k (n, k) represents the amplitude spectrum of the first original audio signal; MagY 48k (n, k) represents the amplitude spectrum of the first band noise frequency signal.
返回参照图2,在步骤202,可基于第一带噪音频信号的幅度谱,得到针对第一频带范围的第一带噪幅度谱和针对第二频带范围的第二带噪幅度谱。这里,第一频带范围和第二频带范围是从第一带噪音频信号的幅度谱的全频带范围划分得到的,具体来讲,第一频带范围为0-16kHz,第二频带范围为16-48kHz。例如,针对48kHz全频带音频信号,可将第一带噪音频信号的幅度谱MagY48k(n,k)划分为0-16kHz部分(即,第一带噪幅度谱)和16-48kHz部分(即,第二带噪幅度谱),可分别表示为MagY0-16k(h,k)和MagY16-48k(n,k)。Referring back to FIG. 2, in
在步骤203,可将第一带噪幅度谱输入音频处理模型中的第一处理模型,得到针对第一频带范围的第一估计降噪幅度谱,其中,第一处理模型被预先训练好。这里,第一处理模型例如为16kHz音频处理模型,可将MagY0-16k(n,k)输入该16kHz音频处理模型,得到降噪后的MagY0-16k_p(n,k)。In
根据本公开的示例性实施例,第一处理模型可通过下述方式被预先训练好:首先,获取第二训练样本,其中,第二训练样本包括第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱,第二带噪音频信号通过将第二原始音频信号和噪声信号进行混合而得到,第二原始音频信号的频带范围为第一频带范围。这里,第二处理模型可以是神经网络(例如,深度神经网络(DNN),卷积神经网络(CNN),循环神经网络(RNN)等)模型。第二原始音频信号可以是纯净音频信号或者混响音频信号(例如,可通过将纯净音频信号与房间冲激响应进行卷积而产生)等,对此不作限制。噪声信号可以是通过从网上下载、实际录制等方式获取到的噪声信号。具体地说,可按照一定的信噪比在时域将第二原始音频信号与噪声信号相加来产生第二带噪音频信号。另外,第二原始音频信号的频带范围包含于第一原始音频信号的频带范围(或者,第二带噪音频信号的频带范围包含于第一带噪音频信号的频带范围),例如,第一原始音频信号的频带范围为0-48kHz时,第二原始音频信号的频带范围可以是0-16kHz(即,第一频带范围)。According to an exemplary embodiment of the present disclosure, the first processing model may be pre-trained in the following manner: first, acquiring a second training sample, wherein the second training sample includes the amplitude spectrum of the second noisy frequency signal and the second The amplitude spectrum of the original audio signal, the second frequency signal with noise is obtained by mixing the second original audio signal and the noise signal, and the frequency band range of the second original audio signal is the first frequency band range. Here, the second processing model may be a neural network (eg, deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), etc.) model. The second original audio signal may be a clean audio signal or a reverberated audio signal (eg, may be generated by convolving the clean audio signal with a room impulse response), etc., which is not limited. The noise signal may be a noise signal obtained by downloading from the Internet, actual recording, or the like. Specifically, the second audio signal with noise can be generated by adding the second original audio signal and the noise signal in the time domain according to a certain signal-to-noise ratio. In addition, the frequency band range of the second original audio signal is included in the frequency band range of the first original audio signal (or the frequency band range of the second noisy audio signal is included in the frequency band range of the first noisy audio signal), for example, the first original When the frequency band range of the audio signal is 0-48 kHz, the frequency band range of the second original audio signal may be 0-16 kHz (ie, the first frequency band range).
根据本公开的示例性实施例,第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱可通过以下方式得到:首先,分别对第二带噪音频信号和第二原始音频信号进行短时傅里叶变换,得到时频域的第二带噪音频信号和第二原始音频信号,然后,基于时频域的第二带噪音频信号和第二原始音频信号提取幅度谱,得到第二带噪音频信号的幅度谱和第二原始音频信号的幅度谱。这里,在对第二带噪音频信号和第二原始音频信号进行短时傅里叶变换时,可选择与对第一带噪音频信号和第一原始音频信号进行短时傅里叶变换时相同的STFT长度和帧移长度。According to an exemplary embodiment of the present disclosure, the amplitude spectrum of the second noisy frequency signal and the amplitude spectrum of the second original audio signal may be obtained in the following manner: first, the second noisy frequency signal and the second original audio signal are respectively Short-time Fourier transform to obtain the second frequency signal with noise and the second original audio signal in the time-frequency domain, and then extract the amplitude spectrum based on the second frequency signal with noise and the second original audio signal in the time-frequency domain to obtain the first The amplitude spectrum of the two-band noise frequency signal and the amplitude spectrum of the second original audio signal. Here, when the short-time Fourier transform is performed on the second frequency signal with noise and the second original audio signal, the selection can be the same as when the short-time Fourier transform is performed on the first frequency signal with noise and the first original audio signal the STFT length and frame shift length.
例如,针对16kHz全频带音频信号,若长度为T的第二原始音频信号x和第二带噪音频信号y在时域上分别为x(t)和y(t),其中t代表时间,0<t≤T,则经过短时傅里叶变换之后,x(t)和y(t)在时频域可表示为如下的公式(5)和公式(6):For example, for a 16kHz full-band audio signal, if the second original audio signal x and the second noisy audio signal y of length T are respectively x(t) and y(t) in the time domain, where t represents time and 0 <t≤T, then after short-time Fourier transform, x(t) and y(t) in the time-frequency domain can be expressed as the following formulas (5) and (6):
X16k(n,k)=STFT(x16k(t)) (5)X 16k (n,k)=STFT(x 16k (t)) (5)
Y16k(n,k)=STFT(y16k(t)) (6)Y 16k (n,k)=STFT(y 16k (t)) (6)
其中,X16k(n,k)表示第二原始音频信号的时频域信号,Y16k(n,k)表示第二带噪音频信号的时频域信号,x16k(t)表示第二原始音频信号的时域信号,y16k(t)表示第二带噪音频信号的时域信号,n为帧序列,0<n≤N,N为总帧数;k为中心频率序列0<k≤K16;16为总频带数。Wherein, X 16k (n, k) represents the time-frequency domain signal of the second original audio signal, Y 16k (n, k) represents the time-frequency domain signal of the second noise-banded frequency signal, and x 16k (t) represents the second original audio signal The time domain signal of the audio signal, y 16k (t) represents the time domain signal of the second frequency signal with noise, n is the frame sequence, 0<n≤N, N is the total number of frames; k is the center frequency sequence 0<k≤ K 16 ; 16 is the total number of frequency bands.
可分别从第二原始音频信号的时频域信号X16k(n,k)和第二带噪音频信号的时频域信号Y16k(n,k)提取幅度谱,如下公式(7)和(8)所示。The amplitude spectrum can be extracted from the time-frequency domain signal X 16k (n,k) of the second original audio signal and the time-frequency domain signal Y 16k (n,k) of the second noise frequency signal, respectively, as shown in the following formulas (7) and ( 8) shown.
MagX16k(n,k)=abs(X16k(n,k)) (7)MagX 16k (n,k)=abs(X 16k (n,k)) (7)
MagY16k(n,k)=abs(Y16k(n,k)) (8)MagY 16k (n,k)=abs(Y 16k (n,k)) (8)
其中,MagX16k(n,k)表示第二原始音频信号的幅度谱;MagY16k(n,k)表示第二带噪音频信号的幅度谱。这里,可将MagY16k(n,k)作为第一处理模型的输入,将MagX16k(n,k)作为第一处理模型的训练标签。Wherein, MagX 16k (n, k) represents the amplitude spectrum of the second original audio signal; MagY 16k (n, k) represents the amplitude spectrum of the second frequency band noise signal. Here, MagY 16k (n, k) can be used as the input of the first processing model, and MagX 16k (n, k) can be used as the training label of the first processing model.
然后,可将第二带噪音频信号的幅度谱输入第一处理模型,得到针对第一频带范围的第三估计降噪幅度谱。这里,例如是将MagY0-16k(n,k)输入该16kHz音频处理模型,得到估计的降噪后的MagY0-16k_p(n,k)。Then, the amplitude spectrum of the second noisy frequency signal may be input into the first processing model to obtain a third estimated noise reduction amplitude spectrum for the first frequency band. Here, for example, MagY 0-16k (n,k) is input into the 16kHz audio processing model, and an estimated denoised MagY 0-16k_p (n,k) is obtained.
之后,基于第三估计降噪幅度谱和第二原始音频信号的幅度谱,获取第二损失,最后,通过根据该第二损失调整第一处理模型的模型参数,对第一处理模型进行训练。可将第二原始音频信号的幅度谱作为训练标签,在通过第一处理模型得到估计的针对第一频带范围的第三估计降噪幅度谱之后,基于这两个参数确定预先设计的损失函数,并根据确定的损失函数对第一处理模型的模型参数进行反向传播迭代更新。在第一处理模型的训练过程中,可使用批量音频训练样本来调整(或更新)第一处理处理模型的模型参数,并以最小化损失函数的值为目标,迭代地调整(或更新)第一处理模型的模型参数,直至第一处理模型收敛。Then, a second loss is obtained based on the third estimated noise reduction amplitude spectrum and the amplitude spectrum of the second original audio signal, and finally, the first processing model is trained by adjusting the model parameters of the first processing model according to the second loss. The amplitude spectrum of the second original audio signal can be used as a training label, and after obtaining the estimated third estimated noise reduction amplitude spectrum for the first frequency band range through the first processing model, a pre-designed loss function is determined based on these two parameters, And the model parameters of the first processing model are iteratively updated by back-propagation according to the determined loss function. During the training of the first processing model, the batch of audio training samples may be used to adjust (or update) the model parameters of the first processing model, and iteratively adjust (or update) the first processing model with the goal of minimizing the loss function value. Model parameters of a process model until the first process model converges.
返回参照图2,在步骤204,可将第一估计降噪幅度谱和第二带噪幅度谱输入音频处理模型中的第二处理模型,得到针对第二频带范围的第二估计降噪幅度谱。这里,第二处理模型例如为16-48kHz音频处理模型,可将第一处理模型输出的针对第一频带范围的幅度谱进行降噪后的MagY16k(n,k)以及第二带噪幅度谱MagY16-48k(n,k)作为16-48kHz音频处理模型的输入,将第二原始幅度谱MagX16-48k(n,k)作为训练标签,进行16-48kHz音频处理模型的训练。Referring back to FIG. 2, in
在步骤205,可基于第二估计降噪幅度谱和第二原始幅度谱,获取第一损失,第二原始幅度谱为第一原始音频信号的幅度谱中的针对第二频带范围的幅度谱(例如,可被表示为MagX16-48k(n,k))。In
在步骤206,可通过根据第一损失调整第二处理模型的模型参数,对音频处理模型进行训练。这里,可将第二原始幅度谱作为训练标签,在通过第二处理模型得到估计的针对第二频带范围的第二估计降噪幅度谱之后,基于这两个参数确定预先设计的损失函数,并根据确定的损失函数对第二处理模型的模型参数进行反向传播迭代更新。在第二处理模型的训练过程中,可使用批量音频训练样本来调整(或更新)第二处理处理模型的模型参数,并以最小化损失函数的值为目标,迭代地调整(或更新)第二处理模型的模型参数,直至第二处理模型收敛。At
根据上述方案,由于在训练的过程中,第一训练样本(例如,48kHz音频)的数量相较于第二训练样本(例如,16kHz音频)可能显著偏少,因此,在对第二处理模型进行训练时,将训练好的第一处理模型的输出(准确性较高)作为第二处理模型的输入,可指导第二处理模型的训练,因此,在第一训练样本数量偏少的情况下,通过训练得到的第二处理模型处理音频,也能得到很好的降噪效果,增强了整个音频处理模型的鲁棒性。According to the above solution, during the training process, the number of the first training samples (for example, 48kHz audio) may be significantly less than that of the second training samples (for example, 16kHz audio). During training, the output of the trained first processing model (with high accuracy) is used as the input of the second processing model, which can guide the training of the second processing model. Therefore, when the number of first training samples is relatively small, The second processing model obtained by training can also obtain a good noise reduction effect when processing audio, which enhances the robustness of the entire audio processing model.
图3是示出根据本公开的示例性实施例的音频处理方法的流程图。FIG. 3 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure.
参照图3,在步骤301,可获取待处理的音频信号。这里,待处理的音频信号为全频带音频信号(例如,频带为48kHz),处理过程为对音频信号进行降噪。Referring to FIG. 3, in
在步骤302,可利用音频处理模型对获取的待处理的音频信号进行音频处理,得到处理后的音频信号,其中,音频处理模型是基于前述的音频处理模型的训练方法训练得到。In
根据本公开的示例性实施例,利用音频处理模型对获取的待处理的音频信号进行音频处理的过程可参照图4,图4是示出根据本公开的示例性实施例的利用音频处理模型进行音频处理的流程图。According to an exemplary embodiment of the present disclosure, for the process of performing audio processing on the acquired audio signal to be processed by using an audio processing model, reference may be made to FIG. Flowchart of audio processing.
参照图4,在步骤401,可基于待处理的音频信号的幅度谱,得到针对第一频带范围的第一幅度谱和针对第二频带范围的第二幅度谱。Referring to FIG. 4 , in
根据本公开的示例性实施例,第一频带范围和第二频带范围是从待处理的音频信号的幅度谱的全频带范围划分得到的,具体来讲,第一频带范围为0-16kHz,第二频带范围为16-48kHz。在一些实施例中,该待处理的音频信号的幅度谱可通过以下方式得到:首先,对该音频信号进行短时傅里叶变换(STFT:Short-Time Fourier Transform),得到时频域的音频信号,然后,基于时频域的音频信号提取幅度谱,得到该音频信号的幅度谱。具体来讲,待处理的音频信号为时域上的音频信号,为对该音频信号做进一步处理,可对待处理的音频信号进行短时傅里叶变换,将其从时域转换至时频域,从而得到包含待处理音频信号的时间信息、频率信息、幅度信息和相位信息等信息的时频谱,可从时频谱中提取出不同频率信号对应的幅度,组成幅度谱。短时傅里叶变换的执行过程可参照相关技术中的详尽描述,在此不赘述。在一些实施例中,也可对该音频信号进行快速傅里叶变换(FFT,FastFourier Transform)得到时频域的音频信号,对此不作限制。According to an exemplary embodiment of the present disclosure, the first frequency band range and the second frequency band range are obtained by dividing the full frequency band range of the amplitude spectrum of the audio signal to be processed. The two-band range is 16-48kHz. In some embodiments, the amplitude spectrum of the audio signal to be processed can be obtained in the following manner: first, performing a Short-Time Fourier Transform (STFT: Short-Time Fourier Transform) on the audio signal to obtain audio in the time-frequency domain signal, and then extract the amplitude spectrum based on the audio signal in the time-frequency domain to obtain the amplitude spectrum of the audio signal. Specifically, the audio signal to be processed is an audio signal in the time domain. In order to further process the audio signal, the short-time Fourier transform of the audio signal to be processed can be performed to convert it from the time domain to the time-frequency domain. , so as to obtain the time spectrum including the time information, frequency information, amplitude information and phase information of the audio signal to be processed, and the corresponding amplitudes of different frequency signals can be extracted from the time spectrum to form the amplitude spectrum. For the execution process of the short-time Fourier transform, reference may be made to the detailed description in the related art, and details are not described here. In some embodiments, a Fast Fourier Transform (FFT, Fast Fourier Transform) may also be performed on the audio signal to obtain an audio signal in the time-frequency domain, which is not limited.
在步骤402,可将第一幅度谱输入音频处理模型中的第一处理模型,得到针对第一频带范围的第一估计幅度谱。这里,音频处理模型为被预先训练好的神经网络(例如,深度神经网络(DNN),卷积神经网络(CNN),循环神经网络(RNN)等)模型,其训练过程在下文详细描述,在此暂不展开。针对48kHz全频带音频信号,可将频带范围为0-16kHz的幅度谱输入第一处理模型,得到针对0-16kHz频带范围的第一估计幅度谱,其中,该第一估计幅度谱为估计的对0-16kHz频带范围的幅度谱进行降噪后的幅度谱。In
在步骤403,可将第二幅度谱和第一估计幅度谱输入音频处理模型中的第二处理模型,得到针对第二频带范围的第二估计幅度谱。例如,针对48kHz全频带音频信号,可将频带范围为16-48kHz的幅度谱和第一处理模型输出的估计的对0-16kHz频带范围的幅度谱降噪后的幅度谱输入第二处理模型,得到估计的对16-48kHz频带范围的幅度谱降噪后的幅度谱。In
在步骤404,可将第一估计幅度谱和第二估计幅度谱结合,得到估计幅度谱。这里,针对48kHz全频带音频信号,得到的是估计的对48kHz频带范围的幅度谱降噪后的幅度谱。In
在步骤405,可基于估计幅度谱,得到处理后的音频信号。At
根据本公开的示例性实施例,为了获得形象且直观的降噪后的全频带音频信号,可将估计幅度谱和与该估计幅度谱对应的相位相乘,然后对相乘结果进行反短时傅里叶变换,得到时域的处理后的音频信号。这里,针对48kHz全频带音频信号,降噪后的音频信号,例如,但不限于,可被表示为:According to an exemplary embodiment of the present disclosure, in order to obtain a visual and intuitive full-band audio signal after noise reduction, the estimated amplitude spectrum and the phase corresponding to the estimated amplitude spectrum can be multiplied, and then the multiplication result can be inversely short-term Fourier transform to obtain the processed audio signal in the time domain. Here, for a 48kHz full-band audio signal, the noise-reduced audio signal, for example, but not limited to, can be expressed as:
X0(t)=ISTFT(MagY48k_p(n,k)*PhaY48k(n,k)) (9)X0(t)=ISTFT(MagY 48k_p (n,k)*PhaY 48k (n,k)) (9)
其中,X0(t)表示降噪后的音频信号;n表示音频信号的帧序列,0<n≤N,(N为总帧数);k表示音频信号的中心频率序列,0<k≤K48,(K48为总频带数);MagY48k_p(n,k)表示估计的在时频点(n,k)处的幅度谱;PhaY48k(n,k)表示在时频点(n,k)处的相位。Among them, X0(t) represents the audio signal after noise reduction; n represents the frame sequence of the audio signal, 0<n≤N, (N is the total number of frames); k represents the center frequency sequence of the audio signal, 0<k≤K 48 , (K 48 is the total number of frequency bands); MagY 48k_p (n, k) represents the estimated amplitude spectrum at the time-frequency point (n, k); PhaY 48k (n, k) represents the time-frequency point (n, k) phase at k).
在另一些实施例中,也可直接基于第一估计幅度谱和第二估计幅度谱,得到处理后的音频信号,对此不作限制。具体来讲,可将第一估计幅度谱和与该第一估计幅度谱对应的相位相乘,得到第一乘积,并对第一乘积做反短时傅里叶变换,得到处理后的第一音频信号,同理,可对第二估计幅度谱执行类似操作,得到处理后的第二音频信号,将第一音频信号和第二音频信号结合,可得到最终期望的处理后的音频信号。In other embodiments, the processed audio signal may also be obtained directly based on the first estimated amplitude spectrum and the second estimated amplitude spectrum, which is not limited. Specifically, the first estimated amplitude spectrum and the phase corresponding to the first estimated amplitude spectrum can be multiplied to obtain a first product, and an inverse short-time Fourier transform is performed on the first product to obtain the processed first Similarly, similar operations can be performed on the second estimated amplitude spectrum to obtain a processed second audio signal, and the final desired processed audio signal can be obtained by combining the first audio signal and the second audio signal.
图5是示出根据本公开的示例性实施例的音频处理模型的训练装置500的框图。FIG. 5 is a block diagram illustrating a
参照图5,根据本公开的示例性实施例的音频处理模型的训练装置500可包括第一训练样本获取单元501、带噪幅度谱获取单元502、第一估计降噪幅度谱确定单元503、第二估计降噪幅度谱确定单元504、第一损失获取单元505和模型参数调整单元506。5 , an
第一训练样本获取单元501可获取第一训练样本,其中,第一训练样本包括第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱,该第一带噪音频信号通过将该第一原始音频信号和噪声信号进行混合而得到。这里,第一原始音频信号可以是纯净音频信号或者混响音频信号(例如,可通过将纯净音频信号与房间冲激响应进行卷积而产生)等,对此不作限制。噪声信号可以是通过从网上下载、实际录制等方式获取到的噪声信号。具体地说,可按照一定的信噪比在时域将第一原始音频信号与噪声信号相加来产生第一带噪音频信号。The first training
根据本公开的示例性实施例,第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱可通过以下方式得到:首先,分别对第一带噪音频信号和第一原始音频信号进行短时傅里叶变换,得到时频域的第一带噪音频信号和第一原始音频信号,从而可以获取每帧第一带噪音频信号和第一原始音频信号的幅度信息和相位信息。然后,基于时频域的第一带噪音频信号和第一原始音频信号提取幅度谱,得到第一带噪音频信号的幅度谱和第一原始音频信号的幅度谱。According to an exemplary embodiment of the present disclosure, the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal may be obtained in the following manner: first, the first frequency signal with noise and the first original audio signal are respectively Short-time Fourier transform is performed to obtain the first frequency signal with noise and the first original audio signal in the time-frequency domain, so that the amplitude information and phase information of the first frequency signal with noise and the first original audio signal in each frame can be obtained. Then, the amplitude spectrum is extracted based on the first frequency signal with noise and the first original audio signal in the time-frequency domain to obtain the amplitude spectrum of the first frequency signal with noise and the amplitude spectrum of the first original audio signal.
例如,针对48kHz全频带音频信号,若长度为T的第一原始音频信号x和第一带噪音频信号y在时域上分别为x(t)和y(t),其中t代表时间,0<t≤T,则经过短时傅里叶变换之后,x(t)和y(t)在时频域可分别表示为公式(1)和公式(2)。For example, for a 48kHz full-band audio signal, if the first original audio signal x and the first noisy audio signal y of length T are respectively x(t) and y(t) in the time domain, where t represents time and 0 <t≤T, then after short-time Fourier transform, x(t) and y(t) can be expressed as formula (1) and formula (2) respectively in the time-frequency domain.
可分别从第一原始音频信号的时频域信号X48k(n,k)和第一带噪音频信号的时频域信号Y48k(n,k)提取幅度谱,得到第一原始音频信号的幅度谱MagX48k(n,k)和第一带噪音频信号的幅度谱MagY48k(n,k)(分别如公式(3)和(4)所示)。The amplitude spectrum can be extracted from the time-frequency domain signal X 48k (n,k) of the first original audio signal and the time-frequency domain signal Y 48k (n,k) of the first band noise frequency signal, respectively, to obtain the first original audio signal. The magnitude spectrum MagX 48k (n,k) and the magnitude spectrum MagY 48k (n,k) of the first noise-band signal (shown in equations (3) and (4), respectively).
带噪幅度谱获取单元502可基于第一带噪音频信号的幅度谱,得到针对第一频带范围的第一带噪幅度谱和针对第二频带范围的第二带噪幅度谱。这里,第一频带范围和第二频带范围是从第一带噪音频信号的幅度谱的全频带范围划分得到的,具体来讲,由于相关技术中的音频降噪方案存在16kHz以下的频率区域上的降噪后的音质较差以及在16-48kHz的高频区域上的降噪效果较差的情况,因此针对48kHz全频带音频信号,可将其划分为0-16kHz的第一频带范围和16-48kHz的第二频带范围。例如,可将第一带噪音频信号的幅度谱MagY48k(n,k)划分为0-16kHz部分(即,第一带噪幅度谱)和16-48kHz(即,第二带噪幅度谱)部分,可分别表示为MagY0-16k(n,k)和MagY16-48k(n,k)。The noisy amplitude
第一估计降噪幅度谱确定单元503可将第一带噪幅度谱输入音频处理模型中的第一处理模型,得到针对第一频带范围的第一估计降噪幅度谱,其中,第一处理模型被预先训练好。这里,第一处理模型例如为16kHz音频处理模型,可将MagY0-16k(n,k)输入该16kHz音频处理模型,得到降噪后的MagY0-16k_p(n,k)。第一处理模型的训练过程可参照前文关于图2示出的音频处理模型的训练方法的流程图的相关描述,在此不再赘述。The first estimated noise reduction amplitude
第二估计降噪幅度谱确定单元504可将第一估计降噪幅度谱和第二带噪幅度谱输入音频处理模型中的第二处理模型,得到针对第二频带范围的第二估计降噪幅度谱。这里,第二处理模型例如为16-48kHz音频处理模型,可将第一处理模型输出的针对第一频带范围的幅度谱进行降噪后的MagY16k(n,k)以及第二带噪幅度谱MagY16-48k(n,k)作为16-48kHz音频处理模型的输入,将第二原始幅度谱MagX16-48k(n,k)作为训练标签,进行16-48kHz音频处理模型的训练。The second estimated noise reduction magnitude
第一损失获取单元505可基于第二估计降噪幅度谱和第二原始幅度谱,获取第一损失,第二原始幅度谱为第一原始音频信号的幅度谱中的针对第二频带范围的幅度谱(例如,可被表示为MagX16-48k(n,k))。The first
模型参数调整单元506可通过根据第一损失调整第二处理模型的模型参数,对音频处理模型进行训练。这里,可将第二原始幅度谱作为训练标签,在通过第二处理模型得到估计的针对第二频带范围的第二估计降噪幅度谱之后,基于这两个参数确定预先设计的损失函数,并根据确定的损失函数对第二处理模型的模型参数进行反向传播迭代更新。在第二处理模型的训练过程中,可使用批量音频训练样本来调整(或更新)第二处理处理模型的模型参数,并以最小化损失函数的值为目标,迭代地调整(或更新)第二处理模型的模型参数,直至第二处理模型收敛。The model
图6是示出根据本公开的示例性实施例的音频处理装置600的框图。FIG. 6 is a block diagram illustrating an
参照图6,根据本公开的示例性实施例的音频处理装置600可包括音频信号获取单元610和音频信号处理单元620。6 , an
音频信号获取单元610可获取待处理的音频信号。这里,待处理的音频信号为48kHz全频带音频信号,处理过程为对音频信号进行降噪。The audio
音频信号处理单元620可利用音频处理模型对获取的待处理的音频信号进行音频处理,得到处理后的音频信号,其中,音频处理模型是基于前述的音频处理模型的训练方法训练得到。The audio
根据本公开的示例性实施例,音频信号处理单元620可基于该待处理的音频信号的幅度谱,得到针对第一频带范围的第一幅度谱和针对第二频带范围的第二幅度谱,将第一幅度谱输入音频处理模型中的第一处理模型,得到针对第一频带范围的第一估计幅度谱,将第二幅度谱和第一估计幅度谱输入音频处理模型中的第二处理模型,得到针对第二频带范围的第二估计幅度谱,将第一估计幅度谱和第二估计幅度谱结合,得到估计幅度谱,基于估计幅度谱,得到处理后的音频信号。According to an exemplary embodiment of the present disclosure, the audio
根据本公开的示例性实施例,音频信号处理单元620可进一步包括幅度谱获取单元621、第一估计幅度谱确定单元622、第二估计幅度谱确定单元623、估计幅度谱确定单元624和音频信号获取单元625。According to an exemplary embodiment of the present disclosure, the audio
图7是示出根据本公开的示例性实施例的音频信号处理单元620的框图。FIG. 7 is a block diagram illustrating the audio
参照图7,幅度谱获取单元621可基于该待处理的音频信号的幅度谱,得到针对第一频带范围的第一幅度谱和针对第二频带范围的第二幅度谱。第一频带范围和第二频带范围是从待处理的音频信号的幅度谱的全频带范围划分得到的,具体来讲,由于相关技术中的音频降噪方案存在16kHz以下的频率区域上的降噪后的音质较差以及在16-48kHz的高频区域上的降噪效果较差的情况,因此针对48kHz全频带音频信号,可将其划分为0-16kHz的第一频带范围和16-48kHz的第二频带范围。在一些实施例中,音频处理装置600还可包括音频信号幅度谱获取单元603(图6中未示出),音频信号幅度谱获取单元603可对待处理的该音频信号进行短时傅里叶变换(STFT:Short-Time Fourier Transform),得到时频域的音频信号,然后,基于时频域的音频信号提取幅度谱,得到该音频信号的幅度谱。具体来讲,待处理的音频信号为时域上的音频信号,为对该音频信号做进一步处理,可对待处理的音频信号进行短时傅里叶变换,将其从时域转换至时频域,从而得到包含待处理音频信号的时间信息、频率信息、幅度信息和相位信息等信息的时频谱,可从时频谱中提取出不同频率信号对应的幅度,组成幅度谱。短时傅里叶变换的执行过程可参照相关技术中的详尽描述,在此不赘述。在一些实施例中,也可对该音频信号进行快速傅里叶变换(FFT,Fast FourierTransform)得到时频域的音频信号,对此不作限制。Referring to FIG. 7 , the amplitude
第一估计幅度谱确定单元622可将第一幅度谱输入音频处理模型中的第一处理模型,得到针对第一频带范围的第一估计幅度谱。这里,音频处理模型为被预先训练好的神经网络模型,其训练过程可参照前文关于图3示出的音频处理模型的训练方法的流程图的相关描述。针对48kHz全频带音频信号,第一估计幅度谱确定单元403可将频带范围为0-16kHz的幅度谱输入第一处理模型,得到针对0-16kHz频带范围的第一估计幅度谱,其中,该第一估计幅度谱为估计的对0-16kHz频带范围的幅度谱降噪后的幅度谱。The first estimated magnitude
第二估计幅度谱确定单元623可将第二幅度谱和第一估计幅度谱输入音频处理模型中的第二处理模型,得到针对第二频带范围的第二估计幅度谱。这里,针对48kHz全频带音频信号,可将频带范围为16-48kHz的幅度谱和第一处理模型输出的估计的对0-16kHz频带范围的幅度谱降噪后的幅度谱输入第二处理模型,得到估计的对16-48kHz频带范围的幅度谱降噪后的幅度谱。The second estimated magnitude
估计幅度谱确定单元624可将第一估计幅度谱和第二估计幅度谱结合,得到估计幅度谱。这里,针对48kHz全频带音频信号,得到的是估计的对48kHz频带范围的幅度谱降噪后的幅度谱。The estimated magnitude
音频信号获取单元625可基于估计幅度谱,得到处理后的音频信号。The audio
根据本公开的示例性实施例,为了获得形象且直观的降噪后的全频带音频信号,音频信号获取单元625可将估计幅度谱和与该估计幅度谱对应的相位相乘,然后对相乘结果进行反短时傅里叶变换,得到时域的处理后的音频信号。这里,针对48kHz全频带音频信号,降噪后的音频信号,例如,但不限于,可由公式(9)表示。According to an exemplary embodiment of the present disclosure, in order to obtain a visual and intuitive full-band audio signal after noise reduction, the audio
图8是根据本公开的示例性实施例的智能音箱800的框图。FIG. 8 is a block diagram of a
参照图8,根据本公开的示例性实施例的智能音箱800包括根据本公开示出的音频处理装置600。在具体的实施过程中,智能音箱800例如可应用于视频会议等场景,在视频会议场景中,智能音箱800可包括信号采集模块、信号处理模块和信号展示模块,其中,信号采集模块可采集环境中的音频信号,信号处理模块可对信号采集模块采集的音频信号进行处理(例如,包括采用根据本公开的示例性实施例的音频处理方法对采集到的音频信号进行降噪处理),信号展示模块可将经过信号处理模块进行处理之后的音频信号在环境中进行展示,当然,智能音箱800还可应用于其他的场景,例如,居家环境等,对此不作限制,在不同的使用环境中,智能音箱800的组成结构可能有所不同,需要明确的是,只要是采用根据本公开示出的音频处理方法进行音频降噪的智能音箱,都属于本公开所欲保护的范围。Referring to FIG. 8 , a
图9是示出根据本公开的示例性实施例的电子设备900的框图。FIG. 9 is a block diagram illustrating an
参照图9,电子设备900包括至少一个存储器901和至少一个处理器902,所述至少一个存储器901中存储有计算机可执行指令集合,当计算机可执行指令集合被至少一个处理器902执行时,执行根据本公开的示例性实施例的音频处理模型的训练方法或音频处理方法。9, the
作为示例,电子设备900可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里,电子设备900并非必须是单个的电子设备,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备900还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子设备。As an example, the
在电子设备900中,处理器902可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In
处理器902可运行存储在存储器901中的指令或代码,其中,存储器901还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,网络接口装置可采用任何已知的传输协议。
存储器901可与处理器902集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储器901可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器901和处理器902可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器902能够读取存储在存储器中的文件。The
此外,电子设备900还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。电子设备900的所有组件可经由总线和/或网络而彼此连接。Additionally, the
根据本公开的示例性实施例,还可提供一种存储指令的计算机可读存储介质,其中,当指令被至少一个处理器运行时,促使至少一个处理器执行根据本公开的音频处理模型的训练方法或音频处理方法。这里的计算机可读存储介质的示例包括:只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如,多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置,所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,此外,在一个示例中,计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上,使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to exemplary embodiments of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform training of an audio processing model according to the present disclosure method or audio processing method. Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card, or Extreme Digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk, and any other apparatus configured to store and provide the computer program and any associated data, data files and data structures in a non-transitory manner with the computer program and any associated data, data files and data structures The computer program is given to a processor or computer so that the processor or computer can execute the computer program. The computer program in the above-mentioned computer readable storage medium can be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc. Furthermore, in one example, the computer program and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
根据本公开的示例性实施例,还可提供一种计算机程序产品,该计算机程序产品中的指令可由计算机设备的处理器执行以完成根据本公开的示例性实施例的音频处理模型的训练方法或音频处理方法。According to an exemplary embodiment of the present disclosure, there can also be provided a computer program product, the instructions in the computer program product can be executed by a processor of a computer device to complete the method for training an audio processing model according to an exemplary embodiment of the present disclosure or audio processing method.
根据本公开的音频处理模型的训练方法、装置、设备及介质,在高频带音频样本(例如,48kHz音频信号)相对于低频带音频样本(例如,16kHz音频信号)偏少或高频带音频样本(例如,48kHz音频信号)与低频带音频样本(例如,16kHz音频信号)的数据类型等不匹配的情况下,通过预先训练好的音频处理模型(例如,用于处理频带为16kHz以下的音频)的输出结果指导另一个处理模型(例如,用于处理频带为16kHz-48kHz的音频)的训练过程,可使训练完成后的另一个处理模型具有良好的降噪效果,从而增强整个音频处理模型的鲁棒性。According to the training method, apparatus, device, and medium of an audio processing model of the present disclosure, there are fewer or fewer high-band audio samples in high-band audio samples (eg, 48 kHz audio signal) relative to low-band audio samples (eg, 16 kHz audio signal) In the case where the data type of the sample (for example, 48kHz audio signal) does not match the data type of the low-band audio sample (for example, 16kHz audio signal), the pre-trained audio processing model (for example, for processing the audio frequency below 16kHz) ) to guide the training process of another processing model (for example, for processing audio with a frequency band of 16kHz-48kHz), so that the other processing model after the training has a good noise reduction effect, thereby enhancing the entire audio processing model. robustness.
另外,根据本公开的音频处理方法、装置、设备、音箱及介质,将待处理的音频信号的幅度谱划分为低频带范围(例如,16kHz以下)的幅度谱和高频带范围(例如,16-48kHz)的幅度谱,并分别将两个幅度谱输入到两个音频处理模型中进行处理,并以低频带范围的幅度谱的音频处理结果指导高低频带范围的幅度谱的音频处理,从而可避免由于模型复杂度高而造成的降噪后音质下降的问题,并且改善高频区域音频的降噪效果。In addition, according to the audio processing method, device, device, sound box and medium of the present disclosure, the amplitude spectrum of the audio signal to be processed is divided into an amplitude spectrum in a low frequency band range (eg, below 16 kHz) and a high frequency band range (eg, 16 kHz). -48kHz) amplitude spectrum, and input the two amplitude spectra into two audio processing models respectively for processing, and guide the audio processing of the amplitude spectrum in the high and low frequency bands with the audio processing results of the amplitude spectrum in the low frequency band range, so that it can be Avoid the problem of sound quality degradation after noise reduction due to high model complexity, and improve the noise reduction effect of audio in high frequency regions.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210723242.8A CN115101084A (en) | 2022-06-21 | 2022-06-21 | Model training method, audio processing method, device, sound box, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210723242.8A CN115101084A (en) | 2022-06-21 | 2022-06-21 | Model training method, audio processing method, device, sound box, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115101084A true CN115101084A (en) | 2022-09-23 |
Family
ID=83293308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210723242.8A Pending CN115101084A (en) | 2022-06-21 | 2022-06-21 | Model training method, audio processing method, device, sound box, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115101084A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111916103A (en) * | 2020-08-11 | 2020-11-10 | 南京拓灵智能科技有限公司 | Audio noise reduction method and device |
CN113299308A (en) * | 2020-09-18 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Voice enhancement method and device, electronic equipment and storage medium |
CN113921032A (en) * | 2021-10-11 | 2022-01-11 | 北京达佳互联信息技术有限公司 | Training method and device for audio processing model, and audio processing method and device |
CN114566180A (en) * | 2020-11-27 | 2022-05-31 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
US20220189497A1 (en) * | 2020-12-15 | 2022-06-16 | Google Llc | Bone conduction headphone speech enhancement systems and methods |
-
2022
- 2022-06-21 CN CN202210723242.8A patent/CN115101084A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111916103A (en) * | 2020-08-11 | 2020-11-10 | 南京拓灵智能科技有限公司 | Audio noise reduction method and device |
CN113299308A (en) * | 2020-09-18 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Voice enhancement method and device, electronic equipment and storage medium |
CN114566180A (en) * | 2020-11-27 | 2022-05-31 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
WO2022110802A1 (en) * | 2020-11-27 | 2022-06-02 | 北京搜狗科技发展有限公司 | Speech processing method and apparatus, and apparatus for processing speech |
US20220189497A1 (en) * | 2020-12-15 | 2022-06-16 | Google Llc | Bone conduction headphone speech enhancement systems and methods |
CN113921032A (en) * | 2021-10-11 | 2022-01-11 | 北京达佳互联信息技术有限公司 | Training method and device for audio processing model, and audio processing method and device |
Non-Patent Citations (2)
Title |
---|
GUOCHEN YU ET AL.: "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement", ARXIV:2203.16033V2, 15 June 2022 (2022-06-15), pages 1 - 5 * |
XU ZHANG ET AL.: "A two-step backward compatible fullband speech enhancement system", 2022 IEEE INTERNATIONAL CONFERENCE ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 27 April 2022 (2022-04-27), pages 1 - 5 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113241088B (en) | Training method and device of voice enhancement model and voice enhancement method and device | |
US9485597B2 (en) | System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain | |
WO2022012195A1 (en) | Audio signal processing method and related apparatus | |
CN110634499A (en) | Neural network for speech denoising with deep feature loss training | |
WO2021114733A1 (en) | Noise suppression method for processing at different frequency bands, and system thereof | |
CN112652290B (en) | Method for generating reverberation audio signal and training method of audio processing model | |
CN113555031B (en) | Training method and device of voice enhancement model, and voice enhancement method and device | |
CN112309426B (en) | Voice processing model training method and device and voice processing method and device | |
CN113593594B (en) | Training method and equipment for voice enhancement model and voice enhancement method and equipment | |
CN113284507B (en) | Training method and device for voice enhancement model and voice enhancement method and device | |
CN114121029B (en) | Speech enhancement model training method and device and speech enhancement method and device | |
JP2018506078A (en) | System and method for speech restoration | |
CN112712816B (en) | Training method and device for voice processing model and voice processing method and device | |
CN113345460B (en) | Audio signal processing method, device, device and storage medium | |
CN115223583A (en) | Voice enhancement method, device, equipment and medium | |
CN114038476B (en) | Audio signal processing method and device | |
Fan et al. | Specmnet: Spectrum mend network for monaural speech enhancement | |
JP6891144B2 (en) | Generation device, generation method and generation program | |
CN113990343B (en) | Training method and device of speech noise reduction model and speech noise reduction method and device | |
Yu et al. | A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering | |
CN115101084A (en) | Model training method, audio processing method, device, sound box, equipment and medium | |
CN114242110B (en) | Model training method, audio processing method, device, equipment, medium and product | |
JP2019090930A (en) | Sound source enhancement device, sound source enhancement learning device, sound source enhancement method and program | |
CN114694683B (en) | Speech enhancement evaluation method, speech enhancement evaluation model training method and device | |
CN116129928B (en) | A near-end speech intelligibility enhancement method and system for broadcast communication scenarios |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |