CN114333867A

CN114333867A - Audio data processing method and device, call method, audio processing chip, electronic device and computer readable storage medium

Info

Publication number: CN114333867A
Application number: CN202011073889.8A
Authority: CN
Inventors: 王子腾; 纳跃跃; 马骁; 田彪; 付强; 李韵; 刘章
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2022-04-12
Anticipated expiration: 2040-10-09
Also published as: CN114333867B

Abstract

The present application discloses an audio data processing method and apparatus, a calling method, an audio processing chip, an electronic device, and a computer-readable storage medium. The method includes: performing linear filtering processing on the first audio data and the second audio data to obtain linear echo data; determining linear output data according to the second audio data and the linear echo data; audio data, determine the first state data and the second state data; determine the weight factor according to the first state data and the second state data, so as to perform weighted filtering processing on the linear output data, and obtain the third audio frequency sent to the first calling party data. Therefore, the present application can perform weighted filtering based on the current call state or adopt a corresponding suppression scheme for processing, so that the echo residual suppression processing can be performed considering the component characteristics of the echo residual under different call states, and the echo residual suppression effect can be improved effectively. Improve call quality.

Description

Audio data processing method and apparatus, calling method, audio processing chip, electronic device and computer readable storage medium

技术领域technical field

本申请涉及音频数据处理技术领域，尤其涉及一种音频数据处理方法和装置、通话方法、音频处理芯片、电子设备以及计算机可读存储介质。The present application relates to the technical field of audio data processing, and in particular, to an audio data processing method and apparatus, a calling method, an audio processing chip, an electronic device, and a computer-readable storage medium.

背景技术Background technique

随着音频通话技术的应用场景越来越广泛，人们对通话质量的要求也越来越高。在通常的通话过程中，当通话的一方说出语音之后，由该一方的通话设备采集并传输到通话的另一方并且由该另一方的通话设备的语音播放装置播放出来，从而通话的另一方能够收听到。在该过程中，当通话的一方的语音音频在通话的另一方的通话设备的语音播放装置播放时，会在该另一方所在的空间中产生回音，即，所播放的语音音频被空间中的各种墙壁或物体的表面反射，并且进而在该另一方响应于通话的一方的语音而做出语音应答时，被该另一方的通话设备的语音采集装置采集到，从而被当做通话的另一方的语音而回传给通话的乙方。因此，通话一方会在说话的同时接收到自己的声音在传输到通话另一方被再次传回的音频声音，即，产生了通话回声，这样的通话回声严重影响了通话方的通话体验。As the application scenarios of audio call technology become more and more extensive, people have higher and higher requirements for call quality. During a normal call, when one party on the call speaks a voice, the voice of the other party is collected and transmitted to the other party of the call and played by the voice playback device of the other party's calling device, so that the other party of the call can hear. In this process, when the voice audio of one party on the call is played on the voice playback device of the other party's calling device, an echo will be generated in the space where the other party is located, that is, the played voice audio is played by the voice and audio in the space. The surface reflection of various walls or objects, and further when the other party responds to the voice of the calling party, it is collected by the voice acquisition device of the other party's communication device, and is regarded as the other party of the call. The voice is sent back to Party B of the call. Therefore, one party to the call may receive audio sounds whose own voice is transmitted to the other party of the call while speaking, that is, a call echo is generated, and such a call echo seriously affects the call experience of the call party.

现有技术中，通常根据音频数据的时延估计结果以及线性滤波器的输出等信息，来估计回声残余的能量，从而对线性回声处理后的信号进行频谱增益调整。但是，现有的回声处理方案仅考虑了通话中音频数据中的回声的残余能量，但是随着音频通话技术的应用场景的日益多样化，不同场景和环境下回声会呈现不同的特点，因此使用统一的残余能量为基准来抑制回声，难以满足多变复杂的场景下人们对音频通话质量的要求。In the prior art, the energy of the echo residual is usually estimated according to the time delay estimation result of the audio data and the output of the linear filter, so as to adjust the spectral gain of the signal after the linear echo processing. However, the existing echo processing solution only considers the residual energy of the echo in the audio data during the call. However, with the increasingly diversified application scenarios of the audio call technology, the echo will show different characteristics in different scenarios and environments. Therefore, using The unified residual energy is used as the benchmark to suppress echoes, which is difficult to meet people's requirements for audio call quality in changing and complex scenarios.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种音频数据处理方法和装置、通话方法、音频处理芯片、电子设备以及计算机可读存储介质，以解决现有技术中回声残余消除效果不佳的缺陷。Embodiments of the present application provide an audio data processing method and apparatus, a calling method, an audio processing chip, an electronic device, and a computer-readable storage medium, so as to solve the defect of poor echo residual elimination effect in the prior art.

为达到上述目的，本申请实施例提供了一种音频数据处理方法，包括：In order to achieve the above purpose, an embodiment of the present application provides an audio data processing method, including:

对第一通话方发送的第一音频数据和第二通话方采集到的第二音频数据进行线性滤波处理，得到线性回声数据，其中，所述第一通话方与所述第二通话方处于同一通话活动中；Perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data, wherein the first calling party and the second calling party are in the same During a call activity;

根据第二通话方采集到的第二音频数据与所述线性回声数据确定线性输出数据；determining linear output data according to the second audio data collected by the second calling party and the linear echo data;

根据所述第一音频数据和所述第二音频数据，确定用于标识在所述第一通话方与所述第二通话方之间进行的音频通话状态的第一状态数据和第二状态数据，其中，所述第一状态数据是所述第一音频数据与所述第二音频数据在各个子频带的相关系数的平均值，并且所述第一状态数据是所述线性回声数据与所述第二音频数据在各个子频带的比值的平均值；Determine first state data and second state data for identifying the state of an audio call between the first caller and the second caller based on the first audio data and the second audio data , wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the first state data is the linear echo data and the The average value of the ratio of the second audio data in each sub-band;

根据所述第一状态数据和所述第二状态数据，确定与所述通话状态相关的权重因子，以对所述线性输出数据进行加权滤波处理，得到发送给所述第一通话方的第三音频数据。According to the first state data and the second state data, a weighting factor related to the call state is determined, so as to perform weighted filtering processing on the linear output data to obtain a third call sent to the first caller. audio data.

本申请实施例还提供了一种音频数据处理方法，包括：The embodiment of the present application also provides an audio data processing method, including:

根据所述第一音频数据和所述第二音频数据，确定用于标识在所述第一通话方与所述第二通话方之间进行的音频通话状态的第一状态数据和第二状态数据，其中，所述第一状态数据是所述第一音频数据与所述第二音频数据在各个子频带的相关系数的平均值，并且所述第一状态数据是所述线性回声数据与所述第二音频数据在各个子频带的比值的平均值；First state data and second state data for identifying the state of an audio call between the first caller and the second caller are determined according to the first audio data and the second audio data , wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the first state data is the linear echo data and the The average value of the ratio of the second audio data in each sub-band;

根据所述第一状态数据和所述第二状态数据，选择与所述第一状态数据和所述第二状态数据对应的信号降低幅度值；selecting, according to the first state data and the second state data, a signal reduction amplitude value corresponding to the first state data and the second state data;

根据所述信号降低幅度值对所述线性输出音频数据进行降低信号幅度的操作，得到发送给所述第一通话方的音频数据。The signal amplitude reduction operation is performed on the linear output audio data according to the signal reduction amplitude value, so as to obtain audio data sent to the first calling party.

本申请实施例还提供了一种通话方法，包括：The embodiment of the present application also provides a method for calling, including:

接收第一音频数据；receiving first audio data;

播放所述第一音频数据；playing the first audio data;

执行音频采集处理以生成第二音频数据，其中，所述第二音频数据至少包括在播放所述第一音频数据时采集到的音频数据；performing audio capture processing to generate second audio data, wherein the second audio data includes at least audio data collected when the first audio data is played;

对所述第二音频数据进行线性滤波处理，得到线性回声数据；performing linear filtering processing on the second audio data to obtain linear echo data;

根据所述第二音频数据与所述线性回声数据确定线性输出数据；determining linear output data according to the second audio data and the linear echo data;

根据所述第一音频数据和所述第二音频数据，确定用于标识音频通话状态的第一状态数据和第二状态数据，其中，所述第一状态数据是所述第一音频数据与所述第二音频数据在各个子频带的相关系数的平均值，并且所述第一状态数据是所述线性回声数据与所述第二音频数据在各个子频带的比值的平均值；According to the first audio data and the second audio data, first state data and second state data for identifying the audio call state are determined, wherein the first state data is the first audio data and the the average value of the correlation coefficient of the second audio data in each sub-band, and the first state data is the average value of the ratio of the linear echo data and the second audio data in each sub-band;

根据所述第一状态数据和所述第二状态数据，确定与所述通话状态相关的权重因子，以对所述线性输出数据进行加权滤波处理，得到第三音频数据；According to the first state data and the second state data, determine a weighting factor related to the talking state, so as to perform weighted filtering processing on the linear output data to obtain third audio data;

将所述第三音频数据输出给进行通话的通话方。The third audio data is output to the calling party making the call.

本申请实施例还提供了一种音频处理芯片，包括：The embodiment of the present application also provides an audio processing chip, including:

音频接收模块，用于接收第一音频数据；an audio receiving module for receiving the first audio data;

音频输出，用于播放所述第一音频数据；audio output for playing the first audio data;

拾音模块，用于执行音频采集处理以生成第二音频数据，其中，所述第二音频数据至少包括在播放所述第一音频数据时由所述拾音模块采集到的音频数据；a sound pickup module, configured to perform audio collection processing to generate second audio data, wherein the second audio data at least includes audio data collected by the sound pickup module when playing the first audio data;

滤波模块，用于对所述第二音频数据进行线性滤波处理，得到线性回声数据；a filtering module, configured to perform linear filtering processing on the second audio data to obtain linear echo data;

处理模块，用于根据所述第二音频数据与所述线性回声数据确定线性输出数据，根据所述第一音频数据和所述第二音频数据，确定用于标识音频通话状态的第一状态数据和第二状态数据，其中，所述第一状态数据是所述第一音频数据与所述第二音频数据在各个子频带的相关系数的平均值，并且所述第一状态数据是所述线性回声数据与所述第二音频数据在各个子频带的比值的平均值；以及根据所述第一状态数据和所述第二状态数据，确定与所述通话状态相关的权重因子，a processing module, configured to determine linear output data according to the second audio data and the linear echo data, and determine first state data for identifying the audio call state according to the first audio data and the second audio data and second state data, wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the first state data is the linear an average value of the ratios of the echo data and the second audio data in each sub-band; and determining a weighting factor related to the call state according to the first state data and the second state data,

其中，所述滤波模块用于对所述线性输出数据进行加权滤波处理，得到第三音频数据，并且Wherein, the filtering module is configured to perform weighted filtering processing on the linear output data to obtain third audio data, and

所述音频输出模块用于将所述第三音频数据输出给进行通话的通话方。The audio output module is used for outputting the third audio data to the calling party making the call.

本申请实施例还提供了一种音频数据处理装置，包括：The embodiment of the present application also provides an audio data processing device, including:

滤波模块，用于对第一通话方发送的第一音频数据和第二通话方采集到的第二音频数据进行线性滤波处理，得到线性回声数据，其中，所述第一通话方与所述第二通话方处于同一通话活动中；A filtering module, configured to perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data, wherein the first calling party and the The two call parties are in the same call activity;

线性输出模块，用于根据第二通话方采集到的第二音频数据与所述线性回声数据确定线性输出数据；a linear output module, configured to determine linear output data according to the second audio data collected by the second calling party and the linear echo data;

状态确定模块，用于根据所述第一音频数据和所述第二音频数据，确定用于标识在所述第一通话方与所述第二通话方之间进行的音频通话状态的第一状态数据和第二状态数据，其中，所述第一状态数据是所述第一音频数据与所述第二音频数据在各个子频带的相关系数的平均值，并且所述第一状态数据是所述线性回声数据与所述第二音频数据在各个子频带的比值的平均值；a state determination module for determining, according to the first audio data and the second audio data, a first state for identifying the state of an audio call between the first call party and the second call party data and second state data, wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in respective sub-bands, and the first state data is the the average value of the ratio of the linear echo data to the second audio data in each sub-band;

抑制模块，用于根据所述第一状态数据和所述第二状态数据，确定与所述通话状态相关的权重因子，以对所述线性输出数据进行加权滤波处理，得到发送给所述第一通话方的第三音频数据。A suppression module, configured to determine a weighting factor related to the call state according to the first state data and the second state data, so as to perform weighted filtering processing on the linear output data, and obtain a result that is sent to the first The third audio data of the calling party.

本申请实施例还提供了一种电子设备，包括：The embodiment of the present application also provides an electronic device, including:

存储器，用于存储程序；memory for storing programs;

处理器，用于运行所述存储器中存储的所述程序，所述程序运行时执行本申请实施例提供的音频数据处理方法。The processor is configured to run the program stored in the memory, and when the program runs, the audio data processing method provided by the embodiment of the present application is executed.

本申请实施例还提供了一种计算机可读存储介质，其上存储有可被处理器执行的计算机程序，其中，该程序被处理器执行时实现如本申请实施例提供的音频数据处理方法。Embodiments of the present application further provide a computer-readable storage medium on which a computer program executable by a processor is stored, wherein when the program is executed by the processor, the audio data processing method provided by the embodiments of the present application is implemented.

本申请实施例提供的音频数据处理方法和装置、通话方法、音频处理芯片、电子设备以及计算机可读存储介质，能够根据第一通话方发送的音频数据和第二通话方的采集数据，来确定用于标识当前的音频通话状态的第一状态数据和第二状态数据；进而，根据第一状态数据和第二状态数据所确定的与当前的音频通话状态相关的加权系数或者对应的抑制方案来对经过线性滤波后的麦克风采集数据进行针对性的抑制处理。从而能够基于当前通话状态来进行加权滤波或者采取对应的抑制方案来进行处理，从而能够考虑不同通话状态下回声残余的成分特性来进行回声残余抑制处理，能够提高回声残余抑制效果，有效提高通话质量。The audio data processing method and device, the calling method, the audio processing chip, the electronic device, and the computer-readable storage medium provided by the embodiments of the present application can be determined according to the audio data sent by the first calling party and the data collected by the second calling party. The first state data and the second state data used to identify the current audio call state; and then, according to the weighting coefficient or the corresponding suppression scheme related to the current audio call state determined by the first state data and the second state data. Targeted suppression processing is performed on the data collected by the microphone after linear filtering. Therefore, weighted filtering can be performed based on the current call state or a corresponding suppression scheme can be used for processing, so that the echo residual suppression processing can be performed considering the component characteristics of the echo residual under different call states, which can improve the echo residual suppression effect and effectively improve the call quality. .

上述说明仅是本申请技术方案的概述，为了能够更清楚了解本申请的技术手段，而可依照说明书的内容予以实施，并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂，以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of the present application. In order to be able to understand the technical means of the present application more clearly, it can be implemented according to the content of the description, and in order to make the above-mentioned and other purposes, features and advantages of the present application more obvious and easy to understand , and the specific embodiments of the present application are listed below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本申请的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1为本申请实施例提供的音频数据处理方法的应用场景示意图；1 is a schematic diagram of an application scenario of an audio data processing method provided by an embodiment of the present application;

图2为本申请提供的音频数据处理方法一个实施例的流程图；2 is a flowchart of an embodiment of an audio data processing method provided by the application;

图3为本申请提供的音频数据处理方法另一个实施例的流程图；3 is a flowchart of another embodiment of an audio data processing method provided by the present application;

图4为本申请提供的音频数据处理装置实施例的结构示意图；4 is a schematic structural diagram of an embodiment of an audio data processing apparatus provided by the present application;

图5为本申请提供的电子设备实施例的结构示意图。FIG. 5 is a schematic structural diagram of an embodiment of an electronic device provided by the present application.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

实施例一Example 1

本申请实施例提供的方案可应用于任何具有音频处理能力的通信系统，例如安装有音频处理模块的通信设备等等。图1为本申请实施例提供的音频数据处理方法的应用场景示意图，图1所示的场景仅仅是本申请的技术方案可以应用的场景的示例之一。The solutions provided by the embodiments of the present application can be applied to any communication system with audio processing capability, such as a communication device installed with an audio processing module, and so on. FIG. 1 is a schematic diagram of an application scenario of an audio data processing method provided by an embodiment of the present application. The scenario shown in FIG. 1 is only one example of a scenario to which the technical solution of the present application can be applied.

随着音频技术的发展，音频通话技术的应用场景越来越广泛，不仅仅是日常的音频通话，而且也在商务领域中得到了广泛的应用。尤其是近来远程办公的兴起，越来越多的用户采用视频会议或音频会议来进行沟通。因此，用户对通话质量的要求也越来越高。With the development of audio technology, the application scenarios of audio call technology have become more and more extensive, not only in daily audio calls, but also in the business field. Especially with the recent rise of telecommuting, more and more users use video conferencing or audio conferencing to communicate. Therefore, users have higher and higher requirements for call quality.

在通常的两端通话过程中，当通话的一端，例如与本地方相对的另一端，也称为远端，说出语音之后，由该另一方或者说远端的通话设备采集并进而传输到通话的本地方，例如近端，并且由近端的通话设备的音频播放装置播放出来，从而近端能够收听到远端发出的语音音频。In the usual two-party call process, when one end of the call, such as the other end opposite to the local place, also called the far end, speaks the voice, which is collected by the other party or the far-end calling device and then transmitted to the The local place of the call, such as the near-end, is played by the audio playback device of the calling device at the near-end, so that the near-end can listen to the voice and audio sent by the far-end.

在该过程中，当远端发出的语音音频在近端端的通话设备的语音播放装置播放时，会在该近端所在的空间中产生回音，即，所播放的语音音频被近端所在的空间中传播，并且进而在被该近端的通话设备的语音采集装置采集到，从而被当做近端的语音音频而回传给远端，在该情况下，通常远端这时还在说话，即持续发出语音，而近端这时实际处在倾听状态下而没有说话，但是由于远端发出的音频在近端产生的回声回传到远端的通话设备，因此会在远端说话的同时将回传的远端自己的声音播放出来，从而远端在说话时还会听到自己的声音，即之前说过的话的回声，即，在远端处产生了通话回声，这样的通话回声严重影响了通话方的通话体验。In this process, when the voice audio from the far end is played on the voice playback device of the near-end calling device, an echo will be generated in the space where the near-end is located, that is, the played voice and audio are played by the space where the near-end is located. and then is collected by the voice acquisition device of the near-end communication device, so as to be sent back to the far-end as the near-end voice audio. In this case, the far-end is usually still talking at this time, that is Continue to send out speech, and the near-end is actually in the listening state at this time and does not speak. However, since the echo generated by the audio from the far-end is echoed back to the far-end calling device, it will The echoed far-end own voice is played, so that the far-end can hear its own voice when speaking, that is, the echo of what was said before, that is, a call echo is generated at the far end, and such a call echo seriously affects improve the call experience of the caller.

为此，现有技术中已经提出了根据传输的音频数据的时延估计结果以及线性滤波器的输出等信息，来估计回声残余的能量，从而对线性回声处理后的信号进行频谱增益调整，以实现对于通话音频中回声残余量的抑制。具体地，在现有技术中，例如如图1中所示，来自远端的下行数据，即已经包含有远端发出的音频数据在近端处通过近端的通话设备中的例如扬声器的播放装置播放后，可以沿着例如图1中所示的回声路径由近端的通话设备的例如麦克风的音频采集装置采集到，从而采集到的数据可以与远端的下行数据一起经过延迟估计模块来与近端的通话设备中的例如麦克风的音频采集设备采集到的包括有远端的音频数据的回声的信号进行延迟对齐，从而获得远端的下行数据在近端处的时延估计结果，之后将延迟调整的信号和音频采集设备采集到的信号输入到线性滤波器中进行线性回声的估计处理，从而最终基于时延估计结果和线性回声估计结果来估计近端的上行数据中回声残余的能量，从而对上行数据中的回声进行抑制。For this reason, it has been proposed in the prior art to estimate the energy of the echo residual according to the information such as the time delay estimation result of the transmitted audio data and the output of the linear filter, so as to perform spectral gain adjustment on the signal after the linear echo processing, in order to Implements the suppression of the residual echo in the call audio. Specifically, in the prior art, for example, as shown in FIG. 1 , the downlink data from the far end, that is, the audio data that already contains the audio data sent by the far end, is played at the near end through, for example, a loudspeaker in the near end communication device. After the device is played, it can be collected by an audio acquisition device such as a microphone of the near-end communication device along the echo path shown in FIG. Perform delay alignment with the signal including the echo of the far-end audio data collected by the audio collection device such as a microphone in the near-end communication device, so as to obtain the delay estimation result of the far-end downlink data at the near end, and then Input the delay-adjusted signal and the signal collected by the audio acquisition device into the linear filter for linear echo estimation processing, so as to finally estimate the echo residual energy in the near-end uplink data based on the delay estimation result and the linear echo estimation result , so as to suppress the echo in the uplink data.

但是在实际使用中，通话双方的情况存在着很多变化。例如，一方可能使用耳机，另一方可能使用扬声器，或者双方都在讲话等等，在这些情况下采集到的音频信号也是存在着很大差异。例如，在图1中所示的场景中，在近端使用耳机的情况下麦克风采集到的回声残余就非常小，因此，在现有技术的技术方案中，仍然通过延迟估计和线型滤波估计来估计回声能量就有可能将近端发出的语音错误地识别为回声，在该情况下，可能会错误地对近端的语音进行了抑制，从而反而影响了远端收听近端的语音音频。However, in actual use, there are many changes in the situation of both parties in the call. For example, one party may use headphones, the other party may use speakers, or both parties are talking, etc. In these cases, the collected audio signals are also very different. For example, in the scenario shown in Fig. 1, the echo residual collected by the microphone is very small when the earphone is used at the near end. Therefore, in the technical solution of the prior art, delay estimation and linear filtering estimation are still used. To estimate the echo energy may mistakenly identify the near-end speech as an echo. In this case, the near-end speech may be erroneously suppressed, thereby affecting the far-end listening to the near-end speech audio.

因此，在音频通话技术的应用场景的日益多样化的情况下，在不同场景和环境下回声成分会呈现不同的特点，因此使用统一的残余能量为基准来抑制回声，难以满足多变复杂的场景下人们对音频通话质量的要求。Therefore, with the increasingly diverse application scenarios of audio call technology, the echo components will show different characteristics in different scenarios and environments. Therefore, using a unified residual energy as a benchmark to suppress echoes is difficult to meet the changing and complex scenarios. meet people's requirements for audio call quality.

为此，如图1中所示，示出了为远端，即，作为通话的本地方的另一方，消除其收听到的通话回声的场景。在现有技术中，作为远端的第一通话方可以将发出的语音数据作为下行数据以有线或无线的方式传输给近端的通话设备，例如，在图1中，下行数据x(t)被传输到作为近端的第二通话方的通话设备中的扬声器和延迟估计模块，从而在由近端的通话设备的扬声器播放远端的语音音频的同时，由延迟估计模块对下行数据x(t)以及由近端的通话设备的麦克风采集到的音频数据d(t)一起进行延迟对齐处理，以获得延迟估计结果x(t’)，接下来，可以进一步将该延迟估计结果x(t’)与麦克风采集到的音频数据d(t)一起输入到现有的线性回声消除滤波器中进行线性回声估计计算，从而获得线性回声估计结果y(t)和对应的线性输出。To this end, as shown in FIG. 1, a scenario is shown in which the call echo heard by the far end, ie, the other party at the local place of the call, is canceled. In the prior art, the first calling party at the far end can transmit the voice data sent out as downlink data to the near-end calling device in a wired or wireless manner. For example, in FIG. 1, the downlink data x(t) It is transmitted to the speaker and delay estimation module in the calling device of the second calling party as the near-end, so that while the voice audio of the far-end is played by the speaker of the near-end calling device, the downlink data x( t) and the audio data d(t) collected by the microphone of the near-end calling device are subjected to delay alignment processing to obtain a delay estimation result x(t'). Next, the delay estimation result x(t') can be further processed. ') together with the audio data d(t) collected by the microphone and input into the existing linear echo cancellation filter for linear echo estimation calculation, so as to obtain the linear echo estimation result y(t) and the corresponding linear output.

与现有技术中不同的是，在本申请中，在作为近端的第二通话方的通话设备中或者通话设备中的音频处理芯片中可以设置有处理模块，从而在如现有技术的方案中那样获得了延迟估计结果x(t’)和线性回声估计结果y(t)之后，在处理模块中可以基于作为下行数据的第一音频数据x(t)、线性回声数据y(t)和麦克风采集到的第二音频数据d(t)来进一步确定第二通话方当前的通话状态。例如，在本申请实施例中，在处理模块中，可以根据第一通话方传输来的第一音频数据x(t)和近端的通话设备的麦克风采集到的音频数据d(t)确定第一状态数据并且根据第二通话方传输来的第二音频数据d(t)和线性回声数据y(t)来确定第二状态数据。Different from the prior art, in the present application, a processing module may be provided in the communication device serving as the near-end second calling party or in the audio processing chip in the communication device, so that in the solution as in the prior art, a processing module may be provided. After obtaining the delay estimation result x(t') and the linear echo estimation result y(t) as in The second audio data d(t) collected by the microphone is used to further determine the current calling state of the second calling party. For example, in this embodiment of the present application, in the processing module, the first audio data x(t) transmitted by the first calling party and the audio data d(t) collected by the microphone of the near-end calling device may be used to determine the a state data and the second state data is determined according to the second audio data d(t) and the linear echo data y(t) transmitted by the second calling party.

例如，在本申请实施例中，图1中所示的方案中，，第一状态数据可以是第一音频数据与第二音频数据在各个子频带的相关系数的平均值，并且第二状态数据是线性回声数据与第二音频数据在各个子频带的比值的平均值，因此，根据本申请实施例，与现有技术相比，能够根据现有技术原有的输出进一步确定当前的通话状态，并且根据通话状态来确定与通话状态的权重因子，并且在现有技术中的回声抑制处理中引入该权重因子来考虑第二通话方的通话状态，即，根据第二通话方的通话状态的不同能够更加明确地确定线性输出结果中可能存在的回声残余分量，从而可以进一步对于现有技术的方案的线性输出结果进行进一步的回声残余分量的抑制，或者在本申请实施例中，也可以根据历史经验数据来建立通话状态与线性输出的增益调整方案的映射表，从而当确定了第一状态数据和第二状态数据时，可以直接根据所确定的第一状态数据和第二状态数据查询预先建立的映射表来选择对应的调整方案或者增益调整因子来直接对线性输出结果进行回声残余处理。For example, in the embodiment of the present application, in the solution shown in FIG. 1 , the first state data may be the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the second state data is the average value of the ratio of the linear echo data to the second audio data in each sub-band. Therefore, according to the embodiment of the present application, compared with the prior art, the current call state can be further determined according to the original output of the prior art, And according to the call state, the weighting factor of the call state is determined, and the weight factor is introduced in the echo suppression process in the prior art to consider the call state of the second call party, that is, according to the difference of the call state of the second call party. The echo residual components that may exist in the linear output results can be determined more clearly, so that further echo residual components can be suppressed for the linear output results of the solutions in the prior art, or in the embodiment of the present application, the echo residual components can also be suppressed according to the history Using empirical data to establish a mapping table between the call state and the gain adjustment scheme of the linear output, so that when the first state data and the second state data are determined, the pre-established query can be directly queried according to the determined first state data and the second state data. The corresponding adjustment scheme or gain adjustment factor is selected from the mapping table to directly perform echo residual processing on the linear output result.

因此，根据本申能够根据第一通话方发送的音频数据和第二通话方的采集数据，来确定用于标识当前的音频通话状态的第一状态数据和第二状态数据；进而，根据第一状态数据和第二状态数据所确定的与当前的音频通话状态相关的加权系数或者对应的抑制方案来对经过线性滤波后的麦克风采集数据进行针对性的抑制处理。从而能够基于当前通话状态来进行加权滤波或者采取对应的抑制方案来进行处理，从而能够考虑不同通话状态下回声残余的成分特性来进行回声残余抑制处理，能够提高回声残余抑制效果，有效提高通话质量。Therefore, according to the present application, the first state data and the second state data for identifying the current audio call state can be determined according to the audio data sent by the first call party and the collected data of the second call party; The weighting coefficient related to the current audio call state determined by the state data and the second state data or a corresponding suppression scheme is used to perform targeted suppression processing on the linearly filtered microphone collection data. Therefore, weighted filtering can be performed based on the current call state or a corresponding suppression scheme can be used for processing, so that the echo residual suppression processing can be performed considering the component characteristics of the echo residual under different call states, which can improve the echo residual suppression effect and effectively improve the call quality. .

上述实施例是对本申请实施例的技术原理和示例性的应用框架的说明，下面通过多个实施例来进一步对本申请实施例具体技术方案进行详细描述。The foregoing embodiments are descriptions of the technical principles and exemplary application frameworks of the embodiments of the present application, and the specific technical solutions of the embodiments of the present application will be further described in detail below through multiple embodiments.

实施例二Embodiment 2

图2为本申请提供的音频数据处理方法一个实施例的流程图，该方法的执行主体可以为具有音频处理能力的各种物联网终端或设备，也可以为集成在这些设备上的装置或芯片。如图2所示，该音频数据处理方法包括如下步骤：FIG. 2 is a flowchart of an embodiment of an audio data processing method provided by this application. The execution body of the method may be various Internet of Things terminals or devices with audio processing capabilities, or may be devices or chips integrated on these devices. . As shown in Figure 2, the audio data processing method includes the following steps:

S201，对第一通话方发送的第一音频数据和第二通话方采集到的第二音频数据进行线性滤波处理，得到线性回声数据。S201: Perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data.

在本申请实施例中，当处于本地方相对的远端的远端一方的第一通话方向近端的第二通话方在同一通话活动中发出语音作为第一音频数据时，如图1中所示，在本地方一端，即第二通话方可以在接收到该第一音频数据时，可以对该第一音频数据进行播放，例如通过近端的通话设备的扬声器进行播放，并且通过通过麦克风进行采集来获得第二音频数据，从而可以对第一音频数据和第二音频数据进行线性滤波处理。例如，可以将接收到的第一通话方的第一音频数据与本地方的通话设备所采集到的第二音频数据一起输入到例如线性AEC(Acoustic Echo Cancellation，声学回声消除)滤波模块来获得线性回声数据。即，通过线性滤波器的处理来分离线性回声数据。In the embodiment of the present application, when the first call of the far-end party at the far end opposite to the local place is directed to the second caller of the near-end in the same call activity, when the voice is sent as the first audio data, as shown in FIG. 1 display, at the local end, that is, the second calling party can play the first audio data when receiving the first audio data, for example, through the speaker of the near-end calling device, and through the microphone acquisition to obtain second audio data, so that linear filtering processing can be performed on the first audio data and the second audio data. For example, the received first audio data of the first calling party and the second audio data collected by the local calling device may be input into, for example, a linear AEC (Acoustic Echo Cancellation, acoustic echo cancellation) filtering module to obtain linearity echo data. That is, the linear echo data is separated by processing of a linear filter.

S202，根据第二通话方采集到的第二音频数据与线性回声数据确定线性输出数据。S202: Determine linear output data according to the second audio data and the linear echo data collected by the second calling party.

在步骤S201中获得了线性回声数据之后，由于线性回声数据可以反映本地方的通信装置所采集到的第二音频数据中与远端的第一通话方发送的第一音频数据相关的回声成分，因此可以进一步根据由作为本地方的近端的第二通话方采集到的第二音频数据和S201中获得的线性回声数据来确定线性输出数据。After the linear echo data is obtained in step S201, since the linear echo data can reflect the echo components related to the first audio data sent by the first calling party at the far end in the second audio data collected by the local communication device, Therefore, the linear output data can be further determined according to the second audio data collected by the second calling party that is the near end of the local place and the linear echo data obtained in S201.

S203，根据第一音频数据和第二音频数据，确定用于标识在第一通话方与第二通话方之间进行的音频通话状态的第一状态数据和第二状态数据。S203: Determine, according to the first audio data and the second audio data, first state data and second state data for identifying the state of the audio call between the first call party and the second call party.

在本申请实施例中，由于第一音频数据是远端的第一通话方发送的语音数据，并且第二音频数据是本地方的第二通话方的通信设备的例如麦克风所采集到的数据。在通常情况下，作为近端的本地方可能有三种通话状态：正在使用耳机说话、正在使用扬声器外放远端方的通话并且近端没有说话以及正在使用扬声器外放远端方的通话并且近端正在说话。In this embodiment of the present application, since the first audio data is voice data sent by the first calling party at the far end, and the second audio data is data collected by, for example, a microphone of the communication device of the second calling party at the local place. Under normal circumstances, the local site as the near end may have three call states: using the headset to speak, using the loudspeaker to play the far party's call and the near end is not speaking, and using the speaker to play the far party's call and the near end Duan is talking.

例如，在上述三种状态下，由近端的通信设备采集到的第二音频数据中可以包含有第一音频数据的回声数据的情况存在着差异。例如，在近端使用耳机收听远端的通话或使用耳机说话的情况下，近端的通信装置的麦克风采集到的第二音频数据中几乎不包含有远端发送的第一音频数据。在近端使用扬声器外放远端方的通话并且近端没有说话的情况下，由于近端使用扬声器播放远端的第一音频数据，因此在近端所在的空间中传播的第一音频数据会被近端的通信装置采集到并且因此包含在近端的通信装置采集到的第二音频数据中，特别是在该状态下由于近端没有说话，因此近端的通信装置采集到的第二音频数据几乎都是第一音频数据。在近端使用扬声器外放远端的第一音频数据并且近端同时在说话的情况下，在近端所在的空间中传播的不仅仅是远端的第一音频数据，而且还有近端在该空间中发出的语音数据，因此，近端的通信装置能够采集到的第二音频数据能够既包含有远端的第一音频数据的成分，即回声数据，并且还包含有近端正在说的话。For example, in the above three states, the second audio data collected by the near-end communication device may contain echo data of the first audio data. For example, when the near-end uses an earphone to listen to the far-end call or uses the earphone to speak, the second audio data collected by the microphone of the near-end communication device hardly includes the first audio data sent by the far end. When the near-end uses the speaker to broadcast the call of the far-end party and the near-end does not speak, since the near-end uses the speaker to play the far-end first audio data, the first audio data propagated in the space where the near-end is located will The second audio data collected by the near-end communication device and therefore included in the second audio data collected by the near-end communication device, especially in this state, since the near-end does not speak, the second audio data collected by the near-end communication device The data is almost always the first audio data. When the near-end uses a loudspeaker to amplify the far-end first audio data and the near-end is talking at the same time, not only the far-end first audio data but also the near-end is propagated in the space where the near-end is located. The voice data sent out in this space, therefore, the second audio data that can be collected by the near-end communication device can include not only the components of the far-end first audio data, that is, echo data, but also what the near-end is saying .

因此，在本申请实施例中，可以在步骤S203中基于第一音频数据和第二音频数据来确定近端的通话状态，即，确定标识近端的通话状态的第一状态数据和第二状态数据。例如，在本申请实施例中，第一状态数据可以是通过对第一音频信号和第二音频信号在各个自频带的相关系数进行频带平均来获得。例如，在本申请实施例中，第二状态数据可以是通过经过线性滤波器处理分离的线性回声数据和第二音频数据在各个自频带的比值进行频带平均得到。即，第一状态数据是第一音频数据与第二音频数据在各个子频带的相关系数的平均值，并且第二状态数据是线性回声数据与第二音频数据在各个子频带的比值的平均值。因此，这样确定的第一状态数据和第二状态数据能够很好地区分第二通话方的上述三种通话状态，以便于在后续处理中可以基于根据通话状态而进行加权滤波。Therefore, in this embodiment of the present application, the call state of the near-end may be determined based on the first audio data and the second audio data in step S203, that is, the first state data and the second state that identify the call state of the near-end are determined data. For example, in this embodiment of the present application, the first state data may be obtained by performing frequency band averaging on the correlation coefficients of the first audio signal and the second audio signal in respective frequency bands. For example, in this embodiment of the present application, the second state data may be obtained by performing frequency band averaging on ratios of the separated linear echo data and the second audio data in each self-frequency band after being processed by a linear filter. That is, the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each subband, and the second state data is the average value of the ratios of the linear echo data and the second audio data in each subband . Therefore, the first state data and the second state data determined in this way can well distinguish the above-mentioned three call states of the second call party, so that weighted filtering can be performed based on the call state in subsequent processing.

S204，根据第一状态数据和第二状态数据，确定与通话状态相关的权重因子，以对线性输出数据进行加权滤波处理，得到发送给第一通话方的第三音频数据。S204 , according to the first state data and the second state data, determine a weighting factor related to the call state, so as to perform weighted filtering processing on the linear output data to obtain third audio data sent to the first caller.

因此，可以根据在步骤S203中确定的第一状态数据和第二状态数据来确定与第二通话方的当前通话状态相关的加权因子，能够在步骤S204直接使用步骤S201和S202的输出来进一步根据当前的通话状态，即与通话状态的权重因子来进行加权滤波处理。Therefore, the weighting factor related to the current call state of the second calling party can be determined according to the first state data and the second state data determined in step S203, and the outputs of steps S201 and S202 can be directly used in step S204 to further The current call state, that is, the weighting factor of the call state, is subjected to weighted filtering processing.

换言之，本申请实施例可以在现有技术中的回声抑制处理中引入该权重因子来考虑第二通话方的通话状态，即，根据第二通话方的通话状态的不同能够更加明确地确定线性输出结果中可能存在的回声残余分量，从而可以进一步对于现有技术的方案的线性输出结果进行进一步的回声残余分量的抑制，或者在本申请实施例中，也可以根据历史经验数据来建立通话状态与线性输出的增益调整方案的映射表，从而当确定了第一状态数据和第二状态数据时，可以直接根据所确定的第一状态数据和第二状态数据查询预先建立的映射表来选择对应的调整方案或者增益调整因子来直接对线性输出结果进行回声残余处理。In other words, the embodiments of the present application can introduce the weighting factor into the echo suppression process in the prior art to consider the call state of the second caller, that is, the linear output can be more clearly determined according to the difference of the caller state of the second caller The echo residual components that may exist in the result, so that the linear output result of the solution in the prior art can be further suppressed. The mapping table of the gain adjustment scheme of the linear output, so that when the first state data and the second state data are determined, the pre-established mapping table can be directly inquired according to the determined first state data and the second state data to select the corresponding Adjust the scheme or gain adjustment factor to directly perform echo residual processing on the linear output result.

实施例三Embodiment 3

图3为本申请提供的音频数据处理方法另一个实施例的流程图，该方法的执行主体可以为具有音频处理能力的各种通信终端或设备，也可以为集成在这些设备上的装置或芯片。如图3所示，该音频数据处理方法包括如下步骤：FIG. 3 is a flowchart of another embodiment of the audio data processing method provided by this application. The execution subject of the method may be various communication terminals or devices with audio processing capabilities, or may be devices or chips integrated on these devices. . As shown in Figure 3, the audio data processing method includes the following steps:

S301，对第一音频数据进行音频活动检测，以确定第一音频数据中是否包含有语音音频。S301. Perform audio activity detection on the first audio data to determine whether the first audio data contains voice audio.

在本申请实施例中，由于通信终端中用于进行音频处理的各个模块消耗的功率较大，而通话过程中通常双方不会一直说话，因此，在作为本地方的近端在进行通话时可以对接收到的远端的第一音频数据进行音频活动检测处理(Voice Activity Detection，VAD)，以在确定第一音频数据中包含有语音数据时才启动相关的音频处理模块，并且在其他时候都使得相关的音频处理模块都处于待机或休眠状态，并且直到本步骤中的VAD处理确定第一音频数据中包含有语音数据时，即确定远端的第一通信方正在说话时，才向相关的音频处理模块发送唤醒信号或者向通信装置的中央处理器或控制器等等发送通知信号以便于远端的通信装置能够唤醒相关的音频处理模块来对包含有远端的语音的第一音频数据进行处理。In the embodiment of the present application, since each module used for audio processing in the communication terminal consumes a large amount of power, the two parties usually do not talk all the time during the call. Perform audio activity detection processing (Voice Activity Detection, VAD) on the received first audio data of the remote end, so as to start the relevant audio processing module only when it is determined that the first audio data contains voice data, and at other times. The related audio processing modules are all in standby or dormant state, and only when the VAD processing in this step determines that the first audio data contains voice data, that is, when it is determined that the first communication party at the far end is speaking The audio processing module sends a wake-up signal or sends a notification signal to the central processing unit or controller of the communication device, so that the remote communication device can wake up the relevant audio processing module to perform processing on the first audio data containing the voice of the remote end. deal with.

S302，根据第二音频数据对第一音频数据进行延迟对齐处理。S302: Perform delay alignment processing on the first audio data according to the second audio data.

在步骤S301确定接收到的第一音频数据中包含有远端的音频数据的情况下，可以在步骤S302中根据近端的通信装置所采集到的第二音频数据对接收到的第一音频数据例如通过延迟估计模块进行延迟对齐，从而对接收到的第一音频进行延迟调整。在通常情况下，由于作为本地方的近端在通过通信装置的例如麦克风的播放装置来播放第一音频数据并且通过例如扬声器的采集模块采集空间中传播的音频数据，即获得第二音频数据时，第二音频数据需要经过近端所在的空间中的传播才能够被近端的通信装置采集到，因此，即使在上述近端没有说话的情况下，近端的通信装置采集到的第二音频数据中包含的远端的说话成分会与第一音频数据存在着时间差，即第一音频数据由近端的通信装置的扬声器播放出来并经过空间传播而被麦克风接收到需要花费一段时间，从而与第一音频数据再时间线上是没有对齐的。因此，在步骤S302中可以使用第一音频数据来对近端的通信装置采集到的第二音频数据进行时间线上的对齐，从而能够有利于后续的回声识别和比对处理的准确性和效率。In the case where it is determined in step S301 that the received first audio data contains the audio data of the far end, in step S302, the received first audio data can be compared according to the second audio data collected by the communication device of the near end. For example, delay alignment is performed by a delay estimation module, so as to perform delay adjustment on the received first audio. Under normal circumstances, since the local near-end plays the first audio data through a playback device such as a microphone of the communication device and collects the audio data propagated in the space through a collection module such as a speaker, that is, when the second audio data is obtained , the second audio data needs to propagate in the space where the near-end is located before it can be collected by the near-end communication device. Therefore, even if the near-end does not speak, the second audio data collected by the near-end communication device There is a time difference between the far-end speech component contained in the data and the first audio data, that is, the first audio data is played by the speaker of the near-end communication device, propagated through space, and then received by the microphone, which takes a period of time. The first audio data is not aligned on the timeline. Therefore, in step S302, the first audio data can be used to align the second audio data collected by the near-end communication device on the timeline, thereby facilitating the accuracy and efficiency of subsequent echo identification and comparison processing. .

S303，对第一通话方发送的第一音频数据和第二通话方采集到的第二音频数据进行线性滤波处理，得到线性回声数据。S303: Perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data.

在本申请实施例中，当处于本地方相对的远端的远端一方向近端发出语音作为第一音频数据时，如图1中所示，在本地方一端，即近端可以在接收到该第一音频数据时，可以对该第一音频数据进行播放，例如通过近端的通话设备的扬声器进行播放，并且通过通过麦克风进行采集来获得第二音频数据，从而可以对第一音频数据和第二音频数据进行线性滤波处理。即，第一音频数据是第一通话方在通话活动中发送给第二通话方的语音数据，并且第二音频数据是第二通话方在播放第一音频数据时采集到的音频数据。例如，可以将接收到的第一通话方的第一音频数据与本地方的通话设备所采集到的第二音频数据一起输入到例如线性AEC(Acoustic Echo Cancellation)滤波模块来获得线性回声数据。即，通过线性滤波器的处理来分离线性回声数据。In this embodiment of the present application, when the far end that is located at the opposite end of the local place sends a voice to the near end as the first audio data, as shown in FIG. 1 , the local end, that is, the near end, can receive When the first audio data is used, the first audio data can be played, for example, through the speaker of the near-end calling device, and the second audio data can be obtained by collecting through the microphone, so that the first audio data and the The second audio data is subjected to linear filtering processing. That is, the first audio data is voice data sent by the first calling party to the second calling party during the calling activity, and the second audio data is audio data collected by the second calling party when playing the first audio data. For example, the received first audio data of the first calling party and the second audio data collected by the local calling device may be input into, for example, a linear AEC (Acoustic Echo Cancellation) filtering module to obtain linear echo data. That is, the linear echo data is separated by processing of a linear filter.

S304，根据第二通话方采集到的第二音频数据与线性回声数据确定线性输出数据。S304: Determine linear output data according to the second audio data and the linear echo data collected by the second calling party.

在步骤S301中获得了线性回声数据之后，由于线性回声数据可以反映本地方的通信装置所采集到的第二音频数据中与远端的第一通话方发送的第一音频数据相关的回声成分，因此可以进一步根据由作为本地方的近端的第二通话方采集到的第二音频数据和S301中获得的线性回声数据来确定线性输出数据。After the linear echo data is obtained in step S301, since the linear echo data can reflect the echo components related to the first audio data sent by the first calling party at the far end in the second audio data collected by the local communication device, Therefore, the linear output data can be further determined according to the second audio data collected by the second calling party that is the near end of the local place and the linear echo data obtained in S301.

S305，确定第一音频数据与第二音频数据在各个子频带的相关系数，并确定各相关系数的平均值作为第一状态数据。S305: Determine the correlation coefficient of the first audio data and the second audio data in each sub-band, and determine the average value of each correlation coefficient as the first state data.

S306，确定线性回声数据与第二音频数据在各个子频带的比值，并确定各比值的平均值作为第二状态数据。S306, determine the ratio of the linear echo data and the second audio data in each subband, and determine the average value of each ratio as the second state data.

例如，在上述三种状态下，由近端的通信设备采集到的第二音频数据中包含的第一音频数据，即回声数据的情况存在着差异。例如，在近端使用耳机收听远端的通话或使用耳机说话的情况下，近端的通信装置的麦克风采集到的第二音频数据中几乎不包含有远端发送的第一音频数据。在近端使用扬声器外放远端方的通话并且近端没有说话的情况下，由于近端使用扬声器播放远端的第一音频数据，因此在近端所在的空间中传播的第一音频数据会被近端的通信装置采集到并且因此包含在近端的通信装置采集到的第二音频数据中，特别是在该状态下由于近端没有说话，因此近端的通信装置采集到的第二音频数据几乎都是第一音频数据。在近端使用扬声器外放远端的第一音频数据并且近端同时在说话的情况下，在近端所在的空间中传播的不仅仅是远端的第一音频数据，而且还有近端在该空间中发出的语音数据，因此，近端的通信装置能够采集到的第二音频数据能够既包含有远端的第一音频数据的成分，即回声数据，并且还包含有近端正在说的话。For example, in the above three states, the first audio data included in the second audio data collected by the near-end communication device, that is, the echo data, are different. For example, when the near-end uses an earphone to listen to the far-end call or uses the earphone to speak, the second audio data collected by the microphone of the near-end communication device hardly includes the first audio data sent by the far end. When the near-end uses the speaker to broadcast the call of the far-end party and the near-end does not speak, since the near-end uses the speaker to play the far-end first audio data, the first audio data propagated in the space where the near-end is located will The second audio data collected by the near-end communication device and therefore included in the second audio data collected by the near-end communication device, especially in this state, since the near-end does not speak, the second audio data collected by the near-end communication device The data is almost always the first audio data. When the near-end uses a loudspeaker to amplify the far-end first audio data and the near-end is talking at the same time, not only the far-end first audio data but also the near-end is propagated in the space where the near-end is located. The voice data sent out in this space, therefore, the second audio data that can be collected by the near-end communication device can include not only the components of the far-end first audio data, that is, echo data, but also what the near-end is saying .

因此，在本申请实施例中，可以在步骤S303和S304中基于第一音频数据和第二音频数据来确定近端的通话状态，即，在步骤S303中可以通过对第一音频信号和第二音频信号在各个子频带的相关系数进行频带平均来获得第一状态数据Coh_XD，并且可以在步骤S304中通过经过线性滤波器处理分离的线性回声数据和第二音频数据在各个子频带的比值进行频带平均得到第二状态数据YDR。因此，在该情况下，步骤S303获得的第一状态数据以及步骤S304中获得的第二状态数据可以用来标识第一通话方和第二通话方之间进行的音频通话的状态。因此，这样确定的第一状态数据和第二状态数据能够很好地区分第二通话方的上述三种通话状态，以便于在后续处理中可以基于根据通话状态而进行加权滤波。Therefore, in this embodiment of the present application, the call state of the near-end may be determined based on the first audio data and the second audio data in steps S303 and S304, that is, in step S303, the first audio signal and the second audio The correlation coefficients of the audio signals in each sub-band are band-averaged to obtain the first state data Coh _XD , and in step S304, the ratio of the separated linear echo data and the second audio data in each sub-band can be processed by a linear filter. The frequency bands are averaged to obtain the second state data YDR. Therefore, in this case, the first state data obtained in step S303 and the second state data obtained in step S304 can be used to identify the state of the audio call between the first calling party and the second calling party. Therefore, the first state data and the second state data determined in this way can well distinguish the above-mentioned three call states of the second call party, so that weighted filtering can be performed based on the call state in subsequent processing.

S307，根据第一状态数据和第二状态数据确定用于控制语音失真的权衡因子。S307: Determine a trade-off factor for controlling speech distortion according to the first state data and the second state data.

在本申请实施例中，在步骤S303获得的第一状态数据和步骤S304获得的第二状态数据来确定权衡因子。在本申请实施例中，可以使用下述公式来确定该权衡因子。In this embodiment of the present application, the first state data obtained in step S303 and the second state data obtained in step S304 are used to determine the trade-off factor. In this embodiment of the present application, the following formula can be used to determine the trade-off factor.

其中，Φ_YY和ΦEE是利用不同帧信号计算的协方差矩阵。因此，在本申请实施例中，可以通过第一状态数据和第二状态数据来获得能够用于确定回声残余成分的权衡因子。Among them, _ΦYY and ΦEE are covariance matrices calculated using different frame signals. Therefore, in this embodiment of the present application, a trade-off factor that can be used to determine echo residual components can be obtained through the first state data and the second state data.

S308，根据权衡因子、线性回声数据和线性输出数据，确定帧间维纳滤波器系数。S308: Determine the inter-frame Wiener filter coefficients according to the trade-off factor, the linear echo data, and the linear output data.

在步骤S307中获得了权衡因子之后，可以进一步在步骤S308中根据步骤S307中的权衡因子以及步骤S303和S304中获得的线性回声数据以及线性输出数据来确定帧间维纳滤波器系数。例如，可以采用下述公式来确定帧间维纳滤波器系数。After the trade-off factor is obtained in step S307, the inter-frame Wiener filter coefficient may be further determined in step S308 according to the trade-off factor in step S307 and the linear echo data and linear output data obtained in steps S303 and S304. For example, the following formulas can be used to determine the inter-frame Wiener filter coefficients.

其中，t为时间，并且f表示声音频率，e₁表示计算参数。where t is time, and f is the sound frequency, and e ₁ is the calculation parameter.

S309，根据维纳滤波器系数对线性输出数据进行滤波处理，得到第一输出音频数据。S309: Perform filtering processing on the linear output data according to the Wiener filter coefficients to obtain first output audio data.

在步骤S309中可以利用步骤S308中确定的维纳滤波器系数来使用维纳滤波器对步骤S304获得的线性输出数据进行滤波处理，以便于获得第一输出音频数据。本申请实施例可以在现有技术中的回声抑制处理中引入该权衡因子来考虑第二通话方的通话状态，即，根据第二通话方的通话状态的不同能够更加明确地确定线性输出结果中可能存在的回声残余分量，从而可以进一步对于现有技术的方案的线性输出结果进行进一步的回声残余分量的抑制。In step S309, the Wiener filter coefficients determined in step S308 may be used to perform filtering processing on the linear output data obtained in step S304, so as to obtain the first output audio data. In the embodiment of the present application, the trade-off factor can be introduced into the echo suppression process in the prior art to consider the call state of the second call party, that is, according to the difference of the call state of the second call party, the linear output result can be more clearly determined. possible existing echo residual components, so that further echo residual components can be suppressed for the linear output result of the solution in the prior art.

因此，在本申请实施例中，能够考虑近端的通话状态来在确定滤波器系数时引入对于当前近端的通话状态的判断结果作为参数，从而使得对回声抑制处理的效果的提升。Therefore, in the embodiment of the present application, the near-end talking state can be considered to introduce the judgment result of the current near-end talking state as a parameter when determining the filter coefficient, so as to improve the effect of echo suppression processing.

S310，确定第一输出音频数据的频带平均增益。S310: Determine the average gain of the frequency band of the first output audio data.

S311，选择与频带平均增益对应的信号降低幅度值。S311, select a signal reduction amplitude value corresponding to the average gain of the frequency band.

S312，根据信号降低幅度值对第一输出音频数据进行降低信号幅度的操作。S312 , performing an operation of reducing the signal amplitude on the first output audio data according to the signal reduction amplitude value.

在步骤S309获得了第一输出音频数据之后，可以进一步在步骤S310中确定步骤S309中获得的第一输出音频数据的频带平均增益，并且在步骤S311中根据步骤S309中获得的频带平均增益来选择对应的信号降低幅度值，从而在步骤S312中利用所选择的信号降低幅度值来对步骤S309中获得的第一输出音频数据进行降低信号幅度的操作，从而实现对第一输出音频数据中包含的回声残留的进一步的抑制。After the first output audio data is obtained in step S309, the band average gain of the first output audio data obtained in step S309 may be further determined in step S310, and selected according to the band average gain obtained in step S309 in step S311 The corresponding signal reduces the amplitude value, so that in step S312, the selected signal reduction amplitude value is used to perform the operation of reducing the signal amplitude on the first output audio data obtained in step S309, so as to realize the operation of reducing the signal amplitude contained in the first output audio data. Further suppression of echo residuals.

此外，在本申请实施例中，也可以根据历史经验数据来建立通话状态与线性输出的增益调整方案的映射表，从而当确定了第一状态数据和第二状态数据时，可以直接根据所确定的第一状态数据和第二状态数据查询预先建立的映射表来选择对应的调整方案或者增益调整因子来直接对线性输出结果进行回声残余处理。In addition, in the embodiment of the present application, a mapping table between the call state and the gain adjustment scheme of the linear output can also be established according to historical experience data, so that when the first state data and the second state data are determined, the determined The first state data and the second state data of the first state data and the second state data query the pre-established mapping table to select the corresponding adjustment scheme or gain adjustment factor to directly perform echo residual processing on the linear output result.

例如，在本申请实施例中，替代确定权衡因子，可以选择与第一状态数据和第二状态数据对应的信号降低幅度值，并且根据信号降低幅度值对线性输出音频数据进行降低信号幅度的操作，从而能够根据历史经验数据来快速地确定用于去除回声残余的增量调整方案。For example, in this embodiment of the present application, instead of determining the trade-off factor, a signal reduction amplitude value corresponding to the first state data and the second state data may be selected, and an operation of reducing the signal amplitude is performed on the linear output audio data according to the signal reduction amplitude value , so that an incremental adjustment scheme for removing echo residues can be quickly determined according to historical experience data.

实施例四Embodiment 4

图4为本申请提供的音频数据处理装置实施例的结构示意图，可用于执行如图2和图3所示的方法步骤。如图4所示，该音频数据处理装置可以包括：滤波模块41、线性输出模块42、状态确定模块43和抑制模块44。FIG. 4 is a schematic structural diagram of an embodiment of an audio data processing apparatus provided by the present application, which can be used to execute the method steps shown in FIG. 2 and FIG. 3 . As shown in FIG. 4 , the audio data processing apparatus may include: a filtering module 41 , a linear output module 42 , a state determination module 43 and a suppression module 44 .

滤波模块41可以用于对第一通话方发送的第一音频数据和第二通话方采集到的第二音频数据进行线性滤波处理，得到线性回声数据。The filtering module 41 may be configured to perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data.

线性输出模块42用于根据第二通话方采集到的第二音频数据与线性回声数据进行确定线性输出数据。The linear output module 42 is configured to determine the linear output data according to the second audio data and the linear echo data collected by the second calling party.

在滤波模块41获得了线性回声数据之后，由于线性回声数据可以反映本地方的通信装置所采集到的第二音频数据中与远端的第一通话方发送的第一音频数据相关的回声成分，因此可以进一步利用线性输出模块42来根据由作为本地方的近端的第二通话方采集到的第二音频数据和滤波模块41获得的线性回声数据来确定线性输出数据。After the filtering module 41 obtains the linear echo data, since the linear echo data can reflect the echo components related to the first audio data sent by the first calling party at the far end in the second audio data collected by the local communication device, Therefore, the linear output module 42 can be further utilized to determine the linear output data according to the second audio data collected by the second calling party as the local near-end and the linear echo data obtained by the filtering module 41 .

状态确定模块43，用于根据第一音频数据和第二音频数据，确定用于标识在第一通话方与第二通话方之间进行的音频通话状态的第一状态数据和第二状态数据。The state determination module 43 is configured to determine, according to the first audio data and the second audio data, the first state data and the second state data for identifying the state of the audio call between the first calling party and the second calling party.

因此，在本申请实施例中，状态确定模块43可以基于第一音频数据和第二音频数据来确定近端的通话状态，即，确定标识近端的通话状态的第一状态数据和第二状态数据。例如，在本申请实施例中，可以通过状态确定模块43的第一确定单元431来通过对第一音频信号和第二音频信号在各个自频带的相关系数进行频带平均来获得第一状态数据。例如，在本申请实施例中，可以通过状态确定模块43的第二确定单元432来通过经过线性滤波器处理分离的线性回声数据和第二音频数据在各个自频带的比值进行频带平均得到第二状态数据。Therefore, in this embodiment of the present application, the state determination module 43 may determine the call state of the near-end based on the first audio data and the second audio data, that is, determine the first state data and the second state that identify the call state of the near-end data. For example, in this embodiment of the present application, the first determination unit 431 of the state determination module 43 may obtain the first state data by performing frequency band averaging on the correlation coefficients of the first audio signal and the second audio signal in respective frequency bands. For example, in this embodiment of the present application, the second determination unit 432 of the state determination module 43 may perform frequency band averaging on the ratio of the linear echo data and the second audio data separated by the linear filter processing in the respective frequency bands to obtain the second status data.

抑制模块44，可以用于根据第一状态数据和第二状态数据，确定与通话状态相关的权重因子，以对线性输出数据进行加权滤波处理，得到发送给第一通话方的第三音频数据。The suppression module 44 may be configured to determine a weighting factor related to the call state according to the first state data and the second state data, so as to perform weighted filtering processing on the linear output data to obtain third audio data sent to the first caller.

因此，可以根据状态确定模块43确定的第一状态数据和第二状态数据来确定与第二通话方的当前通话状态相关的加权因子，抑制模块44能够直接使用滤波模块41和线性输出模块42的输出来进一步根据当前的通话状态，即与通话状态的权重因子来进行加权滤波处理。Therefore, the weighting factor related to the current call state of the second calling party can be determined according to the first state data and the second state data determined by the state determination module 43 , and the suppression module 44 can directly use the filtering module 41 and the linear output module 42 The output is further weighted and filtered according to the current call state, that is, the weight factor of the call state.

例如，在本申请实施例中，抑制模块44可以包括：第三确定单元441、第四确定单元442和滤波单元443。For example, in this embodiment of the present application, the suppression module 44 may include: a third determination unit 441 , a fourth determination unit 442 , and a filtering unit 443 .

第三确定单元441可以用于根据第一状态数据和所述第二状态数据确定用于控制语音失真的权衡因子。The third determining unit 441 may be configured to determine a trade-off factor for controlling speech distortion according to the first state data and the second state data.

在本申请实施例中，利用第一确定单元431获得的第一状态数据和第二确定单元432获得的第二状态数据来确定权衡因子。在本申请实施例中，可以使用下述公式来确定该权衡因子μ。In this embodiment of the present application, the trade-off factor is determined by using the first state data obtained by the first determining unit 431 and the second state data obtained by the second determining unit 432 . In this embodiment of the present application, the following formula can be used to determine the trade-off factor μ.

其中，Φ_YY和Φ_EE是利用不同帧信号确定的协方差矩阵。因此，在本申请实施例中，可以通过第一状态数据和第二状态数据来获得这样的能够用于确定回声残余成分的权衡因子。Among them, Φ _YY and Φ _EE are covariance matrices determined using different frame signals. Therefore, in this embodiment of the present application, such a trade-off factor that can be used to determine echo residual components can be obtained through the first state data and the second state data.

第四确定单元442可以用于根据权衡因子、线性回声数据和线性输出数据，确定帧间维纳滤波器系数。The fourth determination unit 442 may be configured to determine the inter-frame Wiener filter coefficients according to the trade-off factor, the linear echo data and the linear output data.

在第三确定单元441获得了权衡因子之后，可以进一步利用第四确定单元442根据第三确定单元441确定出的权衡因子以及滤波模块41获得的线性回声数据以及线性输出模块42获得的线性输出数据来确定帧间维纳滤波器系数。例如，可以采用下述公式来确定帧间维纳滤波器系数w(t,f)。After the third determining unit 441 obtains the trade-off factor, the fourth determining unit 442 may further use the trade-off factor determined by the third determining unit 441, the linear echo data obtained by the filtering module 41, and the linear output data obtained by the linear output module 42. to determine the inter-frame Wiener filter coefficients. For example, the following formula can be used to determine the inter-frame Wiener filter coefficients w(t,f).

滤波单元443可以用于根据维纳滤波器系数对所述线性输出数据进行滤波处理，得到第一输出音频数据The filtering unit 443 may be configured to perform filtering processing on the linear output data according to the Wiener filter coefficients to obtain the first output audio data

在本申请实施例中，滤波单元443可以利用第三确定单元441确定获得的维纳滤波器系数来使用维纳滤波器对线性输出模块42获得的线性输出数据进行滤波处理，以便于获得第一输出音频数据。因此，在本申请实施例中，可以在现有技术中的回声抑制处理中引入该权衡因子来考虑第二通话方的通话状态，即，根据第二通话方的通话状态的不同能够更加明确地确定线性输出结果中可能存在的回声残余分量，从而可以进一步对于现有技术的方案的线性输出结果进行进一步的回声残余分量的抑制。In this embodiment of the present application, the filtering unit 443 may use the Wiener filter coefficients determined and obtained by the third determining unit 441 to perform filtering processing on the linear output data obtained by the linear output module 42 by using the Wiener filter, so as to obtain the first Output audio data. Therefore, in this embodiment of the present application, the trade-off factor can be introduced into the echo suppression processing in the prior art to consider the call state of the second call party, that is, according to the difference of the call state of the second call party, it can be more clearly defined The echo residual components that may exist in the linear output result are determined, so that further echo residual components can be suppressed for the linear output results of the solution in the prior art.

此外，本申请实施例的音频处理装置可以进一步包括：增益确定模块47、选择模块48和信号幅度调整模块49。In addition, the audio processing apparatus in this embodiment of the present application may further include: a gain determination module 47 , a selection module 48 and a signal amplitude adjustment module 49 .

增益确定模块47可以用于确定所述第一输出音频数据的频带平均增益；The gain determination module 47 can be used to determine the average gain of the frequency band of the first output audio data;

选择模块48可以用于选择与所频带平均增益对应的信号降低幅度值；The selection module 48 can be used to select the signal reduction amplitude value corresponding to the average gain of the frequency band;

信号幅度调整模块49可以用于根据所述信号降低幅度值对所述第一输出音频数据进行降低信号幅度的操作。The signal amplitude adjustment module 49 may be configured to perform an operation of reducing the signal amplitude on the first output audio data according to the signal amplitude reduction value.

在滤波单元443获得了第一输出音频数据之后，可以进一步利用增益确定模块47确定滤波单元443获得的第一输出音频数据的频带平均增益，并且利用选择模块48来根据增益确定模块47获得的频带平均增益来选择对应的信号降低幅度值，从而使用信号幅度调整模块49来利用所选择的信号降低幅度值来对滤波单元443获得的第一输出音频数据进行降低信号幅度的操作，从而实现对第一输出音频数据中包含的回声残留的进一步的抑制。After the filtering unit 443 obtains the first output audio data, the gain determination module 47 may be used to further determine the average gain of the frequency band of the first output audio data obtained by the filtering unit 443, and the selection module 48 may be used to determine the frequency band obtained by the gain determination module 47 according to the The average gain is used to select the corresponding signal reduction amplitude value, so that the signal amplitude adjustment module 49 is used to use the selected signal reduction amplitude value to perform the operation of reducing the signal amplitude on the first output audio data obtained by the filtering unit 443, thereby realizing the first output audio data. A further suppression of echo residues contained in the output audio data.

此外，抑制模块44可以进一步包括：选择单元444和信号幅度调整单元445。In addition, the suppression module 44 may further include: a selection unit 444 and a signal amplitude adjustment unit 445 .

选择单元444可以用于选择与第一状态数据和第二状态数据对应的信号降低幅度值。The selection unit 444 may be configured to select a signal reduction amplitude value corresponding to the first state data and the second state data.

信号幅度调整单元445可以用于根据信号降低幅度值对线性输出音频数据进行降低信号幅度的操作。The signal amplitude adjustment unit 445 may be configured to perform the signal amplitude reduction operation on the linear output audio data according to the signal reduction amplitude value.

此外，本申请实施例中已经获得了第一状态数据和第二状态数据之后，也可以通过选择单元444直接根据标识第一状态数据和第二状态数据所标识的近端的当前的通话状态来从例如预先设置的表中查询与状态数据对应的信号降低幅度值，并且利用信号幅度调整单元445使用步骤选择单元444选择的信号降低幅度值来对线性输出模块42获得的线性输出音频进行降低信号幅度的操作。In addition, after the first state data and the second state data have been obtained in this embodiment of the present application, the selection unit 444 can also directly identify the current call state of the near-end identified by the first state data and the second state data through the selection unit 444. The signal reduction amplitude value corresponding to the state data is inquired from, for example, a preset table, and the linear output audio obtained by the linear output module 42 is down signaled by the signal amplitude adjustment unit 445 using the signal reduction amplitude value selected by the step selection unit 444 Amplitude operation.

语音检测模块45可以用于对第一音频数据进行音频活动检测，以确定第一音频数据中是否包含有语音音频。The voice detection module 45 may be configured to perform audio activity detection on the first audio data to determine whether the first audio data contains voice audio.

在本申请实施例中，由于通信终端中用于进行音频处理的各个模块消耗的功率较大，而通话过程中通常双方不会一直说话，因此，在作为本地方的近端在进行通话时可以利用语音检测模块45对接收到的远端的第一音频数据进行音频活动检测处理(VoiceActivity Detection，VAD)，以在确定第一音频数据中包含有语音数据时才启动相关的音频处理模块，并且在其他时候都使得相关的音频处理模块都处于待机或休眠状态，并且直到本步骤中的VAD处理确定第一音频数据中包含有语音数据时，即确定远端的第一通信方正在说话时，才向相关的音频处理模块发送唤醒信号或者向通信装置的中央处理器或控制器等等发送通知信号以便于远端的通信装置能够唤醒相关的音频处理模块来对包含有远端的语音的第一音频数据进行处理。In the embodiment of the present application, since each module used for audio processing in the communication terminal consumes a large amount of power, the two parties usually do not talk all the time during the call. Use the voice detection module 45 to perform audio activity detection processing (VoiceActivity Detection, VAD) on the received first audio data of the far end, so as to activate the relevant audio processing module only when it is determined that the first audio data contains voice data, and At other times, the relevant audio processing modules are all in a standby or dormant state, and until the VAD processing in this step determines that the first audio data contains voice data, that is, when it is determined that the first communication party at the far end is speaking, Only send a wake-up signal to the relevant audio processing module or send a notification signal to the central processing unit or controller of the communication device, etc., so that the remote communication device can wake up the relevant audio processing module to respond to the first voice containing the remote voice. An audio data is processed.

延迟对齐模块46可以用于根据第二音频数据对第一音频数据进行延迟对齐处理。The delay alignment module 46 may be configured to perform delay alignment processing on the first audio data according to the second audio data.

在语音检测模块45确定接收到的第一音频数据中包含有远端的音频数据的情况下，可以利用延迟对齐模块46根据近端的通信装置所采集到的第二音频数据对接收到的第一音频数据例如通过延迟估计模块进行延迟对齐，从而对接收到的第一音频进行延迟调整。在通常情况下，由于作为本地方的近端在通过通信装置的例如麦克风的播放装置来播放第一音频数据并且通过例如扬声器的采集模块采集空间中传播的音频数据，即获得第二音频数据时，第二音频数据需要经过近端所在的空间中的传播才能够被近端的通信装置采集到，因此，即使在上述近端没有说话的情况下，近端的通信装置采集到的第二音频数据中包含的远端的说话成分会与第一音频数据存在着时间差，即第一音频数据由近端的通信装置的扬声器播放出来并经过空间传播而被麦克风接收到需要花费一段时间，从而与第一音频数据再时间线上是没有对齐的。因此，可以通过例如近端的通信装置的延迟对齐模块46来可以使用第一音频数据来对近端的通信装置采集到的第二音频数据进行时间线上的对齐，从而能够有利于后续的回声识别和比对处理的准确性和效率。In the case that the voice detection module 45 determines that the received first audio data contains the audio data of the far end, the delay alignment module 46 can be used to align the received first audio data according to the second audio data collected by the near end communication device. For example, a delay aligning of audio data is performed by a delay estimation module, so as to perform delay adjustment on the received first audio. Under normal circumstances, since the local near-end plays the first audio data through a playback device such as a microphone of the communication device and collects the audio data propagated in the space through a collection module such as a speaker, that is, when the second audio data is obtained , the second audio data needs to propagate in the space where the near-end is located before it can be collected by the near-end communication device. Therefore, even if the near-end does not speak, the second audio data collected by the near-end communication device There is a time difference between the far-end speech component contained in the data and the first audio data, that is, the first audio data is played by the speaker of the near-end communication device, propagated through space, and then received by the microphone, which takes a period of time. The first audio data is not aligned on the timeline. Therefore, for example, the delay alignment module 46 of the near-end communication device can use the first audio data to align the second audio data collected by the near-end communication device on the timeline, so as to facilitate subsequent echoes Accuracy and efficiency of identification and comparison processing.

因此，根据本申请实施例的音频数据处理装置，因此，根据本申能够根据第一通话方发送的音频数据和第二通话方的采集数据，来确定用于标识当前的音频通话状态的第一状态数据和第二状态数据；进而，根据第一状态数据和第二状态数据所确定的与当前的音频通话状态相关的加权系数或者对应的抑制方案来对经过线性滤波后的麦克风采集数据进行针对性的抑制处理。从而能够基于当前通话状态来进行加权滤波或者采取对应的抑制方案来进行处理，从而能够考虑不同通话状态下回声残余的成分特性来进行回声残余抑制处理，能够提高回声残余抑制效果，有效提高通话质量。Therefore, according to the audio data processing apparatus of the embodiment of the present application, according to the present application, it is possible to determine the first audio call state for identifying the current audio call state according to the audio data sent by the first call party and the collected data of the second call party. state data and second state data; further, according to the weighting coefficient related to the current audio call state determined by the first state data and the second state data or the corresponding suppression scheme, the linear filtering of the microphone collection data is carried out. Sexual inhibition treatment. Therefore, weighted filtering can be performed based on the current call state or a corresponding suppression scheme can be used for processing, so that the echo residual suppression processing can be performed considering the component characteristics of the echo residual in different call states, which can improve the echo residual suppression effect and effectively improve the call quality. .

实施例五Embodiment 5

以上描述了音频数据处理装置的内部功能和结构，该装置可实现为一种电子设备。图5为本申请提供的电子设备实施例的结构示意图。如图5所示，该电子设备包括存储器51和处理器52。The internal functions and structures of the audio data processing apparatus are described above, and the apparatus can be implemented as an electronic device. FIG. 5 is a schematic structural diagram of an embodiment of an electronic device provided by the present application. As shown in FIG. 5 , the electronic device includes a memory 51 and a processor 52 .

存储器51，用于存储程序。除上述程序之外，存储器51还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。The memory 51 is used to store programs. In addition to the above-described programs, the memory 51 may also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.

存储器51可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。Memory 51 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

处理器52，不仅仅局限于中央处理器(CPU)，还可能为图形处理器(GPU)、现场可编辑门阵列(FPGA)、嵌入式神经网络处理器(NPU)或人工智能(AI)芯片等处理芯片。处理器52，与存储器51耦合，执行存储器51所存储的程序，该程序运行时执行上述实施例二和三的音频数据处理方法。The processor 52 is not limited to a central processing unit (CPU), but may also be a graphics processing unit (GPU), a field programmable gate array (FPGA), an embedded neural network processor (NPU), or an artificial intelligence (AI) chip and other processing chips. The processor 52, coupled with the memory 51, executes the program stored in the memory 51, and when the program runs, the audio data processing methods of the second and third embodiments above are executed.

进一步，如图5所示，电子设备还可以包括：通信组件53、电源组件54、音频组件55、显示器56等其它组件。图5中仅示意性给出部分组件，并不意味着电子设备只包括图5所示组件。Further, as shown in FIG. 5 , the electronic device may further include: a communication component 53 , a power supply component 54 , an audio component 55 , a display 56 and other components. Only some components are schematically shown in FIG. 5 , which does not mean that the electronic device only includes the components shown in FIG. 5 .

通信组件53被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络，如WiFi，3G、4G或5G，或它们的组合。在一个示例性实施例中，通信组件53经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件53还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component 53 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 3G, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component 53 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 53 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

电源组件54，为电子设备的各种组件提供电力。电源组件54可以包括电源管理系统，一个或多个电源，及其他与为电子设备生成、管理和分配电力相关联的组件。The power supply assembly 54 provides power to various components of the electronic device. Power supply components 54 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.

音频组件55被配置为输出和/或输入音频信号。例如，音频组件55包括一个麦克风(MIC)，当电子设备处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器51或经由通信组件53发送。在一些实施例中，音频组件55还包括一个扬声器，用于输出音频信号。Audio component 55 is configured to output and/or input audio signals. For example, audio component 55 includes a microphone (MIC) that is configured to receive external audio signals when the electronic device is in operating modes, such as calling mode, recording mode, and voice recognition mode. The received audio signal may be further stored in the memory 51 or transmitted via the communication component 53 . In some embodiments, audio assembly 55 also includes a speaker for outputting audio signals.

显示器56包括屏幕，其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。Display 56 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by program instructions related to hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps including the above method embodiments are executed; and the foregoing storage medium includes: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

1. A method for processing audio data, comprising:

Perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data, wherein the first calling party and the second calling party are in the same During a call activity;

determining linear output data according to the second audio data collected by the second calling party and the linear echo data;

First state data and second state data for identifying the state of an audio call between the first caller and the second caller are determined according to the first audio data and the second audio data , wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the second state data is the linear echo data and the The average value of the ratio of the second audio data in each sub-band;

According to the first state data and the second state data, a weighting factor related to the call state is determined, so as to perform weighted filtering processing on the linear output data to obtain a third call sent to the first caller. audio data.

2. The audio data processing method according to claim 1, wherein the first audio data is voice data sent by the first calling party to the second calling party in the calling activity, and the The second audio data is audio data collected by the second calling party when the first audio data is played.

3. The audio data processing method according to claim 1 or 2, wherein the weighting factor related to the call state is determined according to the first state data and the second state data, so as to determine the weighting factor related to the call state. The linear output data is subjected to weighted filtering processing to obtain third audio data sent to the first calling party, including:

determining a trade-off factor for controlling speech distortion according to the first state data and the second state data;

determining inter-frame Wiener filter coefficients according to the trade-off factor, the linear echo data, and the linear output data;

The linear output data is filtered according to the Wiener filter coefficients to obtain first output audio data as the third audio data.

4. The audio data processing method according to claim 3, wherein the method further comprises:

determining a frequency band average gain of the first output audio data;

selecting a signal reduction amplitude value corresponding to the average gain of the frequency band;

A signal amplitude reduction operation is performed on the first output audio data according to the signal reduction amplitude value.

5. The audio data processing method according to claim 1, wherein, in the said first state data and said second state data, determining a weighting factor related to the talking state, so that the linear Before performing weighted filtering processing on the output data to obtain third audio data sent to the first calling party, the method further includes:

Audio activity detection is performed on the first audio data to determine whether the first audio data contains speech audio.

6 . The audio data processing method according to claim 1 , wherein, in said determining according to the first audio data and the second audio data, a method for identifying the connection between the first calling party and the second audio data is determined. 7 . Before the first state data and the second state data of the audio call state between the calling parties, the method further includes:

Delay alignment processing is performed on the first audio data according to the second audio data.

7. An audio data processing method, comprising:

First state data and second state data for identifying the state of an audio call between the first caller and the second caller are determined according to the first audio data and the second audio data , wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the first state data is the linear echo data and the The average value of the ratio of the second audio data in each sub-band;

selecting, according to the first state data and the second state data, a signal reduction amplitude value corresponding to the first state data and the second state data;

The signal amplitude reduction operation is performed on the linear output audio data according to the signal reduction amplitude value, so as to obtain audio data sent to the first calling party.

8. A calling method, comprising:

receiving first audio data;

playing the first audio data;

performing audio capture processing to generate second audio data, wherein the second audio data includes at least audio data collected when the first audio data is played;

performing linear filtering processing on the second audio data to obtain linear echo data;

determining linear output data according to the second audio data and the linear echo data;

According to the first audio data and the second audio data, first state data and second state data for identifying the audio call state are determined, wherein the first state data is the first audio data and the the average value of the correlation coefficient of the second audio data in each sub-band, and the first state data is the average value of the ratio of the linear echo data and the second audio data in each sub-band;

According to the first state data and the second state data, determine a weighting factor related to the talking state, so as to perform weighted filtering processing on the linear output data to obtain third audio data;

The third audio data is output to the calling party making the call.

9. An audio processing chip, comprising:

an audio receiving module for receiving the first audio data;

an audio output module for playing the first audio data;

a sound pickup module, configured to perform audio collection processing to generate second audio data, wherein the second audio data at least includes audio data collected by the sound pickup module when playing the first audio data;

a filtering module, configured to perform linear filtering processing on the second audio data to obtain linear echo data;

a processing module, configured to determine linear output data according to the second audio data and the linear echo data, and determine first state data for identifying the audio call state according to the first audio data and the second audio data and second state data, wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in each sub-band, and the first state data is the linear an average value of the ratios of the echo data and the second audio data in each sub-band; and determining a weighting factor related to the call state according to the first state data and the second state data,

Wherein, the filtering module is configured to perform weighted filtering processing on the linear output data to obtain third audio data, and

The audio output module is used for outputting the third audio data to the calling party making the call.

10. An audio data processing device, comprising:

A filtering module, configured to perform linear filtering processing on the first audio data sent by the first calling party and the second audio data collected by the second calling party to obtain linear echo data, wherein the first calling party and the The two call parties are in the same call activity;

a linear output module, configured to determine linear output data according to the second audio data collected by the second calling party and the linear echo data;

a state determination module for determining, according to the first audio data and the second audio data, a first state for identifying the state of an audio call between the first call party and the second call party data and second state data, wherein the first state data is the average value of the correlation coefficients of the first audio data and the second audio data in respective sub-bands, and the first state data is the the average value of the ratio of the linear echo data to the second audio data in each sub-band;

A suppression module, configured to determine a weighting factor related to the call state according to the first state data and the second state data, so as to perform weighted filtering processing on the linear output data, and obtain a result that is sent to the first The third audio data of the calling party.

11. An electronic device comprising:

memory for storing programs;

A processor, configured to run the program stored in the memory, and execute the method according to any one of claims 1 to 8 when the program runs.

12. A computer-readable storage medium having stored thereon a computer program executable by a processor, wherein the program, when executed by the processor, implements the method of any one of claims 1 to 8.