CN115706756A

CN115706756A - Abnormal echo delay identification method, device, terminal and storage medium

Info

Publication number: CN115706756A
Application number: CN202110936165.XA
Authority: CN
Inventors: 高毅; 罗程; 李斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2023-02-17
Anticipated expiration: 2041-08-16
Also published as: CN115706756B

Abstract

The application discloses an abnormal echo delay identification method, an abnormal echo delay identification device, a terminal and a storage medium, and relates to the technical field of communication. The method comprises the following steps: performing audio feature extraction on an input audio frame acquired by a microphone to obtain a first audio feature; in response to reaching the target delay, determining second audio features from the candidate audio features based on the first audio features, wherein the candidate audio features are audio features corresponding to output audio frames for loudspeaker playing, and the second audio features are matched with the first audio features; determining echo delay of the output audio frame corresponding to the second audio feature; and determining that abnormal echo delay exists in response to the echo delay being smaller than the target delay. The method for detecting the negative delay in the echo cancellation process is provided, so that the situation that the echo cancellation module continues to perform echo cancellation work under wrong echo delay or cannot perform echo cancellation due to the fact that the echo delay cannot be calculated is avoided, and the accuracy of echo cancellation can be improved.

Description

Abnormal echo delay identification method, device, terminal and storage medium

技术领域technical field

本申请实施例涉及通话技术领域，特别涉及一种异常回声延时识别方法、装置、终端及存储介质。The embodiment of the present application relates to the technical field of communication, and in particular to a method, device, terminal and storage medium for identifying abnormal echo delay.

背景技术Background technique

在通过音频终端设备进行语音通话过程中，扬声器播放出来的声音，特别是免提模式通过扬声器外放的声音比较大，容易被麦克风采集到；使得扬声器播放的声音经过麦克风采集之后又反馈至远端，则远端说话的人会听到自己的声音，形成回声并严重影响语音通话质量。During the voice call through the audio terminal equipment, the sound played by the speaker, especially the sound played by the speaker in the hands-free mode is relatively loud, which is easy to be collected by the microphone; the sound played by the speaker is collected by the microphone and then fed back to the remote end, the person speaking at the far end will hear his own voice, forming an echo and seriously affecting the quality of the voice call.

相关技术中，音频终端设备都会部署有软件或硬件的回声消除模块对麦克风采集到的回声进行消除。回声消除过程一般采用回波抵消方法，通过对比参考点信号和接收点信号，估计出从参考点信号到接收点信号之间的回声延时，并根据回声延时对参考点信号进行延时处理后，与接收点信号共同计算出传递函数，该传递函数用于预测回声副本，使得可以利用回声副本对接收点信号进行回声消除。In the related art, the audio terminal equipment is equipped with a software or hardware echo cancellation module to cancel the echo collected by the microphone. The echo cancellation process generally adopts the echo cancellation method. By comparing the reference point signal and the receiving point signal, the echo delay from the reference point signal to the receiving point signal is estimated, and the reference point signal is delayed according to the echo delay. Finally, the transfer function is calculated jointly with the signal at the receiving point, and the transfer function is used to predict the echo replica, so that the echo replica can be used to perform echo cancellation on the signal at the receiving point.

由上述回声消除过程可知，准确的回声延时估计是提高回声消除效果的前提条件。It can be seen from the above echo cancellation process that accurate echo delay estimation is a prerequisite for improving the echo cancellation effect.

发明内容Contents of the invention

本申请实施例提供了一种异常回声延时识别方法、装置、终端及存储介质。所述技术方案如下：Embodiments of the present application provide a method, device, terminal and storage medium for identifying abnormal echo delay. Described technical scheme is as follows:

根据本申请的一方面，提供了一种异常回声延时识别方法，所述方法包括：According to one aspect of the present application, a method for identifying abnormal echo delay is provided, the method comprising:

对麦克风采集到的输入音频帧进行音频特征提取，得到第一音频特征；Perform audio feature extraction on the input audio frame collected by the microphone to obtain the first audio feature;

响应于达到目标延时，基于所述第一音频特征，从候选音频特征中确定第二音频特征，所述候选音频特征是输出音频帧对应的音频特征，所述输出音频帧用于扬声器播放，所述第二音频特征与所述第一音频特征匹配；In response to reaching the target delay, based on the first audio feature, determining a second audio feature from candidate audio features, the candidate audio feature is an audio feature corresponding to an output audio frame, and the output audio frame is used for speaker playback, the second audio characteristic matches the first audio characteristic;

确定所述第二音频特征对应输出音频帧的回声延时；Determining the echo delay of the output audio frame corresponding to the second audio feature;

响应于所述回声延时小于所述目标延时，确定存在异常回声延时。In response to the echo delay being less than the target delay, it is determined that there is an abnormal echo delay.

根据本申请的另一方面，提供了一种异常回声延时识别装置，所述装置包括：According to another aspect of the present application, an abnormal echo delay identification device is provided, the device comprising:

特征提取模块，用于对麦克风采集到的输入音频帧进行音频特征提取，得到第一音频特征；The feature extraction module is used to extract the audio feature from the input audio frame collected by the microphone to obtain the first audio feature;

第一确定模块，用于响应于达到目标延时，基于所述第一音频特征，从候选音频特征中确定第二音频特征，所述候选音频特征是输出音频帧对应的音频特征，所述输出音频帧用于扬声器播放，所述第二音频特征与所述第一音频特征匹配；The first determination module is configured to determine a second audio feature from candidate audio features based on the first audio feature in response to reaching the target delay, the candidate audio feature is an audio feature corresponding to an output audio frame, and the output The audio frame is used for speaker playback, and the second audio feature matches the first audio feature;

第二确定模块，用于确定所述第二音频特征对应输出音频帧的回声延时；The second determination module is used to determine the echo delay of the output audio frame corresponding to the second audio feature;

第三确定模块，用于响应于所述回声延时小于所述目标延时，确定存在异常回声延时。A third determining module, configured to determine that there is an abnormal echo delay in response to the echo delay being less than the target delay.

根据本申请的另一方面，提供了一种终端，所述终端包括处理器和存储器，所述存储器中存储有至少一段程序，所述至少一段程序由所述处理器加载并执行以实现如上方面所述的异常回声延时识别方法。According to another aspect of the present application, a terminal is provided, the terminal includes a processor and a memory, at least one program is stored in the memory, and the at least one program is loaded and executed by the processor to realize the above aspects The method for identifying the abnormal echo delay.

根据本申请的另一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一段程序，所述至少一段程序由处理器加载并执行以实现如上方面所述的异常回声延时识别方法。According to another aspect of the present application, a computer-readable storage medium is provided. At least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to realize the abnormal echo delay as described above. time identification method.

根据本申请的另一方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。终端的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该终端执行上述可选实现方式中提供的异常回声延时识别方法。According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the terminal reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the terminal executes the abnormal echo delay identification method provided in the above optional implementation manner.

本申请实施例提供的技术方案带来的有益效果至少包括：The beneficial effects brought by the technical solutions provided by the embodiments of the present application at least include:

本申请实施例中，针对回声消除过程中出现负延时的情况，会使得在接收点接收到输入音频帧后，需要等待一段时间才可以查找到与其匹配的输出音频帧的特点，提出了一种检测回声消除过程中负延时的方式，通过在获取到输入音频帧对应的第一音频特征，并延时目标延时后，再基于第一音频特征进行特征匹配，使得在负延时的情况下仍然可以估计出回声延时；而延时特征匹配使得计算出的回声延时为目标延时和传递延时之和，则可以基于回声延时与目标延时的关系，确定出传递延时的正负性，从而可以及时确定出是否存在异常回声延时(负延时)，避免回声消除模块在错误回声延时下继续进行回声消除工作，或避免由于无法计算出回声延时而导致无法进行回声消除的情况，从而可以提高回声消除的准确性。In the embodiment of the present application, in view of the fact that negative delay occurs in the process of echo cancellation, after the input audio frame is received at the receiving point, it needs to wait for a period of time to find the matching output audio frame, and proposes a A method of detecting negative delay in the process of echo cancellation, by obtaining the first audio feature corresponding to the input audio frame and delaying the target delay, and then performing feature matching based on the first audio feature, so that in the negative delay The echo delay can still be estimated in the case of , and the delay feature matching makes the calculated echo delay the sum of the target delay and the transfer delay, then the transfer delay can be determined based on the relationship between the echo delay and the target delay Positive or negative of the time, so that it can be determined in time whether there is an abnormal echo delay (negative delay), avoiding the echo cancellation module to continue the echo cancellation work under the wrong echo delay, or avoiding the echo delay caused by the inability to calculate the echo delay The situation where echo cancellation cannot be performed, so that the accuracy of echo cancellation can be improved.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1示出了相关技术中回声消除系统的结构示意图；FIG. 1 shows a schematic structural diagram of an echo cancellation system in the related art;

图2示出了本申请一个示例性实施例示出的回声消除系统的结构示意图；Fig. 2 shows a schematic structural diagram of an echo cancellation system shown in an exemplary embodiment of the present application;

图3示出了本申请一个示例性实施例提供的异常回声延时识别方法的流程图；Fig. 3 shows the flowchart of the abnormal echo delay identification method provided by an exemplary embodiment of the present application;

图4示出了本申请一个示例性实施例示出的确定回声延时的过程示意图；Fig. 4 shows a schematic diagram of the process of determining the echo delay shown in an exemplary embodiment of the present application;

图5示出了本申请另一个示例性实施例提供的异常回声延时识别方法的流程图；FIG. 5 shows a flow chart of an abnormal echo delay identification method provided by another exemplary embodiment of the present application;

图6示出了本申请一个示例性示出的延时估计过程与异常延时检测过程的工作示意图；Fig. 6 shows a working diagram of an exemplary delay estimation process and abnormal delay detection process of the present application;

图7示出了本申请一个示例性实施例示出的音频特征的提取过程示意图；FIG. 7 shows a schematic diagram of an audio feature extraction process shown in an exemplary embodiment of the present application;

图8示出了本申请一个示例性实施例示出的特征存储区实现延时功能的原理示意图；Fig. 8 shows a schematic diagram of the principle of implementing a delay function in a feature storage area shown in an exemplary embodiment of the present application;

图9示出了本申请一个示例性实施例示出的延时特征匹配的过程示意图；Fig. 9 shows a schematic diagram of a delay feature matching process shown in an exemplary embodiment of the present application;

图10示出了本申请一个示例性实施例示出的特征匹配过程示意图；Fig. 10 shows a schematic diagram of a feature matching process shown in an exemplary embodiment of the present application;

图11示出了本申另一个示例性实施例示出的延时估计过程和异常延时检测过程示意图；Fig. 11 shows a schematic diagram of a delay estimation process and an abnormal delay detection process shown in another exemplary embodiment of the present application;

图12示出了本申请一个示例性实施例示出的延时估计过程中音频特征提取的过程示意图；Fig. 12 shows a schematic diagram of the audio feature extraction process in the delay estimation process shown in an exemplary embodiment of the present application;

图13示出了本申请另一个示例性实施例提供的异常回声延时识别方法的流程图；FIG. 13 shows a flow chart of an abnormal echo delay identification method provided by another exemplary embodiment of the present application;

图14是本申请一个示例性实施例提供的异常回声延时识别装置的结构框图；Fig. 14 is a structural block diagram of an abnormal echo delay identification device provided by an exemplary embodiment of the present application;

图15示出了本申请一个示例性实施例提供的终端的结构框图。Fig. 15 shows a structural block diagram of a terminal provided by an exemplary embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

在通过音频终端设备进行语音通话过程中，通过扬声器(喇叭)播放出来的声音，尤其是扬声器以外放模式播放出来的声音较大，播放出来的声音容易被音频终端设备的麦克风重新采集到，使得扬声器播放出的声音经过麦克风采集之后又会送至远端，远端的说话人就会听到自己的声音，进而形成回声并严重影响通话质量。因此，一般音频终端设备(通话设备)都会包含软件或硬件的回声消除系统，并通过该回声消除系统对麦克风采集到的声音进行回声消除。图1示出了相关技术中回声消除系统的结构示意图。在语音通话过程中，终端从远端接收到的音频信号在送到扬声器播放之前需要经过一个参考点，参考点处采集到的音频信号一般称为参考点信号，参考点信号经过软硬件播放逻辑送到扬声器播放，通过空气等介质传播进入麦克风，再经过软硬件采集逻辑到达接收点位置，从接收点获取得到的音频信号称为接收点信号。可见，当参考点信号到达接收点位置的过程中，需要经过软硬件播放延时(从参考点传输到扬声器经过的时间)、声学路径延时(从扬声器通过空气等传输介质达到麦克风经过的时间)以及软硬件采集延时(从麦克风经过软硬件采集逻辑到达接收点的时间)。During a voice call through the audio terminal device, the sound played by the speaker (speaker), especially the sound played by the speaker in the external mode, is louder, and the played sound is easy to be re-collected by the microphone of the audio terminal device, so that The sound played by the speaker will be collected by the microphone and then sent to the remote end. The speaker at the far end will hear his own voice, which will form an echo and seriously affect the call quality. Therefore, a general audio terminal device (communication device) will include a software or hardware echo cancellation system, and the echo cancellation system will perform echo cancellation on the sound collected by the microphone. Fig. 1 shows a schematic structural diagram of an echo cancellation system in the related art. During a voice call, the audio signal received by the terminal from the remote end needs to pass through a reference point before being sent to the speaker for playback. The audio signal collected at the reference point is generally called the reference point signal, and the reference point signal passes through the software and hardware playback logic. It is sent to the speaker for playback, propagates into the microphone through air and other media, and then reaches the receiving point through software and hardware acquisition logic. The audio signal obtained from the receiving point is called the receiving point signal. It can be seen that when the reference point signal arrives at the receiving point, it needs to go through the software and hardware playback delay (the time passed from the reference point to the speaker), and the acoustic path delay (the time passed from the speaker to the microphone through air and other transmission media). ) and software and hardware acquisition delay (the time from the microphone to the receiving point through the software and hardware acquisition logic).

如图1所示，为了实现回声消除，在音频终端设备(通话设备)中设置有延时估计模块101、延时对齐模块102以及回声消除模块103。其中，延时估计模块101用于对比参考点信号和接收点信号，从而估计出从参考点信号到接收点信号之间的回声延时(Tde)，并将该Tde传递至延时对齐模块102；延时对齐模块102用于根据延时估计模块101估计得到的Tde，将参考点信号延时Tde得到延时后的参考点信号，并将延时后的参考点信号送到回声消除模块103，由回声消除模块103基于延时后的参考点信号与接收点信号，估计出回声路径的传递函数，进而当参考点信号接收到新的参考点信号后，即可以基于新参考点信号和传递函数，预测得到参考点信号对应回声的回声副本，通过在接收点信号减去该回声副本，从而消除接收点信号中的回声信号。As shown in FIG. 1 , in order to implement echo cancellation, a delay estimation module 101 , a delay alignment module 102 and an echo cancellation module 103 are provided in an audio terminal device (communication device). Wherein, the delay estimation module 101 is used for comparing the reference point signal and the receiving point signal, thereby estimating the echo delay (Tde) between the reference point signal and the receiving point signal, and passing the Tde to the delay alignment module 102 The delay alignment module 102 is used to obtain the Tde estimated according to the delay estimation module 101, delaying the reference point signal by Tde to obtain a delayed reference point signal, and sending the delayed reference point signal to the echo cancellation module 103 , the echo cancellation module 103 estimates the transfer function of the echo path based on the delayed reference point signal and the receiving point signal, and then when the reference point signal receives a new reference point signal, it can be based on the new reference point signal and transfer function The function predicts the echo copy corresponding to the reference point signal, and subtracts the echo copy from the receiving point signal to eliminate the echo signal in the receiving point signal.

从图1的回声消除系统可以看出，延时估计模块101是否可以准确估计出回声延时Tde，会影响回声消除的准确性，而延时估计过程中可能会受到操作系统对麦克风采集线程、扬声器播放线程调度，以及应用程序稳定性等原因的影响，从参考点到接收点的回声延时并非固定不变的，比如，当接收点读取接收点信号的线程卡顿时，会导致接收点提前获取到新采集到的接收点信号，而无法从存储参考点信号对应音频特征的特征存储区中找到匹配的音频特征，从而无法估计或无法准确计算出回声延时Tde，出现负延时的情况，从而严重影响回声消除的效果，或者无法实现回声消除。As can be seen from the echo cancellation system in FIG. 1, whether the delay estimation module 101 can accurately estimate the echo delay Tde will affect the accuracy of the echo cancellation, and the operating system may be affected by the operating system during the delay estimation process. Due to the impact of speaker playback thread scheduling and application stability, the echo delay from the reference point to the receiving point is not constant. The newly collected receiving point signal is obtained in advance, but the matching audio feature cannot be found from the feature storage area that stores the audio feature corresponding to the reference point signal, so the echo delay Tde cannot be estimated or accurately calculated, and negative delay occurs situation, which seriously affects the effect of echo cancellation, or cannot achieve echo cancellation.

可见，在回声消除过程中，如何有效且及时的检测出回声消除系统是否存在异常回声延，可以避免回声消除系统采用错误延时进行回声消除，或无法进行回声消除的情况，进而可以及时采取有效手段来消除异常延时，是提高回声消除准确性的关键。基于上述问题，如图2所示，其示出了本申请一个示例性实施例示出的回声消除系统的结构示意图。该回声消除系统主要包含异常监控模块201、延时估计模块202、延时对齐模块203以及回声消除模块204，相比于图1，本申请在回声消除系统中增加了异常监控模块201，异常监控模块201可以基于参考点信号和接收点信号，对可能出现的异常延时(负延时)进行监控，当检测到负延时时，可以向延时估计模块发送重置指令，以便延时估计模块202和延时对齐模块203均可以重置算法，清空缓存，以便后续回声消除过程可以正常进行回声延时估计。It can be seen that in the process of echo cancellation, how to effectively and timely detect whether there is an abnormal echo delay in the echo cancellation system can avoid the situation that the echo cancellation system adopts a wrong delay for echo cancellation, or cannot perform echo cancellation, and then can take effective measures in time. It is the key to improve the accuracy of echo cancellation to eliminate the abnormal delay by means. Based on the above problems, as shown in FIG. 2 , it shows a schematic structural diagram of an echo cancellation system shown in an exemplary embodiment of the present application. The echo cancellation system mainly includes an abnormality monitoring module 201, a delay estimation module 202, a delay alignment module 203, and an echo cancellation module 204. Compared with FIG. 1, the present application adds an abnormality monitoring module 201 to the echo cancellation system. Module 201 can monitor the possible abnormal delay (negative delay) based on the reference point signal and the receiving point signal. When a negative delay is detected, a reset instruction can be sent to the delay estimation module, so that the delay estimation Both the module 202 and the delay alignment module 203 can reset the algorithm and clear the cache, so that the subsequent echo cancellation process can normally estimate the echo delay.

请参考图3，其示出了本申请一个示例性实施例提供的异常回声延时识别方法的流程图。本申请实施例以该方法应用于终端为例进行说明，该方法包括：Please refer to FIG. 3 , which shows a flowchart of a method for identifying abnormal echo delay provided by an exemplary embodiment of the present application. In this embodiment of the application, the method is applied to a terminal as an example for illustration, and the method includes:

步骤301，对麦克风采集到的输入音频帧进行音频特征提取，得到第一音频特征。Step 301 , extracting audio features from input audio frames collected by a microphone to obtain first audio features.

其中，输入音频帧是由麦克风采集到的，在麦克风采集过程中，麦克风采集到的输入音频信号(输入音频帧)首先会存储在软硬件采集逻辑对应的buffer中，进而通过调用应用程序线程从该buffer中读取输入音频帧。Among them, the input audio frame is collected by the microphone. During the microphone collection process, the input audio signal (input audio frame) collected by the microphone will first be stored in the buffer corresponding to the hardware and software collection logic, and then call the application thread from the Read input audio frames from this buffer.

由图2所示，参考点信号需要经过软硬件播放逻辑、空气等介质传播、软硬件采集逻辑到达接收点，对应同一音频信号在参考点处和接收点处并非完全相同，但应该是最相似的，因此，在一种可能的实施方式中，为了从参考点信号中确定出与输入音频帧最相似的音频帧，可以采用音频特征匹配的方式，也即通过对输入音频帧进行音频特征提取，得到第一音频特征，进而基于该第一音频特征与历史输出音频帧(参考点信号)对应的音频特征进行特征匹配，以便从中确定出与第一音频特征最相似的音频特征，用于后续进行回声延时估计。As shown in Figure 2, the reference point signal needs to go through software and hardware playback logic, air and other media propagation, and software and hardware acquisition logic to reach the receiving point. The corresponding audio signal is not exactly the same at the reference point and the receiving point, but it should be the most similar Therefore, in a possible implementation, in order to determine the audio frame most similar to the input audio frame from the reference point signal, an audio feature matching method can be used, that is, by performing audio feature extraction on the input audio frame , to obtain the first audio feature, and then perform feature matching based on the first audio feature and the audio feature corresponding to the historical output audio frame (reference point signal), so as to determine the audio feature most similar to the first audio feature, for subsequent Perform echo delay estimation.

由于第一音频特征是用于做音频帧匹配的，为了提高音频匹配的准确性，则第一音频特征一般选择音频信号所独有的特征，示意性的，第一音频特征可以是梅尔频率倒谱系数(Mel-frequency Cepstral Coefficients，MFCC)、傅里叶系数、线性预测系数(LinearPrediction Coefficients，LPC)、基于频谱能量的音频特征等。Since the first audio feature is used for audio frame matching, in order to improve the accuracy of audio matching, the first audio feature generally selects the unique features of the audio signal. Schematically, the first audio feature can be Mel frequency Cepstral coefficients (Mel-frequency Cepstral Coefficients, MFCC), Fourier coefficients, linear prediction coefficients (LinearPrediction Coefficients, LPC), audio features based on spectral energy, etc.

步骤302，响应于达到目标延时，基于第一音频特征，从候选音频特征中确定第二音频特征，候选音频特征是输出音频帧对应的音频特征，输出音频帧用于扬声器播放，第二音频特征与第一音频特征匹配。Step 302, in response to reaching the target delay, based on the first audio feature, determine the second audio feature from the candidate audio features, the candidate audio feature is the audio feature corresponding to the output audio frame, the output audio frame is used for speaker playback, the second audio The feature matches the first audio feature.

在回声消除过程中，当调用应用程序线程从buffer(软硬件采集逻辑对应的buffer)中读取输入音频帧的过程出现卡顿时，buffer遵循先进先出的原理，且buffer中数据存储量有限，若数据读取线程卡顿，但是数据写入线程仍然持续进行，当buffer中数据存满后，新采集到的输入音频帧会覆盖掉历史存储在buffer中的输入音频帧，若后续卡顿恢复，会直接读取新的输入音频帧，使得新的输入音频帧提前到达接收点；同时卡顿过程也会导致历史输出音频帧对应候选音频特征停止写入历史特征存储区，从而使得无法在历史特征存储区中找到与当前输入音频帧匹配的输出音频帧，导致无法进行延时估计，或延时估计错误，也即出现负延时(当前输入音频帧对应输出音频帧的音频特征需要在接收当前输入音频帧之后一段时间才会写入历史特征存储区中)的情况；因此，基于出现负延时情况的特点，为了可以准确检测到负延时的情况，也即为了在出现负延时时仍然可以计算出回声延时，在一种可能的实施方式中，当提取到输入音频帧对应的第一音频特征后，不会立即基于该第一音频特征进行特征匹配，而是延时目标延时后，再基于该第一音频特征进行特征匹配，以便可以查找到匹配的候选音频特征。During the echo cancellation process, when the process of calling the application thread to read the input audio frame from the buffer (the buffer corresponding to the hardware and software acquisition logic) is stuck, the buffer follows the first-in-first-out principle, and the data storage capacity in the buffer is limited. If the data reading thread freezes, but the data writing thread continues, when the data in the buffer is full, the newly collected input audio frames will overwrite the historical input audio frames stored in the buffer, and if the subsequent freeze recovers , will directly read the new input audio frame, so that the new input audio frame arrives at the receiving point in advance; at the same time, the freezing process will also cause the historical output audio frame corresponding to the candidate audio feature to stop writing to the historical feature storage area, so that it cannot be recorded in the historical feature storage area. An output audio frame matching the current input audio frame is found in the feature storage area, resulting in the inability to perform delay estimation, or a delay estimation error, that is, negative delay occurs (the audio feature of the current input audio frame corresponding to the output audio frame needs to be received A period of time after the current input audio frame will be written into the historical feature storage area); therefore, based on the characteristics of the negative delay, in order to accurately detect the negative delay, that is, in order to The echo delay can still be calculated. In a possible implementation, after extracting the first audio feature corresponding to the input audio frame, the feature matching will not be performed immediately based on the first audio feature, but the delay target After a delay, feature matching is performed based on the first audio feature, so that a matching candidate audio feature can be found.

其中，目标延时可以由开发者设置，或由业务人员基于实际情况进行设置。由于目标延时是为了保证在出现负延时的情况下，仍然可以查找到与第一音频特征匹配的第二音频特征，对应的目标延时需要大于等于操作系统所能出现的最大负延时，示意性的，若最大负延时为100ms，则目标延时可以是120ms。本申请实施例对目标延时的具体数值不构成限定。Among them, the target delay can be set by the developer, or set by the business personnel based on the actual situation. Since the target delay is to ensure that the second audio feature that matches the first audio feature can still be found in the event of a negative delay, the corresponding target delay must be greater than or equal to the maximum negative delay that can occur in the operating system , illustratively, if the maximum negative delay is 100ms, the target delay may be 120ms. The embodiment of the present application does not limit the specific value of the target delay.

可选的，候选音频特征是输出音频帧对应的音频特征，该输出音频帧是送到扬声器播放之前获取到的，也即当输出音频信号经过参考点时，将其存储在buffer中，并在接收点接收到一帧输入音频帧时，从buffer中读取一帧历史输出音频帧，进行音频特征提取，并将提取得到的候选音频特征存储在历史特征存储区中，以便后续可以基于第一音频特征从该历史特征存储区中查找与之匹配的第二音频特征。Optionally, the candidate audio feature is the audio feature corresponding to the output audio frame, and the output audio frame is obtained before being sent to the speaker for playback, that is, when the output audio signal passes the reference point, it is stored in the buffer, and in When the receiving point receives a frame of input audio frame, it reads a frame of historical output audio frame from the buffer, performs audio feature extraction, and stores the extracted candidate audio feature in the historical feature storage area, so that the follow-up can be based on the first The audio feature searches for a matching second audio feature from the historical feature storage area.

步骤303，确定第二音频特征对应输出音频帧的回声延时。Step 303, determine the echo delay of the output audio frame corresponding to the second audio feature.

在一种可能的实施方式中，当从候选音频特征中确定出与第一音频特征匹配的第二音频特征后，表示输入音频帧是该第二音频特征对应输出音频帧经过回声路径到达接收点时所对应的音频信号，且在回声延时期间，第二音频特征在历史特征存储器中的存储位置随时间移动，对应的，可以基于第二音频特征在历史特征存储器中的存储位置，确定出第二音频特征对应输出音频帧的回声延时。In a possible implementation, when the second audio feature matching the first audio feature is determined from the candidate audio features, it means that the input audio frame is the second audio feature and the corresponding output audio frame reaches the receiving point through the echo path , and during the echo delay period, the storage position of the second audio feature in the historical feature memory moves with time, correspondingly, based on the storage position of the second audio feature in the historical feature memory, it can be determined The second audio feature corresponds to an echo delay of the output audio frame.

示意性的，如图4所示，其示出了本申请一个示例性实施例示出的确定回声延时的过程示意图。在参考点的输出音频帧Q，经过传输延时Td，到达接收点得到输入音频帧Q＇，由于经过传输信道的失真以及噪声等影响，Q和Q＇是不同但相近的信号。通过在接收点对输入音频帧Q＇进行音频特征提取，得到第一音频特征G；输出音频帧Q假设是第80帧音频帧，对输出音频帧Q进行音频特征提取得到的第二音频特征为F(80)，并将F(80)存储在历史特征存储器的尾部位置(tail)，在经过传输延时Td之后，参考点新获取到31帧输出音频帧并计算得到31个候选音频特征依次放入历史特征存储区；因此，当接收点得到输入音频帧Q＇对应的第一音频特征G的时候，历史特征存储器又新存入了从F(81)到F(111)共31个候选音频特征；由于存在目标延时，会在经过目标延时后，基于第一音频特征G与历史特征存储区中存储的候选音频特征进行特征匹配，而在经过目标延时后，历史特征存储器中又新存储了从F(111)到F(141)个候选音频特征；则此时的回声延时为从F(80)到F(141)等61个输出音频帧的长度，则对应可以基于每个输出音频帧的采样间隔和帧数，计算得到回声延时。Schematically, as shown in FIG. 4 , it shows a schematic diagram of a process of determining an echo delay in an exemplary embodiment of the present application. The output audio frame Q at the reference point, after the transmission delay Td, arrives at the receiving point to obtain the input audio frame Q'. Due to the distortion and noise of the transmission channel, Q and Q' are different but similar signals. By extracting the audio features of the input audio frame Q' at the receiving point, the first audio feature G is obtained; assuming that the output audio frame Q is the 80th audio frame, the second audio feature obtained by extracting the audio features of the output audio frame Q is F(80), and store F(80) in the tail position (tail) of the historical feature memory. After the transmission delay Td, the reference point newly acquires 31 frames of output audio frames and calculates 31 candidate audio features in turn. Put it into the historical feature storage area; therefore, when the receiving point obtains the first audio feature G corresponding to the input audio frame Q', the historical feature storage has newly stored a total of 31 candidates from F(81) to F(111) Audio features; due to the presence of target delay, after the target delay, feature matching will be performed based on the first audio feature G and the candidate audio features stored in the historical feature storage area, and after the target delay, the historical feature memory Newly stored from F(111) to F(141) candidate audio features; then the echo delay at this time is the length of 61 output audio frames from F(80) to F(141), then the corresponding can be based on The sampling interval and frame number of each output audio frame to calculate the echo delay.

步骤304，响应于回声延时小于目标延时，确定存在异常回声延时。Step 304, in response to the fact that the echo delay is less than the target delay, it is determined that there is an abnormal echo delay.

由于本申请是在接收点接收到输入音频帧后，延时了目标延时之后再进行音频特征匹配，则操作系统正常运行的情况下(即不存在负延时的情况下)，估计得到的回声延时应该为传输延时和目标延时之和，传输延时为参考点信号到接收点信号经过的延时，正常情况下，传输延时为正整数，也就是说，回声延时一定是大于目标延时的；反之，若回声延时小于目标延时，则表示传输延时为负值，即存在负延时的情况，因此，在一种可能的实施方式中，当确定出回声延时小于目标延时时，表示系统存在线程调用异常而导致的负延时情况，即存在异常回声延时，对应延时估计模块无法准确估计出传输延时，需要重置延时估计算法，并清空其缓存以消除负延时的情况。Since the application performs audio feature matching after the target delay is delayed after receiving the input audio frame at the receiving point, the estimated The echo delay should be the sum of the transmission delay and the target delay. The transmission delay is the delay from the reference point signal to the receiving point signal. Under normal circumstances, the transmission delay is a positive integer, that is, the echo delay must be is greater than the target delay; on the contrary, if the echo delay is less than the target delay, it means that the transmission delay is negative, that is, there is a negative delay. Therefore, in a possible implementation, when it is determined that the echo When the delay is less than the target delay, it means that the system has a negative delay caused by an abnormal thread call, that is, there is an abnormal echo delay. The corresponding delay estimation module cannot accurately estimate the transmission delay, and the delay estimation algorithm needs to be reset. And clear its cache to eliminate the case of negative latency.

综上所述，本申请实施例中，针对回声消除过程中出现负延时的情况，会使得在接收点接收到输入音频帧后，需要等待一段时间才可以查找到与其匹配的输出音频帧的特点，提出了一种检测回声消除过程中负延时的方式，通过在获取到输入音频帧对应的第一音频特征，并延时目标延时后，再基于第一音频特征进行特征匹配，使得在负延时的情况下仍然可以估计出回声延时；而延时特征匹配使得计算出的回声延时为目标延时和传递延时之和，则可以基于回声延时与目标延时的关系，确定出传递延时的正负性，从而可以及时确定出是否存在异常回声延时(负延时)，避免回声消除模块在错误回声延时下继续进行回声消除工作，或避免由于无法计算出回声延时而导致无法进行回声消除的情况，从而可以提高回声消除的准确性。To sum up, in the embodiment of the present application, in view of the negative delay in the echo cancellation process, after receiving the input audio frame at the receiving point, it takes a while to find the matching output audio frame. feature, a method for detecting negative delay in the echo cancellation process is proposed, by obtaining the first audio feature corresponding to the input audio frame and delaying the target delay, and then performing feature matching based on the first audio feature, so that In the case of negative delay, the echo delay can still be estimated; and the delay feature matching makes the calculated echo delay the sum of the target delay and the transfer delay, which can be based on the relationship between the echo delay and the target delay , to determine the positive or negative of the transmission delay, so that it can be determined in time whether there is an abnormal echo delay (negative delay), so as to prevent the echo cancellation module from continuing to perform echo cancellation work under the wrong echo delay, or avoid In the event that the echo is delayed and the echo cannot be canceled, the accuracy of the echo cancellation can be improved.

由于当接收点获取到输入音频帧对应的第一音频特征后，不会立即对其进行特征匹配，以避免在负延时情况下无法计算出回声延时，会在达到目标延时后，再对其进行特征匹配，而在目标延时过程中，接收点仍然会接收新的输入音频帧，并对其进行音频特征提取，则为了可以在目标延时后仍然可以使用历史第一音频特征进行特征匹配，在一种可能的实施方式中，增加有用于存储输入音频帧对应第一音频特征的特征存储器，通过该特征存储器来实现对特征匹配的延时功能。Since when the receiving point acquires the first audio feature corresponding to the input audio frame, it will not perform feature matching immediately, so as to avoid the inability to calculate the echo delay in the case of negative delay, and then after reaching the target delay, then Perform feature matching on it, and during the target delay process, the receiving point will still receive new input audio frames and perform audio feature extraction on it, so that the first audio feature in history can still be used after the target delay. For feature matching, in a possible implementation manner, a feature memory for storing the first audio feature corresponding to the input audio frame is added, and the delay function for feature matching is realized through the feature memory.

在一个示例性的例子中，如图5所示，其示出了本申请另一个示例性实施例提供的异常回声延时识别方法的流程图。本申请实施例以该方法应用于终端为例进行说明，该方法包括：In an exemplary example, as shown in FIG. 5 , it shows a flowchart of a method for identifying an abnormal echo delay provided by another exemplary embodiment of the present application. In this embodiment of the application, the method is applied to a terminal as an example for illustration, and the method includes:

步骤501，对麦克风采集到的输入音频帧进行时频转换和频带划分，确定出M个子带。Step 501, perform time-frequency conversion and frequency band division on the input audio frame collected by the microphone, and determine M sub-bands.

需要说明的是，本申请实施例是在原有回声消除系统中新增有异常监控模块(异常延时检测模块)，而异常监控模块与原有延时估计模块均需要进行音频特征提取，在一种可能的场景下，异常监控模块中提取的音频特征可以与原有延时估计模块中提取的音频特征相同；可选的，异常监控模块中提取的音频特征也可以与原有延时估计模块不同。It should be noted that in the embodiment of the present application, an abnormality monitoring module (abnormal delay detection module) is added to the original echo cancellation system, and both the abnormality monitoring module and the original delay estimation module need to perform audio feature extraction. In one possible scenario, the audio features extracted in the abnormal monitoring module can be the same as those extracted in the original delay estimation module; optionally, the audio features extracted in the abnormal monitoring module can also be the same as the original delay estimation module different.

当异常监控模块与原有延时估计模块中所提取的音频特征不同时，就需要为音频特征分配不同的特征存储区，对应的，需要增加至少两个特征存储区用于存储音频特征，其中，第一特征存储区用于存储输入音频帧对应的音频特征，而第二特征存储区用于存储输出音频帧对应的音频特征。When the audio features extracted by the abnormality monitoring module and the original delay estimation module are different, different feature storage areas need to be allocated for the audio features. Correspondingly, at least two feature storage areas need to be added for storing audio features. , the first feature storage area is used to store audio features corresponding to input audio frames, and the second feature storage area is used to store audio features corresponding to output audio frames.

示意性的，如图6所示，其示出了本申请一个示例性示出的延时估计过程与异常延时检测过程的工作示意图。由图6可知，延时估计过程601中所提取到音频特征的特征存储区，与异常延时检测过程602中所提取到音频特征的特征存储区不同，其中，延时估计过程对应特征存储区1，而异常延时检测过程602对应特征存储区3和特征存储区4。Schematically, as shown in FIG. 6 , it shows a working diagram of an exemplary delay estimation process and abnormal delay detection process in the present application. It can be seen from FIG. 6 that the feature storage area of the audio feature extracted in the delay estimation process 601 is different from the feature storage area of the audio feature extracted in the abnormal delay detection process 602, wherein the delay estimation process corresponds to the feature storage area 1, and the abnormal delay detection process 602 corresponds to feature storage area 3 and feature storage area 4.

在延时估计过程601中，通过特征提取模块1对参考点信号(输出音频帧)进行音频特征提取，并将提取到的候选音频特征存储在特征存储区1中；当获取到接收点信号后，通过特征提取模块2对接收点信号(输入音频帧)进行音频特征提取，并将提取到的第一音频特征送入特征匹配模块1中进行特征匹配，具体是基于第一音频特征从特征存储区1中查找相匹配的第二音频特征，进而基于延时确定策略1确定出传输延时Tde。In the delay estimation process 601, the audio feature extraction is carried out to the reference point signal (output audio frame) by the feature extraction module 1, and the candidate audio features extracted are stored in the feature storage area 1; when the receiving point signal is obtained , the audio feature extraction is carried out to the receiving point signal (input audio frame) through the feature extraction module 2, and the first audio feature extracted is sent to the feature matching module 1 for feature matching, specifically based on the first audio feature from the feature storage Search for the matching second audio feature in zone 1, and then determine the transmission delay Tde based on the delay determination strategy 1.

在异常延时检测过程602中，通过特征提取模块3对参考点信号(输出音频帧)进行音频特征提取，并将提取到的候选音频特征存储在特征存储区3中，并通过特征提取模块4对接收点信号(输入音频帧)进行音频特征提取，并将提取到的第一音频特征存储在特征存储区4中，进而在目标延时后通过特征匹配模块3从特征存储区4中读取第一音频特征，基于该第一音频特征从特征存储区3中查找匹配的第二音频特征，进而通过延时确定策略3计算得到回声延时Tde3；在异常延时检测处判断回声延时Tde3是否为异常回声延时，若是，则重置延时估计过程中的各个模块。In the abnormal delay detection process 602, the audio feature extraction is carried out to the reference point signal (output audio frame) by the feature extraction module 3, and the extracted candidate audio features are stored in the feature storage area 3, and passed through the feature extraction module 4 Carry out audio feature extraction to the receiving point signal (input audio frame), and store the extracted first audio feature in the feature storage area 4, and then read from the feature storage area 4 by the feature matching module 3 after the target delay The first audio feature, based on the first audio feature, finds the matching second audio feature from the feature storage area 3, and then calculates the echo delay Tde3 through the delay determination strategy 3; judge the echo delay Tde3 at the abnormal delay detection place Whether it is an abnormal echo delay, if so, reset each module in the delay estimation process.

本实施例中提取到的音频特征为频域特征，在一种可能的实施方式中，首先对输入音频帧进行时频转换和频带划分，从中获取所需的M个子带，进而基于该M个子带提取对应的第一音频特征。The audio features extracted in this embodiment are frequency-domain features. In a possible implementation, firstly, time-frequency conversion and frequency band division are performed on the input audio frame to obtain the required M sub-bands, and then based on the M sub-bands, The band extracts the corresponding first audio features.

示意性的，M的取值可以由开发人员进行设置，由于第一音频特征需要存储在特征存储区中，为了提高计算效率，M可以是便于后续存储和计算的量。示意性的，M的取值为36，36个子带可以得到32个二元值的音频特征，32bit正好可以存放在一个32位整型变量中，便于提高存储和计算效率。Schematically, the value of M can be set by the developer. Since the first audio feature needs to be stored in the feature storage area, in order to improve calculation efficiency, M can be a quantity that is convenient for subsequent storage and calculation. Schematically, the value of M is 36, and 32 binary audio features can be obtained from 36 subbands, and 32 bits can be stored in a 32-bit integer variable, which is convenient for improving storage and calculation efficiency.

可选的，可以采用短时傅里叶变换将输入音频帧转换到频域，得到输入音频帧的频谱信号，再对频谱信号进行频带划分，频带划分方式可以采用线性划分或者根据心理声学理论进行梅尔频带划分，从而得到M个子带。Optionally, short-time Fourier transform can be used to convert the input audio frame to the frequency domain to obtain the spectrum signal of the input audio frame, and then divide the frequency spectrum signal into frequency bands. The frequency band division method can be linear division or based on psychoacoustic theory. The Mel frequency band is divided to obtain M subbands.

可选的，在语音通话场景中，参考点信号和接收点信号对应的采样率为16kHz或者32kHz等，以32kHz为例，根据奈奎斯特采样定理，32kHz采样率采集到的语音有效带宽为16KHz；而提取音频特征所需要的语音带宽不需要很高，因为在实际语音通讯系统中，能够比较可靠表征语音的频率范围在300Hz～3kHz左右，因此所需实际带宽只要3kHz左右即可，因此，为了减少计算量，音频特征提取过程中将参考点或者接收点高采样率的音频帧下采样到6khz左右(6kHz采样率对应的有效音频带宽即为3kHz)，进而进行后续音频特征提取过程。Optionally, in a voice call scenario, the sampling rate corresponding to the reference point signal and the receiving point signal is 16kHz or 32kHz, etc. Taking 32kHz as an example, according to the Nyquist sampling theorem, the effective bandwidth of the voice collected at the 32kHz sampling rate is 16KHz; and the voice bandwidth required to extract audio features does not need to be very high, because in the actual voice communication system, the frequency range that can reliably characterize voice is about 300Hz to 3kHz, so the actual bandwidth required is only about 3kHz, so , in order to reduce the amount of calculation, during the audio feature extraction process, the audio frame with a high sampling rate of the reference point or the receiving point is down-sampled to about 6khz (the effective audio bandwidth corresponding to the 6kHz sampling rate is 3kHz), and then the subsequent audio feature extraction process is performed.

示意性的，如图7所示，其示出了本申请一个示例性实施例示出的音频特征的提取过程示意图。将输入音频帧下采样之后的音频帧，经过时频域转换算法(例如短时傅里叶变换)转换到频域，得到输入音频帧的频谱信号；频谱信号经过频带划分，将频谱划分为36个子带(子带1～子带36)。Schematically, as shown in FIG. 7 , it shows a schematic diagram of an audio feature extraction process shown in an exemplary embodiment of the present application. The audio frame after the downsampling of the input audio frame is converted to the frequency domain through a time-frequency domain conversion algorithm (such as short-time Fourier transform) to obtain the spectral signal of the input audio frame; the spectral signal is divided into 36 frequency bands. subbands (subband 1 to subband 36).

步骤502，通过对M个子带的子带能量进行频域能量比较，得到N个第一频域特征分值，N为正整数，且M-N为正整数。In step 502, N first frequency-domain feature scores are obtained by comparing sub-band energies of M sub-bands in frequency domain, where N is a positive integer, and M-N is a positive integer.

在一种可能的实施方式中，当对输入音频帧进行时频转换和频带划分，得到M个子带后，分别计算每个子带对应的子带能量，进而对子带能量进行频域能量比较，从而得到N个频域特征分值。In a possible implementation manner, after time-frequency conversion and frequency band division are performed on the input audio frame to obtain M subbands, the subband energy corresponding to each subband is calculated respectively, and then the frequency domain energy comparison is performed on the subband energy, Thus, N frequency-domain feature scores are obtained.

其中，N的具体取值可以由开发人员进行设置，为了提高存储和计算效率，N的取值可以是32，因为32个bit正好可以存放在31位整形变量中；可选的，N的取值还可以是16、64等值。Among them, the specific value of N can be set by the developer. In order to improve storage and calculation efficiency, the value of N can be 32, because 32 bits can be stored in a 31-bit integer variable; optionally, the value of N Values can also be 16, 64, etc.

可选的，计算频域特征分值的过程可以包括以下步骤，也即步骤502可以包括步骤502A和步骤502B。Optionally, the process of calculating the feature score in the frequency domain may include the following steps, that is, step 502 may include step 502A and step 502B.

步骤502A，响应于第j子带能量是第j-i子带至第j+i子带对应子带能量中的最大值，将第一分值确定为第j子带对应的第一频域特征分值，第j子带能量为第j子带对应的子带能量，其中，i为正整数，且j-i为正整数，j+i小于等于M。Step 502A, in response to the energy of the jth subband being the maximum value among the subband energies corresponding to the j-ith subband to the j+ith subband, determine the first score as the first frequency-domain feature score corresponding to the jth subband value, the jth subband energy is the subband energy corresponding to the jth subband, where i is a positive integer, and j-i is a positive integer, and j+i is less than or equal to M.

在一种可能的实施方式中，在计算频域特征分值时，将当前子带能量与和它相邻的上下各两个子带能量进行比较，如果当前子带能量具有最大值则输出二元值1，否则输出二元值0；也就是说，对于第j子带，若第j子带对应的第j子带能量是第j-i子带至第j+i子带对应子带能量中的最大值，则将第一分值确定为第j子带对应的第一频域特征分值。示意性的，第一分值可以为1。In a possible implementation, when calculating the frequency-domain feature score, the energy of the current subband is compared with the energy of the upper and lower subbands adjacent to it, and if the energy of the current subband has the maximum value, a binary The value is 1, otherwise the binary value 0 is output; that is, for the jth subband, if the jth subband energy corresponding to the jth subband is the energy of the j-ith subband to the j+ith subband corresponding to the subband energy maximum value, the first score is determined as the first frequency-domain feature score corresponding to the jth subband. Schematically, the first score may be 1.

其中，i的取值可以由开发人员进行设置，示意性的，i的取值可以是1，对应将当前子带能量与和它相邻的上下各一个子带能量进行比较；若i的取值是2，对应的将当前子带能量与和它相邻的上下各两个子带能量进行比较。Among them, the value of i can be set by the developer. Schematically, the value of i can be 1, which corresponds to comparing the energy of the current subband with the energy of the upper and lower subbands adjacent to it; if the value of i is The value is 2, which corresponds to comparing the energy of the current sub-band with the energy of two sub-bands adjacent to it.

步骤502B，响应于第j子带能量小于第j-i子带至第j+i子带对应子带能量中的最大值，将第二分值确定为第j子带对应的第一频域特征分值。Step 502B, in response to the energy of the jth subband being less than the maximum value of the subband energies corresponding to the j-ith subband to the j+ith subband, determine the second score as the first frequency-domain feature score corresponding to the jth subband value.

反之，对于第j子带能量，若第j子带能量小于第j-i子带至第j+i子带对应子带能量中的最大值，则将第二分值确定为第j子带对应的第一频域特征分值。示意性的，第二分值可以为0。Conversely, for the jth subband energy, if the jth subband energy is less than the maximum value of the j-ith subband to the j+ith subband corresponding to the subband energy, the second score is determined as the jth subband corresponding The first frequency-domain feature score. Schematically, the second score may be 0.

如图7所示，当获得M个子带后，分别对每个子带进行能量计算，得到36个子带能量(子带能量1～子带能量36)；i的取值为2，则比较器将当前子带能量与和它相邻的上下各两个子带能量进行比较；比如，比较器1对子带能量3与它相邻的上下各两个子带能量进行比较，也即若子带能量3在子带能量1～子带能量5中具有最大值，则比较器1将子带能量3对应的频域特征分值设置为1，也即比较器1输出G(n，1)，否则比较器1输出G(n，0)；由于子带能量4在子带能量2～子带能量6中具有最大值，则比较器2输出G(n，1)；其中n为第n帧音频帧。As shown in Figure 7, when M subbands are obtained, the energy of each subband is calculated separately to obtain 36 subband energies (subband energy 1 to subband energy 36); if the value of i is 2, the comparator will The energy of the current subband is compared with the two subband energies adjacent to it; for example, comparator 1 compares the subband energy 3 with the two subband energies adjacent to it, that is, if the subband energy 3 is in If subband energy 1 to subband energy 5 has the maximum value, then comparator 1 sets the frequency-domain feature score corresponding to subband energy 3 to 1, that is, comparator 1 outputs G(n, 1), otherwise comparator 1 1 outputs G(n, 0); since subband energy 4 has the maximum value among subband energy 2 to subband energy 6, comparator 2 outputs G(n, 1); where n is the nth audio frame.

步骤503，将N个第一频域特征分值的集合确定为第一音频特征。Step 503, determining a set of N first frequency-domain feature scores as a first audio feature.

在一种可能的实施方式中，对M个子带的子带能量进行能量比较，可以得到N个第一频域特征分值，则N个第一频域特征分值的集合即为输入音频帧对应的第一音频特征。In a possible implementation manner, the subband energies of the M subbands are energy compared to obtain N first frequency domain feature scores, and then the set of N first frequency domain feature scores is the input audio frame corresponding to the first audio feature.

示例性的，若N为32，则第一音频特征为32位二元值的集合。Exemplarily, if N is 32, the first audio feature is a set of 32-bit binary values.

需要说明的是，输出音频帧的特征提取过程可以参考输入音频帧的特征提取过程，本实施例在此不做赘述。It should be noted that, the feature extraction process of the output audio frame may refer to the feature extraction process of the input audio frame, which will not be described in detail here in this embodiment.

步骤504，将第一音频特征存储在第一特征存储区的尾部存储位置，第一特征存储区的第一存储容量由目标延时确定，第一音频特征的存储位置随时间由尾部存储位置向头部存储位置移动。Step 504, the first audio feature is stored in the tail storage position of the first feature storage area, the first storage capacity of the first feature storage area is determined by the target delay, and the storage position of the first audio feature changes from the tail storage position to Head storage location moved.

在一种可能的实施方式中，设置有第一特征存储区，用于存储输入音频帧对应的第一音频特征，且该第一特征存储区为先进先出型存储区，也即新增音频特征会存储在第一特征存储区的尾部位置，而每新增一帧输入音频帧对应的第一音频特征，第一特征存储区会对应从头部删除一帧输入音频帧对应的第一音频特征。In a possible implementation manner, a first feature storage area is provided for storing the first audio feature corresponding to the input audio frame, and the first feature storage area is a first-in-first-out type storage area, that is, newly added audio The feature will be stored at the end of the first feature storage area, and every time a new frame of the first audio feature corresponding to the input audio frame is added, the first feature storage area will correspondingly delete a frame of the first audio corresponding to the input audio frame from the head feature.

为了使得第一特征存储区可以实现延时目标延时进行特征匹配的功能，在一种可能的实施方式中，第一特征存储区的第一存储容量需要由目标延时确定，也就是说，当第一音频特征从第一特征存储区的尾部存储位置随时间移动至头部存储位置时进行特征匹配，使得第一音频特征可以刚好延时目标延时后进行特征匹配。In order to enable the first feature storage area to achieve the function of delaying the target delay for feature matching, in a possible implementation manner, the first storage capacity of the first feature storage area needs to be determined by the target delay, that is, The feature matching is performed when the first audio feature moves from the tail storage position of the first feature storage area to the head storage position over time, so that the feature matching can be performed on the first audio feature just after a target delay.

可选的，第一存储容量由目标延时和相邻输入音频帧之间的采样时间间隔确定，示意性的，目标延时为120ms，每个输入音频帧经过4ms到达下一个音频帧，则表示第一特征存储区需要使得第一音频特征延时30帧再进行特征匹配，对应的，第一特征存储区需要存储31帧音频输入帧对应的第一音频特征。Optionally, the first storage capacity is determined by the target delay and the sampling time interval between adjacent input audio frames. Schematically, the target delay is 120ms, and each input audio frame reaches the next audio frame after 4ms, then Indicates that the first feature storage area needs to delay the first audio feature by 30 frames before performing feature matching. Correspondingly, the first feature storage area needs to store the first audio feature corresponding to 31 frames of audio input frames.

在一种可能的实施方式中，当提取到输入音频帧对应的第一音频特征后，并非直接基于第一音频特征进行特征匹配，而是将第一音频特征存储在第一特征存储区的尾部存储位置，在目标延时过程中，第一个存储区不断新增第一音频特征，且第一音频特征的存储位置会随着音频特征的增加，由尾部存储位置向头部存储位置移动，也即第一音频特征的存储位置会随着时间由尾部存储位置向头部存储位置移动，当第一音频特征移动至头部存储位置时，可以确定达到目标延时。In a possible implementation, after the first audio feature corresponding to the input audio frame is extracted, feature matching is not performed directly based on the first audio feature, but the first audio feature is stored at the end of the first feature storage area Storage location, during the target delay process, the first storage area continuously adds the first audio feature, and the storage location of the first audio feature will move from the tail storage location to the head storage location as the audio feature increases. That is, the storage location of the first audio feature will move from the tail storage location to the head storage location over time, and when the first audio feature moves to the head storage location, it can be determined that the target delay is reached.

示意性的，如图8所示，其示出了本申请一个示例性实施例示出的特征存储区实现延时功能的原理示意图。若第一特征存储区在写入第130帧输入音频帧对应的第一音频特征G(130)之前，第一特征存储区中存储有G(99)～G(129)等31帧输入音频帧对应的第一音频特征；当获取到130帧对应的第一音频特征G(130)，将G(130)写入第一特征存储区的尾部存储区域(tail)，则此时第一特征存储区中存储的第一音频特征为：G(100)～G(130)；在目标延时过程中，G(130)随时间由尾部存储区域向头部存储区域位置移动，当G(130)移动至头部存储位置时，确定达到目标延时，此时，第一特征存储区中存储的第一音频特征为：G(130)～G(160)。Schematically, as shown in FIG. 8 , it shows a schematic diagram of the principle of implementing the delay function in the characteristic storage area shown in an exemplary embodiment of the present application. If the first feature storage area is written before the first audio feature G (130) corresponding to the 130th frame input audio frame, 31 frames of input audio frames such as G (99) to G (129) are stored in the first feature storage area The corresponding first audio feature; when the first audio feature G (130) corresponding to 130 frames is obtained, G (130) is written into the tail storage area (tail) of the first feature storage area, then the first feature storage The first audio feature stored in the area is: G(100)～G(130); in the target delay process, G(130) moves from the tail storage area to the head storage area position with time, when G(130) When moving to the head storage position, it is determined that the target delay is reached. At this time, the first audio features stored in the first feature storage area are: G(130)-G(160).

步骤505，响应于第一音频特征移动至第一特征存储区的头部存储位置，基于第一音频特征，从候选音频特征中确定第二音频特征。Step 505, in response to the first audio feature moving to the head storage location of the first feature storage area, based on the first audio feature, determine the second audio feature from the candidate audio features.

在一种可能的实施方式中，当第一音频特征移动至第一特征存储区的头部存储位置时，基于第一特征存储区的第一存储容量与目标延时的关系，确定达到目标延时，进而可以基于第一音频特征，从候选音频特征中确定第二音频特征。In a possible implementation manner, when the first audio feature moves to the head storage position of the first feature storage area, based on the relationship between the first storage capacity of the first feature storage area and the target delay time, it is determined that the target delay time is reached. , the second audio feature may be determined from the candidate audio features based on the first audio feature.

其中，基于第一音频特征从候选音频特征中确定第二音频特征的过程(特征匹配过程)可以包括步骤一和步骤二。Wherein, the process of determining the second audio feature from candidate audio features based on the first audio feature (feature matching process) may include step 1 and step 2.

一、对第一音频特征和候选音频特征进行特征匹配，得到至少一个候选匹配分值，候选匹配分值用于指示第一音频特征和候选音频特征之间的匹配度，候选匹配分值与匹配度成负相关关系。1. Perform feature matching on the first audio feature and candidate audio features to obtain at least one candidate matching score. The candidate matching score is used to indicate the degree of matching between the first audio feature and the candidate audio features. The candidate matching score and the matching into a negative correlation.

在一种可能的实施方式中，当经过目标延时后，基于第一音频特征和当前历史特征存储器(该历史特征存储器中存储有输出音频帧对应的候选音频特征)中存储的各个候选音频特征进行特征匹配，得到至少一个候选匹配分值，进而根据该候选匹配分值，从候选音频特征中确定与第一音频特征匹配的第二音频特征。In a possible implementation manner, after the target delay, based on the first audio feature and each candidate audio feature stored in the current historical feature memory (the historical feature memory stores the candidate audio feature corresponding to the output audio frame) Perform feature matching to obtain at least one candidate matching score, and then determine a second audio feature matching the first audio feature from the candidate audio features according to the candidate matching score.

其中，候选匹配分值的数量由历史特征存储区的存储容量确定。示意性的，历史特征存储区中存储有75帧输出音频帧对应的候选音频特征，则将第一音频特征分别与75帧输出音频帧对应的各个候选音频特征进行特征匹配，从而可以得到75个候选匹配分值。Wherein, the number of candidate matching scores is determined by the storage capacity of the historical feature storage area. Schematically, the candidate audio features corresponding to 75 frames of output audio frames are stored in the historical feature storage area, and the first audio features are respectively matched with the candidate audio features corresponding to the 75 frames of output audio frames, so that 75 audio features can be obtained Candidate match score.

示意性的，如图10所示，其示出了本申请一个示例性实施例示出的特征匹配的过程示意图。当在目标延时后，基于第一音频特征G(80)从候选音频特征中查找匹配的第二音频特征时，此时历史特征存储器中存储有F(67)～F(141)等75个候选音频特征，对应将G(80)分别与75个候选音频特征进行特征匹配，得到75个候选匹配分值：S(1)～S(75)。Schematically, as shown in FIG. 10 , it shows a schematic diagram of a feature matching process shown in an exemplary embodiment of the present application. When after the target delay time, based on the first audio feature G (80), when searching for a matching second audio feature from the candidate audio features, 75 such as F (67) to F (141) are stored in the historical feature memory. Candidate audio features correspond to feature matching between G(80) and 75 candidate audio features to obtain 75 candidate matching scores: S(1)～S(75).

由上文实施例可知，每个音频特征中均包含N个频域特征分值，对应的，在进行特征匹配过程中，也需要分别对每个频域特征分值进行匹配运算，在一个示例性的例子中，步骤一还可以包括步骤1和步骤2(即候选匹配分值的确定过程还可以包括步骤1和步骤2)。It can be seen from the above embodiments that each audio feature contains N frequency-domain feature scores. Correspondingly, in the process of feature matching, it is also necessary to perform matching operations on each frequency-domain feature score. In an example In a specific example, Step 1 may also include Step 1 and Step 2 (that is, the process of determining the candidate matching score may also include Step 1 and Step 2).

1、对第k第一频域特征分值和第k第二频域特征分值进行匹配运算，得到第k子匹配分值，k为小于等于N的正整数，子匹配分值为第一音频特征和第二音频特征中第k个频域特征分值的匹配度。1. Perform a matching operation on the k-th first frequency-domain feature score and the k-th second frequency-domain feature score to obtain the k-th sub-matching score, k is a positive integer less than or equal to N, and the sub-matching score is the first The matching degree of the audio feature and the score of the kth frequency domain feature in the second audio feature.

为了可以进行特征匹配，则需要保证候选音频特征与第一音频特征是采用相同的特征提取方法提取到的，且候选音频特征中所包含的频域特征分值的数量，需要与第一音频特征中所包含的频域特征分值的数量相同，示意性的，第一音频特征中包含N个第一频域特征分值，对应候选音频特征中也包含有N个第二频域特征分值，N为正整数。示意性的，N为32时，则第一音频特征中包含32个二元值，候选音频特征中也包含32个二元值。In order to perform feature matching, it is necessary to ensure that the candidate audio feature and the first audio feature are extracted using the same feature extraction method, and the number of frequency domain feature scores contained in the candidate audio feature needs to be the same as that of the first audio feature The number of frequency-domain feature scores contained in is the same. Schematically, the first audio feature contains N first frequency-domain feature scores, and the corresponding candidate audio feature also contains N second frequency-domain feature scores. , N is a positive integer. Schematically, when N is 32, the first audio feature contains 32 binary values, and the candidate audio feature also contains 32 binary values.

在一种可能的实施方式中，在匹配运算过程中，分别对N个第一频域特征分值和N个第二频域特征分值进行匹配运算，也就是说，对第k个第一频域特征分值和第k第二音频特征进行匹配运算，得到第k子匹配分值，进而将N个子匹配分值之和确定为第一音频特征和候选音频特征的匹配分值。In a possible implementation manner, during the matching operation, the matching operation is performed on the N first frequency-domain feature scores and the N second frequency-domain feature scores respectively, that is, for the kth first The frequency domain feature score is matched with the kth second audio feature to obtain the kth sub-matching score, and then the sum of the N sub-matching scores is determined as the matching score of the first audio feature and the candidate audio feature.

示意性的，若N为5，第一音频特征为：G ₁(n，0)、G ₂(n，1)、G ₃(n，1)、G₄(n，0)、G ₅(n，1)，候选音频特征为：F₁(n，1)、F₂(n，0)、F₃(n，0)、F₄(n，1)、F₅(n，0)；则对第一音频特征和候选音频特征进行匹配运算时，将G₁(n，0)和F₁(n，1)进行匹配运算，得到第一子匹配分值，对G ₂(n，1)和F₂(n，0)进行匹配运算，得到第二子匹配分值，同理可以得到第三子匹配分值、第四子匹配分值和第五子匹配分值，进而将第一子匹配分值至第五子匹配分值之和确定为候选匹配分值。Schematically, if N is 5, the first audio features are: G ₁ (n, 0), G ₂ (n, 1), G ₃ (n, 1), G ₄ (n, 0), G ₅ ( n, 1), the candidate audio features are: F ₁ (n, 1), F ₂ (n, 0), F ₃ (n, 0), F ₄ (n, 1), F ₅ (n, 0); Then, when performing the matching operation on the first audio feature and the candidate audio feature, G ₁ (n, 0) and F ₁ (n, 1) are matched to obtain the first sub-matching score, and G ₂ (n, 1 ) and F ₂ (n, 0) to perform matching operations to obtain the second sub-matching score, similarly the third sub-matching score, the fourth sub-matching score and the fifth sub-matching score can be obtained, and then the first The sum of the sub-matching scores to the fifth sub-matching scores is determined as a candidate matching score.

可选的，匹配运算可以采用异或运算，或者其他匹配算法，本申请实施例对此不构成限定。Optionally, the matching operation may use an exclusive OR operation or other matching algorithms, which is not limited in this embodiment of the present application.

2、将N个子匹配分值之和确定为候选匹配分值。2. Determine the sum of the N sub-matching scores as the candidate matching score.

在一种可能的实施方式中，将N个子匹配分值之和确定为候选匹配分值。示意性的，若匹配运算采用异或运算，则第一音频特征和候选音频特征越相似，其对应的候选匹配分值越低，也即候选匹配分值与匹配度成负相关关系。In a possible implementation manner, the sum of N sub-matching scores is determined as a candidate matching score. Schematically, if the matching operation uses an XOR operation, the more similar the first audio feature is to the candidate audio feature, the lower the corresponding candidate matching score, that is, the candidate matching score is negatively correlated with the matching degree.

二、基于至少一个候选匹配分值，从候选音频特征中确定出第二音频特征。2. Determine a second audio feature from the candidate audio features based on at least one candidate matching score.

特征匹配的目的是为了从候选音频特征中找到与第一音频特征匹配的第二音频特征，而基于匹配度与候选匹配分值之间的关系，匹配度越高，则候选匹配分值越低，对应的，可以将候选匹配分值中最小值对应的候选音频特征确定为第二音频特征。The purpose of feature matching is to find the second audio feature that matches the first audio feature from the candidate audio features, and based on the relationship between the matching degree and the candidate matching score, the higher the matching degree, the lower the candidate matching score , correspondingly, the candidate audio feature corresponding to the minimum value among the candidate matching scores may be determined as the second audio feature.

可选的，在其他可能的实施方式中，在计算得到候选匹配分值后，可以基于历史匹配结果对其进行平滑处理，得到特征匹配分值，进而基于特征匹配分值确定第二音频特征。在一个示例性的例子中，步骤二还可以包括步骤3～步骤5。Optionally, in other possible implementation manners, after the candidate matching score is calculated, it may be smoothed based on historical matching results to obtain a feature matching score, and then the second audio feature is determined based on the feature matching score. In an exemplary example, Step 2 may further include Step 3-Step 5.

3、对至少一个候选匹配分值进行平滑处理，得到至少一个特征匹配分值。3. Perform smoothing processing on at least one candidate matching score to obtain at least one feature matching score.

在一种可能的实施方式中，对各个候选匹配分值进行平滑处理，得到至少一个特征匹配分值。In a possible implementation manner, each candidate matching score is smoothed to obtain at least one feature matching score.

在一个示例性的例子中，特征匹配分值的计算过程可以表示为：In an illustrative example, the calculation process of the feature matching score can be expressed as:

Sm(n)＝S(n)*b+Sm(n)＇*(1-b)Sm(n)=S(n)*b+Sm(n)'*(1-b)

其中，Sm(n)表示特征匹配分值，S(n)表示候选匹配分值，b表示平滑系数，b为0～1之间的一个小数，b越小，则平滑程度越高，Sm(n)＇表示之前的平滑结果。Among them, Sm(n) represents the feature matching score, S(n) represents the candidate matching score, b represents the smoothing coefficient, b is a decimal between 0 and 1, the smaller b is, the higher the smoothing degree is, and Sm( n)' represents the previous smoothing result.

示意性的，如图10所示，对75个候选匹配分值S(1)～S(75)进行平滑处理，得到75个特征匹配分值：Sm(1)～Sm(75)。Schematically, as shown in FIG. 10 , 75 candidate matching scores S(1)˜S(75) are smoothed to obtain 75 feature matching scores: Sm(1)˜Sm(75).

4、将特征匹配分值中的最小值确定为目标匹配分值。4. Determine the minimum value among the feature matching scores as the target matching score.

由于候选匹配分值与匹配度呈负相关关系，则特征匹配分值与匹配度也呈负相关关系，也就是说，候选音频特征与第一音频特征越相似(越匹配)，候选音频特征与第一音频特征对应的特征匹配分值越小，因此，在一种可能的实施方式中，将特征匹配分值中的最小值确定为目标匹配分值，进而将目标匹配分值对应的候选音频特征确定第二音频特征。Since the candidate matching score is negatively correlated with the matching degree, the feature matching score is also negatively correlated with the matching degree. The smaller the feature matching score corresponding to the first audio feature, therefore, in a possible implementation, the minimum value of the feature matching score is determined as the target matching score, and then the candidate audio corresponding to the target matching score The characteristic determines a second audio characteristic.

5、将目标匹配分值对应的候选音频特征确定为第二音频特征。5. Determine the candidate audio feature corresponding to the target matching score as the second audio feature.

在一种可能的实施方式中，由于目标匹配分值对应的候选音频特征即是与第一音频特征最匹配的音频特征，则可以直接将其确定为第二音频特征。In a possible implementation manner, since the candidate audio feature corresponding to the target matching score is the audio feature that best matches the first audio feature, it may be directly determined as the second audio feature.

可选的，为了进一步提高特征匹配的准确性，在一种可能的实施方式中，设置有匹配分值阈值，当目标匹配分值小于该匹配分值阈值时，才会将其对应的候选音频特征确定为第二音频特征。Optionally, in order to further improve the accuracy of feature matching, in a possible implementation, a matching score threshold is set, and when the target matching score is smaller than the matching score threshold, the corresponding candidate audio The feature is determined as a second audio feature.

步骤506，获取第二音频特征在第二特征存储区中的目标存储位置。Step 506, acquiring a target storage location of the second audio feature in the second feature storage area.

在一种可能的实施方式中，当确定出第二音频特征后，由于第二音频特征也是存储在先进先出的第二特征存储区中，且第二音频特征的存储位置随时间由尾部存储位置向头部存储位置移动，因此，可以基于第二音频特征在第二特征存储区中所处的位置，确定目标回声延时。In a possible implementation manner, after the second audio feature is determined, since the second audio feature is also stored in the first-in-first-out second feature storage area, and the storage position of the second audio feature is stored by the tail The location is shifted towards the head storage location, so the target echo delay can be determined based on the location of the second audio feature in the second feature storage area.

步骤507，基于目标存储位置，确定第二音频特征对应输出音频帧的目标回声延时。Step 507, based on the target storage location, determine the target echo delay of the output audio frame corresponding to the second audio feature.

如图9所示，其示出了本申请一个示例性实施例示出的延时特征匹配的过程示意图。在参考点的输出音频帧Q，经过传输延时Td，到达接收点得到输入音频帧Q＇。通过在接收点对输入音频帧Q＇进行音频特征提取得到特征G(100)，并存储在第一特征存储区的尾部存储位置(图中G(130)所在位置)；输出音频帧Q假设是第100帧音频帧，对输出音频帧Q进行音频特征提取得到的第二音频特征为F(100)，并将F(100)存储在第二特征存储区的尾部存储位置(tail)，在经过传输延时Td之后，参考点新获取到31帧输出音频帧并计算得到31个候选音频特征依次放入第二特征存储区中；因此，当接收点得到输入音频帧Q＇对应的第一音频特征G(100)的时候，第二特征存储区中又新存入了从F(100)到F(131)的31个候选音频特征；由于存在目标延时，当G(100)由图中G(130)所在位置移动至第一特征存储区的head时，再基于第一音频特征G(100)与第二特征存储区中存储的候选音频特征进行特征匹配；而在经过目标延时后，第二特征存储区中又新存储了从F(131)到F(161)个候选音频特征；则此时的回声延时为从F(100)到F(161)等61个输出音频帧的长度，则对应可以基于每个输出音频帧的采样间隔，以及音频帧数，计算得到回声延时。As shown in FIG. 9 , it shows a schematic diagram of a delayed feature matching process according to an exemplary embodiment of the present application. The output audio frame Q at the reference point, after the transmission delay Td, arrives at the receiving point to obtain the input audio frame Q'. The feature G (100) is obtained by extracting the audio features of the input audio frame Q' at the receiving point, and stored in the tail storage position of the first feature storage area (where G (130) is located in the figure); the output audio frame Q is assumed to be The 100th frame audio frame, the second audio feature obtained by extracting the audio feature of the output audio frame Q is F (100), and F (100) is stored in the tail storage position (tail) of the second feature storage area, after passing After the transmission delay Td, the reference point newly acquires 31 output audio frames and calculates 31 candidate audio features and puts them into the second feature storage area in turn; therefore, when the receiving point obtains the first audio frame corresponding to the input audio frame Q' During the feature G (100), 31 candidate audio features from F (100) to F (131) have been newly stored in the second feature storage area; When the position of G (130) moves to the head of the first feature storage area, feature matching is performed based on the first audio feature G (100) and the candidate audio features stored in the second feature storage area; and after the target delay , newly stored from F (131) to F (161) candidate audio features in the second feature storage area; then the echo delay at this time is 61 output audio frames from F (100) to F (161) The corresponding echo delay can be calculated based on the sampling interval of each output audio frame and the number of audio frames.

在一种可能的实施方式中，基于图9所示延时特征匹配过程，可以基于第二音频特征在第二特征存储区中的目标存储位置，确定第二音频特征对应输出音频帧的回声延时。In a possible implementation manner, based on the delay feature matching process shown in FIG. 9 , the echo delay of the output audio frame corresponding to the second audio feature can be determined based on the target storage location of the second audio feature in the second feature storage area. hour.

示意性的，如图10所示，第一音频特征G(80)与第二特征存储区中F(80)匹配，F(80)对应特征匹配分值中的Sm(14)，则回声延时为从Sm(14)～Sm(75)等61个音频帧的长度；若每个音频帧经过4ms到达下一音频帧，则回声延时为61×4ms。Schematically, as shown in Figure 10, the first audio feature G (80) matches F (80) in the second feature storage area, and F (80) corresponds to Sm (14) in the feature matching score, and the echo delay The time is the length of 61 audio frames from Sm(14) to Sm(75); if each audio frame reaches the next audio frame after 4ms, the echo delay is 61×4ms.

步骤508，响应于回声延时大于目标延时，基于回声延时估计得到的回声延时，对输入音频帧进行回声消除处理。Step 508: In response to the echo delay being greater than the target delay, perform echo cancellation processing on the input audio frame based on the echo delay obtained by the echo delay estimation.

在一种可能的实施方式中，若回声延时大于目标延时，则表示未出现负延时，可以基于原有延时估计过程得到的回声延时，对输入音频帧进行回声消除处理。In a possible implementation manner, if the echo delay is greater than the target delay, it means that no negative delay occurs, and echo cancellation processing may be performed on the input audio frame based on the echo delay obtained in the original delay estimation process.

步骤509，响应于回声延时小于目标延时，确定存在异常回声延时。Step 509, in response to the fact that the echo delay is less than the target delay, it is determined that there is an abnormal echo delay.

步骤509的实施方式可以参考上文实施例，本实施例在此不做赘述。For the implementation manner of step 509, reference may be made to the foregoing embodiments, and details are not described in this embodiment here.

步骤510，响应于存在异常回声延时，重新进行回声延时估计。Step 510, re-estimating the echo delay in response to the abnormal echo delay.

在一种可能的实施方式中，若存在异常回声延时，则认为原有延时估计过程中所生成的回声延时不可靠，对应生成重置指令，重置延时估计模块和延时对齐模块，清空它们的缓存以消除负延时的情况，以便延时估计模块重新进行回声延时估计。In a possible implementation, if there is an abnormal echo delay, it is considered that the echo delay generated in the original delay estimation process is unreliable, and a reset command is correspondingly generated to reset the delay estimation module and delay alignment module, clear their buffers to eliminate the negative delay case, so that the delay estimation module can re-echo delay estimation.

本实施例中，通过为输入音频帧对应的第一音频特征设置第一特征存储区，以便通过将第一音频特征存储在第一特征存储区中，实现对第一音频特征的特征延时匹配功能；此外，通过在异常延时监控过程中，采用与原有延时估计过程不同的音频特征提取方式，可以提高异常延时监控过程中特征匹配的准确性，进而提高异常延时监控过程中回声延时的确定准确性，进而通过更准确的回声延时指导原有延时估计过程是否继续执行。In this embodiment, by setting the first feature storage area for the first audio feature corresponding to the input audio frame, in order to realize the feature delay matching of the first audio feature by storing the first audio feature in the first feature storage area function; in addition, by using an audio feature extraction method different from the original delay estimation process in the abnormal delay monitoring process, the accuracy of feature matching in the abnormal delay monitoring process can be improved, thereby improving the abnormal delay monitoring process. Determine the accuracy of the echo delay, and then use the more accurate echo delay to guide whether to continue the original delay estimation process.

在另一种可能的应用场景中，为了节省计算量和数据存储空间，设置异常监控模块与原有延时估计模块均采用相同的特征提取方法，对应的，仅需要增加一个特征存储区，用于存储输入音频帧对应的第一音频特征即可。In another possible application scenario, in order to save the amount of calculation and data storage space, the exception monitoring module and the original delay estimation module are set to use the same feature extraction method. Correspondingly, only one feature storage area needs to be added. It only needs to store the first audio feature corresponding to the input audio frame.

对应图6，若采用相同的特征提取方法，对应的，异常延时检测过程可以不采用单独的特征提取模块，可以减少额外特征提取所产生的计算量；同时，参考点信号对应的候选音频特征也就不需要额外采用新的特征存储区来存储，仅需要增加用于存储第一音频特征的特征存储区即可，可以减少重复音频特征存储所占用的存储空间。Corresponding to Figure 6, if the same feature extraction method is used, correspondingly, the abnormal delay detection process does not need to use a separate feature extraction module, which can reduce the amount of calculations generated by additional feature extraction; at the same time, the candidate audio features corresponding to the reference point signal Therefore, there is no need to additionally use a new feature storage area for storage, and it is only necessary to increase the feature storage area for storing the first audio feature, which can reduce the storage space occupied by repeated audio feature storage.

在图6的基础上，删除特征提取模块3、特征提取模块4以及特征存储区3之后，如图11所示，其示出了本申另一个示例性实施例示出的延时估计过程和异常延时检测过程示意图。在图11中，特征存储区1中存储的候选音频特征既可以用于特征匹配模块1进行特征匹配，以用于计算延时估计过程1101中所产生的回声延时Tde；也可以用于特征匹配模块2进行特征匹配，以用于计算异常延时检测过程1102中所产生的回声延时Tde2；而特征存储区2中存储的第一音频特征，可以在获取到第一音频特征是立即用于特征匹配模块1进行特征匹配，以用于估计回声延时Tde；也可以在达到目标延时后，用于特征匹配模块2进行特征匹配，以用于估计回声延时Tde2。On the basis of Figure 6, after deleting the feature extraction module 3, the feature extraction module 4 and the feature storage area 3, as shown in Figure 11, it shows the delay estimation process and abnormalities shown in another exemplary embodiment of the present application Schematic diagram of the delay detection process. In FIG. 11 , the candidate audio features stored in the feature storage area 1 can be used for feature matching module 1 to perform feature matching, so as to calculate the echo delay Tde generated in the delay estimation process 1101; The matching module 2 performs feature matching for calculating the echo delay Tde2 generated in the abnormal delay detection process 1102; and the first audio feature stored in the feature storage area 2 can be used immediately when the first audio feature is obtained. The feature matching is performed in the feature matching module 1 for estimating the echo delay Tde; after the target delay is reached, the feature matching module 2 can be used for feature matching for estimating the echo delay Tde2.

示意性的，在延时估计过程1101中，通过特征提取模块1对参考点信号(输出音频帧)进行音频特征提取，并将提取到的候选音频特征存储在特征存储区1中；当获取到接收点信号后，通过特征提取模块2对接收点信号(输入音频帧)进行音频特征提取，并将提取到的第一音频特征送入特征匹配模块1中进行特征匹配，具体是基于第一音频特征从特征存储区1中查找到匹配的第二音频特征，进而基于延时确定策略1确定出传输延时Tde。Schematically, in the delay estimation process 1101, the audio feature extraction is performed on the reference point signal (output audio frame) by the feature extraction module 1, and the extracted candidate audio features are stored in the feature storage area 1; After receiving the point signal, the audio feature extraction is carried out to the receiving point signal (input audio frame) by the feature extraction module 2, and the first audio feature extracted is sent to the feature matching module 1 for feature matching, specifically based on the first audio The feature finds the matching second audio feature from the feature storage area 1, and then determines the transmission delay Tde based on the delay determination strategy 1.

在异常延时检测过程1102中，当获取到输入音频帧对应的第一音频特征后，将第一音频特征存储在特征存储区2中，进而在目标延时后通过特征匹配模块2从特征存储区2中读取第一音频特征，基于该第一音频特征从特征存储区1中查找与其匹配的第二音频特征，进而通过延时确定策略2计算得到回声延时Tde2；在异常延时检测处判断回声延时Tde2是否为异常回声延时，若是，则重置延时估计过程中的各个模块。In the abnormal delay detection process 1102, after the first audio feature corresponding to the input audio frame is obtained, the first audio feature is stored in the feature storage area 2, and then after the target delay, the feature matching module 2 is used to store the first audio feature Read the first audio feature in area 2, search for the second audio feature matching it from feature storage area 1 based on the first audio feature, and then calculate the echo delay Tde2 through the delay determination strategy 2; in abnormal delay detection It is judged whether the echo delay Tde2 is an abnormal echo delay, and if so, each module in the delay estimation process is reset.

由图11可知，延时估计过程与异常延时检测过程均采用相同的特征提取方式，图12示出了本申请一个示例性实施例示出的延时估计过程中音频特征提取的过程示意图。参考点或者接收点的音频帧经过下采样和时频域转换算法(例如短时傅里叶变换)转换到频域，得到音频帧的频谱信号；频谱信号经过频带划分，例如线性划分或者根据心理声学理论进行梅尔频带划分，将频谱划分为M个子带(M为32)，因为32个bit正好可以存放在一个32位整形变量中，便于提高计算效率，M也可以是16、64等便于存储和计算的量。对每一个子带计算子带能量E(m)，并通过平滑子带能量模块计算每个子带经过平滑后的平滑子带能量1，平滑子带能量Ep(m)＝E(m)*a+Ep(m)＇*(1-a)，其中a为一个0到1之间的小数，a越小则平滑程度越高。比较器将子带能量与平滑子带能量进行比较，如果子带能量大于平滑子带能量则输出二元值1，否则输出二元值0。假设当前帧为第n帧，第一个子带的二值化输出存放在F的第n帧的第0位，即F(n，0)，同理第二个子带的二值化输出存放在F的第n帧的第1位，即F(n，1)，以此类推，得到M个二元值，即为提取到的音频特征。It can be seen from FIG. 11 that the delay estimation process and the abnormal delay detection process both use the same feature extraction method. FIG. 12 shows a schematic diagram of the audio feature extraction process in the delay estimation process shown in an exemplary embodiment of the present application. The audio frame of the reference point or the receiving point is converted to the frequency domain by downsampling and time-frequency domain conversion algorithm (such as short-time Fourier transform), and the spectral signal of the audio frame is obtained; the spectral signal is divided into frequency bands, such as linear division or according to psychological Acoustic theory divides the Mel frequency band and divides the spectrum into M subbands (M is 32), because 32 bits can be stored in a 32-bit plastic variable, which is convenient for improving calculation efficiency. M can also be 16, 64, etc. Amount of storage and computation. Calculate the subband energy E(m) for each subband, and calculate the smoothed subband energy 1 of each subband through the smooth subband energy module, smooth subband energy Ep(m)=E(m)*a +Ep(m)'*(1-a), where a is a decimal between 0 and 1, and the smaller a is, the higher the smoothness is. The comparator compares the sub-band energy with the smooth sub-band energy, and outputs a binary value 1 if the sub-band energy is greater than the smooth sub-band energy, otherwise outputs a binary value 0. Assuming that the current frame is the nth frame, the binarized output of the first subband is stored in the 0th bit of the nth frame of F, that is, F(n, 0), and the binarized output of the second subband is stored in the same way In the first bit of the nth frame of F, that is, F(n, 1), by analogy, M binary values are obtained, which are the extracted audio features.

在另一种可能的应用场景中，为了进一步减少特征匹配过程中的计算量，第二特征存储区的第二存储容量可以小于第一特征存储区对应的第一存储容量，在该设置情况下，若可以在第二特征存储区中查找到与输入音频帧匹配的输出音频帧，则表示出现了负延时的情况。In another possible application scenario, in order to further reduce the amount of calculation in the feature matching process, the second storage capacity of the second feature storage area may be smaller than the first storage capacity corresponding to the first feature storage area. , if an output audio frame matching the input audio frame can be found in the second feature storage area, it indicates that a negative delay occurs.

在图3的基础上，如图13所示，步骤302至步骤304可以被替换为步骤1301和步骤1302。On the basis of FIG. 3 , as shown in FIG. 13 , step 302 to step 304 may be replaced by step 1301 and step 1302 .

步骤1301，响应于达到目标延时，从第二特征存储区中查找与第一音频特征匹配的候选音频特征。Step 1301, in response to reaching the target delay, search for candidate audio features matching the first audio feature from the second feature storage area.

其中，第一特征存储区的第一存储容量由目标延时和相邻输入音频帧之间的采样时间间隔确定，则第二特征存储区中存储音频特征对应输入音频帧的帧数小于第一特征存储区中存储音频特征对应输出音频帧的帧数。示意性的，若第一特征存储区可用于存储31帧音频帧的音频特征，则第二特征存储区可以设置为可用于存储30帧音频帧的音频特征。Wherein, the first storage capacity of the first feature storage area is determined by the target delay and the sampling time interval between adjacent input audio frames, then the number of frames of the corresponding input audio frames stored in the second feature storage area is less than the first The frame number of the output audio frame corresponding to the audio feature is stored in the feature storage area. Schematically, if the first feature storage area can be used to store audio features of 31 audio frames, then the second feature storage area can be set to be able to store audio features of 30 audio frames.

在一种可能的实施方式中，当提取到第一音频特征后，并在达到目标延时后，基于第一音频特征去第二特征存储区中进行查找，也即将第一音频特征与第二特征存储区中存储的各个候选音频特征进行匹配，并得到第一音频特征与各个候选音频特征对应的匹配分值。In a possible implementation manner, after the first audio feature is extracted and the target delay is reached, the second feature storage area is searched based on the first audio feature, that is, the first audio feature and the second Each candidate audio feature stored in the feature storage area is matched, and a matching score corresponding to the first audio feature and each candidate audio feature is obtained.

需要说明的是，当第二特征存储区的第二存储容量小于第一特征存储区时，则异常延时检测过程和延时估计过程中不能共用第二特征存储区，也就是说，即使异常延时检测过程和延时估计过程中采用相同的音频特征提取方法，延时估计模块中所要使用的候选音频特征也需要存储在其他特征存储区中。It should be noted that when the second storage capacity of the second feature storage area is smaller than the first feature storage area, the second feature storage area cannot be shared in the abnormal delay detection process and the delay estimation process, that is, even if the abnormal The same audio feature extraction method is used in the delay detection process and the delay estimation process, and the candidate audio features to be used in the delay estimation module also need to be stored in other feature storage areas.

步骤1302，响应于查找到与第一音频特征匹配的候选音频特征，确定存在异常回声延时。Step 1302, in response to finding a candidate audio feature that matches the first audio feature, determine that there is an abnormal echo delay.

由于异常延时检测过程仅需要检测回声延时小于目标延时的情况，因此当第二特征存储区的第二存储容量小于第一存储容量时，正常情况下，基于第一音频特征进行特征匹配时，第二存储区中并不存在与其匹配的候选音频特征，反之，如果在第二特征存储区中可以搜索到匹配的候选音频特征，则可以获得小于目标延时的一个回声延时的值，因此可以说明出现了负延时的情况；因此，在一种可能的实施方式中，当在第二存储区中查找到与第一音频特征匹配的候选音频特征，则确定存在异常回声延时。Since the abnormal delay detection process only needs to detect the situation that the echo delay is less than the target delay, when the second storage capacity of the second feature storage area is smaller than the first storage capacity, under normal circumstances, feature matching is performed based on the first audio feature When , there is no matching candidate audio feature in the second storage area. On the contrary, if a matching candidate audio feature can be found in the second feature storage area, a value of an echo delay that is less than the target delay can be obtained , so it can be explained that there is a negative delay; therefore, in a possible implementation, when a candidate audio feature matching the first audio feature is found in the second storage area, it is determined that there is an abnormal echo delay .

本实施例中，通过设置第二特征存储区的存储容量，使得可以无需计算回声延时，仅需要确定第二特征存储区中是否有与第一音频特征匹配的候选音频特征，若存在，则确定存在异常回声延时，可以简化异常延时的检测逻辑，进一步提高异常回声延时的检测效率；此外，减少第二特征存储区的存储容量，还可以降低特征匹配的次数，建议不降低异常回声延时的检测效率。In this embodiment, by setting the storage capacity of the second feature storage area, it is not necessary to calculate the echo delay, and it is only necessary to determine whether there is a candidate audio feature that matches the first audio feature in the second feature storage area. If there is, then Determining the existence of abnormal echo delay can simplify the detection logic of abnormal delay and further improve the detection efficiency of abnormal echo delay; in addition, reducing the storage capacity of the second feature storage area can also reduce the number of feature matching. It is recommended not to reduce the abnormal echo delay. Echo delay detection efficiency.

以下为本申请的装置实施例，对于装置实施例中未详细描述的细节，可参考上述方法实施例。The following are device embodiments of the present application. For details not described in detail in the device embodiments, reference may be made to the above method embodiments.

图14是本申请一个示例性实施例提供的异常回声延时识别装置的结构框图。该装置包括：Fig. 14 is a structural block diagram of an abnormal echo delay identification device provided by an exemplary embodiment of the present application. The unit includes:

特征提取模块1401，用于对麦克风采集到的输入音频帧进行音频特征提取，得到第一音频特征；The feature extraction module 1401 is used to extract the audio feature from the input audio frame collected by the microphone to obtain the first audio feature;

第一确定模块1402，用于响应于达到目标延时，基于所述第一音频特征，从候选音频特征中确定第二音频特征，所述候选音频特征是输出音频帧对应的音频特征，所述输出音频帧用于扬声器播放，所述第二音频特征与所述第一音频特征匹配；The first determining module 1402 is configured to determine a second audio feature from candidate audio features based on the first audio feature in response to reaching the target delay, the candidate audio feature being an audio feature corresponding to an output audio frame, the outputting an audio frame for playback by a speaker, the second audio feature matching the first audio feature;

第二确定模块1403，用于确定所述第二音频特征对应输出音频帧的回声延时；The second determination module 1403 is configured to determine the echo delay of the output audio frame corresponding to the second audio feature;

第三确定模块1404，用于响应于所述回声延时小于所述目标延时，确定存在异常回声延时。The third determination module 1404 is configured to determine that there is an abnormal echo delay in response to the echo delay being less than the target delay.

可选的，所述第一确定模块1402，包括：Optionally, the first determining module 1402 includes:

存储单元，用于将所述第一音频特征存储在第一特征存储区的尾部存储位置，所述第一特征存储区的第一存储容量由所述目标延时确定，所述第一音频特征的存储位置随时间由尾部存储位置向头部存储位置移动；a storage unit, configured to store the first audio feature in the tail storage position of the first feature storage area, the first storage capacity of the first feature storage area is determined by the target delay, and the first audio feature The storage location of moves from the tail storage location to the head storage location over time;

第一确定单元，用于响应于所述第一音频特征移动至所述第一特征存储区的头部存储位置，基于所述第一音频特征，从所述候选音频特征中确定所述第二音频特征。A first determining unit, configured to determine the second audio feature from the candidate audio features based on the first audio feature in response to the first audio feature moving to the head storage location of the first feature storage area. audio characteristics.

可选的，所述候选音频特征存储在第二特征存储区中，所述候选音频特征的存储位置随时间由尾部存储位置向头部存储位置移动；Optionally, the candidate audio feature is stored in the second feature storage area, and the storage location of the candidate audio feature moves from the tail storage location to the head storage location over time;

所述第二确定模块1403，包括：The second determining module 1403 includes:

获取单元，用于获取所述第二音频特征在所述第二特征存储区中的目标存储位置；an acquiring unit, configured to acquire a target storage location of the second audio feature in the second feature storage area;

第二确定单元，用于基于所述目标存储位置，确定所述第二音频特征对应输出音频帧的目标回声延时。The second determining unit is configured to determine, based on the target storage location, a target echo delay of the output audio frame corresponding to the second audio feature.

可选的，所述第二特征存储区的第二存储容量小于所述第一存储容量；Optionally, the second storage capacity of the second feature storage area is smaller than the first storage capacity;

所述装置还包括：The device also includes:

查找模块，用于响应于达到所述目标延时，从所述第二特征存储区中查找与所述第一音频特征匹配的候选音频特征；a search module, configured to search for a candidate audio feature matching the first audio feature from the second feature storage area in response to reaching the target delay;

第四确定模块，用于响应于查找到与所述第一音频特征匹配的候选音频特征，确定存在异常回声延时。The fourth determining module is configured to determine that there is an abnormal echo delay in response to finding a candidate audio feature that matches the first audio feature.

可选的，所述第一存储容量由所述目标延时和相邻输入音频帧之间的采样时间间隔确定，所述第二特征存储区中存储音频特征对应输入音频帧的帧数小于所述第一特征存储区中存储音频特征对应输出音频帧的帧数。Optionally, the first storage capacity is determined by the target delay and the sampling time interval between adjacent input audio frames, and the number of frames corresponding to the input audio frames stored in the second feature storage area is less than the specified The audio feature is stored in the first feature storage area corresponding to the frame number of the output audio frame.

可选的，所述特征提取模块1401，包括：Optionally, the feature extraction module 1401 includes:

第三确定单元，用于对所述麦克风采集到的所述输入音频帧进行时频转换和频带划分，确定出M个子带；A third determining unit, configured to perform time-frequency conversion and frequency band division on the input audio frame collected by the microphone, to determine M subbands;

第四确定单元，用于通过对M个子带的子带能量进行频域能量比较，得到N个第一频域特征分值，N为正整数，且M-N为正整数；The fourth determining unit is configured to obtain N first frequency-domain feature scores by comparing the sub-band energies of the M sub-bands in the frequency domain, where N is a positive integer, and M-N is a positive integer;

第五确定单元，用于将N个第一频域特征分值的集合确定为所述第一音频特征。The fifth determining unit is configured to determine a set of N first frequency-domain feature scores as the first audio feature.

可选的，所述第四确定单元，还用于：Optionally, the fourth determination unit is also used for:

响应于第j子带能量是第j-i子带至第j+i子带对应子带能量中的最大值，将第一分值确定为所述第j子带对应的所述第一频域特征分值，所述第j子带能量为所述第j子带对应的子带能量，其中，i为正整数，且j-i为正整数，j+i小于等于M；Responding to the fact that the jth subband energy is the maximum value among the subband energies corresponding to the j-ith subband to the j+ith subband, determining a first score as the first frequency domain feature corresponding to the jth subband Score, the jth subband energy is the subband energy corresponding to the jth subband, where i is a positive integer, and j-i is a positive integer, and j+i is less than or equal to M;

响应于所述第j子带能量小于第j-i子带至第j+i子带对应子带能量中的最大值，将第二分值确定为所述第j子带对应的所述第一频域特征分值。In response to the energy of the jth subband being less than the maximum value of the subband energies corresponding to the j-ith subband to the j+ith subband, determining a second score as the first frequency corresponding to the jth subband domain feature score.

特征匹配单元，用于对所述第一音频特征和所述候选音频特征进行特征匹配，得到至少一个候选匹配分值，所述候选匹配分值用于指示所述第一音频特征和所述候选音频特征之间的匹配度，所述候选匹配分值与所述相似度成负相关关系；A feature matching unit, configured to perform feature matching on the first audio feature and the candidate audio feature to obtain at least one candidate matching score, the candidate matching score being used to indicate the first audio feature and the candidate audio feature The degree of matching between audio features, the candidate matching score is negatively correlated with the degree of similarity;

第六确定单元，用于基于至少一个所述候选匹配分值，从所述候选音频特征中确定出所述第二音频特征。A sixth determining unit, configured to determine the second audio feature from the candidate audio features based on at least one candidate matching score.

可选的，所述第一音频特征中包含N个第一频域特征分值，所述候选音频特征中包含N个第二频域特征分值，N为正整数；Optionally, the first audio features include N first frequency-domain feature scores, and the candidate audio features include N second frequency-domain feature scores, where N is a positive integer;

所述特征匹配单元，还用于：The feature matching unit is also used for:

对第k第一频域特征分值和第k第二频域特征分值进行匹配运算，得到第k子匹配分值，k为小于等于N的正整数，所述子匹配分值为所述第一音频特征和所述第二音频特征中第k个频域特征分值的匹配度；Perform a matching operation on the kth first frequency domain feature score and the kth second frequency domain feature score to obtain the kth sub-matching score, k is a positive integer less than or equal to N, and the sub-matching score is the The matching degree of the first audio feature and the kth frequency domain feature score in the second audio feature;

将N个子匹配分值之和确定为所述候选匹配分值。The sum of the N sub-matching scores is determined as the candidate matching score.

可选的，所述第五确定单元，还用于：Optionally, the fifth determining unit is also used for:

对至少一个所述候选匹配分值进行平滑处理，得到至少一个特征匹配分值；smoothing at least one of the candidate matching scores to obtain at least one feature matching score;

将所述特征匹配分值中的最小值确定为目标匹配分值；determining the minimum value among the feature matching scores as the target matching score;

将所述目标匹配分值对应的所述候选音频特征确定为所述第二音频特征。The candidate audio feature corresponding to the target matching score is determined as the second audio feature.

可选的，所述装置还包括：Optionally, the device also includes:

重置模块，用于响应于存在异常回声延时，重新进行回声延时估计。The reset module is used for re-estimating the echo delay in response to the abnormal echo delay.

可选的，所述装置还包括：Optionally, the device also includes:

回声消除模块，用于响应于所述回声延时大于所述目标延时，基于回声延时估计得到的回声延时，对输入音频帧进行回声消除处理。The echo cancellation module is configured to perform echo cancellation processing on the input audio frame based on the echo delay obtained by echo delay estimation in response to the echo delay being greater than the target delay.

图15示出了本申请一个示例性实施例提供的终端1500的结构框图。该终端1500可以是：智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio LayerIII，动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group AudioLayer IV，动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1500还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其它名称。Fig. 15 shows a structural block diagram of a terminal 1500 provided by an exemplary embodiment of the present application. The terminal 1500 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4 ) player, laptop or desktop computer. The terminal 1500 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.

通常，终端1500包括有：处理器1501和存储器1502。Generally, the terminal 1500 includes: a processor 1501 and a memory 1502 .

处理器1501可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器1501可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1501也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器1501可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1501还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 1501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1501 can adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 1501 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); the coprocessor is used to Low-power processor for processing data in standby state. In some embodiments, the processor 1501 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 1501 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.

存储器1502可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器1502还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器1502中的非暂态的计算机可读存储介质用于存储至少一个指令，该至少一个指令用于被处理器1501所执行以实现本申请中方法实施例提供的信息处理方法。Memory 1502 may include one or more computer-readable storage media, which may be non-transitory. The memory 1502 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1502 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1501 to implement the information processing provided by the method embodiments in this application method.

在一些实施例中，终端1500还可选包括有：外围设备接口1503和至少一个外围设备。处理器1501、存储器1502和外围设备接口1503之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1503相连。具体地，外围设备包括：射频电路1504、显示屏1505、摄像头组件1506、音频电路1507、定位组件1508和电源1509中的至少一种。In some embodiments, the terminal 1500 may optionally further include: a peripheral device interface 1503 and at least one peripheral device. The processor 1501, the memory 1502, and the peripheral device interface 1503 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 1503 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1504 , a display screen 1505 , a camera component 1506 , an audio circuit 1507 , a positioning component 1508 and a power supply 1509 .

外围设备接口1503可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器1501和存储器1502。在一些实施例中，处理器1501、存储器1502和外围设备接口1503被集成在同一芯片或电路板上；在一些其它实施例中，处理器1501、存储器1502和外围设备接口1503中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 1503 may be used to connect at least one peripheral device related to I/O (Input/Output, input/output) to the processor 1501 and the memory 1502 . In some embodiments, the processor 1501, memory 1502 and peripheral device interface 1503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1501, memory 1502 and peripheral device interface 1503 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路1504用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路1504通过电磁信号与通信网络以及其它通信设备进行通信。射频电路1504将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路1504包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1504可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于：万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路1504还可以包括NFC(Near Field Communication，近距离无线通信)有关的电路，本申请对此不加以限定。The radio frequency circuit 1504 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 1504 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 1504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 1504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 1504 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 1504 may also include circuits related to NFC (Near Field Communication, short-range wireless communication), which is not limited in this application.

显示屏1505用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1505是触摸显示屏时，显示屏1505还具有采集在显示屏1505的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1501进行处理。此时，显示屏1505还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏1505可以为一个，设置终端1500的前面板；在另一些实施例中，显示屏1505可以为至少两个，分别设置在终端1500的不同表面或呈折叠设计；在再一些实施例中，显示屏1505可以是柔性显示屏，设置在终端1500的弯曲表面上或折叠面上。甚至，显示屏1505还可以设置成非矩形的不规则图形，也即异形屏。显示屏1505可以采用LCD(Liquid Crystal Display，液晶显示屏)、OLED(Organic Light-Emitting Diode，有机发光二极管)等材质制备。The display screen 1505 is used for displaying a UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 1505 is a touch display screen, the display screen 1505 also has the ability to collect touch signals on or above the surface of the display screen 1505 . The touch signal can be input to the processor 1501 as a control signal for processing. At this time, the display screen 1505 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 1505, which is provided on the front panel of the terminal 1500; in other embodiments, there may be at least two display screens 1505, which are respectively provided on different surfaces of the terminal 1500 or in a folding design; In some other embodiments, the display screen 1505 may be a flexible display screen, which is arranged on the curved surface or the folding surface of the terminal 1500 . Even, the display screen 1505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 1505 may be made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.

摄像头组件1506用于采集图像或视频。可选地，摄像头组件1506包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中，摄像头组件1506还可以包括闪光灯。闪光灯可以是单色温闪光灯，也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，可以用于不同色温下的光线补偿。The camera assembly 1506 is used to capture images or videos. Optionally, the camera component 1506 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function. Combined with the wide-angle camera to realize panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1506 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

音频电路1507可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器1501进行处理，或者输入至射频电路1504以实现语音通信。出于立体声采集或降噪的目的，麦克风可以为多个，分别设置在终端1500的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1501或射频电路1504的电信号转换为声波。扬声器可以是传统的薄膜扬声器，也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅可以将电信号转换为人类可听见的声波，也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路1507还可以包括耳机插孔。Audio circuitry 1507 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 1501 for processing, or input them to the radio frequency circuit 1504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, there may be multiple microphones, which are respectively set at different parts of the terminal 1500 . The microphone can also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 1501 or the radio frequency circuit 1504 into sound waves. The loudspeaker can be a conventional membrane loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it is possible not only to convert electrical signals into sound waves audible to humans, but also to convert electrical signals into sound waves inaudible to humans for purposes such as distance measurement. In some embodiments, audio circuitry 1507 may also include a headphone jack.

定位组件1508用于定位终端1500的当前地理位置，以实现导航或LBS(LocationBased Service，基于位置的服务)。定位组件1508可以是基于美国的GPS(GlobalPositioning System，全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。The positioning component 1508 is used to locate the current geographic location of the terminal 1500 to implement navigation or LBS (Location Based Service, location-based service). The positioning component 1508 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China or the Galileo system of Russia.

电源1509用于为终端1500中的各个组件进行供电。电源1509可以是交流电、直流电、一次性电池或可充电电池。当电源1509包括可充电电池时，该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池，无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。The power supply 1509 is used to supply power to various components in the terminal 1500 . Power source 1509 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery charged through a wired line, and a wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.

在一些实施例中，终端1500还包括有一个或多个传感器1510。该一个或多个传感器150包括但不限于：加速度传感器1511、陀螺仪传感器1512、压力传感器1513、指纹传感器1514、光学传感器1515以及接近传感器1516。In some embodiments, the terminal 1500 further includes one or more sensors 1510 . The one or more sensors 150 include, but are not limited to: an acceleration sensor 1511 , a gyro sensor 1512 , a pressure sensor 1513 , a fingerprint sensor 1514 , an optical sensor 1515 and a proximity sensor 1516 .

加速度传感器1511可以检测以终端1500建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器1511可以用于检测重力加速度在三个坐标轴上的分量。处理器1501可以根据加速度传感器1511采集的重力加速度信号，控制触摸显示屏1505以横向视图或纵向视图进行信息处理。加速度传感器1511还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 1511 can detect the acceleration on the three coordinate axes of the coordinate system established by the terminal 1500 . For example, the acceleration sensor 1511 can be used to detect the components of the acceleration of gravity on the three coordinate axes. The processor 1501 can control the touch display screen 1505 to process information in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1511 . The acceleration sensor 1511 can also be used for collecting game or user's motion data.

陀螺仪传感器1512可以检测终端1500的机体方向及转动角度，陀螺仪传感器1512可以与加速度传感器1511协同采集用户对终端1500的3D动作。处理器1501根据陀螺仪传感器1512采集的数据，可以实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyro sensor 1512 can detect the body direction and rotation angle of the terminal 1500 , and the gyro sensor 1512 can cooperate with the acceleration sensor 1511 to collect 3D actions of the user on the terminal 1500 . According to the data collected by the gyroscope sensor 1512, the processor 1501 can realize the following functions: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control and inertial navigation.

压力传感器1513可以设置在终端1500的侧边框和/或触摸显示屏1505的下层。当压力传感器1513设置在终端1500的侧边框时，可以检测用户对终端1500的握持信号，由处理器1501根据压力传感器1513采集的握持信号进行左右手识别或快捷操作。当压力传感器1513设置在触摸显示屏1505的下层时，由处理器1501根据用户对触摸显示屏1505的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 1513 may be disposed on the side frame of the terminal 1500 and/or the lower layer of the touch display screen 1505 . When the pressure sensor 1513 is set on the side frame of the terminal 1500 , it can detect the user's grip signal on the terminal 1500 , and the processor 1501 performs left and right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 1513 . When the pressure sensor 1513 is arranged on the lower layer of the touch screen 1505, the processor 1501 controls the operable controls on the UI interface according to the user's pressure operation on the touch screen 1505. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

指纹传感器1514用于采集用户的指纹，由处理器1501根据指纹传感器1514采集到的指纹识别用户的身份，或者，由指纹传感器1514根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时，由处理器1501授权该用户执行相关的敏感操作，该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1514可以被设置终端1500的正面、背面或侧面。当终端1500上设置有物理按键或厂商Logo时，指纹传感器1514可以与物理按键或厂商Logo集成在一起。The fingerprint sensor 1514 is used to collect the user's fingerprint, and the processor 1501 recognizes the identity of the user according to the fingerprint collected by the fingerprint sensor 1514, or, the fingerprint sensor 1514 recognizes the user's identity according to the collected fingerprint. When the identity of the user is recognized as a trusted identity, the processor 1501 authorizes the user to perform related sensitive operations, such sensitive operations include unlocking the screen, viewing encrypted information, downloading software, making payment, and changing settings. The fingerprint sensor 1514 may be provided on the front, back or side of the terminal 1500 . When the terminal 1500 is provided with a physical button or a manufacturer's Logo, the fingerprint sensor 1514 may be integrated with the physical button or the manufacturer's Logo.

光学传感器1515用于采集环境光强度。在一个实施例中，处理器1501可以根据光学传感器1515采集的环境光强度，控制触摸显示屏1505的显示亮度。具体地，当环境光强度较高时，调高触摸显示屏1505的显示亮度；当环境光强度较低时，调低触摸显示屏1505的显示亮度。在另一个实施例中，处理器1501还可以根据光学传感器1515采集的环境光强度，动态调整摄像头组件1506的拍摄参数。The optical sensor 1515 is used to collect ambient light intensity. In one embodiment, the processor 1501 can control the display brightness of the touch screen 1505 according to the ambient light intensity collected by the optical sensor 1515 . Specifically, when the ambient light intensity is high, the display brightness of the touch screen 1505 is increased; when the ambient light intensity is low, the display brightness of the touch screen 1505 is decreased. In another embodiment, the processor 1501 may also dynamically adjust shooting parameters of the camera assembly 1506 according to the ambient light intensity collected by the optical sensor 1515 .

接近传感器1516，也称距离传感器，通常设置在终端1500的前面板。接近传感器1516用于采集用户与终端1500的正面之间的距离。在一个实施例中，当接近传感器1516检测到用户与终端1500的正面之间的距离逐渐变小时，由处理器1501控制触摸显示屏1505从亮屏状态切换为息屏状态；当接近传感器1516检测到用户与终端1500的正面之间的距离逐渐变大时，由处理器1501控制触摸显示屏1505从息屏状态切换为亮屏状态。The proximity sensor 1516 , also called a distance sensor, is usually arranged on the front panel of the terminal 1500 . The proximity sensor 1516 is used to collect the distance between the user and the front of the terminal 1500 . In one embodiment, when the proximity sensor 1516 detects that the distance between the user and the front of the terminal 1500 gradually decreases, the processor 1501 controls the touch display screen 1505 to switch from the bright screen state to the off-screen state; when the proximity sensor 1516 detects When the distance between the user and the front of the terminal 1500 gradually increases, the processor 1501 controls the touch display screen 1505 to switch from the off-screen state to the on-screen state.

本领域技术人员可以理解，图15中示出的结构并不构成对终端1500的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 15 does not constitute a limitation to the terminal 1500, and may include more or less components than shown in the figure, or combine certain components, or adopt a different component arrangement.

本申请还提供了一种计算机可读存储介质，所述可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述任意示例性实施例所提供的异常回声延时识别方法。The present application also provides a computer-readable storage medium, wherein at least one instruction, at least one program, code set or instruction set is stored in the readable storage medium, the at least one instruction, the at least one program, the The code set or instruction set is loaded and executed by the processor to implement the abnormal echo delay identification method provided by any of the above exemplary embodiments.

本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。终端的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该终端执行上述可选实现方式中提供的异常回声延时识别方法。An embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the terminal reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the terminal executes the abnormal echo delay identification method provided in the above optional implementation manner.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application shall be included in the protection of the application. within range.

Claims

1. An abnormal echo delay identification method, characterized in that the method comprises:

performing audio feature extraction on an input audio frame acquired by a microphone to obtain a first audio feature;

in response to reaching a target delay, determining a second audio feature from candidate audio features based on the first audio feature, wherein the candidate audio features are audio features corresponding to an output audio frame, the output audio frame is used for loudspeaker playing, and the second audio feature is matched with the first audio feature;

determining the echo delay of the output audio frame corresponding to the second audio characteristic;

and determining that abnormal echo delay exists in response to the echo delay being less than the target delay.

2. The method of claim 1, wherein determining a second audio feature from the candidate audio features based on the first audio feature in response to reaching the target latency comprises:

storing the first audio feature in a tail storage location of a first feature store, a first storage capacity of the first feature store being determined by the target latency, the storage location of the first audio feature moving from the tail storage location to a head storage location over time;

determining the second audio feature from the candidate audio features based on the first audio feature in response to the first audio feature moving to a head storage location of the first feature store.

3. The method of claim 2, wherein the candidate audio feature is stored in a second feature storage region, the storage location of the candidate audio feature moving from a tail storage location to a head storage location over time;

the determining that the second audio feature corresponds to the echo delay of the output audio frame includes:

acquiring a target storage position of the second audio feature in the second feature storage area;

and determining the target echo delay of the output audio frame corresponding to the second audio characteristic based on the target storage position.

4. The method of claim 3, wherein the second storage capacity of the second feature storage area is less than the first storage capacity;

after the audio feature extraction is performed on the input audio frame acquired by the microphone to obtain the first audio feature, the method further includes:

in response to reaching the target delay, searching the second feature store for candidate audio features that match the first audio feature;

determining that an abnormal echo delay exists in response to finding a candidate audio feature that matches the first audio feature.

5. The method of claim 4,

the first storage capacity is determined by the target delay and the sampling time interval between adjacent input audio frames, and the number of the input audio frames corresponding to the audio features stored in the second feature storage area is smaller than the number of the output audio frames corresponding to the audio features stored in the first feature storage area.

6. The method according to any one of claims 1 to 5, wherein the performing audio feature extraction on the input audio frame collected by the microphone to obtain the first audio feature comprises:

performing time-frequency conversion and frequency band division on the input audio frame acquired by the microphone to determine M sub-bands;

obtaining N first frequency domain feature scores by comparing the frequency domain energy of the sub-band energy of the M sub-bands, wherein N is a positive integer, and M-N is a positive integer;

determining a set of N first frequency-domain feature scores as the first audio feature.

7. The method of claim 6, wherein obtaining N first frequency-domain feature scores by frequency-domain energy comparison of subband energies for M subbands comprises:

determining a first score as the first frequency domain feature score corresponding to the jth sub-band in response to that the jth sub-band energy is the maximum value of sub-band energies corresponding to the jth sub-band from the jth sub-band to the jth + i sub-band, wherein i is a positive integer, j-i is a positive integer, and j + i is less than or equal to M;

and determining a second score as the first frequency domain feature score corresponding to the jth sub-band in response to the jth sub-band energy being less than the maximum of sub-band energies corresponding to the jth sub-band to the jth + i sub-band.

8. The method of any of claims 1 to 5, wherein determining the second audio feature from the candidate audio features based on the first audio feature comprises:

performing feature matching on the first audio feature and the candidate audio feature to obtain at least one candidate matching score, wherein the candidate matching score is used for indicating the matching degree between the first audio feature and the candidate audio feature, and the candidate matching score and the matching degree form a negative correlation relationship;

determining the second audio feature from the candidate audio features based on at least one of the candidate match scores.

9. The method of claim 8, wherein the first audio feature comprises N first frequency-domain feature scores, wherein the candidate audio features comprise N second frequency-domain feature scores, and wherein N is a positive integer;

the performing feature matching on the first audio feature and the candidate audio feature to obtain at least one candidate matching score includes:

performing matching operation on the kth first frequency domain feature score and the kth second frequency domain feature score to obtain a kth sub-matching score, wherein k is a positive integer less than or equal to N, and the sub-matching score is the matching degree of the kth frequency domain feature score in the first audio feature and the second audio feature;

determining a sum of the N sub-match scores as the candidate match score.

10. The method of claim 8, wherein determining the second audio feature from the candidate audio features based on at least one of the candidate match scores comprises:

performing smoothing processing on at least one candidate matching score to obtain at least one feature matching score;

determining the minimum value in the feature matching scores as a target matching score;

and determining the candidate audio features corresponding to the target matching scores as the second audio features.

11. The method of any of claims 1 to 5, wherein after determining that an abnormal echo delay exists in response to the echo delay being less than the target delay, the method further comprises:

and responding to the abnormal echo delay, and re-estimating the echo delay.

12. The method of any of claims 1 to 5, wherein after determining that the second audio feature corresponds to an echo delay of an output audio frame, the method further comprises:

and in response to the echo delay being larger than the target delay, performing echo cancellation processing on the input audio frame based on the echo delay obtained by echo delay estimation.

13. An abnormal echo delay identifying apparatus, comprising:

the characteristic extraction module is used for extracting audio characteristics of the input audio frames collected by the microphone to obtain first audio characteristics;

a first determining module, configured to determine, in response to reaching a target delay, a second audio feature from candidate audio features based on the first audio feature, where the candidate audio feature is an audio feature corresponding to an output audio frame, the output audio frame is used for speaker playing, and the second audio feature is matched with the first audio feature;

a second determining module, configured to determine an echo delay of the output audio frame corresponding to the second audio feature;

and a third determining module, configured to determine that an abnormal echo delay exists in response to the echo delay being smaller than the target delay.

14. A terminal, characterized in that the terminal comprises a processor and a memory, wherein the memory stores at least one program, and the at least one program is loaded and executed by the processor to implement the abnormal echo delay identifying method according to any one of claims 1 to 12.

15. A computer-readable storage medium, wherein at least one program is stored in the computer-readable storage medium, and the at least one program is loaded and executed by a processor to implement the method for identifying abnormal echo delay according to any one of claims 1 to 12.