CN116359843A

CN116359843A - Sound source positioning method, device, intelligent equipment and medium

Info

Publication number: CN116359843A
Application number: CN202310213510.6A
Authority: CN
Inventors: 孟繁荣
Original assignee: Goertek Techology Co Ltd
Current assignee: Goertek Techology Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-30

Abstract

The application discloses a sound source positioning method, a sound source positioning device, intelligent equipment and a sound source positioning medium, and belongs to the field of voice processing. The method comprises the following steps: acquiring audio time domain data acquired by at least two microphones in a microphone array for a sound source respectively and frequency response error compensation values of the microphones in the at least two microphones; calibrating the amplitude of each frame based on the corresponding frequency response error compensation value for each audio time domain data to obtain calibrated audio time domain data so that the amplitude of the corresponding frame in all the calibrated audio time domain data is consistent; and positioning the sound source based on all the calibrated audio time domain data. According to the method and the device, each microphone required for sound source positioning generates consistent amplitude signals for the same sound source, so that positioning accuracy in a sound source positioning method is improved.

Description

Sound source localization method, device, intelligent device and medium

技术领域technical field

本申请涉及语音处理领域，尤其涉及一种声源定位方法、装置、智能设备及介质。The present application relates to the field of speech processing, and in particular to a sound source localization method, device, intelligent device and medium.

背景技术Background technique

相关技术中，智能音箱、智能电视等智能设备配置有远场语音拾取。其中，远场语音拾取功能的前端处理为声源定位功能，而声源定位功能依赖于智能设备设置的麦克风阵列中的多个麦克风中相匹配帧之间的时延信息。In related technologies, smart devices such as smart speakers and smart TVs are equipped with far-field voice pickup. Among them, the front-end processing of the far-field voice pickup function is the sound source localization function, and the sound source localization function depends on the delay information between matching frames among multiple microphones in the microphone array set by the smart device.

但是由于麦克风阵列中各个麦克风之间的个体差异以及安装误差，导致各个麦克风对同一音源会产生不同的频响数据，也即是影响了匹配帧的计算精度，进而导致声源定位功能的计算精度降低。However, due to the individual differences and installation errors between the microphones in the microphone array, each microphone will produce different frequency response data for the same sound source, which affects the calculation accuracy of the matching frame, which in turn leads to the calculation accuracy of the sound source localization function. reduce.

申请内容application content

本申请的主要目的在于提供一种声源定位方法、装置、智能设备及介质，旨在解决现有声源定位精度有待提高的技术问题。The main purpose of the present application is to provide a sound source localization method, device, intelligent equipment and medium, aiming at solving the existing technical problem that the accuracy of sound source localization needs to be improved.

为实现上述目的，本申请提供一种声源定位方法，使用于智能设备，智能设备包括麦克风阵列，方法包括：In order to achieve the above purpose, the present application provides a sound source localization method, which is used in a smart device, the smart device includes a microphone array, and the method includes:

获取麦克风阵列中至少两个麦克风对所述声源分别采集的音频时域数据，以及至少两个所述麦克风中各个所述麦克风的频响误差补偿值；Acquiring audio time-domain data respectively collected by at least two microphones in the microphone array for the sound source, and a frequency response error compensation value of each of the at least two microphones;

针对各个所述音频时域数据，基于相对应的所述频响误差补偿值，对各帧的幅值进行校准，获得校准后的音频时域数据，以使所有所述校准后的音频时域数据中相应帧的幅值一致；For each of the audio time domain data, based on the corresponding frequency response error compensation value, the amplitude of each frame is calibrated to obtain calibrated audio time domain data, so that all the calibrated audio time domain data The amplitudes of the corresponding frames in the data are consistent;

基于所有所述校准后的音频时域数据，对所述声源进行定位。The sound source is localized based on all the calibrated audio time domain data.

在本申请可能的一实施例中，获取至少两个麦克风中各个麦克风的频响误差补偿值，包括：In a possible embodiment of the present application, acquiring the frequency response error compensation value of each of the at least two microphones includes:

获取至少两个麦克风中各个麦克风对同一测试音源数据分别采集得到的实际音频数据；Acquiring the actual audio data collected by each of the at least two microphones for the same test sound source data;

确定各个实际音频数据的测试音频能量值；determining the test audio energy value of each actual audio data;

基于测试音频能量值，确定出同一测试音源数据对应的基准音频能量值；Based on the test audio energy value, determine the reference audio energy value corresponding to the same test sound source data;

基于测试音频能量值和基准音频能量值，确定对应的麦克风的频响误差补偿值，以得到各个麦克风的频响误差补偿值。Based on the test audio energy value and the reference audio energy value, the frequency response error compensation value of the corresponding microphone is determined to obtain the frequency response error compensation value of each microphone.

在本申请可能的一实施例中，基于测试音频能量值和基准音频能量值，确定对应的麦克风的频响误差补偿值，以得到各个麦克风的频响误差补偿值，包括：In a possible embodiment of the present application, based on the test audio energy value and the reference audio energy value, the frequency response error compensation value of the corresponding microphone is determined to obtain the frequency response error compensation value of each microphone, including:

确定测试音频能量值相对于基准音频能量值的比值；determining the ratio of the test audio energy value to the reference audio energy value;

基于比值，获得对应的麦克风的补偿系数；Obtaining a compensation coefficient of a corresponding microphone based on the ratio;

基于补偿系数，获得对应的麦克风在时域中的频响误差补偿值。Based on the compensation coefficient, a frequency response error compensation value of the corresponding microphone in the time domain is obtained.

在本申请可能的一实施例中，基于测试音频能量值，确定出同一测试音源数据对应的基准音频能量值，包括：In a possible embodiment of the present application, based on the test audio energy value, the reference audio energy value corresponding to the same test sound source data is determined, including:

确定所有测试音频能量值的平均能量值；determine the average energy value of all test audio energy values;

将平均能量值作为基准音频能量值。Use the average energy value as the reference audio energy value.

在本申请可能的一实施例中，同一测试音源数据为预设扫频音频数据，预设扫频音频数据至少包括目标频段的音频；In a possible embodiment of the present application, the same test sound source data is preset frequency sweep audio data, and the preset frequency sweep audio data includes at least the audio of the target frequency band;

确定各个实际音频数据的测试音频能量值，包括：Determine the test audio energy value for each actual audio data, including:

确定各个实际音频数据中分布在目标频段的音频的测试音频能量值。A test audio energy value of audio distributed in a target frequency band in each actual audio data is determined.

在本申请可能的一实施例中，基于所有校准后的音频时域数据，对声源进行定位，包括：In a possible embodiment of the present application, the sound source is located based on all the calibrated audio time domain data, including:

获得校准后的音频时域数据中的音频特征值；Obtain audio feature values in the calibrated audio time domain data;

基于音频特征值和预设关键词特征值，对所有校准后的音频时域数据中的任意两者均进行比对，筛选出音频特征值相匹配的语音关键词帧组；Based on the audio feature value and the preset keyword feature value, any two of all the calibrated audio time domain data are compared, and the voice keyword frame group matching the audio feature value is selected;

确定语音关键词帧组中两帧音频帧之间的时延信息；Determining time delay information between two audio frames in the voice keyword frame group;

基于时延信息，对声源进行定位。Based on the delay information, the sound source is located.

在本申请可能的一实施例中，基于音频特征值和预设关键词特征值，对所有校准后的音频时域数据中的任意两者均进行比对，筛选出音频特征值相匹配的语音关键词帧组，包括：In a possible embodiment of the present application, based on the audio feature value and the preset keyword feature value, any two of all the calibrated audio time domain data are compared, and the voices with matching audio feature values are screened out. Keyword frame group, including:

基于音频特征值，对任意两个校准后的音频时域数据进行比对，筛选出音频特征值相匹配的至少一个音频帧组；Based on the audio characteristic value, compare any two calibrated audio time domain data, and filter out at least one audio frame group whose audio characteristic value matches;

从至少一个音频帧组中筛选出与预设关键词特征值相匹配的语音关键词帧组。A voice keyword frame group matching a preset keyword feature value is screened out from at least one audio frame group.

第二方面，本申请还提供了一种声源定位装置，配置于智能设备，智能设备包括麦克风阵列，装置包括：In the second aspect, the present application also provides a sound source localization device configured on a smart device, the smart device includes a microphone array, and the device includes:

信息获取模块，用于获取麦克风阵列中至少两个麦克风对声源分别采集的音频时域数据，以及至少两个所述麦克风中各个麦克风的频响误差补偿值；An information acquisition module, configured to acquire audio time-domain data collected by at least two microphones in the microphone array for the sound source, and a frequency response error compensation value of each of the at least two microphones;

音频校准模块，用于针对各个音频时域数据，基于相对应的频响误差补偿值，对各帧的幅值进行校准，获得校准后的音频时域数据，以使所有校准后的音频时域数据中相应帧的幅值一致；The audio calibration module is used to calibrate the amplitude of each frame based on the corresponding frequency response error compensation value for each audio time domain data, and obtain the calibrated audio time domain data, so that all calibrated audio time domain data The amplitudes of the corresponding frames in the data are consistent;

声源定位模块，用于基于所有校准后的音频时域数据，对声源进行定位。The sound source localization module is used for localizing the sound source based on all the calibrated audio time domain data.

第三方面，本申请还提供了一种智能设备，包括：In a third aspect, the present application also provides a smart device, including:

麦克风阵列，麦克风阵列包括多个麦克风，麦克风用于采集音频模拟数据；A microphone array, the microphone array includes a plurality of microphones, and the microphones are used to collect audio analog data;

模数转换器，用于将音频模拟数据转化为音频时域数据；an analog-to-digital converter for converting audio analog data into audio time-domain data;

控制器，控制器与模数转换器连接，用于接收音频时域数据，控制器包括处理器，存储器以及存储在存储器中的声源定位程序，声源定位程序被处理器运行时实现声源定位方法的步骤。The controller is connected with the analog-to-digital converter for receiving audio time-domain data. The controller includes a processor, a memory and a sound source localization program stored in the memory. When the sound source localization program is run by the processor, the sound source localization program is realized. The steps of the positioning method.

第四方面，本申请还提供了一种计算机可读存储介质，计算机可读存储介质上存储有声源定位程序，声源定位程序被处理器运行时实现声源定位方法的步骤。In a fourth aspect, the present application also provides a computer-readable storage medium, on which a sound source localization program is stored, and when the sound source localization program is run by a processor, the steps of the sound source localization method are realized.

由此，本申请实施例提供了一种声源定位方法，包括：获取麦克风阵列中至少两个麦克风对声源分别采集的音频时域数据，以及至少两个麦克风中各个麦克风的频响误差补偿值；针对各个音频时域数据，基于相对应的频响误差补偿值，对各帧的幅值进行校准，获得校准后的音频时域数据，以使所有校准后的音频时域数据中相应帧的幅值一致；基于所有校准后的音频时域数据，对所述声源进行定位。Therefore, an embodiment of the present application provides a sound source localization method, including: acquiring audio time-domain data collected by at least two microphones in the microphone array for the sound source, and frequency response error compensation for each of the at least two microphones value; for each audio time domain data, based on the corresponding frequency response error compensation value, the amplitude of each frame is calibrated to obtain the calibrated audio time domain data, so that all the corresponding frames in the calibrated audio time domain data The amplitudes of the audio signals are consistent; based on all the calibrated audio time domain data, the sound source is localized.

由此，本申请在对麦克风阵列采集的所有音频时域数据进行处理时，会利用各个麦克风的频响误差补偿值对音频时域数据中的各帧的幅值进行补偿，从而消除各个麦克风由于个体差异以及组装公差等因素产生的频响误差，使得各个麦克风对同一音源产生一致的幅值信号，便于后续的不同麦克风之间的数据的处理，进而提高了声源定位方法中的定位精度，使得配置了麦克风阵列的智能设备的远场语音功能的识别精度更佳。Therefore, when the present application processes all the audio time-domain data collected by the microphone array, the frequency response error compensation value of each microphone will be used to compensate the amplitude of each frame in the audio time-domain data, thereby eliminating the The frequency response error caused by factors such as individual differences and assembly tolerances enables each microphone to generate a consistent amplitude signal for the same sound source, which facilitates subsequent data processing between different microphones, thereby improving the positioning accuracy in the sound source localization method. This makes the recognition accuracy of the far-field voice function of the smart device configured with the microphone array better.

附图说明Description of drawings

图1为本申请智能设备的结构示意图；FIG. 1 is a schematic structural diagram of an intelligent device of the present application;

图2为本申请声源定位方法第一实施例的流程示意图；Fig. 2 is a schematic flow chart of the first embodiment of the sound source localization method of the present application;

图3为本申请声源定位方法第二实施例的流程示意图；Fig. 3 is a schematic flow chart of the second embodiment of the sound source localization method of the present application;

图4为本申请声源定位方法第三实施例的流程示意图；Fig. 4 is a schematic flow chart of the third embodiment of the sound source localization method of the present application;

图5为本申请声源定位装置的模块示意图。Fig. 5 is a block diagram of the sound source localization device of the present application.

本申请目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

相关技术中，智能音箱、电视机等配置有远场语音拾取功能的智能设备在生活中的用处越来越多。而远场语音拾取功能的前端处理为声源定位功能，而声源定位功能依赖于智能设备设置的麦克风阵列中的多个麦克风中相匹配帧之间的时延信息。具体的，声源定位为利用多个麦克风在环境中不同位置点对声信号进行测量，由于声信号到达各麦克风的时间有不同程度的延迟，利用算法对测量到的声信号进行处理，由此获得声源点相对于麦克风的到达方向(包括方位角、俯仰角)和距离等。In related technologies, smart speakers, televisions and other smart devices equipped with a far-field voice pickup function are more and more useful in daily life. The front-end processing of the far-field voice pickup function is the sound source localization function, and the sound source localization function depends on the time delay information between matching frames among multiple microphones in the microphone array set by the smart device. Specifically, the sound source localization is to use multiple microphones to measure the sound signal at different positions in the environment. Since the time when the sound signal arrives at each microphone has different degrees of delay, the measured sound signal is processed by an algorithm, thus Obtain the arrival direction (including azimuth and elevation angle) and distance of the sound source point relative to the microphone.

但是在实际生产制造中，每个麦克风之间必然存在个体差异，以及生产制造误差等，导致每个麦克风之间存在频响误差，如一般而言，麦克风允许的频响误差为±3dB，即针对相同的音源，多个麦克风之间的频响差异最大可以达到6dB。且麦克风阵列在安装过程中，还存在组装误差，如允许的组装误差可以是0.03mm。这些个体差异和组装误差会导致麦克风阵列中各个麦克风对同一音源产生不同的频响，即声压值不同，也即是影响了匹配帧的计算精度，进而导致声源定位功能的计算精度降低，也影响了语音识别的计算精度。However, in actual production, there must be individual differences between each microphone, as well as manufacturing errors, etc., resulting in a frequency response error between each microphone. Generally speaking, the allowable frequency response error of the microphone is ±3dB, that is For the same sound source, the frequency response difference between multiple microphones can reach up to 6dB. In addition, during the installation process of the microphone array, there is still an assembly error, for example, the allowable assembly error may be 0.03 mm. These individual differences and assembly errors will cause each microphone in the microphone array to produce different frequency responses to the same sound source, that is, the sound pressure value is different, which affects the calculation accuracy of the matching frame, thereby reducing the calculation accuracy of the sound source localization function. It also affects the calculation accuracy of speech recognition.

为此，本申请提供了一种解决方案，通过确认麦克风阵列中各个麦克风采集同一音源时，得到相同的频响所需的频响误差补偿值，从而利用各个麦克风的频响误差补偿值对音频时域数据中的各帧的幅值进行补偿，消除各个麦克风由于个体差异以及组装公差等因素产生的频响误差，使得麦克风阵列中各个麦克风对同一音源产生一致的幅值信号，便于后续的不同麦克风之间的数据的处理，进而提高了声源定位方法中的定位精度，使得配置了麦克风阵列的智能设备的远场语音功能的识别精度更佳。For this reason, the present application provides a solution, by confirming that each microphone in the microphone array collects the same sound source, the frequency response error compensation value required for the same frequency response is obtained, thereby using the frequency response error compensation value of each microphone to adjust the audio frequency The amplitude of each frame in the time domain data is compensated to eliminate the frequency response error of each microphone due to individual differences and assembly tolerances, so that each microphone in the microphone array generates a consistent amplitude signal for the same sound source, which is convenient for subsequent different The data processing between the microphones further improves the positioning accuracy in the sound source localization method, making the recognition accuracy of the far-field voice function of the smart device configured with the microphone array better.

下面对本实施例涉及到的技术术语进行说明：The technical terms involved in this embodiment are described below:

频率范围，是指音箱能够重放的最低有效回放频率与最高有效回放频率之间的范围。The frequency range refers to the range between the lowest effective playback frequency and the highest effective playback frequency that the speaker can reproduce.

频率响应，是指将一个以恒电压输出的音频信号通过音箱播放时，音箱产生的音量随频率的变化而发生增大或衰减、相位随频率而发生变化的现象，这种音量和相位与频率的相关联的变化关系(变化量)称为频率响应，单位为分贝(dB)。Frequency response refers to the phenomenon that when an audio signal output at a constant voltage is played through a speaker, the volume generated by the speaker increases or attenuates with the frequency, and the phase changes with the frequency. The associated change relationship (variation) is called the frequency response, and the unit is decibel (dB).

频率响应误差，是指在同一倍频程带宽内产生的平均音量之差，用于衡量音箱还原声音的准确程度。The frequency response error refers to the difference in the average volume produced within the same octave bandwidth, which is used to measure the accuracy of the sound reproduction of the speaker.

本申请实施例以下，将对本申请技术实现中应用到的智能设备进行说明：The following embodiments of this application will describe the smart devices used in the technical implementation of this application:

参照图2，图2为本申请实施例方案涉及的硬件运行环境的智能设备的结构示意图。Referring to FIG. 2, FIG. 2 is a schematic structural diagram of a smart device in a hardware operating environment involved in the solution of the embodiment of the present application.

如图2所示，该智能设备可以包括：麦克风阵列107、模数转换器1006和控制器。As shown in FIG. 2 , the smart device may include: a microphone array 107 , an analog-to-digital converter 1006 and a controller.

麦克风阵列1007包括多个麦克风，本实施例中提供的麦克风阵列可以是安装于例如智能音箱或者安装于手机、平板电脑等设备中，用于拾音的微型麦克风阵列。麦克风用于采集音频模拟数据。The microphone array 1007 includes a plurality of microphones, and the microphone array provided in this embodiment may be a miniature microphone array installed in, for example, a smart speaker or installed in a mobile phone, a tablet computer, and the like for sound pickup. A microphone is used to capture audio analog data.

模数转换器1006分别与麦克风阵列1007和控制器连接，用于将模拟信号的音频模拟数据转化为数字信息格式的音频时域数据。The analog-to-digital converter 1006 is respectively connected with the microphone array 1007 and the controller, and is used for converting the audio analog data of the analog signal into the audio time domain data of the digital information format.

控制器包括：处理器1001，例如中央处理器(Central Processing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(WIreless-FIdelity，WI-FI)接口)。存储器1005可以是高速的随机存取存储器(Random Access Memory，RAM)存储器，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。The controller includes: a processor 1001 , such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图2中示出的结构并不构成对智能设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 2 does not constitute a limitation on the smart device, and may include more or less components than shown in the figure, or combine some components, or arrange different components.

如图2所示，作为一种存储介质的存储器1005中可以包括操作系统、数据存储模块、网络通信模块、用户接口模块以及声源定位程序。As shown in FIG. 2 , the memory 1005 as a storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a sound source localization program.

在图2所示的智能设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本申请智能设备中的处理器1001、存储器1005可以设置在智能设备中，智能设备通过处理器1001调用存储器1005中存储的声源定位程序，并执行本申请实施例提供的声源定位方法。In the smart device shown in Figure 2, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the smart device of the present application can be set in In the smart device, the smart device invokes the sound source localization program stored in the memory 1005 through the processor 1001, and executes the sound source localization method provided in the embodiment of the present application.

基于上述智能设备的硬件结构但不限于上述硬件结构，本申请提供一种声源定位方法第一实施例。参照图2，图2示出了本申请声源定位方法第一实施例的流程示意图。Based on but not limited to the above-mentioned hardware structure of the smart device, the present application provides a first embodiment of a sound source localization method. Referring to FIG. 2 , FIG. 2 shows a schematic flowchart of a first embodiment of a sound source localization method of the present application.

需要说明的是，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.

本实施例中，声源定位方法包括：In this embodiment, the sound source localization method includes:

步骤S100、获取麦克风阵列中至少两个麦克风对声源分别采集的音频时域数据，以及至少两个麦克风中各个麦克风的频响误差补偿值。Step S100, acquiring audio time-domain data respectively collected by at least two microphones in the microphone array for the sound source, and a frequency response error compensation value of each of the at least two microphones.

具体而言，本实施例中方法的执行主体为智能设备。智能设备配置有麦克风阵列，且具有远场语音功能。智能设备可以是智能音箱、智能电视机或者手机等。可以理解的，智能设备还可以配置于智能家居系统中，以接收语音控制指令，并将语音控制指令发送至服务器以使相应的智能家居设备执行相应的动作。下文以智能设备为智能音箱为例进行具体说明。Specifically, the execution subject of the method in this embodiment is a smart device. The smart device is equipped with a microphone array and has a far-field voice function. The smart device may be a smart speaker, a smart TV, or a mobile phone. It can be understood that the smart device can also be configured in the smart home system to receive the voice control command and send the voice control command to the server so that the corresponding smart home device executes the corresponding action. The following takes the smart device as a smart speaker as an example for specific description.

声源可以是下达语音控制指令的用户，用户通过说出预设关键词以输入相应的语音控制指令。如在一书房场景中，智能音箱中的麦克风阵列接收到用户发出的“小x，小x”的声音信息。The sound source may be a user who gives a voice control command, and the user inputs a corresponding voice control command by speaking preset keywords. For example, in a study room scene, the microphone array in the smart speaker receives the sound information of "Xiao x, Xiao x" from the user.

可以理解的，针对该声源，麦克风阵列中的各个麦克风各自采集得到一段模拟信号，该模拟信号通过ADC(Analog-to-Digital Converter，模数转换器)转为数字信号，然后智能音箱中的SOC(System on Chip，系统级芯片)接收到数字信号，也即是本实施例中的音频时域数据为各个麦克风采集的声音的数字信号。It can be understood that for the sound source, each microphone in the microphone array collects an analog signal respectively, and the analog signal is converted into a digital signal through an ADC (Analog-to-Digital Converter, analog-to-digital converter), and then the smart speaker The SOC (System on Chip, system-on-chip) receives a digital signal, that is, the audio time-domain data in this embodiment is a digital signal of sound collected by each microphone.

频响误差补偿值为各个智能音箱的SOC保存的值，为各麦克风针对相同音频获得相同幅值所需的补偿值。频响误差补偿值可以由厂家在测试环境中通过对麦克风采集同一测试激励信号的录音文件分析计算得到。频响误差补偿值还可以是智能音箱配置有校准模式，用户可通过进入校准模式测量得到。The frequency response error compensation value is the value saved by the SOC of each smart speaker, and is the compensation value required for each microphone to obtain the same amplitude for the same audio. The frequency response error compensation value can be calculated by the manufacturer in the test environment by analyzing the recording files of the same test excitation signal collected by the microphone. The frequency response error compensation value can also be obtained when the smart speaker is equipped with a calibration mode, and the user can enter the calibration mode for measurement.

值得一提的是，由于每个麦克风必然存在频响误差，因此针对同一音源，同一智能设备上的多个麦克风采集的模拟信号还原出的数字信号在音频能量值或者响度或者声压值上存在一定的偏差，本实施例中的频响误差补偿值为使得各个麦克风还原出的数字时域信号在音频能量值或者响度或者声压值的幅值趋于一致，而不一定是使得各个麦克风还原出的数字信号在音频能量值或者响度或者声压值上的幅值与原始声音信号的音频能量值一致。It is worth mentioning that, since each microphone must have a frequency response error, for the same sound source, the digital signal restored from the analog signals collected by multiple microphones on the same smart device will have an error in the audio energy value or loudness or sound pressure value. A certain deviation, the frequency response error compensation value in this embodiment is to make the digital time domain signal restored by each microphone tend to be consistent in the audio energy value or loudness or the amplitude of the sound pressure value, not necessarily to make each microphone restore The audio energy value or loudness or sound pressure value of the output digital signal is consistent with the audio energy value of the original sound signal.

由于声源定位功能至少需要两个麦克风才能实现，因此，本实施例中，需要获取至少两个麦克风采集的音频时域数据和该至少两个麦克风中各个麦克风的频响误差补偿值。当然，为了提高声源定位的精度，可以使用麦克风阵列中所有麦克风采集的音频时域数据和所有麦克风的频响误差补偿值。下文以使用麦克风阵列中所有麦克风采集的音频时域数据和所有麦克风的频响误差补偿值为例进行说明。Since the sound source localization function requires at least two microphones, in this embodiment, audio time domain data collected by at least two microphones and frequency response error compensation values of each of the at least two microphones need to be acquired. Of course, in order to improve the accuracy of sound source localization, audio time domain data collected by all microphones in the microphone array and frequency response error compensation values of all microphones may be used. The following uses the audio time domain data collected by all the microphones in the microphone array and the frequency response error compensation values of all the microphones as an example for illustration.

步骤S200、针对各个音频时域数据，基于相对应的频响误差补偿值，对各帧的幅值进行校准，获得校准后的音频时域数据，以使所有校准后的音频时域数据中相应帧的幅值一致。Step S200, for each audio time domain data, based on the corresponding frequency response error compensation value, the amplitude of each frame is calibrated, and the calibrated audio time domain data is obtained, so that all the calibrated audio time domain data correspond to The amplitude of the frame is consistent.

对于每个麦克风，其形成的模拟信号转换得到时域波形信号中的任一帧而言，在时域波形图的纵轴上的坐标值为声音的幅值。由于频率响应误差的影响，不同麦克风得到的时域波形数据中同一时刻对应的原始幅值不同。本实施例通过每个麦克风对应的频响误差补偿值对每帧的原始幅值进行补偿，从而得到新的时域波形数据，即校准后的音频时域数据。所有校准后的音频数据对于声音的同一音频的幅值是一致的。For any frame in the time-domain waveform signal obtained by converting the analog signal formed by each microphone, the coordinate value on the vertical axis of the time-domain waveform graph is the amplitude of the sound. Due to the influence of the frequency response error, the original amplitude corresponding to the same moment in the time-domain waveform data obtained by different microphones is different. In this embodiment, the original amplitude of each frame is compensated by the frequency response error compensation value corresponding to each microphone, so as to obtain new time-domain waveform data, that is, calibrated audio time-domain data. All calibrated audio data is consistent for the same audio amplitude of the sound.

如对于声源在某一时刻发出的音频帧的幅值为A，而第一麦克风对应的第一波形数据中和该音频帧对应的幅值为data₁，且第一麦克风对应的频响误差补偿值为ff₁，第二麦克风对应的第二波形数据中和该音频帧对应的幅值为data₂，且第二麦克风对应的频响误差补偿值为ff₂，第三麦克风对应的第三波形数据中和该音频帧对应的幅值为data₃，且第三麦克风对应的频响误差补偿值为ff₃。因此，校准后的第一波形数据中和该音频帧对应的幅值为data₁+ff₁，校准后的第二波形数据中和该音频帧对应的幅值为data₂+ff₂，校准后的第三波形数据中和该音频帧对应的幅值为data₃+ff₃。校准后，data₁+ff₁＝data₂+ff₂＝data₃+ff₃。而对于data₁+ff₁＝data₂+ff₂＝data₃+ff₃与A之间的具体数值大小关系，本实施例对此并不限制。For example, the amplitude of the audio frame sent by the sound source at a certain moment is A, and the amplitude corresponding to the audio frame in the first waveform data corresponding to the first microphone is data ₁ , and the frequency response error corresponding to the first microphone is The compensation value is ff ₁ , the amplitude corresponding to the audio frame in the second waveform data corresponding to the second microphone is data ₂ , and the frequency response error compensation value corresponding to the second microphone is ff ₂ , the third microphone corresponds to the third The amplitude corresponding to the audio frame in the waveform data is data ₃ , and the frequency response error compensation value corresponding to the third microphone is ff ₃ . Therefore, the amplitude corresponding to the audio frame in the first calibrated waveform data is data ₁ + ff ₁ , and the amplitude corresponding to the audio frame in the calibrated second waveform data is data ₂ + ff ₂ , after calibration The amplitude corresponding to the audio frame in the third waveform data is data ₃ +ff ₃ . After calibration, data ₁ +ff ₁ =data ₂ +ff ₂ =data ₃ +ff ₃ . As for the specific value relationship between data ₁ +ff ₁ =data ₂ +ff ₂ =data ₃ +ff ₃ and A, this embodiment does not limit it.

步骤S300、基于所有校准后的音频时域数据，对声源进行定位。Step S300, based on all the calibrated audio time domain data, locate the sound source.

在得到校准后的音频时域数据后，即可通过一致性的幅值信号进行声源定位算法处理，由此获得声源点相对于麦克风的到达方向(包括方位角、俯仰角)和距离等，从而实现声源定位，进而实现后续的远场语音拾取功能。After the calibrated audio time domain data is obtained, the sound source localization algorithm can be processed through the consistent amplitude signal, thereby obtaining the arrival direction (including azimuth, elevation angle) and distance of the sound source point relative to the microphone, etc. , so as to realize sound source localization, and then realize the subsequent far-field voice pickup function.

本实施例中，在对麦克风阵列采集的多个音频时域数据进行处理时，会利用各个麦克风的频响误差补偿值对音频时域数据中的各帧的幅值进行补偿，从而消除各个麦克风由于个体差异以及组装公差等因素产生的频响误差，使得麦克风阵列中各个麦克风对同一音源产生一致的幅值信号，便于后续的不同麦克风之间的数据的处理，进而提高了声源定位方法中的定位精度，使得配置了麦克风阵列的智能设备的远场语音功能的识别精度更佳。In this embodiment, when processing multiple audio time-domain data collected by the microphone array, the frequency response error compensation value of each microphone is used to compensate the amplitude of each frame in the audio time-domain data, thereby eliminating the Due to the frequency response error caused by factors such as individual differences and assembly tolerances, each microphone in the microphone array generates a consistent amplitude signal for the same sound source, which facilitates subsequent data processing between different microphones, thereby improving the sound source localization method. The positioning accuracy makes the recognition accuracy of the far-field voice function of smart devices equipped with microphone arrays better.

基于上述实施例，提供本申请声源定位方法第二实施例。参阅图3，图3为本申请声源定位方法第二实施例的流程示意图。Based on the foregoing embodiments, a second embodiment of the sound source localization method of the present application is provided. Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a second embodiment of a sound source localization method of the present application.

本实施例中，获取至少两个所述麦克风中各个所述麦克风的频响误差补偿值的步骤具体包括：In this embodiment, the step of obtaining the frequency response error compensation value of each of the at least two microphones specifically includes:

步骤S21、获取各个麦克风对同一测试音源数据分别采集得到的实际音频数据。Step S21 , acquiring actual audio data collected by each microphone for the same test sound source data.

具体的，可以给智能音箱配置一麦克风校准模式，用户通过输入相应的预设指令控制智能音箱进入到麦克风校准模式。在进入麦克风校准模式后，用户可通过仿真嘴等播放设备播放测试音频数据。可以理解的，为了适应频率响应特性，测试音频数据可以是一段以相同音量输出，但是频率从100hz到20khz连续变化的预设扫频音频数据。同时智能音箱控制麦克风阵列开始录音，每个麦克风采集得到一个录音文件，从而获得多个录音文件。录音文件为实际音频数据，即获得各个麦克风采集得到的实际音频数据。Specifically, a microphone calibration mode can be configured for the smart speaker, and the user controls the smart speaker to enter the microphone calibration mode by inputting a corresponding preset command. After entering the microphone calibration mode, the user can play the test audio data through playback devices such as artificial mouth. It can be understood that, in order to adapt to the frequency response characteristic, the test audio data may be a piece of preset frequency sweep audio data output at the same volume, but whose frequency is continuously changed from 100 Hz to 20 khz. At the same time, the smart speaker controls the microphone array to start recording, and each microphone collects a recording file, thereby obtaining multiple recording files. The recording file is actual audio data, that is, the actual audio data collected by each microphone is obtained.

步骤S22、确定各个实际音频数据的测试音频能量值。Step S22, determining the test audio energy value of each actual audio data.

智能音箱可以根据实际音频数据计算得到各实际音频数据的测试音频能量值：e₁，e₂，……e_n；n为麦克风阵列中麦克风的数量，e_n表示第n个麦克风的测试音频能量值。The smart speaker can calculate the test audio energy value of each actual audio data according to the actual audio data: e ₁ , e ₂ ,... e _n ; n is the number of microphones in the microphone array, and e _n represents the test audio energy of the nth microphone value.

值得一提的是，在音箱领域中，200HZ为重低音喇叭的分频点,800HZ是中低音的分频点。而人一般说话的频率分布在200-1000hz，因此在一实施例中，步骤S22具体包括：It is worth mentioning that in the field of speakers, 200HZ is the crossover point of the subwoofer, and 800HZ is the crossover point of the mid-bass. And the frequency distribution of people's general speech is 200-1000hz, so in one embodiment, step S22 specifically includes:

其中，目标频段为200hz-800hz。即本实施例中，智能音箱可以根据实际音频数据中，计算得到各实际音频数据在200hz-800hz的测试音频能量值：e₁，e₂，……e_n。此时，测试音频能量值可以是实际音频数据中分布在目标频段的音频的平均音频能量值。Among them, the target frequency band is 200hz-800hz. That is, in this embodiment, the smart speaker can calculate the test audio energy values of each actual audio data at 200hz-800hz according to the actual audio data: e ₁ , e ₂ , ... e _n . At this time, the test audio energy value may be an average audio energy value of audio distributed in the target frequency band in the actual audio data.

步骤S23、基于测试音频能量值，确定出同一测试音源数据对应的基准音频能量值。Step S23, based on the test audio energy value, determine the reference audio energy value corresponding to the same test sound source data.

在得到所有测试样本数据，即所有测试音频能量值后，即可根据部分测试音频能量值或者所有测试音频能量值确定出一基准音频能量值。可以理解的，根据所有测试音频能量值确定出的基准音频能量值作为基准更加准确。如基准音频能量值可以是所有测试音频能量值的平均值、众数或者平均数等。After obtaining all test sample data, that is, all test audio energy values, a reference audio energy value can be determined according to part of the test audio energy values or all test audio energy values. It can be understood that it is more accurate to use the reference audio energy value determined according to all test audio energy values as a reference. For example, the reference audio energy value may be an average value, a mode number, or an average number of all test audio energy values.

作为一种具体实施方式，步骤S23具体包括：As a specific implementation manner, step S23 specifically includes:

步骤S231、确定所有测试音频能量值的平均能量值；Step S231, determining the average energy value of all test audio energy values;

步骤S232、将平均能量值作为基准音频能量值。Step S232, taking the average energy value as the reference audio energy value.

具体而言，计算出所有测试音频能量值的平均能量值

并将平均能量值作为基准音频能量值e₀。Specifically, the average energy value of all test audio energy values is calculated

And take the average energy value as the reference audio energy value e ₀ .

步骤S24、基于测试音频能量值和基准音频能量值，确定对应的麦克风的频响误差补偿值，以得到各个麦克风的频响误差补偿值。Step S24: Determine the frequency response error compensation value of the corresponding microphone based on the test audio energy value and the reference audio energy value, so as to obtain the frequency response error compensation value of each microphone.

通过测试音频能量值和基准音频能量值之间的数值关系，即可确定该测试音频能量值对应的麦克风的频响误差补偿值，从而得到各麦克风的频响误差补偿值。Through the numerical relationship between the test audio energy value and the reference audio energy value, the frequency response error compensation value of the microphone corresponding to the test audio energy value can be determined, thereby obtaining the frequency response error compensation value of each microphone.

作为一个实施例，步骤S24具体包括：As an embodiment, step S24 specifically includes:

步骤S241、确定测试音频能量值相对于基准音频能量值的比值。Step S241. Determine the ratio of the test audio energy value to the reference audio energy value.

步骤S242、基于比值，获得对应的麦克风的补偿系数。Step S242, based on the ratio, obtain the compensation coefficient of the corresponding microphone.

步骤S243、基于补偿系数，获得对应的麦克风在时域中的频响误差补偿值。Step S243, based on the compensation coefficient, obtain the frequency response error compensation value of the corresponding microphone in the time domain.

具体的，针对测试音频能量值：e₁，e₂，……，e_n。计算得到每个测试音频能量值相对于基准音频能量值e₀的比值r。如

然后根据比值r计算得到补偿系数f。Specifically, for the test audio energy values: e ₁ , e ₂ , . . . , e _n . Calculate the ratio r of each test audio energy value relative to the reference audio energy value e ₀ . like

Then calculate the compensation coefficient f according to the ratio r.

作为一种选择，可根据比值r和公式一计算得到补偿系数f。公式一具体为：

因此，可以计算得到各个麦克风对应的补偿系数：f₁、f₂、f₃、…、f_n。f_n为第n个麦克风的补偿系数。As an option, the compensation coefficient f can be calculated according to the ratio r and formula one. Formula 1 is specifically:

Therefore, compensation coefficients corresponding to each microphone can be calculated: f ₁ , f ₂ , f ₃ , . . . , f _n . f _n is the compensation coefficient of the nth microphone.

然后基于各个麦克风对应的补偿系数：f₁、f₂、f₃、…、f_n，即可计算得到获得各麦克风的频响误差补偿值。Then, based on the compensation coefficients corresponding to each microphone: f ₁ , f ₂ , f ₃ , . . . , f _n , the frequency response error compensation value of each microphone can be calculated and obtained.

作为一个实施例，基于各个麦克风对应的补偿系数：f₁、f₂、f₃、…、f_n和公式二，计算得到获得各麦克风的频响误差补偿值ff_n。公式二为：As an embodiment, based on the compensation coefficients corresponding to each microphone: f ₁ , f ₂ , f ₃ _, _. Formula two is:

ff_n＝2³¹·f_n；ff _n = 2 ³¹ f _n ;

其中，ff_n为第n个麦克风的频响误差补偿值。Wherein, ff _n is the frequency response error compensation value of the nth microphone.

本实施例中，提供了一种确定各个麦克风对同一测试音源数据得到的实际音频数据的音频能量值之间的数值关系，进而根据该数值关系确定出使得各个麦克风采集同一音频时获得相同幅值信号的补偿值的方式，该种方式确定出的补偿值可以对各帧的幅值进行较为准确的校准，以将相应帧的幅值校准至一致。In this embodiment, a method is provided to determine the numerical relationship between the audio energy values of the actual audio data obtained by each microphone for the same test sound source data, and then determine according to the numerical relationship so that each microphone can obtain the same amplitude when collecting the same audio The method of compensation value of the signal, the compensation value determined in this way can more accurately calibrate the amplitude of each frame, so as to calibrate the amplitude of the corresponding frames to be consistent.

基于上述实施例，提供本申请声源定位方法第三实施例。参阅图4，图4为本申请声源定位方法第三实施例的流程示意图。Based on the foregoing embodiments, a third embodiment of the sound source localization method of the present application is provided. Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a third embodiment of a sound source localization method of the present application.

本实施例中，步骤S300具体包括：In this embodiment, step S300 specifically includes:

步骤S310、获得校准后的音频时域数据中的音频特征值。Step S310, obtaining audio feature values in the calibrated audio time domain data.

具体的，音频特征值包括但不限于音频时域数据的每帧、部分波形或者全局数据的均方根能量等能量特征、时域特征、频域特征、乐理特征以及感知特征等。Specifically, audio feature values include, but are not limited to, energy features such as root mean square energy of each frame of audio time domain data, partial waveforms, or global data, time domain features, frequency domain features, music theory features, and perceptual features.

步骤S320、基于音频特征值和预设关键词特征值，对所有校准后的音频时域数据中的任意两者均进行比对，筛选出音频特征值相匹配的语音关键词帧组。Step S320 , based on the audio feature value and the preset keyword feature value, compare any two of all the calibrated audio time-domain data, and filter out the voice keyword frame group whose audio feature value matches.

在对校准后的音频时域数据处理得到音频特征值后，即可基于音频特征值，对所有校准后的音频时域数据中的任意两个校准后的音频时域数据进行比对处理，从而筛选出音频特征值相匹配的语音关键词帧组。音频特征值相匹配也即是两帧音频帧的音频特征值一致，语音关键词帧组也即是两帧音频帧内均包括语音关键词的特征。After processing the calibrated audio time-domain data to obtain the audio characteristic value, any two calibrated audio time-domain data in all the calibrated audio time-domain data can be compared based on the audio characteristic value, thereby Filter out the voice keyword frame groups that match the audio feature value. The matching of the audio feature values means that the audio feature values of the two audio frames are consistent, and the voice keyword frame group is the feature that both audio frames include the voice keyword.

作为一个实施例，步骤S320具体包括：As an embodiment, step S320 specifically includes:

步骤S321、基于音频特征值，对所有校准后的音频时域数据中的任意两者均进行比对，筛选出音频特征值相匹配的至少一个音频帧组。Step S321 , based on the audio characteristic value, compare any two of all the calibrated audio time-domain data, and filter out at least one audio frame group whose audio characteristic value matches.

步骤S322、从至少一个音频帧组中筛选出与预设关键词特征值相匹配的语音关键词帧组。Step S322 , filtering out the voice keyword frame group matching the preset keyword feature value from at least one audio frame group.

可以理解的，由于麦克风阵列中各个麦克风距离用户的距离不一致，因此接收到用户的声音信号的时间也不一致。在声源定位时，需要先将不同校准后的音频时域数据中针对同一声信号的音频帧筛选出来，即对任意两个校准后的音频时域数据进行比对，筛选出音频特征值相匹配的至少一个音频帧组。可以理解的，由于通过频响误差补偿值进行校准处理，因此，不同校准后的音频时域数据中针对同一声信号的音频帧的幅值为一致的，即不同校准后的音频时域数据中针对同一段声信号的的波形是一致的，因此，可以更加准确地筛选出音频特征值相匹配的至少一个音频帧组。It can be understood that, since the distances between the microphones in the microphone array and the user are inconsistent, the time at which the user's voice signal is received is also inconsistent. When locating the sound source, it is necessary to filter out the audio frames of the same sound signal from the different calibrated audio time domain data, that is, to compare any two calibrated audio time domain data, and to filter out the corresponding audio feature values. At least one audio framegroup that matches. It can be understood that since the calibration process is performed through the frequency response error compensation value, the amplitude values of the audio frames for the same acoustic signal in different calibrated audio time domain data are consistent, that is, in different calibrated audio time domain data The waveforms for the same segment of audio signal are consistent, therefore, at least one audio frame group with matching audio feature values can be more accurately screened out.

然后再根据预设关键词特征值，从至少一个音频帧组中筛选出与预设关键词特征值相匹配的语音关键词帧组。预设关键词特征值可以根据预设关键词提取得到。如在一示例中，预设关键词为“小x,小x”,则智能音箱可提取“小x,小x”的音频特征值，并作为预设关键词特征值保存，然后将至少一个音频帧组中的音频帧的特征值和预设关键词特征值一一进行特征比对，筛选出语音关键词帧组。Then, according to the preset keyword feature value, the voice keyword frame group matching the preset keyword feature value is selected from at least one audio frame group. The feature value of the preset keyword may be extracted according to the preset keyword. For example, if the preset keyword is "small x, small x", the smart speaker can extract the audio feature value of "small x, small x" and save it as a preset keyword feature value, and then at least one The eigenvalues of the audio frames in the audio frame group are compared with the preset keyword eigenvalues one by one, and the speech keyword frame group is screened out.

步骤S330、确定语音关键词帧组中两帧音频帧之间的时延信息。Step S330. Determine the time delay information between two audio frames in the voice keyword frame group.

步骤S340、基于时延信息，对声源进行定位。Step S340, based on the delay information, locate the sound source.

在确定出语音关键词帧组后，即可根据两帧音频帧各自的时间戳计算得到时延信息，然后基于时延信息和对应的麦克风的位置信息，对声源进行定位。After the voice keyword frame group is determined, the time delay information can be calculated according to the respective time stamps of the two audio frames, and then the sound source can be located based on the time delay information and the corresponding microphone position information.

如语音关键词帧组包括第一麦克风的第t_i时刻音频帧，和第二麦克风的第t_j时刻音频帧,即可根据|t_i-t_j|计算得到时延信息。For example, if the voice keyword frame group includes the audio frame at time t _i of the first microphone and the audio frame at time t _j of the second microphone, the delay information can be calculated according to |t _i -t _j |.

基于同一发明构思，请参阅图5，本申请还提供了一种声源定位装置，包括：Based on the same inventive concept, please refer to Figure 5, the present application also provides a sound source localization device, including:

需要说明的是，本实施例中的关于声源定位装置的各实施方式以及其达到的技术效果可参照前述实施例中声源定位方法的各种实施方式，这里不再赘述。It should be noted that, for various implementations of the sound source localization device in this embodiment and the technical effects achieved by it, reference may be made to various implementations of the sound source localization method in the foregoing embodiments, which will not be repeated here.

此外，本申请实施例还提出一种计算机存储介质，存储介质上存储有声源定位程序，声源定位程序被处理器执行时实现如上文的声源定位方法的步骤。因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。确定为示例，程序指令可被部署为在一个计算设备上执行，或者在位于一个地点的多个计算设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算设备上执行。In addition, the embodiment of the present application also proposes a computer storage medium, on which a sound source localization program is stored, and when the sound source localization program is executed by a processor, the steps of the above sound source localization method are realized. Therefore, details will not be repeated here. In addition, the description of the beneficial effect of adopting the same method will not be repeated here. For the technical details not disclosed in the embodiments of the computer-readable storage medium involved in the present application, please refer to the description of the method embodiments of the present application. Certainly for example, program instructions can be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network to execute.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，上述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，上述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(RandomAccessMemory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware. The above programs can be stored in a computer-readable storage medium. During execution, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the above-mentioned storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

另外需说明的是，以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本申请提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units , which can be located in one place, or can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. It can be understood and implemented by those skilled in the art without creative effort.

通过以上的实施方式的描述，所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现，当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下，凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现，而且，用来实现同一功能的具体硬件结构也可以是多种多样的，例如模拟电路、数字电路或专用电路等。但是，对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在可读取的存储介质中，如计算机的软盘、U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，RandomAccessMemory)、磁碟或者光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware, and of course it can also be realized by special hardware including application-specific integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions completed by computer programs can be easily realized by corresponding hardware, and the specific hardware structure used to realize the same function can also be varied, such as analog circuits, digital circuits or special-purpose circuit etc. However, for this application, software program implementation is a better implementation mode in most cases. Based on this understanding, the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk , U disk, mobile hard disk, read-only memory (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, A server, or a network device, etc.) executes the methods of various embodiments of the present application.

以上仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. All equivalent structures or equivalent process transformations made by using the description of the application and the accompanying drawings are directly or indirectly used in other related technical fields. , are all included in the patent protection scope of the present application in the same way.

Claims

1. A method for localizing a sound source, characterized in that it is used in a smart device, and the smart device includes a microphone array, and the method comprises:

Acquiring audio time-domain data respectively collected by at least two microphones in the microphone array for the sound source, and a frequency response error compensation value of each of the at least two microphones;

For each of the audio time domain data, based on the corresponding frequency response error compensation value, the amplitude of each frame is calibrated to obtain calibrated audio time domain data, so that all the calibrated audio time domain data The amplitudes of the corresponding frames in the data are consistent;

The sound source is localized based on all the calibrated audio time domain data.

2. The sound source localization method according to claim 1, wherein the obtaining the frequency response error compensation value of each of the at least two microphones comprises:

Acquiring actual audio data collected by each of the at least two microphones for the same test sound source data;

determining a test audio energy value for each of said actual audio data;

Based on the test audio energy value, determine the reference audio energy value corresponding to the same test sound source data;

Based on the test audio energy value and the reference audio energy value, determine the frequency response error compensation value of the corresponding microphone, so as to obtain the frequency response error compensation value of each of the microphones.

3. The sound source localization method according to claim 2, wherein the frequency response error compensation value of the corresponding microphone is determined based on the test audio energy value and the reference audio energy value, To obtain the frequency response error compensation value of each of the microphones, including:

determining a ratio of the test audio energy value relative to the reference audio energy value;

Obtaining a corresponding compensation coefficient of the microphone based on the ratio;

Based on the compensation coefficient, a corresponding frequency response error compensation value of the microphone in the time domain is obtained.

4. The sound source localization method according to claim 2, wherein, based on the test audio energy value, determining the reference audio energy value corresponding to the same test sound source data comprises:

determining an average energy value of all said test audio energy values;

The average energy value is used as the reference audio energy value.

5. The sound source localization method according to any one of claims 2 to 4, wherein the same test sound source data is preset frequency sweep audio data, and the preset frequency sweep audio data includes at least the target frequency band audio;

The determination of the test audio energy value of each of the actual audio data includes:

Determining test audio energy values of audio distributed in the target frequency band in each of the actual audio data.

6. The sound source localization method according to claim 1, wherein the localization of the sound source based on all the calibrated audio time domain data comprises:

Obtain audio feature values in the calibrated audio time domain data;

Based on the audio characteristic value and the preset keyword characteristic value, any two of all the calibrated audio time domain data are compared, and the speech keyword frame group matching the audio characteristic value is selected. ;

Determining delay information between two audio frames in the voice keyword frame group;

Based on the time delay information, the sound source is located.

7. The sound source localization method according to claim 6, wherein, based on the audio feature value and the preset keyword feature value, any two in all the calibrated audio time domain data All compare, filter out the speech keyword frame group that described audio feature value matches, comprise:

Based on the audio feature value, comparing any two of all the calibrated audio time-domain data, and screening out at least one audio frame group whose audio feature value matches;

The voice keyword frame group matching the preset keyword feature value is screened out from at least one of the audio frame groups.

8. A sound source localization device, characterized in that it is configured in a smart device, the smart device includes a microphone array, and the device includes:

An information acquisition module, configured to acquire audio time-domain data respectively collected by at least two microphones in the microphone array from the sound source, and a frequency response error compensation value of each of the at least two microphones;

An audio calibration module, for each of the audio time domain data, based on the corresponding frequency response error compensation value, to calibrate the amplitude of each frame, to obtain the calibrated audio time domain data, so that all the The amplitudes of the corresponding frames in the calibrated audio time domain data are consistent;

A sound source localization module, configured to localize the sound source based on all the calibrated audio time domain data.

9. A smart device, characterized in that it comprises:

Microphone array, described microphone array comprises a plurality of microphones, and described microphone is used for collecting audio analog data;

an analog-to-digital converter for converting said audio analog data into audio time-domain data;

a controller, the controller is connected with the analog-to-digital converter and is used to receive the audio time-domain data, the controller includes a processor, a memory and a sound source localization program stored in the memory, the When the sound source localization program is run by the processor, the steps of the sound source localization method according to any one of claims 1-7 are realized.

10. A computer-readable storage medium, characterized in that a sound source localization program is stored on the computer-readable storage medium, and when the sound source localization program is run by the processor, any of claims 1-7 is implemented. A step of the sound source localization method.