CN115038014B

CN115038014B - Audio signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN115038014B
Application number: CN202210625968.8A
Authority: CN
Inventors: 邓刚
Original assignee: Shenzhen Changfeng Imaging Equipment Co ltd
Current assignee: Shenzhen Changfeng Imaging Equipment Co ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2024-10-29
Anticipated expiration: 2042-06-02
Also published as: CN115038014A

Abstract

The embodiment of the invention discloses an audio signal processing method, an audio signal processing device, electronic equipment and a storage medium, wherein the audio signal processing method comprises the following steps: acquiring a first audio signal acquired by a directional microphone and acquiring a second audio signal acquired by an omni-directional microphone; determining a sound source direction according to the acquisition parameters of the second audio signal; determining a speaking mode based on the sound source direction and the intensity of the first audio signal; and carrying out fusion processing on the first audio signal and the second audio signal based on the speaking mode to obtain a target audio signal. The embodiment of the invention can receive the audio signals in all directions without transmitting the microphone, thereby improving the user experience. And obtaining the target audio signal by carrying out fusion processing on the first audio signal and the second audio signal based on the speaking mode. The received audio signals can be processed more accurately, the signal to noise ratio of the target audio signals is effectively improved, and the user experience is further improved.

Description

Audio signal processing method, device, electronic device and storage medium

技术领域Technical Field

本发明实施例涉及信号处理技术，尤其涉及一种音频信号处理方法、装置、电子设备和存储介质。The embodiments of the present invention relate to signal processing technology, and in particular to an audio signal processing method, device, electronic device and storage medium.

背景技术Background Art

采集音频信号的麦克风一般分为指向性麦克风和全向性麦克风。其中，指向性麦克风用于采集主讲人的声音音频信号。全向性麦克风用于采集多人的声音音频信号。Microphones for collecting audio signals are generally divided into directional microphones and omnidirectional microphones. Among them, directional microphones are used to collect the audio signals of the speaker's voice. Omnidirectional microphones are used to collect the audio signals of multiple people's voices.

在日常会议中，往往利用指向性麦克风拾取主讲人的声音音频信号，在其他参会人讲话时，则需要用户传递使用，这种方式非常不方便；利用全向性麦克风可以拾取多人的声音音频信号，但同时也会拾取较多的环境噪音，信号质量不好。In daily meetings, directional microphones are often used to pick up the speaker's voice and audio signals. When other participants are speaking, the user needs to pass the microphone on, which is very inconvenient. Omnidirectional microphones can pick up voice and audio signals from multiple people, but they will also pick up more ambient noise, resulting in poor signal quality.

发明内容Summary of the invention

本发明实施例提供一种音频信号处理方法、装置、电子设备和存储介质，本发明实施例可以适用于根据不同讲话模式对接收到的音频信号进行不同的降噪处理。The embodiments of the present invention provide an audio signal processing method, an apparatus, an electronic device and a storage medium. The embodiments of the present invention can be applied to perform different noise reduction processes on received audio signals according to different speech modes.

第一方面，本发明实施例提供了一种音频信号处理方法，所述方法包括：In a first aspect, an embodiment of the present invention provides an audio signal processing method, the method comprising:

获取所述指向性麦克风采集的第一音频信号，并获取所述全向性麦克风采集的第二音频信号；Acquire a first audio signal collected by the directional microphone, and acquire a second audio signal collected by the omnidirectional microphone;

根据所述第二音频信号的采集参数确定声源方向；determining a sound source direction according to an acquisition parameter of the second audio signal;

根据所述声源方向和所述第一音频信号的强度确定讲话模式；determining a speech mode according to the direction of the sound source and the strength of the first audio signal;

基于所述讲话模式对所述第一音频信号和所述第二音频信号进行融合处理，得到目标音频信号。The first audio signal and the second audio signal are fused based on the speech mode to obtain a target audio signal.

第二方面，本发明实施例提供了一种音频信号处理装置，所述装置包括：In a second aspect, an embodiment of the present invention provides an audio signal processing device, the device comprising:

信号获取模块，用于获取所述指向性麦克风采集的第一音频信号，并获取所述全向性麦克风采集的第二音频信号；A signal acquisition module, used to acquire a first audio signal collected by the directional microphone and to acquire a second audio signal collected by the omnidirectional microphone;

方向确定模块，用于根据所述第二音频信号的采集参数确定声源方向；A direction determination module, used to determine the direction of the sound source according to the acquisition parameters of the second audio signal;

模式确定模块，用于根据所述声源方向和所述第一音频信号的强度确定讲话模式；a mode determination module, configured to determine a speech mode according to the direction of the sound source and the intensity of the first audio signal;

目标信号获取模块，用于基于所述讲话模式对所述第一音频信号和所述第二音频信号进行融合处理，得到目标音频信号。The target signal acquisition module is used to perform fusion processing on the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal.

第三方面，本发明实施例还提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如本发明实施例中任一所述的音频信号处理方法。In a third aspect, an embodiment of the present invention further provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, an audio signal processing method as described in any one of the embodiments of the present invention is implemented.

第四方面，本发明实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如本发明实施例中任一所述的音频信号处理方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the audio signal processing method as described in any one of the embodiments of the present invention.

本发明实施例中，应用于包括指向性麦克风和全向性麦克风的音频装置，获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号；根据第二音频信号的采集参数确定声源方向；根据声源方向和第一音频信号的强度确定讲话模式；基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。即本发明实施例中，音频装置同时具有指向性麦克风和全向性麦克风这两类麦克风，可以获取到指向性麦克风和全向性麦克风接收的音频信号，集合了这两类麦克风的优势，在不需要传递麦克风的同时，还能接收到各个方向的音频信号，提高了用户体验；基于讲话模式对这两类麦克风采集的音频信号进行融合处理，得到目标音频信号，通过信号融合提高了得到的目标音频信号的质量，进一步提高了用户体验。In an embodiment of the present invention, an audio device including a directional microphone and an omnidirectional microphone is applied to obtain a first audio signal collected by the directional microphone and a second audio signal collected by the omnidirectional microphone; the direction of the sound source is determined according to the collection parameters of the second audio signal; the speech mode is determined according to the direction of the sound source and the intensity of the first audio signal; the first audio signal and the second audio signal are fused based on the speech mode to obtain a target audio signal. That is, in an embodiment of the present invention, the audio device has both a directional microphone and an omnidirectional microphone, and can obtain audio signals received by the directional microphone and the omnidirectional microphone, combining the advantages of these two types of microphones. While not needing to transmit the microphone, it can also receive audio signals from all directions, thereby improving the user experience; the audio signals collected by these two types of microphones are fused based on the speech mode to obtain the target audio signal, and the quality of the obtained target audio signal is improved through signal fusion, further improving the user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1为本发明实施例提供的音频信号处理方法的一个流程示意图；FIG1 is a schematic flow chart of an audio signal processing method provided by an embodiment of the present invention;

图2为本发明实施例提供的得到目标音频信号的流程图；FIG2 is a flow chart of obtaining a target audio signal according to an embodiment of the present invention;

图3为本发明实施例提供的音频信号处理方法的另一流程图；FIG3 is another flow chart of the audio signal processing method provided by an embodiment of the present invention;

图4为本发明实施提供的信号提取的流程示意图；FIG4 is a schematic diagram of a signal extraction process according to an embodiment of the present invention;

图5为本发明实施例提供的音频信号处理装置的一个结构图；FIG5 is a structural diagram of an audio signal processing device provided by an embodiment of the present invention;

图6为本发明实施例提供的电子设备的一个结构示意图。FIG. 6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释本发明，而非对本发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are only used to explain the present invention, rather than to limit the present invention. It should also be noted that, for ease of description, only parts related to the present invention, rather than all structures, are shown in the accompanying drawings.

图1为本发明实施例提供的音频信号处理方法的一个流程示意图，本发明实施例的方法适用于具有两类麦克风的音频装置。该方法可以由本发明实施例提供的音频信号处理装置来执行，该装置可采用软件和/或硬件的方式实现。在一个具体地实施例中，该装置可以集成在电子设备中，电子设备比如可以是音频装置。以下实施例将以该装置集成在音频装置中为例进行说明，该音频装置包括指向性麦克风和全向性麦克风，指向性麦克风的数量可以为一个或多个。FIG1 is a flow chart of an audio signal processing method provided in an embodiment of the present invention. The method in an embodiment of the present invention is applicable to an audio device having two types of microphones. The method can be executed by an audio signal processing device provided in an embodiment of the present invention, and the device can be implemented in software and/or hardware. In a specific embodiment, the device can be integrated in an electronic device, and the electronic device can be, for example, an audio device. The following embodiments will be described by taking the device integrated in an audio device as an example. The audio device includes a directional microphone and an omnidirectional microphone, and the number of directional microphones can be one or more.

参考图1，该方法具体可以包括如下步骤：Referring to FIG1 , the method may specifically include the following steps:

步骤101、获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号。Step 101: Acquire a first audio signal collected by a directional microphone, and acquire a second audio signal collected by an omnidirectional microphone.

其中，指向性麦克风以多振膜技术为基础，通过变换咪头的腔体设定以及混合调制多振膜接收声音信号。指向性麦克风对其指定方向的声音信号感知的灵敏度高，对其他方向的声音信号感知的灵敏度低。全向性麦克风可以接收到任意方向的声音信号，并且对所有方向的声音信号感知的灵敏度相同，适合在会议室和办公室等场所使用。例如，在一个有多人参与的会议中，设置有指向性麦克风和全向性麦克风。指向性麦克风可以采集到正在使用指向性麦克风讲话的人的声音信号，全向性麦克风可以采集到使用指向性麦克风讲话的人的声音信号和其他参会人员发出的声音信号。当然，指向性麦克风在接收到人声音频信号的同时，也会接收到其方向上的环境音频信号。全向性麦克风在接收到所有人声音频信号的同时，也会接收到会议中的环境音频信号。具体地，指向性麦克风接收到的人声音频信号和环境音频信号是第一音频信号，全向性麦克风接收到的人声音频信号和环境音频信号是第二音频信号。Among them, the directional microphone is based on multi-diaphragm technology, and receives sound signals by changing the cavity setting of the microphone head and mixing and modulating the multi-diaphragm. The directional microphone has high sensitivity to the sound signal in its specified direction and low sensitivity to the sound signal in other directions. The omnidirectional microphone can receive sound signals in any direction, and has the same sensitivity to the sound signals in all directions, which is suitable for use in places such as conference rooms and offices. For example, in a meeting with multiple participants, a directional microphone and an omnidirectional microphone are set. The directional microphone can collect the sound signal of the person who is speaking with the directional microphone, and the omnidirectional microphone can collect the sound signal of the person who is speaking with the directional microphone and the sound signal emitted by other participants. Of course, the directional microphone will receive the ambient audio signal in its direction while receiving the human voice audio signal. The omnidirectional microphone will also receive the ambient audio signal in the meeting while receiving all human voice audio signals. Specifically, the human voice audio signal and the ambient audio signal received by the directional microphone are the first audio signal, and the human voice audio signal and the ambient audio signal received by the omnidirectional microphone are the second audio signal.

步骤102、根据第二音频信号的采集参数确定声源方向。Step 102: Determine the sound source direction according to the acquisition parameters of the second audio signal.

其中，第二音频信号包括全向性麦克风采集的人声音频信号和环境音频信号，采集参数包括接收到音频信号的时间、音频信号的音速、音频信号的模拟角频率、幅度和相位等。具体地，可以通过基于最大输出功率的可控波束形成方法、基于高分辨率谱图估计方法和基于声音时间差的声源定位方法确定出音频信号的声源方向。其中，基于最大输出功率的可控波束形成方法是将采集来的音频信号进行加权求和形成波束，通过搜索声源的可能位置来引导该波束，再修改权值使得麦克风阵列的输出信号功率最大。基于高分辨率谱估计方法通过分解协方差矩阵估计声源方位。基于高分辨率谱估计方法不受采样频率限制，且在一定程度下可以实现任意程度的定位。基于声音时间差的声源定位方法可以通过估计信号到达各麦克风的时间差对声源方向进行定位，也可以利用声音信号和麦克风的几何关系确定声源位置。进一步地，根据接收到音频信号的时间、音频信号的音速、音频信号的模拟角频率、幅度和相位等采集参数，利用上述声源定位方法，确定出声源方向。Among them, the second audio signal includes a human voice audio signal and an environmental audio signal collected by an omnidirectional microphone, and the collection parameters include the time when the audio signal is received, the speed of sound of the audio signal, the simulated angular frequency, amplitude and phase of the audio signal, etc. Specifically, the sound source direction of the audio signal can be determined by a controllable beamforming method based on maximum output power, a high-resolution spectrogram estimation method and a sound source localization method based on sound time difference. Among them, the controllable beamforming method based on maximum output power is to form a beam by weighted summing of the collected audio signal, guide the beam by searching for the possible position of the sound source, and then modify the weight to maximize the output signal power of the microphone array. The high-resolution spectrum estimation method estimates the direction of the sound source by decomposing the covariance matrix. The high-resolution spectrum estimation method is not limited by the sampling frequency, and can achieve any degree of positioning to a certain extent. The sound source localization method based on sound time difference can locate the direction of the sound source by estimating the time difference of the signal reaching each microphone, and can also determine the sound source position by using the geometric relationship between the sound signal and the microphone. Furthermore, the direction of the sound source is determined by using the above-mentioned sound source localization method according to the acquisition parameters such as the time when the audio signal is received, the sound speed of the audio signal, the analog angular frequency, amplitude and phase of the audio signal.

步骤103、根据声源方向和第一音频信号的强度确定讲话模式。Step 103: Determine a speech mode according to the direction of the sound source and the strength of the first audio signal.

其中，讲话模式包括单人讲话模式和多人讲话模式。单人讲话模式适用于只需采集一人或指定一个方向的音频信号的场景。例如，演讲会场中只需采集演讲人发出的音频信号，剧场表演、演唱表演等场景中只需采集舞台上演员发出的音频信号。多人讲话模式适用于需要采集多人或来自多个方向的音频信号的场景。例如，在多人会议中需要采集参加会议的各个人员发出的音频信号。根据声源方向和第一音频信号的强度可以确定讲话模式。具体地，当指向性麦克风采集到的第一音频信号的强度到达一定程度时，且该音频信号的方向在指向性麦克风指定的方向内。则说明在此时的场景中，有人在设置有指向性麦克风的方位发出声音，进一步可以确定讲话模式为单人模式。如果指向性麦克风采集的第一音频信号的强度比较弱，或者第一音频信号的强度达到一定程度但其信号方向不属于指向性麦克风指向的方向，则说明讲话的人并不在设置有指向性麦克风的方位中，进一步可以确定讲话模式为多人讲话模式。Among them, the speech mode includes a single-person speech mode and a multi-person speech mode. The single-person speech mode is suitable for scenes where only one person or a specified direction of audio signals need to be collected. For example, in a lecture venue, only the audio signal emitted by the speaker needs to be collected, and in scenes such as theater performances and singing performances, only the audio signal emitted by the actors on the stage needs to be collected. The multi-person speech mode is suitable for scenes where audio signals from multiple people or from multiple directions need to be collected. For example, in a multi-person meeting, the audio signals emitted by each person attending the meeting need to be collected. The speech mode can be determined according to the direction of the sound source and the intensity of the first audio signal. Specifically, when the intensity of the first audio signal collected by the directional microphone reaches a certain level, and the direction of the audio signal is within the direction specified by the directional microphone. It means that in the scene at this time, someone is making a sound in the direction where the directional microphone is set, and the speech mode can be further determined to be a single-person mode. If the intensity of the first audio signal collected by the directional microphone is relatively weak, or the intensity of the first audio signal reaches a certain level but its signal direction does not belong to the direction pointed by the directional microphone, it means that the person speaking is not in the direction where the directional microphone is set, and the speech mode can be further determined to be a multi-person speech mode.

步骤104、基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。Step 104: perform fusion processing on the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal.

其中，融合处理包括将第一音频信号和第二音频信号中放大的人声音频信号和抑噪后的环境音频信号进行融合，最终得到目标音频信号。图2是本发明实施例提供的得到目标音频信号的流程图，如图2所示，在获取到第一音频信号和第二音频信号后，将第一音频信号和第二音频信号中的人声音频信号和环境音频信号提取出来，对提取出的人声音频信号进行放大处理得到人声增强信号，对提取出的环境音频信号进行抑噪处理得到抑噪音频信号。再将人声增强信号和抑噪音频信号进行融合，最终得到目标音频信号。Among them, the fusion processing includes fusing the amplified human voice audio signal and the noise-suppressed environmental audio signal in the first audio signal and the second audio signal to finally obtain the target audio signal. Figure 2 is a flowchart of obtaining the target audio signal provided by an embodiment of the present invention. As shown in Figure 2, after obtaining the first audio signal and the second audio signal, the human voice audio signal and the environmental audio signal in the first audio signal and the second audio signal are extracted, the extracted human voice audio signal is amplified to obtain a human voice enhancement signal, and the extracted environmental audio signal is noise-suppressed to obtain a noise-suppressed audio signal. Then the human voice enhancement signal and the noise-suppressed audio signal are fused to finally obtain the target audio signal.

本实施例的技术方案，可以获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号。根据第二音频信号的采集参数确定声源方向。根据声源方向和第一音频信号的强度确定讲话模式。基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。利用本实施例的技术方案，可以接收到各个方向的音频信号，根据声源方向和第一音频信号的强度确定讲话模式，可以提高根据接收到的音频信号确定讲话模式的准确度。通过基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。可以更加准确的对接收的音频信号进行处理，提高目标音频信号的信噪比，进一步提高了用户体验。The technical solution of this embodiment can obtain a first audio signal collected by a directional microphone, and obtain a second audio signal collected by an omnidirectional microphone. The direction of the sound source is determined according to the collection parameters of the second audio signal. The speech mode is determined according to the direction of the sound source and the strength of the first audio signal. The first audio signal and the second audio signal are fused based on the speech mode to obtain a target audio signal. Using the technical solution of this embodiment, audio signals from various directions can be received, and the speech mode can be determined based on the direction of the sound source and the strength of the first audio signal, which can improve the accuracy of determining the speech mode based on the received audio signal. The target audio signal is obtained by fusing the first audio signal and the second audio signal based on the speech mode. The received audio signal can be processed more accurately, the signal-to-noise ratio of the target audio signal can be improved, and the user experience is further improved.

图3为本发明实施例提供的一种音频信号处理方法的另一流程图，本实施例是在上述实施例的基础上进行细化。具体地方法可如图3所示，该方法可以包括如下步骤：FIG3 is another flow chart of an audio signal processing method provided by an embodiment of the present invention. This embodiment is a refinement based on the above embodiment. The specific method can be shown in FIG3, and the method can include the following steps:

步骤201、获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号。Step 201: Acquire a first audio signal collected by a directional microphone, and acquire a second audio signal collected by an omnidirectional microphone.

步骤202、根据第二音频信号的采集参数确定声源方向。Step 202: Determine the sound source direction according to the acquisition parameters of the second audio signal.

其中，采集参数包括接收到第二音频信号的时间、音频信号的音速、音频信号的模拟角频率、幅度和相位等。全向性麦克风可以接收到任意方向的音频信号，因此，通过全向性麦克风接收的第二音频信号可以确定出第一音频信号的声源方向。The acquisition parameters include the time when the second audio signal is received, the speed of sound of the audio signal, the analog angular frequency, amplitude and phase of the audio signal, etc. The omnidirectional microphone can receive audio signals from any direction, so the second audio signal received by the omnidirectional microphone can determine the sound source direction of the first audio signal.

本方案中，全向性麦克风具有至少两个，根据第二音频信号的采集参数确定声源方向，包括：在假设声源方向基于至少两个全向性麦克风中每个全向性麦克风采集的第二音频信号的采集参数计算至少两个全向性麦克风对采集信号的相位变换加权的广义互相关GCC-PHAT函数之和；基于GCC-PHAT函数之和在声源空间寻找使可控响应频率SRP值最大的方向，得到声源方向。In this solution, there are at least two omnidirectional microphones, and the sound source direction is determined according to the collection parameters of the second audio signal, including: calculating the sum of the generalized cross-correlation GCC-PHAT functions of the phase transformation weighted by at least two omnidirectional microphones for the collected signals based on the collection parameters of the second audio signal collected by each of the at least two omnidirectional microphones when assuming the sound source direction; and finding the direction in the sound source space that maximizes the controllable response frequency SRP value based on the sum of the GCC-PHAT functions to obtain the sound source direction.

具体地，用x_m(n)表示第m个全向性麦克风的接收信号，用r_l和r_m分别表示第l个基元的直角坐标矢量和第m个基元的直角坐标矢量，c为空气中的声速(约为342m/s)，q为假想声源直角坐标矢量，τ_lm(q)表示假想声源到第l个和第m个麦克风的到达时间差，则τ_lm(q)为：Specifically, _xm (n) represents the received signal of the mth omnidirectional microphone, _rl and _rm represent the rectangular coordinate vector of the lth element and the mth element respectively, c is the sound speed in air (about 342m/s), q is the rectangular coordinate vector of the imaginary sound source, _τlm (q) represents the arrival time difference from the imaginary sound source to the lth and mth microphones, then _τlm (q) is:

其中，||q-r_m||表示q-r_m的范数，||q-r_l||表示q-r_l的范数。用表示第l个麦克风和第m个麦克风的接收信号的GCC-PHAT函数，的表达式为：Among them, ||qr _m || represents the norm of qr _m , and ||qr _l || represents the norm of qr _l . represents the GCC-PHAT function of the received signals of the lth microphone and the mth microphone, The expression is:

其中，X_m(k)是x_m(n)的快速傅里叶变换，X_l(k)是x_l(n)的快速傅里叶变换，“*”表示取共轭，k为快速傅里叶变换点数，w为模拟角频率，ejwτ表示积分系数。Wherein, _Xm (k) is the fast Fourier transform of _xm (n), _Xl (k) is the fast Fourier transform of _xl (n), “*” indicates taking the conjugate, k is the number of fast Fourier transform points, w is the analog angular frequency, and ejwτ represents the integral coefficient.

则基于GCC-PHAT的SRP-PHAT函数为：Then the SRP-PHAT function based on GCC-PHAT is:

其中，q为假想声源直角坐标矢量，τ_lm(q)表示假想声源到第l个和第m个麦克风的到达时间差，为第l个麦克风和第m个麦克风的接收信号的GCC-PHAT函数。则声源位置的估计可以表示为：Where q is the rectangular coordinate vector of the imaginary sound source, τ _lm (q) represents the arrival time difference between the imaginary sound source and the lth and mth microphones, is the GCC-PHAT function of the received signals of the lth microphone and the mth microphone. Then the estimation of the sound source position can be expressed as:

其中，arg表示求复数辐角，Q为预先设定的声源空间。Among them, arg represents the complex argument, and Q is the preset sound source space.

步骤203、根据声源方向和第一音频信号的强度确定讲话模式。Step 203: Determine the speech mode according to the sound source direction and the strength of the first audio signal.

在确定出第一音频的声源方向后，根据声源方向和第一音频信号的强度确定讲话模式。After the sound source direction of the first audio signal is determined, the speech mode is determined according to the sound source direction and the strength of the first audio signal.

本方案实施例中，可选的，根据声源方向和第一音频信号的强度确定讲话模式包括如下步骤A1-步骤A3：In the embodiment of the present solution, optionally, determining the speech mode according to the sound source direction and the strength of the first audio signal includes the following steps A1 to A3:

步骤A1：确定第一音频信号的强度是否大于预设强度，并确定第一音频信号的方向是否属于声源方向。Step A1: determining whether the intensity of the first audio signal is greater than a preset intensity, and determining whether the direction of the first audio signal belongs to the direction of the sound source.

其中，预设强度可根据实际应用场景的环境和具体需求设定。全向性麦克风可以接收到任意方向的音频信号，因此，根据全向性麦克风接收到的音频信号可以判断出指向性麦克风接收到的第一音频信号的声源方向。进一步地，在接收到第一音频信号后，确定第一音频信号的强度是否大于预设强度，并根据全向性麦克风接收到的第二音频信号确定第一音频信号的方向是否属于声源方向。Among them, the preset strength can be set according to the environment and specific needs of the actual application scenario. The omnidirectional microphone can receive audio signals from any direction. Therefore, the sound source direction of the first audio signal received by the directional microphone can be determined based on the audio signal received by the omnidirectional microphone. Furthermore, after receiving the first audio signal, it is determined whether the strength of the first audio signal is greater than the preset strength, and whether the direction of the first audio signal belongs to the sound source direction based on the second audio signal received by the omnidirectional microphone.

步骤A2：若第一音频信号的强度大于预设强度，且第一音频信号的方向属于声源方向，则确定讲话模式为单人讲话模式。Step A2: If the intensity of the first audio signal is greater than a preset intensity and the direction of the first audio signal belongs to the direction of the sound source, then determining that the speech mode is a single-person speech mode.

具体地，当第一音频信号的强度大于预设强度，且第一音频信号的方向属于声源方向时，说明此场景下指向性麦克风能够接收到来自其指定方向的，并且信号较强的音频信号，则确定讲话模式为单人讲话模式。例如在演讲表演时，舞台上设置的指向性麦克风可以接收到演讲者的声音音频信号。Specifically, when the intensity of the first audio signal is greater than the preset intensity and the direction of the first audio signal belongs to the direction of the sound source, it means that the directional microphone in this scenario can receive the audio signal with a strong signal from its specified direction, and the speech mode is determined to be the single-person speech mode. For example, during a speech performance, the directional microphone set on the stage can receive the speaker's voice audio signal.

步骤A3：若第一音频信号的强度不大于预设强度，或者若第一音频信号的强度大于预设强度但第一音频信号的方向不属于声源方向，则确定讲话模式为多人讲话模式。Step A3: If the intensity of the first audio signal is not greater than the preset intensity, or if the intensity of the first audio signal is greater than the preset intensity but the direction of the first audio signal does not belong to the sound source direction, determine that the speech mode is a multi-person speech mode.

具体地，若第一音频信号的强度不大于预设强度时，说明指向性麦克风不能接收到一定强度的音频信号，则确定讲话模式为多人讲话模式。若第一音频信号的强度大于预设强度但第一音频信号的方向不属于声源方向，说明发出音频信号的人不在指向性麦克风指定的范围，则确定讲话模式为多人讲话模式。例如会议室的讲席设置有指向性麦克风，座位席设置有全向性麦克风。当展开讨论性会议时，一般用户会在座位席上进行讨论，这时即使指向性麦克风可以接收到音频信号，但音频信号的方向来自于座位席而不是讲席，则确定讲话模式为多人讲话模式。Specifically, if the intensity of the first audio signal is not greater than the preset intensity, it means that the directional microphone cannot receive an audio signal of a certain intensity, and the speech mode is determined to be a multi-person speech mode. If the intensity of the first audio signal is greater than the preset intensity but the direction of the first audio signal does not belong to the sound source direction, it means that the person sending the audio signal is not within the range specified by the directional microphone, and the speech mode is determined to be a multi-person speech mode. For example, the lecture hall in the conference room is equipped with a directional microphone, and the seat is equipped with an omnidirectional microphone. When a discussion meeting is held, users generally discuss in their seats. At this time, even if the directional microphone can receive the audio signal, the direction of the audio signal comes from the seat rather than the lecture hall, and the speech mode is determined to be a multi-person speech mode.

根据上述步骤确定出讲话模式，可以提高确定讲话模式的准确度，进一步为后续根据不同讲话模式确定不同的音频处理方式奠定了基础。Determining the speech mode according to the above steps can improve the accuracy of determining the speech mode, and further lay the foundation for subsequently determining different audio processing methods according to different speech modes.

步骤204、基于讲话模式从第一音频信号和第二音频信号中提取人声音频信号。Step 204: extract a human voice audio signal from the first audio signal and the second audio signal based on the speech mode.

在接收到第一音频信号和第二音频信号后，将人声音频信号从第一音频信号和第二音频信号中提取出来。After receiving the first audio signal and the second audio signal, a human voice audio signal is extracted from the first audio signal and the second audio signal.

本方案实施例中，可选的，当讲话模式为多人讲话模式时，从第一音频信号和第二音频信号中提取所有人声的音频信号，得到人声音频信号。当讲话模式为单人讲话模式时，从第一音频信号中提取人声的音频信号，得到人声音频信号。In the embodiment of the present solution, optionally, when the speech mode is a multi-person speech mode, the audio signals of all human voices are extracted from the first audio signal and the second audio signal to obtain a human voice audio signal. When the speech mode is a single-person speech mode, the audio signal of human voice is extracted from the first audio signal to obtain a human voice audio signal.

其中，当讲话模式确定为多人讲话模式时，需要提取指向性麦克风和全向性麦克风中接收到的人声音频信号。例如，在记者招待会中，会场讲台设置有指向性麦克风，记者席设置有全向性麦克风。此时指向性麦克风接收到的主讲人的人声音频信号和记者的人声音频信号都需要被提取。进一步地，当讲话模式为多人讲话模式时，从第一音频信号和第二音频信号中提取所有人声的音频信号，得到人声音频信号。Among them, when the speech mode is determined to be a multi-person speech mode, it is necessary to extract the human voice audio signals received by the directional microphone and the omnidirectional microphone. For example, in a press conference, the podium is provided with a directional microphone, and the press box is provided with an omnidirectional microphone. At this time, both the human voice audio signal of the speaker and the human voice audio signal of the reporter received by the directional microphone need to be extracted. Furthermore, when the speech mode is a multi-person speech mode, the audio signals of all human voices are extracted from the first audio signal and the second audio signal to obtain the human voice audio signal.

其中，当讲话模式确定为单人讲话模式时，需要提取指向性麦克风接收到的人声音频信号。例如，在演讲表演时，演讲台上设置的指向性麦克风可以接收到演讲者的人声音频信号。而观众席上设置的全向性麦克风接收到的人声音频信号是不需要当做人声音频信号提取的。进一步地，当讲话模式为单人讲话模式时，从第一音频信号中提取人声的音频信号，得到人声音频信号。Among them, when the speech mode is determined to be a single-person speech mode, it is necessary to extract the human voice audio signal received by the directional microphone. For example, during a speech performance, the directional microphone set on the podium can receive the speaker's human voice audio signal. However, the human voice audio signal received by the omnidirectional microphone set in the auditorium does not need to be extracted as a human voice audio signal. Furthermore, when the speech mode is a single-person speech mode, the human voice audio signal is extracted from the first audio signal to obtain the human voice audio signal.

在上述方法描述中，可以根据不同的讲话模式从第一音频信号和第二音频信号中，简单并准确的从第一音频信号和第二音频信号提取出不同人声音频信号，提高了后续得到的目标音频信号的信噪比。In the above method description, different human voice audio signals can be simply and accurately extracted from the first audio signal and the second audio signal according to different speech modes, thereby improving the signal-to-noise ratio of the target audio signal obtained subsequently.

步骤205、基于讲话模式从第一音频信号和第二音频信号中提取环境音频信号。Step 205: extract an ambient audio signal from the first audio signal and the second audio signal based on the speech mode.

在确定讲话模式并根据讲话模式提取出人声音频信号后，需要从第一音频信号和第二音频信号中提取环境音频信号。After the speech mode is determined and the human voice audio signal is extracted according to the speech mode, it is necessary to extract the environmental audio signal from the first audio signal and the second audio signal.

本方案实施例中，可选的，从第一音频信号和第二音频信号中提取环境音频信号包括如下步骤B1-步骤B2：In the embodiment of the present solution, optionally, extracting the ambient audio signal from the first audio signal and the second audio signal includes the following steps B1-B2:

步骤B1：确定第一音频信号和第二音频信号与预设环境信号的相似度。Step B1: Determine the similarity between the first audio signal and the second audio signal and a preset environment signal.

其中，基于深度学习通过预先训练各种环境音信号的方式，再根据环境音的种类可以建立预设环境音信号。在得到预设环境音信号后，将第一音频信号和第二音频信号与环境音信号进行相似度匹配，得到相似度匹配结果。Among them, by pre-training various ambient sound signals based on deep learning, a preset ambient sound signal can be established according to the type of ambient sound. After obtaining the preset ambient sound signal, the first audio signal and the second audio signal are similarly matched with the ambient sound signal to obtain a similarity matching result.

步骤B2：从第一音频信号和第二音频信号中提取与预设环境信号的相似度超过预设相似度值的信号。Step B2: extracting a signal whose similarity with the preset environment signal exceeds a preset similarity value from the first audio signal and the second audio signal.

其中，可以根据实际环境和具体需求设定预设相似度值，也可以通过多次训练预设环境音信号设定预设相似度值。具体地，在得到第一音频信号和第二音频信号与预设环境信号的相似度匹配结果后，确定第一音频信号和第二音频信号中提取与预设环境信号的相似度是否超过预设相似度值，如果超过预设相似度值，则说明该音频信号为环境音频信号。进一步地，从第一音频信号和第二音频信号中提取与预设环境信号的相似度超过预设相似度值的信号。Among them, the preset similarity value can be set according to the actual environment and specific needs, and the preset similarity value can also be set by training the preset environmental sound signal multiple times. Specifically, after obtaining the similarity matching result between the first audio signal and the second audio signal and the preset environmental signal, it is determined whether the similarity extracted from the first audio signal and the second audio signal with the preset environmental signal exceeds the preset similarity value. If it exceeds the preset similarity value, it means that the audio signal is an environmental audio signal. Furthermore, a signal whose similarity with the preset environmental signal exceeds the preset similarity value is extracted from the first audio signal and the second audio signal.

在上述方法中，利用基于深度学习的方法预先训练各种环境音信号，从第一音频信号和第二音频信号中提取与预设环境信号的相似度超过预设相似度值的信号，可以提高识别音频信号是否是环境音频信号的准确率。In the above method, various ambient sound signals are pre-trained using a deep learning-based method, and a signal whose similarity with a preset ambient signal exceeds a preset similarity value is extracted from the first audio signal and the second audio signal, which can improve the accuracy of identifying whether an audio signal is an ambient audio signal.

其中，对于不同的讲话模式，提取的环境音频信号是不同的。Among them, for different speech modes, the extracted environmental audio signals are different.

本方案实施例中，可选的，当讲话模式为单人讲话模式时，从第一音频信号和第二音频信号中提取出环境的音频信号，并从第二音频信号中提取出人声的音频信号，得到环境音频信号。当讲话模式为多人讲话模式时，从第一音频信号和第二音频信号中提取出环境的音频信号，得到环境音频信号。In the embodiment of the present solution, optionally, when the speech mode is a single-person speech mode, an ambient audio signal is extracted from the first audio signal and the second audio signal, and an audio signal of a human voice is extracted from the second audio signal to obtain an ambient audio signal. When the speech mode is a multi-person speech mode, an ambient audio signal is extracted from the first audio signal and the second audio signal to obtain an ambient audio signal.

具体地，当讲话模式为单人讲话模式时，说明第二音频信号中的人声音频信号是不需要作为人声音频信号的。例如在演讲表演时，演讲台上设置的指向性麦克风可以接收到演讲者的人声音频信号。而观众席上设置的全向性麦克风接收到的人声音频信号可以被视为环境杂音。因此，当讲话模式为单人讲话模式时，从第一音频信号和第二音频信号中提取出环境的音频信号，并从第二音频信号中提取出人声的音频信号，得到环境音频信号。Specifically, when the speech mode is a single-person speech mode, it means that the human voice audio signal in the second audio signal is not needed as a human voice audio signal. For example, during a speech performance, the directional microphone installed on the podium can receive the speaker's human voice audio signal. The human voice audio signal received by the omnidirectional microphone installed in the auditorium can be regarded as environmental noise. Therefore, when the speech mode is a single-person speech mode, the environmental audio signal is extracted from the first audio signal and the second audio signal, and the human voice audio signal is extracted from the second audio signal to obtain the environmental audio signal.

具体地，当讲话模式为多人讲话模式时，第二音频信号中的人声音频信号是需要被作为人声音频信号提取的，因此，当讲话模式为多人讲话模式时，从第一音频信号和第二音频信号中提取出环境的音频信号，得到环境音频信号。Specifically, when the speech mode is a multi-speech mode, the human voice audio signal in the second audio signal needs to be extracted as a human voice audio signal. Therefore, when the speech mode is a multi-speech mode, the ambient audio signal is extracted from the first audio signal and the second audio signal to obtain the ambient audio signal.

在上述方法描述中，可以根据不同的讲话模式从第一音频信号和第二音频信号中，简单并准确的从第一音频信号和第二音频信号提取出不同的环境音频信号。在单人讲话模式中将第二音频信号中的人声信号所谓环境音频信号，能够提高后续得到的目标音频信号的信噪比。In the above method description, different environmental audio signals can be simply and accurately extracted from the first audio signal and the second audio signal according to different speech modes. In the single-person speech mode, the human voice signal in the second audio signal is called the environmental audio signal, which can improve the signal-to-noise ratio of the target audio signal obtained subsequently.

步骤206、对人声音频信号进行放大处理，得到人声增强信号。Step 206: amplify the human voice audio signal to obtain a human voice enhanced signal.

在得到人声音频信号后，按照预设放大比例对人声音频信号进行放大处理，得到人声增强信号。具体地，对于单人讲话模式和多人讲话模式提取出的人声音频信号，可以按照相同的放大比例进行放大处理，也可以按照不同的放大比例进行放大处理。具体可以根据实际环境和具体需求对放大比例进行设定。After obtaining the human voice audio signal, the human voice audio signal is amplified according to a preset amplification ratio to obtain a human voice enhanced signal. Specifically, the human voice audio signals extracted in the single-speaking mode and the multi-speaking mode can be amplified according to the same amplification ratio or different amplification ratios. The amplification ratio can be set according to the actual environment and specific needs.

步骤207、确定环境音频信号的声波幅度值。Step 207: Determine the sound wave amplitude value of the ambient audio signal.

其中，声波幅度值可以被理解为音频信号的声音振幅，声波幅度值可以反应出声音振动的范围和强度。具体地，在确定出第一音频信号和第二音频信号的环境音频信号后，确定环境音频信号的声波幅度值。The sound wave amplitude value can be understood as the sound amplitude of the audio signal, and the sound wave amplitude value can reflect the range and intensity of the sound vibration. Specifically, after determining the ambient audio signal of the first audio signal and the second audio signal, the sound wave amplitude value of the ambient audio signal is determined.

步骤208、从环境音频信号中滤除声波幅度值超过预设声波幅度值的信号，得到抑噪音频信号；或者利用抵消信号将环境音频信号中的指定信号的信号值抵消至不超过预设声波幅度值，得到抑噪音频信号，指定信号为环境音频信号中声波幅度值超过预设信号值的信号。Step 208: Filter out signals whose sound wave amplitude values exceed a preset sound wave amplitude value from the ambient audio signal to obtain a noise-suppressed audio signal; or use a cancellation signal to cancel the signal value of a designated signal in the ambient audio signal to no more than the preset sound wave amplitude value to obtain a noise-suppressed audio signal, where the designated signal is a signal whose sound wave amplitude value exceeds the preset signal value in the ambient audio signal.

其中，可以根据实际环境、具体需求和专业经验对预设声波幅度值进行设定。如果环境音频信号中存在声波幅度值超过预设声波幅度值的信号，则说明该信号值可以被视为是噪音信号，则将环境音频信号中存在声波幅度值超过预设声波幅度值的信号进行滤除。或者，在环境音信号中确定出指定信号，模拟出和指定信号有相近声波幅度值但极向相反的信号，并将该信号值作为抵消信号。利用抵消信号将环境音频信号中的指定信号值抵消至不超过预设声波幅度值，得到抑噪音频信号。图4为本发明实施提供的信号提取的流程示意图。如图4所示，在确定讲话模式为多人讲话模式时，从第一音频信号和第二音频信号中提取所有人声的音频信号作为人声音频信号，将人声信号放大得到人声增强信号。从第一音频信号和第二音频信号中提取出环境的音频信号，确定环境音频信号与预设环境音信号的相似度是否超过预射相似度值，如果是，则将环境音频信号进行滤除或抵消，最终得到抑噪音频信号。在确定讲话模式为单人讲话模式时，从第一音频信号中提取人声的音频信号，从第一音频信号和第二音频信号中提取出环境的音频信号，并从第二音频信号中提取出人声的音频信号作为环境音频信号，确定环境音频信号与预设环境音信号的相似度是否超过预射相似度值，如果是，则将环境音频信号进行滤除或抵消，最终得到抑噪音频信号。Among them, the preset sound wave amplitude value can be set according to the actual environment, specific needs and professional experience. If there is a signal in the ambient audio signal whose sound wave amplitude value exceeds the preset sound wave amplitude value, it means that the signal value can be regarded as a noise signal, and the signal in the ambient audio signal whose sound wave amplitude value exceeds the preset sound wave amplitude value is filtered out. Alternatively, a specified signal is determined in the ambient sound signal, and a signal with a sound wave amplitude value similar to the specified signal but opposite in polarity is simulated, and the signal value is used as a cancellation signal. The cancellation signal is used to offset the specified signal value in the ambient audio signal to no more than the preset sound wave amplitude value, so as to obtain a noise-suppressed audio signal. Figure 4 is a flow chart of signal extraction provided by the implementation of the present invention. As shown in Figure 4, when the speech mode is determined to be a multi-person speech mode, the audio signals of all human voices are extracted from the first audio signal and the second audio signal as human voice audio signals, and the human voice signals are amplified to obtain human voice enhancement signals. An ambient audio signal is extracted from the first audio signal and the second audio signal, and it is determined whether the similarity between the ambient audio signal and the preset ambient sound signal exceeds the pre-emitted similarity value. If so, the ambient audio signal is filtered out or offset, and finally a noise-suppressed audio signal is obtained. When the speech mode is determined to be a single-person speech mode, an audio signal of a human voice is extracted from the first audio signal, an ambient audio signal is extracted from the first audio signal and the second audio signal, and an audio signal of a human voice is extracted from the second audio signal as the ambient audio signal, and it is determined whether the similarity between the ambient audio signal and the preset ambient sound signal exceeds the pre-emitted similarity value. If so, the ambient audio signal is filtered out or offset, and finally a noise-suppressed audio signal is obtained.

步骤209、将人声增强信号和抑噪音频信号进行融合处理，得到目标音频信号。Step 209: fusing the human voice enhancement signal and the noise suppression audio signal to obtain a target audio signal.

在得到人声增强信号和抑噪音频信号后，将人声增强信号和抑噪音频信号进行融合处理。通过将人声增强信号和抑躁音频信号进行融合处理，可以得到更加清楚的人声音音频信号的同时，还能保证目标音频信号有着很好的通透性。After obtaining the human voice enhancement signal and the noise suppression audio signal, the human voice enhancement signal and the noise suppression audio signal are fused. By fusion processing the human voice enhancement signal and the noise suppression audio signal, a clearer human voice audio signal can be obtained while ensuring that the target audio signal has good transparency.

本发明实施例中，获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号。在假设声源方向基于至少两个全向性麦克风中每个全向性麦克风采集的第二音频信号的采集参数计算至少两个全向性麦克风对采集信号的相位变换加权的广义互相关GCC-PHAT函数之和。基于GCC-PHAT函数之和在声源空间寻找使可控响应频率SRP值最大的方向，得到声源方向。根据声源方向和第一音频信号的强度确定讲话模式。基于讲话模式从第一音频信号和第二音频信号中提取人声音频信号。基于讲话模式从第一音频信号和第二音频信号中提取环境音频信号。对人声音频信号进行放大处理，得到人声增强信号。确定环境音频信号的声波幅度值。从环境音频信号中滤除声波幅度值超过预设声波幅度值的信号，得到抑噪音频信号；或者利用抵消信号将环境音频信号中的指定信号的信号值抵消至不超过预设声波幅度值，得到抑噪音频信号，指定信号为环境音频信号中声波幅度值超过预设信号值的信号。将人声增强信号和抑噪音频信号进行融合处理，得到目标音频信号。本实施例的技术方案中，能够根据第一音频信号和第二音频信号准确地确定出讲话模式，并根据不同的讲话模式对第一音频信号和第二音频信号采用不同的处理方式，能够有效提高目标音频信号的信噪比。通过将人声增强信号和抑躁音频信号进行融合处理，可以得到更加清楚的人声音音频信号的同时，还能保证目标音频信号有着很好地通透性，提高了用户体验。In an embodiment of the present invention, a first audio signal collected by a directional microphone is obtained, and a second audio signal collected by an omnidirectional microphone is obtained. The sum of the generalized cross-correlation GCC-PHAT functions of the phase transformation weighted by at least two omnidirectional microphones for the collected signals is calculated based on the collection parameters of the second audio signal collected by each of the at least two omnidirectional microphones in the assumed sound source direction. Based on the sum of the GCC-PHAT functions, the direction that maximizes the controllable response frequency SRP value is searched in the sound source space to obtain the sound source direction. The speech mode is determined according to the sound source direction and the intensity of the first audio signal. A human voice audio signal is extracted from the first audio signal and the second audio signal based on the speech mode. An ambient audio signal is extracted from the first audio signal and the second audio signal based on the speech mode. The human voice audio signal is amplified to obtain a human voice enhancement signal. The sound wave amplitude value of the ambient audio signal is determined. Signals whose sound wave amplitude values exceed the preset sound wave amplitude value are filtered out from the ambient audio signal to obtain a noise-suppressed audio signal; or the signal value of a designated signal in the ambient audio signal is offset to not exceed the preset sound wave amplitude value by using a cancellation signal to obtain a noise-suppressed audio signal, wherein the designated signal is a signal whose sound wave amplitude value exceeds the preset signal value in the ambient audio signal. The human voice enhancement signal and the noise-suppressed audio signal are fused to obtain a target audio signal. In the technical solution of this embodiment, the speech mode can be accurately determined according to the first audio signal and the second audio signal, and different processing methods can be used for the first audio signal and the second audio signal according to different speech modes, which can effectively improve the signal-to-noise ratio of the target audio signal. By fusing the human voice enhancement signal and the noise-suppressed audio signal, a clearer human voice audio signal can be obtained, while ensuring that the target audio signal has good transparency, thereby improving the user experience.

图5为本发明实施例提供的音频信号处理装置的一个结构图，该装置适用于执行本发明实施例提供的音频信号处理方法，该装置包括指向性麦克风和全向性麦克风。如图5所示，该装置具体可以包括：FIG5 is a structural diagram of an audio signal processing device provided by an embodiment of the present invention, and the device is suitable for executing the audio signal processing method provided by an embodiment of the present invention, and the device includes a directional microphone and an omnidirectional microphone. As shown in FIG5 , the device may specifically include:

信号获取模块501，用于获取所述指向性麦克风采集的第一音频信号，并获取所述全向性麦克风采集的第二音频信号；A signal acquisition module 501 is used to acquire a first audio signal collected by the directional microphone and to acquire a second audio signal collected by the omnidirectional microphone;

方向确定模块502，用于根据所述第二音频信号的采集参数确定声源方向；A direction determination module 502, configured to determine a sound source direction according to acquisition parameters of the second audio signal;

模式确定模块503，用于根据所述声源方向和所述第一音频信号的强度确定讲话模式；A mode determination module 503, configured to determine a speech mode according to the sound source direction and the strength of the first audio signal;

目标信号获取模块504，用于基于所述讲话模式对所述第一音频信号和所述第二音频信号进行融合处理，得到目标音频信号。The target signal acquisition module 504 is configured to perform fusion processing on the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal.

可选的，模式确定模块503具体用于：Optionally, the mode determination module 503 is specifically configured to:

确定所述第一音频信号的强度是否大于预设强度，并确定所述第一音频信号的方向是否属于所述声源方向；determining whether the intensity of the first audio signal is greater than a preset intensity, and determining whether the direction of the first audio signal belongs to the sound source direction;

若所述第一音频信号的强度大于所述预设强度，且所述第一音频信号的方向属于所述声源方向，则确定所述讲话模式为单人讲话模式；If the intensity of the first audio signal is greater than the preset intensity and the direction of the first audio signal belongs to the sound source direction, determining that the speech mode is a single-person speech mode;

若所述第一音频信号的强度不大于所述预设强度，或者若所述第一音频信号的强度大于所述预设强度但所述第一音频信号的方向不属于所述声源方向，则确定所述讲话模式为多人讲话模式。If the intensity of the first audio signal is not greater than the preset intensity, or if the intensity of the first audio signal is greater than the preset intensity but the direction of the first audio signal does not belong to the sound source direction, the speech mode is determined to be a multi-speech mode.

可选的，目标信号获取模块504具体包括：Optionally, the target signal acquisition module 504 specifically includes:

信号提取单元，用于基于所述讲话模式从所述第一音频信号和所述第二音频信号中提取人声音频信号，并基于所述讲话模式从所述第一音频信号和所述第二音频信号中提取环境音频信号；a signal extraction unit, configured to extract a human voice audio signal from the first audio signal and the second audio signal based on the speech mode, and to extract an environmental audio signal from the first audio signal and the second audio signal based on the speech mode;

信号放大单元，用于对所述人声音频信号进行放大处理，得到人声增强信号；A signal amplifying unit, used to amplify the human voice audio signal to obtain a human voice enhancement signal;

信号抑噪单元，用于对所述环境音频信号进行抑噪处理，得到抑噪音频信号；A signal noise suppression unit, used for performing noise suppression processing on the ambient audio signal to obtain a noise-suppressed audio signal;

目标信号获取单元，用于将所述人声增强信号和所述抑噪音频信号进行融合处理，得到所述目标音频信号。The target signal acquisition unit is used to fuse the human voice enhancement signal and the noise suppression audio signal to obtain the target audio signal.

可选的，信号抑噪单元具体用于：Optionally, the signal noise suppression unit is specifically used for:

确定所述环境音频信号的声波幅度值；Determining the sound wave amplitude value of the ambient audio signal;

从所述环境音频信号中滤除声波幅度值超过预设声波幅度值的信号，得到所述抑噪音频信号；或者利用抵消信号将所述环境音频信号中的指定信号的信号值抵消至不超过所述预设声波幅度值，得到所述抑噪音频信号，所述指定信号为所述环境音频信号中声波幅度值超过预设信号值的信号。The signal whose sound wave amplitude value exceeds the preset sound wave amplitude value is filtered out from the ambient audio signal to obtain the noise-suppressed audio signal; or the signal value of a designated signal in the ambient audio signal is offset to not exceed the preset sound wave amplitude value by using a cancellation signal to obtain the noise-suppressed audio signal, wherein the designated signal is a signal in the ambient audio signal whose sound wave amplitude value exceeds the preset signal value.

可选的，信号提取单元具体用于：Optionally, the signal extraction unit is specifically used for:

当所述讲话模式为多人讲话模式时，从所述第一音频信号和所述第二音频信号中提取所有人声的音频信号，得到所述人声音频信号；When the speech mode is a multi-person speech mode, extracting audio signals of all human voices from the first audio signal and the second audio signal to obtain the human voice audio signal;

当所述讲话模式为单人讲话模式时，从所述第一音频信号中提取人声的音频信号，得到所述人声音频信号。When the speech mode is a single-person speech mode, an audio signal of a human voice is extracted from the first audio signal to obtain the human voice audio signal.

可选的，信号提取单元还用于：Optionally, the signal extraction unit is further used for:

当所述讲话模式为多人讲话模式时，从所述第一音频信号和所述第二音频信号中提取出环境的音频信号，得到所述环境音频信号；When the speech mode is a multi-person speech mode, extracting an ambient audio signal from the first audio signal and the second audio signal to obtain the ambient audio signal;

当所述讲话模式为单人讲话模式时，从所述第一音频信号和所述第二音频信号中提取出环境的音频信号，并从所述第二音频信号中提取出人声的音频信号，得到所述环境音频信号。When the speech mode is a single-person speech mode, an ambient audio signal is extracted from the first audio signal and the second audio signal, and an audio signal of a human voice is extracted from the second audio signal to obtain the ambient audio signal.

确定所述第一音频信号和所述第二音频信号与预设环境信号的相似度；Determining similarities between the first audio signal and the second audio signal and a preset environment signal;

从所述第一音频信号和所述第二音频信号中提取与所述预设环境信号的相似度超过预设相似度值的信号。A signal whose similarity to the preset environment signal exceeds a preset similarity value is extracted from the first audio signal and the second audio signal.

可选的，方向确定模块502具体用于：Optionally, the direction determination module 502 is specifically configured to:

在假设声源方向基于所述至少两个全向性麦克风中每个全向性麦克风采集的第二音频信号的采集参数计算所述至少两个全向性麦克风对采集信号的相位变换加权的广义互相关GCC-PHAT函数之和；Calculate the sum of the weighted generalized cross-correlation GCC-PHAT functions of the phase transformation of the collected signals by the at least two omnidirectional microphones based on the collection parameters of the second audio signal collected by each of the at least two omnidirectional microphones under the assumption of the sound source direction;

基于所述GCC-PHAT函数之和在声源空间寻找使可控响应频率SRP值最大的方向，得到所述声源方向。Based on the sum of the GCC-PHAT functions, the direction that maximizes the controllable response frequency SRP value is found in the sound source space to obtain the sound source direction.

本发明实施例所提供的音频信号处理装置可执行本发明任意实施例所提供的音频信号处理方法，具备执行方法相应的功能模块和有益效果。本实施例中未详尽描述的内容可以参考本发明任意方法实施例中的描述。The audio signal processing device provided in the embodiment of the present invention can execute the audio signal processing method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method. For the contents not described in detail in this embodiment, reference can be made to the description in any method embodiment of the present invention.

图6为本发明实施例提供的电子设备的一个结构示意图，参考图6，其示出了适于用来实现本发明实施例的电子设备的计算机系统12的结构示意图。图6示出的电子设备仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。电子设备12的组件可以包括但不限于：一个或者多个处理器或者处理单元16，系统存储器28，连接不同系统组件(包括系统存储器28和处理单元16)的总线18。FIG6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention. Referring to FIG6 , a schematic diagram of the structure of a computer system 12 suitable for implementing an electronic device of an embodiment of the present invention is shown. The electronic device shown in FIG6 is only an example and should not bring any limitation to the functions and scope of use of the embodiment of the present invention. The components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).

总线18表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器，外围总线，图形加速端口，处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说，这些体系结构包括但不限于工业标准体系结构(ISA)总线，微通道体系结构(MAC)总线，增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor or a local bus using any of a variety of bus architectures. By way of example, these architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an Enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

电子设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备12访问的可用介质，包括易失性和非易失性介质，可移动的和不可移动的介质。The electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device 12, including volatile and non-volatile media, removable and non-removable media.

系统存储器28可以包括易失性存储器形式的计算机系统可读介质，例如随机存取存储器(RAM)30和/或高速缓存存储器32。电子设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例，存储系统34可以用于读写不可移动的、非易失性磁介质(图6未显示，通常称为“硬盘驱动器”)。尽管图6中未示出，可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器，以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下，每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品，该程序产品具有一组(例如至少一个)程序模块，这些程序模块被配置以执行本发明各实施例的功能。The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 6 , commonly referred to as a “hard drive”). Although not shown in FIG. 6 , a disk drive for reading and writing to a removable non-volatile disk (e.g., a “floppy disk”) and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM or other optical media) may be provided. In these cases, each drive may be connected to the bus 18 via one or more data media interfaces. The memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to perform the functions of various embodiments of the present invention.

具有一组(至少一个)程序模块42的程序/实用工具40，可以存储在例如存储器28中，这样的程序模块42包括——但不限于——操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本发明所描述的实施例中的功能和/或方法。A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in the memory 28. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment. The program modules 42 generally perform the functions and/or methods of the embodiments described herein.

电子设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信，还可与一个或者多个使得用户能与该电子设备12交互的设备通信，和/或与使得该电子设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡，调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。另外，本实施例中的电子设备12，显示器24不是作为独立个体存在，而是嵌入镜面中，在显示器24的显示面不予显示时，显示器24的显示面与镜面从视觉上融为一体。并且，电子设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器20通过总线18与电子设备12的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备12使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with the electronic device 12, and/or communicate with any device that enables the electronic device 12 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication may be performed through an input/output (I/O) interface 22. In addition, the electronic device 12 in this embodiment, the display 24 does not exist as an independent individual, but is embedded in the mirror surface, and when the display surface of the display 24 is not displayed, the display surface of the display 24 and the mirror surface are visually integrated. In addition, the electronic device 12 may also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN) and/or public network, such as the Internet) through a network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the electronic device 12 through a bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

处理单元16通过运行存储在系统存储器28中的程序，从而执行各种功能应用以及音频信号处理，例如实现本发明实施例所提供的一种音频信号处理方法：获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号；根据第二音频信号的采集参数确定声源方向；根据声源方向和第一音频信号的强度确定讲话模式；基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。The processing unit 16 executes various functional applications and audio signal processing by running the program stored in the system memory 28, for example, implementing an audio signal processing method provided in an embodiment of the present invention: obtaining a first audio signal collected by a directional microphone, and obtaining a second audio signal collected by an omnidirectional microphone; determining the direction of the sound source according to the collection parameters of the second audio signal; determining a speech mode according to the direction of the sound source and the intensity of the first audio signal; and fusing the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal.

本发明实施例提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如本发明所有发明实施例提供的一种音频信号处理方法：获取指向性麦克风采集的第一音频信号，并获取全向性麦克风采集的第二音频信号；根据第二音频信号的采集参数确定声源方向；根据声源方向和第一音频信号的强度确定讲话模式；基于讲话模式对第一音频信号和第二音频信号进行融合处理，得到目标音频信号。可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体地例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored. When the program is executed by a processor, an audio signal processing method as provided in all the embodiments of the present invention is implemented: obtaining a first audio signal collected by a directional microphone, and obtaining a second audio signal collected by an omnidirectional microphone; determining the direction of the sound source according to the collection parameters of the second audio signal; determining the speech mode according to the direction of the sound source and the intensity of the first audio signal; fusing the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal. Any combination of one or more computer-readable media can be used. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In this document, computer readable storage media may be any tangible media that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.

计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, which carry computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、电线、光缆、RF等等，或者上述的任意合适的组合。The program code embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码，程序设计语言包括面向对象的程序设计语言，诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言，诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present invention may be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).

注意，上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解，本发明不限于这里的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此，虽然通过以上实施例对本发明进行了较为详细的说明，但是本发明不仅仅限于以上实施例，在不脱离本发明构思的情况下，还可以包括更多其他等效实施例，而本发明的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present invention and the technical principles used. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and that various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in more detail through the above embodiments, the present invention is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An audio signal processing method, characterized in that it is applied to an audio device including a directional microphone and an omnidirectional microphone, the method comprising:

Acquire a first audio signal collected by the directional microphone, and acquire a second audio signal collected by the omnidirectional microphone;

determining a sound source direction according to an acquisition parameter of the second audio signal;

Determining a speech mode according to the direction of the sound source and the strength of the first audio signal; determining the speech mode according to the direction of the sound source and the strength of the first audio signal includes: determining whether the strength of the first audio signal is greater than a preset strength, and determining whether the direction of the first audio signal belongs to the direction of the sound source; if the strength of the first audio signal is greater than the preset strength, and the direction of the first audio signal belongs to the direction of the sound source, determining that the speech mode is a single-person speech mode; if the strength of the first audio signal is not greater than the preset strength, or if the strength of the first audio signal is greater than the preset strength but the direction of the first audio signal does not belong to the direction of the sound source, determining that the speech mode is a multi-person speech mode;

The first audio signal and the second audio signal are fused based on the speech mode to obtain a target audio signal.

2. The audio signal processing method according to claim 1, wherein the fusing the first audio signal and the second audio signal based on the speech mode to obtain the target audio signal comprises:

extracting a human voice audio signal from the first audio signal and the second audio signal based on the speech pattern, and extracting an environmental audio signal from the first audio signal and the second audio signal based on the speech pattern;

Amplifying the human voice audio signal to obtain a human voice enhancement signal;

Performing noise suppression processing on the ambient audio signal to obtain a noise-suppressed audio signal;

The human voice enhancement signal and the noise suppression audio signal are fused to obtain the target audio signal.

3. The audio signal processing method according to claim 2, wherein the step of performing noise suppression on the ambient audio signal to obtain the noise suppressed audio signal comprises:

Determining the sound wave amplitude value of the ambient audio signal;

Filtering out signals whose sound wave amplitude values exceed a preset sound wave amplitude value from the ambient audio signal to obtain the noise-suppressed audio signal; or

The cancellation signal is used to cancel the signal value of the designated signal in the ambient audio signal to not exceed the preset sound wave amplitude value, thereby obtaining the noise suppression audio signal. The designated signal is a signal in the ambient audio signal whose sound wave amplitude value exceeds the preset signal value.

4. The audio signal processing method according to claim 2, wherein extracting a human voice audio signal from the first audio signal and the second audio signal based on the speech mode comprises:

When the speech mode is a multi-speech mode, extracting audio signals of all human voices from the first audio signal and the second audio signal to obtain the human voice audio signal;

When the speech mode is a single-person speech mode, an audio signal of a human voice is extracted from the first audio signal to obtain the human voice audio signal.

5. The audio signal processing method according to claim 2, wherein extracting the ambient audio signal from the first audio signal and the second audio signal based on the speech mode comprises:

When the speech mode is a multi-person speech mode, extracting an ambient audio signal from the first audio signal and the second audio signal to obtain the ambient audio signal;

When the speech mode is a single-person speech mode, an ambient audio signal is extracted from the first audio signal and the second audio signal, and an audio signal of a human voice is extracted from the second audio signal to obtain the ambient audio signal.

6. The audio signal processing method according to claim 5, wherein extracting the ambient audio signal from the first audio signal and the second audio signal comprises:

Determining similarities between the first audio signal and the second audio signal and a preset environment signal;

A signal whose similarity to the preset environment signal exceeds a preset similarity value is extracted from the first audio signal and the second audio signal.

7. An audio signal processing device, characterized in that it is applied to an audio device including a directional microphone and an omnidirectional microphone, and the audio signal processing device comprises:

A signal acquisition module, used to acquire a first audio signal collected by the directional microphone and to acquire a second audio signal collected by the omnidirectional microphone;

A direction determination module, used to determine the direction of the sound source according to the acquisition parameters of the second audio signal;

A mode determination module, configured to determine a speech mode according to the direction of the sound source and the strength of the first audio signal; the step of determining the speech mode according to the direction of the sound source and the strength of the first audio signal comprises:

determining whether the intensity of the first audio signal is greater than a preset intensity, and determining whether the direction of the first audio signal belongs to the direction of the sound source; if the intensity of the first audio signal is greater than the preset intensity, and the direction of the first audio signal belongs to the direction of the sound source, determining that the speech mode is a single-person speech mode; if the intensity of the first audio signal is not greater than the preset intensity, or if the intensity of the first audio signal is greater than the preset intensity but the direction of the first audio signal does not belong to the direction of the sound source, determining that the speech mode is a multi-person speech mode;

The target signal acquisition module is used to perform fusion processing on the first audio signal and the second audio signal based on the speech mode to obtain a target audio signal.

8. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the audio signal processing method according to any one of claims 1 to 6 when executing the program.

9. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the audio signal processing method according to any one of claims 1 to 6 is implemented.