CN113470675A

CN113470675A - Audio signal processing method and device

Info

Publication number: CN113470675A
Application number: CN202110739135.XA
Authority: CN
Inventors: 操陈斌
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01
Anticipated expiration: 2041-06-30
Also published as: CN113470675B

Abstract

The present disclosure relates to the field of voice communication technologies, and in particular, to an audio signal processing method and apparatus. An audio signal processing method, comprising: determining a first signal vector according to the first reference signal, the second reference signal and a first audio signal picked up by a microphone; the first audio signal comprises a first echo signal generated by a first loudspeaker playing a first reference signal and a second echo signal generated by a second loudspeaker playing a second reference signal; obtaining a first residual signal vector according to a first signal vector of a current frame and an echo separation vector of a previous frame; updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame; and performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal. The method improves the elimination effect of stereo echo and improves the voice communication effect.

Description

Audio signal processing method and device

Technical Field

The present disclosure relates to the field of voice communication technologies, and in particular, to an audio signal processing method and apparatus.

Background

With the development of voice communication systems towards more realistic audio and video directions, for example, in scenes such as online games and video conferences, two speakers are often used to form stereo. After the near-end two speakers play the far-end transmitted sound, the near-end microphone will pick up the sound again and transmit the sound to the far-end, generating an acoustic echo.

In the related art, for echo cancellation in stereo systems, two adaptive filters are generally used to estimate the echo path from each speaker to the microphone, respectively, to cancel stereo echoes. However, for complex acoustic scenes such as Double talk (Double talk), since the echo path of the stereo system changes constantly, the echo cancellation method of the related art cannot accurately and quickly estimate the acoustic transfer function, resulting in poor echo cancellation effect.

Disclosure of Invention

In order to improve the echo cancellation effect of a stereo speech system, the embodiments of the present disclosure provide an audio signal processing method and apparatus.

In a first aspect, the disclosed embodiments provide an audio signal processing method, including:

determining a first signal vector according to the first reference signal, the second reference signal and a first audio signal picked up by a microphone; the first audio signal comprises a first echo signal generated by a first speaker playing the first reference signal and a second echo signal generated by a second speaker playing the second reference signal;

obtaining a first residual signal vector according to the first signal vector of the current frame and the echo separation vector of the previous frame;

updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame;

and performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal.

In some embodiments, the determining a first signal vector from the first reference signal, the second reference signal, and the first audio signal picked up by the microphone comprises:

respectively transforming the first reference signal, the second reference signal and the first audio signal from a time domain to a frequency domain to obtain a first frequency domain reference signal, a second frequency domain reference signal and a first frequency domain audio signal;

and arranging the vectors of the first frequency domain reference signal, the second frequency domain reference signal and the first frequency domain audio signal according to a preset direction to obtain the first signal vector.

In some embodiments, the obtaining a first residual signal vector according to the first signal vector of the current frame and the echo separation vector of the previous frame includes:

performing echo separation on the first signal vector based on the echo separation vector of the previous frame to obtain the first residual signal vector under the condition that the current frame is not the initial frame;

and under the condition that the current frame is an initial frame, performing echo separation on the first signal vector based on a preset initial echo separation vector to obtain the first residual signal vector.

In some embodiments, the updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame includes:

determining an auxiliary variable of the current frame according to the first signal vector and the first residual signal vector of the current frame and an auxiliary variable of a previous frame;

and updating the echo separation vector of the previous frame according to the auxiliary variable of the current frame to obtain the echo separation vector of the current frame.

In some embodiments, said determining an auxiliary variable of the current frame from said first signal vector and first residual signal vector of the current frame and an auxiliary variable of the previous frame comprises:

determining an evaluation function according to the first residual signal vector of the current frame;

determining a contrast function according to the evaluation function;

determining a first covariance matrix according to the first signal vector of the current frame;

and determining the echo separation vector of the current frame according to the auxiliary variable of the previous frame, the first covariance matrix, the contrast function and the smoothing function.

In some embodiments, the performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal includes:

performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target frequency domain signal;

and converting the target frequency domain signal from a frequency domain to a time domain to obtain the target audio signal.

In a second aspect, the present disclosure provides an audio signal processing apparatus, including:

a determining module configured to determine a first signal vector from the first reference signal, the second reference signal and a first audio signal picked up by the microphone; the first audio signal comprises a first echo signal generated by a first speaker playing the first reference signal and a second echo signal generated by a second speaker playing the second reference signal;

a deriving module configured to derive a first residual signal vector from the first signal vector of a current frame and an echo separation vector of a previous frame;

a vector updating module configured to update the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain an echo separation vector of the current frame;

and the echo separation module is configured to perform echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal.

In some embodiments, the determining module is specifically configured to:

In some embodiments, the obtaining module is specifically configured to:

In some embodiments, the vector update module is specifically configured to:

determining a contrast function according to the evaluation function;

In some embodiments, the echo separation module is specifically configured to:

In a third aspect, the disclosed embodiments provide an electronic device, including:

a microphone;

a first speaker;

a second speaker;

a processor; and

a memory storing computer instructions for causing a processor to perform the method according to any of the embodiments of the first aspect.

In a fourth aspect, the embodiments of the present disclosure provide a storage medium storing computer instructions for causing a computer to execute the method according to any one of the embodiments of the first aspect.

The audio signal processing method of the embodiment of the disclosure includes determining a first signal vector according to a first reference signal, a second reference signal and a first audio signal picked up by a microphone, obtaining a first residual signal vector according to the first signal vector of a current frame and an echo separation vector of a previous frame, updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame, and performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal. In the embodiment of the disclosure, stereo echoes are separated based on an independent vector analysis manner, so that the stereo echoes are eliminated, and compared with the echo eliminating method in the related art, the stereo echo eliminating method can improve the stereo echo eliminating effect in complex environments such as a double-talk scene, avoid damaging near-end voice, and improve the voice communication effect.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 2 is a schematic diagram of an audio signal processing method according to some embodiments of the present disclosure.

Fig. 3 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 4 is a schematic diagram of an analysis window in an audio signal processing method according to some embodiments of the present disclosure.

Fig. 5 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 6 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 7 is a block diagram of an audio signal processing apparatus according to some embodiments of the present disclosure.

FIG. 8 is a block diagram of an electronic device suitable for implementing the method of the present disclosure.

Detailed Description

The technical solutions of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure. In addition, technical features involved in different embodiments of the present disclosure described below may be combined with each other as long as they do not conflict with each other.

Stereophonic sound refers to sound with a sense of space emitted by a plurality of loudspeaker sound channels, which is closer to real sound in the nature and can achieve a better audio-video effect, and therefore, the stereophonic sound is widely applied to scenes such as video conferences and online games. The sound emitted by the two speakers of a stereo speech system is picked up again by the near-end microphone and transmitted to the far-end to form acoustic echoes, which can seriously affect the quality of the speech call.

In the related art, one method for canceling stereo echo is to provide two adaptive filters, such as NLMS (Normalized Least Mean square) filters, and each adaptive filter estimates an echo path from one speaker to a microphone to cancel an echo generated by the speaker, thereby implementing stereo echo cancellation. However, in a complex Double talk (Double talk) scene such as multi-user online voice, since the echo path of the stereo system changes constantly, the adaptive filter cannot estimate the change of the echo path quickly and accurately, resulting in a poor echo cancellation effect.

Another type of related art method is echo cancellation based on Independent Component Analysis (ICA) method, which is implemented by analyzing each loudspeaker echo signal and distinguishing the echo signal from the microphone pickup signal. However, in a complex audio scene, due to the inherent frequency arrangement ambiguity of the audio signal, the near-end speech signal is distorted, and the speech communication effect is affected.

Based on the above-mentioned drawbacks of the related art, the embodiments of the present disclosure provide an audio signal processing method, an audio signal processing apparatus, an audio system, and a storage medium, which are intended to improve the echo cancellation effect of a stereo audio system.

In a first aspect, the embodiments of the present disclosure provide an audio signal processing method, which may be applied to an electronic device with a voice communication system, such as a mobile phone, a tablet computer, a notebook computer, and the like, and the disclosure is not limited thereto.

As shown in fig. 1, in some embodiments, an audio signal processing method of an example of the present disclosure includes:

and S110, determining a first signal vector according to the first reference signal, the second reference signal and the first audio signal picked up by the microphone.

Specifically, the voice communication system of the embodiment of the present disclosure is a stereo system, and the system includes two speakers constituting stereo, that is, a first speaker and a second speaker. The first loudspeaker and the second loudspeaker can be respectively arranged at different positions of the system, thereby forming stereo sound.

It will be appreciated that when the first and second speakers are playing sound, the microphone may pick up echo signals played by both speakers simultaneously. Since the two loudspeakers are located differently, the echoes of the first loudspeaker and the second loudspeaker that are picked up by the microphone have different echo paths.

The first reference signal and the second reference signal are far-end voice signals received by the system, for example, for a call scene, the reference signal refers to voice signals generated by speaking of a far-end speaker received by the system. The first reference signal is a far-end speech signal played through a first speaker, and the second reference signal is a far-end speech signal played through a second speaker.

When the first speaker plays the first reference signal, the first reference signal is propagated through an echo path between the first speaker and the microphone, and the microphone receives the first echo signal when the first reference signal reaches the microphone. Similarly, when the second reference signal is played by the second speaker, the second reference signal is propagated through an echo path between the second speaker and the microphone, and the microphone receives the second echo signal when the second reference signal reaches the microphone.

Meanwhile, for a double-talk scene, the microphone also collects a near-end voice signal generated when a near-end speaker speaks and a near-end background noise signal. That is, the first audio signal picked up by the microphone includes: a near-end speech signal, a background noise signal, a first echo signal, and a second echo signal.

In an embodiment of the disclosure, a first signal vector is determined from a first reference signal, a second reference signal, and a first audio signal picked up by a microphone.

For example, in some embodiments, the first reference signal, the second reference signal, and the first audio signal may be converted from a time domain form to a frequency domain form, thereby combining the signals in the frequency domain form into a matrix, i.e., a first signal vector. The following embodiments of the present disclosure will be described in detail, and will not be described in detail here.

And S120, obtaining a first residual signal vector according to the first signal vector of the current frame and the echo separation vector of the previous frame.

It should be noted that the first audio signal picked up by the microphone is a continuous signal in the time domain, and during signal processing, the continuous signal in the time domain needs to be divided into continuous multiframe signals. The "current frame" described in the embodiments of the present disclosure refers to a frame signal to be currently processed, and the "previous frame" refers to a continuous signal located one frame before the current frame.

In the embodiment of the present disclosure, as can be seen from S110, the first signal vector represents a vector including near-end speech and stereo echo, and the echo separation vector refers to a vector in which stereo echo is separated from the first signal vector. The first signal vector is processed through an echo separation vector to separate stereo echoes from the first signal vector while preserving near-end speech.

However, in the embodiment of the present disclosure, when processing the first signal vector of the current frame, first, the first signal vector is subjected to separation processing based on the echo separation vector of the previous frame. In a scene with a large change of the echo path, the echo signal of the current frame and the echo signal of the previous frame have a large change, so that the stereo echo cannot be accurately and completely separated from the first signal vector based on the echo separation vector of the previous frame. That is, after the first signal vector is processed by using the echo separation vector of the previous frame, the obtained first residual signal vector includes the near-end speech signal and the residual stereo echo signal.

Therefore, in the embodiment of the present disclosure, the echo separation vector of the previous frame needs to be updated based on the first residual signal vector, so as to obtain a relatively accurate echo separation vector of the current frame. The echo separation vector of the current frame can relatively accurately represent the stereo echo path of the current frame, so that the stereo echo can be more accurately separated from the first signal vector by using the echo separation vector of the current frame, and the near-end voice is kept. The following S130 to S140 are specifically described.

The following embodiments of the present disclosure will be described in detail with respect to a process of processing a first signal vector of a current frame based on an echo separation vector of a previous frame to obtain a first residual signal vector, which will not be described in detail herein.

And S130, updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame.

In some embodiments, an auxiliary variable of the current frame may be calculated based on the first signal vector, the first residual signal vector, and an auxiliary variable of a previous frame, and an echo separation vector of the current frame may be obtained according to the auxiliary variable of the current frame. The present disclosure is described in detail below, and will not be described in detail here.

It can be understood that the echo separation vector of the current frame represents the stereo echo signal corresponding to the current frame, and the stereo echo path at the current time can be better represented relative to the echo separation vector of the previous frame, and the stereo echo signal can be better separated for the complex scene with the echo path change.

And S140, performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal.

After obtaining the echo separation vector of the current frame, as can be seen from the foregoing description, the first signal vector represents a vector of the current frame including the near-end speech and the stereo echo, and the echo separation vector of the current frame represents an echo path of the stereo echo of the current frame.

Therefore, echo separation is carried out on the first signal based on the echo separation vector of the current frame, namely, the two vectors are multiplied, and the target audio signal can be obtained. The target audio signal represents the near-end speech signal after the stereo echo is cancelled.

In the above description, a frame signal processing procedure in the first audio signal is specifically described, and in the processing procedure of consecutive frames, as time goes on, the processing procedure of each "current frame" repeatedly executes the above processes from S110 to S140, so as to implement processing on the signal picked up by the microphone, and obtain a clean near-end speech signal after stereo echo is removed.

As can be seen from the above description, in the embodiment of the present disclosure, the stereo echo is eliminated by using a method based on Independent Vector Analysis (IVA), that is, the first audio signal picked up by the microphone and the stereo echo signals generated by the two speakers are constructed as a first signal Vector, and then the stereo echo is separated based on the Independent Vector (the first signal Vector), so that the echo cancellation problem is converted into a multi-channel speech separation problem, and the stereo echo cancellation is implemented.

In addition, compared with the echo cancellation method of the double filter in the related technology, extra double-talk scene detection and self-adaptive step length control are not needed, and the problem of poor double-talk scene stereo echo cancellation effect caused by inaccurate detection and control is fundamentally avoided. Compared with the ICA method in the related art, the method avoids the problem of near-end voice signal distortion caused by the possibility of wrong arrangement on the frequency band, and the near-end voice signal and echo separation vector updating speed is higher, thereby improving the voice communication effect.

Fig. 2 shows a schematic diagram of an audio signal processing method in some embodiments of the present disclosure, and the method of the present disclosure is described below with reference to fig. 2.

As shown in fig. 2, the audio system of the disclosed example is a stereo system, which includes two speakers, namely, a first speaker 210, a second speaker 220, and a microphone 100. The first speaker 210 may play the received far-end first reference signal x1(n), so that the microphone 100 picks up the first echo signal y1(n) generated by playing the first reference signal x1 (n). The second speaker 220 may play the received far-end second reference signal x2(n), so that the microphone 100 picks up the first echo signal y2(n) generated by playing the first reference signal x2 (n).

Meanwhile, in a double-talk scene, a near-end speech signal s (n) generated by the near-end speaker speaking and a near-end background noise signal v (n) can be also picked up by the microphone 100. That is, the first audio signal d (n) picked up by the microphone 100 can be represented as:

d(n)＝s(n)+v(n)+y1(n)+y2(n)

where s (n) + v (n) denotes a near-end audio signal, which includes a near-end speech signal and a background noise signal. y1(n) + y2(n) represents stereo echo signals, and the method of embodiments of the present disclosure aims to cancel the stereo echo signals y1(n) and y2(n) from the first audio signal d (n).

As shown in fig. 3, in some embodiments, an audio signal processing method of an example of the present disclosure includes:

s310, respectively transforming the first reference signal, the second reference signal and the first audio signal from the time domain to the frequency domain to obtain a first frequency domain reference signal, a second frequency domain reference signal and a first frequency domain audio signal.

S320, arranging the vectors of the first frequency domain reference signal, the second frequency domain reference signal and the first frequency domain audio signal according to a preset direction to obtain a first signal vector.

Specifically, the first reference signal, the second reference signal and the first audio signal are time domain signals, and in order to facilitate the signal processing calculation, the time domain signals are first converted into a frequency domain.

In some embodiments, a short-time Fourier transform (STFT) may be employed to convert the time-domain signal to a frequency-domain signal. In one example, the process of the STFT in fig. 2 is represented as:

Xn＝fft(d.*win)

Xf1＝fft(x1.*win)

Xf2＝fft(x2.*win)

where d is the first audio signal picked up by the microphone 100, x1 is the first reference signal, x2 is the second reference signal, and fft (·) is a short-time fourier transform.

win is a short analysis window, which is expressed as:

win＝[0；sqrt(hanning(N-1))]

hanning(n)＝0.5*[1-cos(2π*n/N)]

where N is the analysis frame length and hanning (N) is the Hanning window of length N-1. In one example, the short analysis window may be as shown in FIG. 4.

The first reference signal Xf1, the second reference signal Xf2 and the first audio signal Xn are obtained by the above formula calculation, and then the first reference signal Xf1, the second reference signal Xf2 and the first audio signal Xn are combined into a matrix form to obtain a first signal vector, which is expressed as:

X(l)＝[Xn,Xf1,Xf2]

where l denotes a frame index, k denotes a frequency point, and x (l) denotes a first signal vector.

With continued reference to fig. 2, after obtaining the first signal vector, the echo cancellation module 300 separates the stereo echo signals in the first signal vector. In the embodiment of the present disclosure, a first residual signal vector is obtained according to a first signal vector of a current frame and an echo separation vector of a previous frame, and is represented as:

E₁(l,k)＝W^T(l-1,k)×X(l,k)

where X (l, k) denotes a first signal vector of the current frame (the l-th frame), W^T(l-1, k) represents the echo separation vector of the previous frame (l-1 st frame), E₁(l, k) denotes a first residual signal vector of the current frame. That is, in the above formula, the stereo echo in the first signal vector of the current frame is first cancelled by the echo separation vector of the previous frame, and the obtained first residual signal vector E₁(l,k)。

It should be noted that, during the signal processing of the initial frame, since the initial frame has no previous frame signal, the initial echo separation vector may be preset, and in the case that the current frame is the initial frame, the first signal vector of the initial frame obtains the first residual signal vector based on the initial echo separation vector.

And under the condition that the current frame is not the initial frame, because each frame signal can be calculated to obtain the echo separation vector corresponding to the current frame, the first signal vector of the current frame obtains the first residual signal vector based on the echo separation vector of the previous frame. And the echo separation vector of the current frame obtained by subsequent calculation is used as the echo separation vector of the previous frame of the next frame signal, and the loop iteration processing is carried out.

As shown in fig. 5, in some embodiments, an audio signal processing method of an example of the present disclosure includes:

and S510, determining an auxiliary variable of the current frame according to the first signal vector and the first residual signal vector of the current frame and the auxiliary variable of the previous frame.

S520, updating the echo separation vector of the previous frame according to the auxiliary variable of the current frame to obtain the echo separation vector of the current frame.

Specifically, as can be seen from the foregoing, in the embodiment of the present disclosure, it is necessary to obtain an echo separation vector of the current frame based on the first residual signal vector. In the embodiment of the present disclosure, an auxiliary variable function is introduced to calculate and obtain an echo separation vector of the current frame.

Firstly, according to the first residual signal E of the current frame₁(l, k) an evaluation function r is calculated, expressed as:

the evaluation function r represents the evaluation of the first residual signal E₁Analytical evaluation of each frequency point in (l, k). A contrast function is then determined from the evaluation function

Expressed as:

a first covariance matrix Xf (l, k) is then determined from the first signal vector X (l, k) of the current frame, as represented by: xf (l, k) ═ X (l, k) ×^H(l, k) wherein (·)^HRepresenting the hermitian conjugate transpose. Then, updating the auxiliary variable of the previous frame based on the function and the covariance matrix to obtain the auxiliary variable of the current frame, which is expressed as:

where V (l, k) denotes an auxiliary variable of a current frame (frame I), V (l-1, k) denotes an auxiliary variable of a previous frame (frame I-1), alpha denotes a smoothing function,

representing a contrast function.

After determining the auxiliary variable V (l, k) of the current frame, an echo separation vector of the current frame is obtained according to the auxiliary variable V (l, k) of the current frame, and is expressed as:

W(l,k)＝V(l,k)^-1I

where W (l, k) denotes an echo separation vector of the current frame (I-th frame), I is a unit vector, and I ═ 1,0]^T。

Through the above process, the echo separation vector W (l, k) of the current frame is calculated, so that the stereo echo in the first signal vector of the current frame can be separated based on the echo separation vector of the current frame.

As shown in fig. 6, in some embodiments, an audio signal processing method of an example of the present disclosure includes:

s610, echo separation is carried out on the first signal vector based on the echo separation vector of the current frame, and a target frequency domain signal is obtained.

And S620, converting the target frequency domain signal from a frequency domain to a time domain to obtain a target audio signal.

Specifically, the process of performing echo separation on the first signal vector of the current frame is represented as:

E₂(l,k)＝W^T(l,k)×X(l,k)

wherein, W^T(l, k) denotes an echo separation vector of the current frame, and X (l, k) denotes a first signal vector of the current frame. And performing echo separation on the first signal vector of the current frame based on the echo separation vector of the current frame to obtain the audio signal without the stereo echo.

It should be noted that, as shown in fig. 2, after performing echo separation and cancellation on the first signal vector of the current frame, the echo cancellation module 300 obtains a target frequency domain signal in a frequency domain form, so that the target frequency domain signal can be converted into a time domain by inverse short-time fourier transform (ISTFT), and a target audio signal e in a time domain form is obtained, which can be represented as:

e＝ifft(E(l)).*win

where e is the target audio signal and ifft (-) is the inverse short-time fourier transform. The target audio signal e is a clean near-end audio signal after the stereo echo is removed, and mainly includes near-end speech and background noise.

Therefore, in the embodiment of the disclosure, the echo separation vector of the stereo echo is estimated by using an independent vector analysis technology, and an auxiliary variable is introduced to accelerate the update of the echo separation vector, so as to achieve the elimination of the stereo echo.

In a second aspect, the embodiments of the present disclosure provide an audio signal processing apparatus, which may be applied to an electronic device with a voice communication system, such as a mobile phone, a tablet computer, a notebook computer, and the like, and the disclosure is not limited thereto.

As shown in fig. 7, in some embodiments, an audio signal processing apparatus of an example of the present disclosure includes:

a determining module 701 configured to determine a first signal vector from the first reference signal, the second reference signal and the first audio signal picked up by the microphone; the first audio signal comprises a first echo signal generated by a first loudspeaker playing a first reference signal and a second echo signal generated by a second loudspeaker playing a second reference signal;

a deriving module 702 configured to derive a first residual signal vector according to a first signal vector of a current frame and an echo separation vector of a previous frame;

a vector updating module 703 configured to update the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame;

and an echo separation module 704 configured to perform echo separation on the first signal vector based on the echo separation vector of the current frame, so as to obtain a target audio signal.

In some embodiments, the determining module 701 is specifically configured to:

and arranging the vectors of the first frequency domain reference signal, the second frequency domain reference signal and the first frequency domain audio signal according to a preset direction to obtain a first signal vector.

In some embodiments, the obtaining module 702 is specifically configured to:

under the condition that the current frame is not the initial frame, performing echo separation on the first signal vector based on the echo separation vector of the previous frame to obtain a first residual signal vector;

and under the condition that the current frame is the initial frame, performing echo separation on the first signal vector based on a preset initial echo separation vector to obtain a first residual signal vector.

In some embodiments, the vector update module 703 is specifically configured to:

determining an evaluation function according to a first residual signal vector of the current frame;

determining a contrast function according to the evaluation function;

determining a first covariance matrix according to a first signal vector of a current frame;

In some embodiments, the echo separation module 704 is specifically configured to:

and converting the target frequency domain signal from a frequency domain to a time domain to obtain a target audio signal.

a microphone;

a first speaker;

a second speaker;

a processor; and

a memory storing computer instructions for causing the processor to perform the method according to any of the embodiments of the first aspect.

The electronic device according to the embodiment of the present disclosure may be described with reference to any one of the foregoing embodiments, and the present disclosure is not repeated herein.

In a fourth aspect, the disclosed embodiments provide a storage medium storing computer instructions for causing a computer to perform the method according to any one of the embodiments of the first aspect.

Fig. 8 is a block diagram of an electronic device according to some embodiments of the present disclosure, and the following describes principles related to the electronic device and a storage medium according to some embodiments of the present disclosure with reference to fig. 8.

Referring to fig. 8, the electronic device 1800 may include one or more of the following components: processing component 1802, memory 1804, power component 1806, multimedia component 1808, audio component 1810, input/output (I/O) interface 1812, sensor component 1816, and communications component 1818.

The processing component 1802 generally controls the overall operation of the electronic device 1800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1802 may include one or more processors 1820 to execute instructions. Further, the processing component 1802 may include one or more modules that facilitate interaction between the processing component 1802 and other components. For example, the processing component 1802 can include a multimedia module to facilitate interaction between the multimedia component 1808 and the processing component 1802. As another example, the processing component 1802 can read executable instructions from a memory to implement electronic device related functions.

The memory 1804 is configured to store various types of data to support operation at the electronic device 1800. Examples of such data include instructions for any application or method operating on the electronic device 1800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1806 provides power to various components of the electronic device 1800. The power components 1806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1800.

The multimedia component 1808 includes a display screen that provides an output interface between the electronic device 1800 and a user. In some embodiments, the multimedia component 1808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera can receive external multimedia data when the electronic device 1800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 1810 is configured to output and/or input audio signals. For example, the audio component 1810 can include a Microphone (MIC) that can be configured to receive external audio signals when the electronic device 1800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1804 or transmitted via the communication component 1818. In some embodiments, audio component 1810 also includes a speaker for outputting audio signals.

I/O interface 1812 provides an interface between processing component 1802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 1816 includes one or more sensors to provide status evaluations of various aspects for the electronic device 1800. For example, the sensor component 1816 can detect an open/closed state of the electronic device 1800, the relative positioning of components such as a display and keypad of the electronic device 1800, the sensor component 1816 can also detect a change in position of the electronic device 1800 or a component of the electronic device 1800, the presence or absence of user contact with the electronic device 1800, orientation or acceleration/deceleration of the electronic device 1800, and a change in temperature of the electronic device 1800. Sensor assembly 1816 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1816 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1816 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1818 is configured to facilitate communications between the electronic device 1800 and other devices in a wired or wireless manner. The electronic device 1800 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, 3G, 4G, 5G, or 6G, or a combination thereof. In an exemplary embodiment, the communication component 1818 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1818 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components.

It should be understood that the above embodiments are only examples for clearly illustrating the present invention, and are not intended to limit the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the present disclosure may be made without departing from the scope of the present disclosure.

Claims

1. An audio signal processing method, comprising:

2. The method of claim 1, wherein determining the first signal vector from the first reference signal, the second reference signal, and the first audio signal picked up by the microphone comprises:

3. The method of claim 1, wherein obtaining a first residual signal vector according to the first signal vector of the current frame and the echo separation vector of the previous frame comprises:

4. The method of claim 1, wherein the updating the echo separation vector of the previous frame according to the first signal vector and the first residual signal vector to obtain the echo separation vector of the current frame comprises:

5. The method of claim 4, wherein determining the auxiliary variable of the current frame according to the first signal vector and the first residual signal vector of the current frame and the auxiliary variable of the previous frame comprises:

determining a contrast function according to the evaluation function;

6. The method of claim 2, wherein the performing echo separation on the first signal vector based on the echo separation vector of the current frame to obtain a target audio signal comprises:

7. An audio signal processing apparatus, comprising:

8. The apparatus of claim 7, wherein the determination module is specifically configured to:

9. The apparatus of claim 7, wherein the obtaining module is specifically configured to:

10. The apparatus of claim 7, wherein the vector update module is specifically configured to:

11. The apparatus of claim 10, wherein the vector update module is specifically configured to:

determining a contrast function according to the evaluation function;

12. The apparatus of claim 8, wherein the echo separation module is specifically configured to:

13. An electronic device, comprising:

a microphone;

a first speaker;

a second speaker;

a processor; and

memory storing computer instructions for causing a processor to perform the method according to any one of claims 1 to 6.

14. A storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 6.