CN113810828A

CN113810828A - Audio signal processing method and device, readable storage medium and earphone

Info

Publication number: CN113810828A
Application number: CN202111093716.7A
Authority: CN
Inventors: 周岭松; 王昭; 相非
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-17

Abstract

The disclosure relates to an audio signal processing method, an audio signal processing device, a readable storage medium and an earphone. The method comprises the following steps: acquiring an environment sound signal; filtering the environment sound signal according to a preset transparent filter to obtain a first audio signal; extracting a human voice signal in the environment sound signal to obtain a second audio signal; and sending the first audio signal and the second audio signal to a loudspeaker, and controlling the loudspeaker to synchronously play the first audio signal and the second audio signal. Like this, through extracting the people's voice signal in the environment tone signal, the environment tone signal after the stack filtering is handled through synchronous broadcast, can strengthen the people's voice part in the environment tone signal after penetrating under the prerequisite of not losing pronunciation, provides clear people's voice perception under the penetrating mode, provides good environment penetrating experience for the user. In addition, because the superimposed is the human voice signal, the noise is not enhanced, so that the human voice heard by the user is clearer.

Description

Audio signal processing method and device, readable storage medium and earphone

Technical Field

The present disclosure relates to the field of audio processing, and in particular, to an audio signal processing method and apparatus, a readable storage medium, and an earphone.

Background

For adapting to different scenes, many existing earphones are provided with a noise reduction mode and a transparent mode, the noise reduction mode is used for blocking external sound signals, and the transparent mode is used for enabling the external sound signals to enter human ears. When a user wears the earphone and wants to have a conversation with other people, the user can switch to the transparent mode without taking off the earphone, and the effect of taking off the earphone is the same, so that clear conversation with the other party is realized. But the environment is usually noisy, and people expect to hear more voice and less noise when talking, so that a voice enhancing function for making the voice clearer is added.

Because the voice frequency band range is 300 Hz-3400 Hz, the current human voice enhancement method usually processes the sounds of different frequency bands respectively: for low frequency noise below 300Hz, the noise is counteracted by applying reverse sound waves; filtering the voice frequency band of 300 Hz-3400 Hz by using a transparent filter, and then filtering and amplifying the energy of the voice subjected to noise reduction by using a band-pass filter; and finally, superposing the sound waves subjected to band-pass amplification by the reversed-phase sound waves, and playing the sound waves by a loudspeaker. However, in an actual environment, noise is distributed in a full frequency band, the noise is also contained in a frequency band of 300Hz to 3400Hz, the noise is amplified while voice is amplified, and actual user experience is only that the noise and human voice are enhanced together. Further, there is a possibility that voice may be generated even at 300Hz or lower, and voice may be damaged by sound wave cancellation.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an audio signal processing method, apparatus, readable storage medium, and earphone.

According to a first aspect of the embodiments of the present disclosure, there is provided an audio signal processing method applied to a headphone, including:

acquiring an environment sound signal, wherein the environment sound signal is a sound signal in the surrounding environment of the earphone;

filtering the environment sound signal according to a preset transparent filter to obtain a first audio signal;

extracting a human voice signal in the environment sound signal to obtain a second audio signal;

and sending the first audio signal and the second audio signal to a loudspeaker, and controlling the loudspeaker to synchronously play the first audio signal and the second audio signal.

Optionally, the extracting the human voice signal in the environment sound signal includes:

and extracting a human voice signal in the environment sound signal through wiener filtering.

Optionally, the extracting, by wiener filtering, a human voice signal in the environment sound signal includes:

transforming the environment sound signal from a time domain to a frequency domain through Fourier transform to obtain a frequency domain signal corresponding to the environment sound signal;

for each audio frame in the frequency domain signal, determining a wiener filter coefficient corresponding to the audio frame;

filtering the audio frame by using a wiener filtering coefficient corresponding to the audio frame to obtain a frequency domain human voice signal in the audio frame;

and carrying out inverse Fourier transform on the frequency domain human voice signal to obtain the human voice signal in the time domain signal corresponding to the audio frame.

Optionally, the determining the wiener filter coefficient corresponding to the audio frame includes:

determining a power spectrum corresponding to the audio frame, and performing noise estimation on the audio frame to obtain a power spectrum corresponding to a noise signal in the audio frame;

and determining a wiener filter coefficient corresponding to the audio frame according to the power spectrum corresponding to the audio frame, the power spectrum corresponding to the noise signal in the audio frame and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame.

Optionally, the determining, according to the power spectrum corresponding to the audio frame, the power spectrum corresponding to the noise signal in the audio frame, and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame, the wiener filter coefficient corresponding to the audio frame includes:

determining a posterior signal-to-noise ratio corresponding to the audio frame according to the power spectrum corresponding to the audio frame and the power spectrum corresponding to the noise signal in the audio frame;

determining a priori signal-to-noise ratio estimation value corresponding to the audio frame according to the posterior signal-to-noise ratio and a power spectrum corresponding to a noise signal in a previous audio frame of the audio frame;

and generating a wiener filter coefficient corresponding to the audio frame according to the prior signal-to-noise ratio estimation value.

Optionally, the determining, according to the posterior signal-to-noise ratio and a power spectrum corresponding to a noise signal in a previous audio frame of the audio frame, a priori signal-to-noise ratio estimation value corresponding to the audio frame includes:

determining a priori signal-to-noise ratio estimation value corresponding to the audio frame according to the posteriori signal-to-noise ratio and a power spectrum corresponding to a noise signal in a previous audio frame of the audio frame by the following formula:

wherein,

the prior signal-to-noise ratio estimation value corresponding to the nth audio frame in the frequency domain signal corresponding to the environment sound signal is n & gt 1; p_dd(n-1) is a power spectrum corresponding to a noise signal in an n-1 audio frame in a frequency domain signal corresponding to the environment sound signal; alpha is a weight coefficient;

the power spectrum corresponding to the frequency domain human voice signal in the (n-1) th audio frequency frame in the frequency domain signal corresponding to the environment sound signal is obtained; and gamma (n) is the posterior signal-to-noise ratio corresponding to the nth audio frame in the frequency domain signal corresponding to the environment sound signal.

Optionally, the extracting the human voice signal in the environment sound signal to obtain a second audio signal includes:

and inputting the environment sound signal into a pre-trained human voice extraction model to obtain a second audio signal.

Optionally, the preset pass filter is a plurality of cascaded second-order IIR filters.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio signal processing apparatus applied to a headphone, including:

an obtaining module configured to obtain an ambient sound signal, wherein the ambient sound signal is a sound signal in the environment surrounding the headset;

the filtering module is configured to filter the environment sound signal acquired by the acquiring module according to a preset transparent filter to obtain a first audio signal;

the extraction module is configured to extract a human voice signal in the environment sound signal acquired by the acquisition module to obtain a second audio signal;

the playing control module is configured to send the first audio signal and the second audio signal to a loudspeaker, and control the loudspeaker to synchronously play the first audio signal and the second audio signal.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the audio signal processing method provided by the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a headset comprising:

a controller and a feedforward microphone and speaker communicatively coupled to the controller;

the feedforward microphone is used for collecting an environment sound signal and sending the environment sound signal to the controller;

the loudspeaker is used for playing an audio signal according to a control instruction of the controller;

the controller comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor; the processor is configured to execute the audio signal processing method provided by the first aspect of the present disclosure when the computer program is executed.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: filtering the environment sound signal according to a preset transparent filter to obtain a first audio signal, and extracting a human voice signal in the environment sound signal to obtain a second audio signal; and then, sending the first audio signal and the second audio signal to a loudspeaker, and controlling the loudspeaker to synchronously play the first audio signal and the second audio signal. Like this, through extracting the people's voice signal in the environment tone signal, the environment tone signal after the stack filtering is handled through synchronous broadcast, can strengthen the people's voice part in the environment tone signal after penetrating under the prerequisite of not losing pronunciation, provides clear people's voice perception under the penetrating mode, provides good environment penetrating experience for the user. Therefore, the user can clearly listen to the other party in the transparent mode without taking off the earphone, and can also notice important environmental sounds including siren sounds and driving sounds, so that the personal safety is guaranteed. In addition, because the superimposed is the human voice signal, the noise is not enhanced, so that the human voice heard by the user is clearer.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow chart illustrating an audio signal processing method according to an exemplary embodiment.

Fig. 2 is a diagram illustrating a frequency response curve a when an ear is empty and a frequency response curve B after passive noise reduction when a headphone is worn according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a target curve C in accordance with an exemplary embodiment.

Fig. 4 is a diagram illustrating a target curve C and a frequency response curve D according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating a method of extracting a human voice signal in an ambient sound signal through wiener filtering according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating an audio signal processing apparatus according to an exemplary embodiment.

Fig. 7 is a block diagram illustrating an audio signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flow chart illustrating an audio signal processing method according to an exemplary embodiment, wherein the method may be applied to a headset. As shown in fig. 1, the method includes the following S101 to S104.

In S101, an ambient sound signal is acquired.

In the present disclosure, the ambient sound signal is a sound signal in the environment surrounding the headset. The earphone comprises a noise reduction mode and a transparent mode, wherein the noise reduction mode is used for blocking external sound signals, and the transparent mode is used for enabling the external sound signals to enter human ears. The audio signal processing method can be applied to a scene that the earphone is in a transparent mode.

Illustratively, the ambient sound signal may be collected by a feed-forward microphone on the headset.

In S102, the ambient sound signal is filtered according to a preset pass-through filter, so as to obtain a first audio signal.

In the present disclosure, when a user wears the earphones, the difference between the internal and external sound pressure levels of the environmental sound signal is brought about due to the passive sound insulation of the earphones. When the earphone is in the penetrating mode, the environment sound signal can carry out filtering processing to the preset penetrating filter in the earphone through setting up after getting into the earphone to compensate the inside and outside sound pressure level difference of the environment sound signal that brings because the passive sound insulation of earphone, let the user experience be close not wear the input experience of the sound signal under the earphone state, also promptly penetrating mode opens the back, the response of people's ear to external is an open response.

For example, the preset pass filter may be a plurality of (e.g., 6) cascaded second-order IIR filters, and the pass filter performs filtering processing on the environment sound signal by using the frequency response curve D to compensate for the difference between the internal sound pressure level and the external sound pressure level of the environment sound signal caused by passive sound insulation of the earphone.

In S103, the human voice signal in the environment sound signal is extracted to obtain a second audio signal.

In S104, the first audio signal and the second audio signal are sent to a speaker, and the speaker is controlled to play the first audio signal and the second audio signal synchronously.

The following is a detailed description of a method for determining the frequency response curve D used when the pass-through filter (i.e., a plurality of cascaded second-order IIR filters) performs filtering processing on the ambient sound signal.

Specifically, the frequency response curve a when the ear is empty (i.e. the earphone is not worn) and the frequency response curve B after passive noise reduction when the earphone is worn can be collected by the artificial head (as shown in fig. 2). By comparing the frequency response curve a and the frequency response curve B, a passive noise reduction curve, that is, a target curve C (shown in fig. 3) that we need to compensate is obtained.

The target curve C is approximated by designing a plurality of cascaded second order IIR filters. The specific design steps are as follows: firstly, randomly initializing the coefficient (including frequency, gain and Q value) of each second-order IIR filter; then, randomly updating the coefficient of each second-order IIR filter, calculating a compensation curve E, and comparing the difference between the compensation curve E and a target curve C (for example, calculating the cosine distance between the two curves); if the difference between the compensation curve E and the target curve C is smaller than that in the last updating, the current IIR coefficient is taken as the reference, and the coefficients of each second-order IIR filter are continuously updated; and if the difference between the compensation curve E and the target curve C is larger than that in the last updating, the IIR coefficient obtained in the last updating is taken as the reference, and the coefficient of each second-order IIR filter is continuously updated. By analogy, performing multiple iterations until the difference between the compensation curve E and the target curve C is stable (for example, the difference between the maximum value and the minimum value of the difference in the last 10 times is smaller than a preset threshold); then, the compensation curve E obtained by the latest update is used as the frequency response curve D. For example, the frequency response curve D and the target curve C are shown in fig. 4.

The following describes in detail a specific embodiment of extracting the human voice signal from the environment sound signal in S102.

In one embodiment, the ambient sound signal may be input into a pre-trained human voice extraction model to obtain the second audio signal.

In the present disclosure, the above-mentioned human voice extraction model can be obtained by training in the following way:

firstly, acquiring a reference environment sound signal and a human voice signal in the reference environment sound signal;

then, model training is carried out by taking the reference environment sound signal as the input of the human voice extraction model and taking the human voice signal in the reference environment sound signal as the target output of the human voice extraction model, so as to obtain the human voice extraction model.

For example, the above-mentioned human voice extraction model may be Convolutional Neural Networks (CNN) + Long Short-Term Memory Networks (LSTM), Recurrent Neural Networks (RNN), and the like.

In another embodiment, the human voice signal in the environment sound signal can be extracted through wiener filtering. Specifically, it can be realized by S1031 to S1034 shown in fig. 5.

In S1031, the environment sound signal is transformed from the time domain to the frequency domain by fourier transform, and a frequency domain signal corresponding to the environment sound signal is obtained.

In S1032, for each audio frame in the frequency domain signal, a wiener filter coefficient corresponding to the audio frame is determined.

In S1033, the audio frame is filtered by using the wiener filter coefficient corresponding to the audio frame, so as to obtain the frequency domain human voice signal in the audio frame.

For example, the audio frame may be filtered by the following equation (1) using the wiener filter coefficient corresponding to the audio frame to obtain the frequency-domain human voice signal in the audio frame:

wherein y (n) is the nth audio frame in the frequency domain signal corresponding to the ambient sound signal, and n is 1,2, …, m, m is the total number of audio frames included in the frequency domain signal corresponding to the ambient sound signal;

for the nth one in the frequency domain signal corresponding to the environment sound signalFrequency domain vocal signals in audio frame y (n); h (n) is a wiener filter coefficient corresponding to the nth audio frame y (n) in the frequency domain signal corresponding to the ambient sound signal.

In S1034, inverse fourier transform is performed on the frequency domain vocal signal to obtain the vocal signal in the time domain signal corresponding to the audio frame.

A detailed description will be given below of a specific embodiment of determining the wiener filter coefficient corresponding to the audio frame in S1032. Specifically, the method can be realized by the following steps (1) and (2):

(1) and determining a power spectrum corresponding to the audio frame, and performing noise estimation on the audio frame to obtain a power spectrum corresponding to a noise signal in the audio frame.

For example, the noise estimation of the audio frame may be performed by Voice Activity Detection (VAD), Minimum Controlled Recursive Averaging (MCRA), and the like, so as to obtain a power spectrum corresponding to the noise signal in the audio frame.

(2) And determining the wiener filter coefficient corresponding to the audio frame according to the power spectrum corresponding to the audio frame, the power spectrum won by the noise signal in the audio frame and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame.

Specifically, the wiener filter coefficient corresponding to the audio frame can be determined through the following steps 1) to 3):

1) and determining the posterior signal-to-noise ratio corresponding to the audio frame according to the power spectrum corresponding to the audio frame and the power spectrum corresponding to the noise signal in the audio frame.

For example, the a posteriori snr for the audio frame can be determined according to the power spectrum corresponding to the audio frame and the power spectrum corresponding to the noise signal in the audio frame by the following equation (2):

wherein γ (n) is the nth audio frame pair in the frequency domain signal corresponding to the above-mentioned ambient sound signalThe corresponding posterior signal-to-noise ratio; p_yy() A power spectrum corresponding to the nth audio frame in the frequency domain signal corresponding to the environment sound signal; p_dd() The power spectrum corresponding to the noise signal in the nth audio frame in the frequency domain signal corresponding to the environment sound signal.

2) And determining the prior signal-to-noise ratio estimation value corresponding to the audio frame according to the posterior signal-to-noise ratio and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame.

For example, the a priori snr estimate corresponding to the audio frame may be determined by the following equation (3) according to the a posteriori snr and a power spectrum corresponding to a noise signal in an audio frame preceding the audio frame:

wherein,

the prior signal-to-noise ratio estimation value corresponding to the nth audio frame in the frequency domain signal corresponding to the environment sound signal is n & gt 1; p_dd(n-1) is a power spectrum corresponding to the noise signal in the (n-1) th audio frame in the frequency domain signal corresponding to the environment sound signal; alpha is a weight coefficient;

is the power spectrum corresponding to the frequency domain human voice signal in the (n-1) th audio frequency frame in the frequency domain signal corresponding to the environment sound signal,

and,

3) and generating a wiener filter coefficient corresponding to the audio frame according to the prior signal-to-noise ratio estimation value.

Illustratively, the wiener filter coefficients corresponding to the audio frame may be generated according to the a priori snr estimate by the following equation (4):

based on the same inventive concept, the present disclosure also provides an audio signal processing apparatus. As shown in fig. 6, the apparatus 600 includes:

an obtaining module 601 configured to obtain an ambient sound signal, wherein the ambient sound signal is a sound signal in the environment around the headset;

the filtering module 602 is configured to perform filtering processing on the environment sound signal acquired by the acquiring module according to a preset pass-through filter to obtain a first audio signal;

the extracting module 603 is configured to extract a human voice signal from the environment sound signal acquired by the acquiring module 601 to obtain a second audio signal;

a playing control module 604 configured to send the first audio signal and the second audio signal to a speaker, and control the speaker to play the first audio signal and the second audio signal synchronously.

Optionally, the extracting module 603 is configured to extract the human voice signal in the environment sound signal through wiener filtering.

Optionally, the extracting module 603 includes:

the first transformation submodule is configured to transform the environment sound signal from a time domain to a frequency domain through Fourier transformation, so as to obtain a frequency domain signal corresponding to the environment sound signal;

a first determining submodule configured to determine, for each audio frame in the frequency domain signal, a wiener filter coefficient corresponding to the audio frame;

the filtering submodule is configured to filter the audio frame by using a wiener filtering coefficient corresponding to the audio frame to obtain a frequency domain human voice signal in the audio frame;

and the second transformation submodule is configured to perform inverse Fourier transformation on the frequency domain human voice signal to obtain a human voice signal in a time domain signal corresponding to the audio frame.

Optionally, the first determining sub-module includes:

the second determining submodule is configured to determine a power spectrum corresponding to the audio frame, and perform noise estimation on the audio frame to obtain a power spectrum corresponding to a noise signal in the audio frame;

and the third determining submodule is configured to determine the wiener filter coefficient corresponding to the audio frame according to the power spectrum corresponding to the audio frame, the power spectrum corresponding to the noise signal in the audio frame and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame.

Optionally, the third determining sub-module includes:

a fourth determining submodule configured to determine an a posteriori signal-to-noise ratio corresponding to the audio frame according to the power spectrum corresponding to the audio frame and the power spectrum corresponding to the noise signal in the audio frame;

a fifth determining submodule configured to determine a priori signal-to-noise ratio estimation values corresponding to the audio frame according to the posteriori signal-to-noise ratios and a power spectrum corresponding to a noise signal in a previous audio frame of the audio frame;

and the generation submodule is configured to generate a wiener filter coefficient corresponding to the audio frame according to the prior signal-to-noise ratio estimation value.

Optionally, the fifth determining sub-module is configured to determine, according to the a posteriori signal-to-noise ratio and a power spectrum corresponding to a noise signal in a previous audio frame of the audio frame, an a priori signal-to-noise ratio estimated value corresponding to the audio frame by using the following formula:

wherein,

Optionally, the extracting module 603 is configured to input the environment sound signal into a pre-trained human voice extracting model, resulting in a second audio signal.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the above-mentioned audio signal processing method provided by the present disclosure.

The present disclosure also provides a headset, comprising:

the system comprises a controller, a feedforward microphone and a loudspeaker which are in communication connection with the controller;

the feedforward microphone is used for acquiring the environment sound signal and sending the environment sound signal to the controller;

the loudspeaker is used for playing the audio signal according to the control instruction of the controller;

a controller comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor; the processor is adapted to perform the audio signal processing method provided by the first aspect of the present disclosure when running the computer program.

Fig. 7 is a block diagram illustrating an audio signal processing apparatus 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the audio signal processing methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described audio signal processing methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the apparatus 800 to perform the audio signal processing method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the audio signal processing method described above when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio signal processing method applied to a headphone, comprising:

2. The method of claim 1, wherein the extracting the human voice signal from the environment sound signal comprises:

3. The method according to claim 2, wherein the extracting the human voice signal from the environment sound signal by wiener filtering comprises:

4. The method of claim 3, wherein the determining the wiener filter coefficients corresponding to the audio frame comprises:

5. The method of claim 4, wherein determining the wiener filter coefficients corresponding to the audio frame according to the power spectrum corresponding to the audio frame, the power spectrum corresponding to the noise signal in the audio frame, and the power spectrum corresponding to the noise signal in the previous audio frame of the audio frame comprises:

6. The method according to claim 5, wherein said determining an a priori SNR estimate corresponding to said audio frame based on said a posteriori SNR and a power spectrum corresponding to a noise signal in a previous audio frame of said audio frame comprises:

wherein,

7. The method according to claim 1, wherein the extracting the human voice signal from the environment sound signal to obtain a second audio signal comprises:

8. The method according to any one of claims 1-7, wherein the pre-defined pass-through filter is a plurality of cascaded second-order IIR filters.

9. An audio signal processing apparatus, applied to a headphone, comprising:

10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.

11. An earphone, comprising: a controller and a feedforward microphone and speaker communicatively coupled to the controller;

the controller comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor; the processor is configured to execute the audio signal processing method according to any one of claims 1 to 8 when the computer program is executed.