CN119446173A

CN119446173A - A method for processing in-vehicle voice, terminal device and storage medium

Info

Publication number: CN119446173A
Application number: CN202411361388.8A
Authority: CN
Inventors: 张紫璨; 胡若愚; 黄河
Original assignee: Zhejiang Zero Run Technology Co Ltd
Current assignee: Zhejiang Zero Run Technology Co Ltd
Priority date: 2024-09-26
Filing date: 2024-09-26
Publication date: 2025-02-14

Abstract

The present application discloses a method for processing in-vehicle voice, a terminal device, and a storage medium. The processing method includes: obtaining a first audio signal collected by a first microphone, a second audio signal collected by a second microphone, and a third audio signal collected by a third microphone, wherein the distance between the first microphone and the second microphone is less than a preset distance; performing differential beam processing on the first audio signal and the second audio signal respectively to obtain a first processing signal and a second processing signal; performing Kalman filtering processing according to the energy of the first processing signal, the second processing signal, and the third audio signal at each frequency point, filtering out audio signals generated by sound zones that are not corresponding to the microphones themselves in different processing signals or audio signals. The above scheme filters out audio signals that are not generated by the sound zones that are not their own, thereby improving the accuracy of in-vehicle voice interaction.

Description

In-vehicle voice processing method, terminal equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method for processing in-vehicle speech, a terminal device, and a storage medium.

Background

Along with rapid development of technology, a voice interaction technology is gradually applied to a vehicle-mounted interconnection scene, and users are more and more used to interact with vehicle-mounted equipment through voice, so that requirements and demands on a vehicle-mounted voice interaction system are increased. In order to meet the voice interaction between each user in the vehicle and the vehicle-mounted equipment, the vehicle-mounted voice interaction system promotes the vehicle-mounted multi-voice-zone voice interaction service so as to expand the voice interaction range.

Because personnel in the vehicle are dense and the environment is complex, the audio signal acquired by the vehicle-mounted microphone corresponding to a certain sound zone not only contains the voice of the current sound zone, but also possibly comprises the voice of other sound zones, the audio played by the vehicle and the like. If the audio signal is directly used as a voice command, response errors and other conditions are easy to occur, and the use experience of a user is reduced.

Disclosure of Invention

The application provides a processing method of in-car voice, terminal equipment and a storage medium.

The application adopts a technical scheme that a method for processing voice in a vehicle is provided, wherein at least a first microphone, a second microphone and a third microphone are arranged in the vehicle, and each microphone corresponds to one sound zone, and the method comprises the following steps:

Acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone and a third audio signal acquired by a third microphone, wherein the target distance between the first microphone and the second microphone is smaller than a preset threshold value, and the distances between the third microphone and the first microphone and the second microphone are respectively larger than the preset threshold value;

Performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal;

And carrying out Kalman filtering processing according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point, filtering the audio signals generated by the sound areas which are not corresponding to the first microphone in the first processing signal, filtering the audio signals generated by the sound areas which are not corresponding to the second microphone in the second processing signal, and filtering the audio signals generated by the sound areas which are not corresponding to the third microphone in the third audio signal.

Optionally, the kalman filter processing is performed according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point, including:

performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, and performing Fourier transform on the third audio signal to obtain a third frequency domain signal;

Carrying out Kalman filtering processing according to the energy of the first processing signal, the second processing signal and the third audio signal under each frequency point, filtering the audio signals generated by the sound zone corresponding to the non-first microphone in the first processing signal, filtering the audio signals generated by the sound zone corresponding to the non-second microphone in the second processing signal, and filtering the audio signals generated by the sound zone corresponding to the non-third microphone in the third audio signal, wherein the Kalman filtering processing comprises the following steps:

and carrying out Kalman filtering processing on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal according to the energy of each frequency point, filtering audio signals generated by the sound zone corresponding to the non-first microphone in the first frequency domain signal, filtering audio signals generated by the sound zone corresponding to the non-second microphone in the second frequency domain signal, and filtering audio signals generated by the sound zone corresponding to the non-third microphone in the third frequency domain signal to obtain filtering signals corresponding to the sound zones where different microphones are located.

Optionally, according to the energy of each frequency point, performing kalman filtering processing on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal to obtain filtering signals corresponding to the sound areas where different microphones are located, where the steps include:

Determining a target signal from the first frequency domain signal, the second frequency domain signal and the third frequency domain signal in sequence, and taking the rest signals as reference signals;

Comparing the first energy of the target signal at each frequency point with the second energy of each reference signal;

in response to the first energy being smaller than the second energy, performing Kalman filtering processing on the target signal by using the reference signal to obtain a separation signal corresponding to the target signal at each frequency point;

Responsive to the first energy being greater than or equal to the second energy, the target signal is treated as a separation signal;

and obtaining the filtering signals corresponding to the sound areas where the different microphones are positioned according to the separation signals at the different frequency points.

Optionally, obtaining the filtering signals corresponding to different microphones according to the separation signals at different frequency points includes:

calculating a first frequency domain coherence value between the target signal and the separated signal at each frequency point;

calculating a second frequency domain coherence value between the target signal and each reference signal under each frequency point;

determining a suppression factor under each frequency point based on the first frequency domain coherence value and the second frequency domain coherence value corresponding to the same frequency point;

and obtaining the filtering signals corresponding to different microphones by using the suppression factors and the separation signals.

Alternatively, the inhibitor is obtained according to the following formula:

α=max (0.05, min (1-cohxd, cohed)), where α is the suppression factor, cohed is the first frequency domain coherence value, cohxd is the second frequency domain coherence value.

Optionally, the first microphone corresponds to the first sound zone, the second microphone corresponds to the second sound zone, and the third microphone corresponds to the third sound zone;

after the step of obtaining the audio signal generated by each audio zone according to the filtering signals corresponding to the different audio zones, the method comprises the following steps:

performing inverse Fourier transform on the first filtering signal corresponding to the first sound zone to obtain a first target audio signal generated by the first sound zone;

performing inverse Fourier transform on the second filtered signal corresponding to the second sound zone to obtain a second target audio signal generated by the second sound zone;

and performing inverse Fourier transform on the third filtering signal corresponding to the third sound zone to obtain a third target audio signal generated by the third sound zone.

Optionally, the first microphone corresponds to a first sound zone, and the second microphone corresponds to a second sound zone;

performing differential beam processing on the first audio signal to obtain a first processed signal, and performing differential beam processing on the second audio signal to obtain a second processed signal, including:

acquiring a first suppression angle of the first microphone relative to the second sound zone and a second suppression angle of the second microphone relative to the first sound zone;

performing differential beam processing on the first audio signal by using the first suppression angle to obtain a first processing signal; and performing differential beam processing on the second audio signal by using the second suppression angle to obtain a second processed signal.

Optionally, performing differential beam processing on the first audio signal by using the first suppression angle to obtain a first processed signal, including:

determining a first filter coefficient according to the first suppression angle, the target distance and the signal angular frequency;

Performing differential beam processing on the first audio signal by using the first filter coefficient to obtain a first processing signal;

And performing differential beam processing on the second audio signal by using the second suppression angle to obtain a second processed signal, including:

Determining a second filter coefficient according to the second suppression angle, the target distance and the signal angular frequency;

And performing differential beam processing on the second audio signal by using the second filter coefficient to obtain a second processed signal.

Another technical scheme adopted by the application is to provide a terminal device, wherein the terminal device comprises a memory and a processor connected with the memory;

The memory is used for storing program data, and the processor is used for executing the program data to realize the processing method of the in-vehicle voice.

Another technical solution adopted by the present application is to provide a computer storage medium, where the computer storage medium is used to store program data, and when the program data is executed by a computer, the program data is used to implement the method for processing in-vehicle speech as described above.

The method has the beneficial effects that the distance between the first microphone and the second microphone is smaller than the preset threshold value, and the difference between the audio signals acquired by the microphones with smaller distance is increased by performing differential beam processing on the first audio signals acquired by the first microphone and the second audio signals acquired by the second microphone to obtain a first processing signal and a second processing signal. And at each frequency point, carrying out Kalman filtering on the audio signals according to the energy difference between the audio signals, filtering out signals generated by other sound areas in the audio signals, and accurately realizing the separation of voice signals. Further, the method for processing the in-vehicle voice can further improve the accuracy of voice interaction of in-vehicle users and improve the use experience of the users.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flowchart illustrating an embodiment of a method for processing in-vehicle speech according to the present application;

FIGS. 2A-2D are schematic diagrams illustrating the distribution of microphones in a vehicle according to various embodiments of the present application;

FIG. 3 is a flowchart illustrating another embodiment of a method for processing in-vehicle speech according to the present application;

fig. 4 is a schematic structural diagram of an embodiment of a terminal device provided by the present application;

fig. 5 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to improve riding experience of passengers, the intelligent vehicle is provided with vehicle-mounted microphones at different positions inside, and personalized service for different passengers is achieved by acquiring voice control instructions of the passengers at different positions. However, because the personnel in the vehicle are dense, the environment is complex, and the in-vehicle voice signals collected by the vehicle-mounted microphone not only comprise the voice of the target passenger, but also comprise interference signals of passengers at other positions, audio signals played by the vehicle, various environmental noises and the like, how to accurately sense the voice signals of the passengers at different positions is a problem to be solved.

The related art mainly adopts methods of beam forming, blind source separation and the like. Beamforming algorithms are susceptible to the number and location of microphone arrays, such as for linear array and distributed hybrid microphone layouts. The result of blind source separation has the problem of an uncertainty in the alignment. In the voice separation scheme based on the neural network, a certain gap exists between the simulated room impulse response and the real environment in the vehicle, so that the separated voice is distorted.

The application mainly designs a method for improving the accuracy of separating audio signals, and unlike the traditional method, the method starts from the position relation among vehicle-mounted microphones, combines Kalman filtering, can filter audio signals which are not generated in the own voice zone from signals which are acquired by the vehicle-mounted microphones and are mixed with the audio generated in different voice zones, and accurately identifies users speaking in different voice zones.

Referring to fig. 1 specifically, fig. 1 is a flow chart of an embodiment of a method for processing in-vehicle speech according to the present application.

As shown in fig. 1, the method for processing in-vehicle voice according to the embodiment of the present application specifically includes the following steps:

s1, acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone and a third audio signal acquired by a third microphone.

Wherein, the vehicle is provided with at least first microphone, second microphone and third microphone in inside, and every microphone corresponds a sound district.

The target distance between the first microphone and the second microphone is smaller than a preset threshold value, and the distances between the third microphone and the first microphone and the distances between the third microphone and the second microphone are respectively larger than the preset threshold value.

As shown in fig. 2A, mic-1 (microphone 1), mic-2 (microphone 2), and mic-3 correspond to the first, second, and third sound zones, respectively, wherein mic-1 and mic-2 form a linear array microphone array.

As shown in fig. 2B, the mic-1, the mic-2, the mic-3, and the mic-4 correspond to the first, second, third, and fourth sound regions, respectively, wherein the mic-1 and the mic-2 form a linear microphone array.

As shown in fig. 2C, mic-1, mic-2, mic-3, mic-4, mic-5, and mic-6 correspond to the first, second, third, fourth, fifth, and sixth sound regions, respectively, wherein mic-1 and mic-2 form a linear microphone array.

As shown in fig. 2D, mic-1, mic-2, mic-3, mic-4, mic-5, and mic-6 correspond to the first, second, third, fourth, fifth, and sixth sound regions, respectively, wherein mic-1 and mic-2 constitute a linear array microphone array, and mic-5 and mic-6 constitute a linear array microphone array.

The method for processing the in-vehicle voice provided by the application is mainly executed by a voice processing device. In some embodiments, the voice processing device may be the car microphone itself. In some embodiments, the distance measuring device may be a device communicatively coupled to the vehicle microphone. For example, the device may be any one or more of a device for monitoring images, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, a car-mounted device, a wearable device, and an autopilot, a robot, a security system, glasses for augmented reality or virtual reality, a helmet. In some possible implementations, the method for processing the in-vehicle speech may be implemented by a processor invoking computer readable instructions stored in a memory.

Specifically, the voice processing device acquires a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone, and a third audio signal acquired by a third microphone.

Illustratively, the first microphone corresponds to a first sound zone and the second microphone corresponds to a second sound zone. The distance between the first microphone and the second microphone is smaller than a preset threshold, that is, the first microphone and the second microphone form a linear array microphone array (or simply referred to as a microphone array). Because the distance between the first microphone and the second microphone is relatively short, the first audio signal collected by the first microphone not only contains the audio generated in the first sound zone, but also contains the audio generated in the second sound zone, and likewise, the second audio signal collected by the second microphone contains the audio generated in the first sound zone and the second sound zone. The present inventors have found through some time of research that the first audio signal and the second audio signal are directly subjected to the kalman filtering operation, and the audio signal generated not belonging to the own audio zone cannot be obviously separated, and based on this, the first audio signal and the second audio signal need to be subjected to the correlation processing to amplify the difference between the first audio signal and the second audio signal.

In some embodiments, the preset threshold is less than 50cm. In some embodiments, the preset threshold is less than 40cm. In some embodiments, the preset threshold is less than 30cm. In some embodiments, the preset threshold is less than 20cm. In some embodiments, the preset threshold is less than 10cm. The preset threshold is, for example, 15cm.

S2, performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal.

Specifically, the voice processing device performs differential beam processing on the first audio signal to obtain a first processed signal, and the voice processing device performs differential beam processing on the second audio signal to obtain a second processed signal.

In the related art, a beam forming manner is mainly adopted, however, an application scene corresponding to the technology is that a first microphone and a second microphone form an additive microphone array, and the additive microphone array determines the shape of a main lobe by setting the main lobe. For the first microphone and the second microphone belonging to the vehicle-mounted microphone array, the distance between the microphones is smaller, the width of the main lobe of the beam is larger, the directivity of the beam is poor, the array gain is not obvious, the enhancement effect on the target direction is weaker, and the suppression effect on the interference direction is poorer.

The differential beam algorithm is a microphone array-based sound source localization and sound separation technique that exhibits differences in spatial sound pressure. Compared with the related art, the differential microphone array determines the main lobe shape by setting the zero point direction, and improves the voice energy ratio of the target direction and the interference direction by placing the zero point in the interference direction. The sensitivity of the array to different directions of incidence is represented by the beam pattern, the differential beam satisfying the following relationship:

B[h(ω),θ]=d^H(ω,cosθ)h(ω)

Where d (ω, cos θ) is the steering vector of the array, ω is the angular frequency of the signal, θ is the direction of incidence, H (ω) is the output weight of each microphone, and H is the conjugate transpose. Taking the end emission direction of the double microphones as 0 degrees and the suppression direction as 180 degrees as an example, the calculation formula of the beam pattern formed by the differential beam is as follows:

wherein h' (ω) is Τ is the ratio of microphone array spacing to sound velocity.

In this embodiment, the voice processing apparatus applies a differential beam algorithm to the audio enhancement of the real vehicle linear array microphone array. For example, the voice processing device sets the zero point to be the angle of the second sound zone relative to the microphone array, so that the audio energy of the second sound zone direction is effectively restrained, the audio signal of the first sound zone is highlighted, and the energy difference of the audio signals generated between the first sound zone and the second sound zone is improved.

For another example, the speech processing device sets the null to an angle of the first soundfield relative to the microphone array effective to suppress audio energy in the direction of the first soundfield to emphasize audio signals of the second soundfield.

In this embodiment, the first audio signal and the second audio signal are processed by the differential beam algorithm, the obtained energy difference between the first processed signal and the second processed signal becomes large, and the signal energy distribution is similar to the pickup result of the distributed microphone. Wherein distributed microphones refer to two microphones satisfying a spacing greater than or equal to a preset threshold.

Illustratively, the first microphone corresponds to a first sound zone and the second microphone corresponds to a second sound zone. Because the distance between the first microphone and the second microphone is relatively close, the difference of the energy of the voice signals of the first voice zone and the second voice zone picked up by the voice processing device is relatively small, the effect of the subsequent Kalman filtering is affected, and the voice damage is relatively large. Therefore, differential beam processing is firstly carried out on the first audio signal and the second audio signal which are respectively corresponding to the first microphone and the second microphone which are compactly arranged, and the sound of the first received second sound zone is weakened through directional interference suppression, and the sound of the first sound zone received by the second microphone is weakened.

S3, carrying out Kalman filtering processing according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point, filtering the audio signals generated by the sound areas corresponding to the non-first microphones in the first processing signal, filtering the audio signals generated by the sound areas corresponding to the non-second microphones in the second processing signal and filtering the audio signals generated by the sound areas corresponding to the non-third microphones in the third audio signal.

It is assumed that each frequency point corresponds to only one speaker, and it can be understood that the energy of the signal collected by the microphone nearest to the sounding zone is the largest, so that whether the current sounding zone is speaking is judged according to the energy on the frequency point.

The voice processing device performs kalman filtering processing on the first processing signal, the second processing signal and the third processing signal according to the energy between the first processing signal, the second processing signal and the third audio signal at each frequency point, filters the audio signals generated by the sound area corresponding to the non-first microphone in the first processing signal, filters the audio signals generated by the sound area corresponding to the non-second microphone in the second processing signal, and filters the audio signals generated by the sound area corresponding to the non-third microphone in the third audio signal.

According to the scheme, the distance between the first microphone and the second microphone is smaller than the preset threshold, and the difference between the audio signals collected by the microphones with smaller distance is increased by performing differential beam processing on the first audio signals collected by the first microphone and the second audio signals collected by the second microphone to obtain the first processing signals and the second processing signals. And at each frequency point, carrying out Kalman filtering on the audio signals according to the energy difference between the audio signals, filtering out signals generated by other sound areas in the audio signals, and accurately realizing the separation of voice signals. Further, the method for processing the in-vehicle voice can further improve the accuracy of voice interaction of in-vehicle users and improve the use experience of the users.

The other embodiment of the method for processing the in-vehicle voice provided by the application can specifically comprise the following steps:

s11, acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone and a third audio signal acquired by a third microphone.

S12, performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal.

S13, performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, and performing Fourier transform on the third audio signal to obtain a third frequency domain signal.

Since the first processed signal, the second processed signal and the third audio signal belong to the time domain signal. In the subsequent step, the kalman filtering is required according to the energy of the signals in different frequency domains, so that the conversion operation from the time domain signals to the frequency domain signals is required.

For example, the voice processing apparatus may perform Short Time Fourier Transform (STFT) on the first, second and third processed signals, respectively, to obtain the first, second and third frequency domain signals.

S14, carrying out Kalman filtering processing on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal according to the energy of each frequency point, filtering out audio signals generated by a sound zone which is not corresponding to the first microphone in the first frequency domain signal, filtering out audio signals generated by a sound zone which is not corresponding to the second microphone in the second frequency domain signal, and filtering out audio signals generated by a sound zone which is not corresponding to the third microphone in the third frequency domain signal, thereby obtaining filtering signals corresponding to the sound zones where different microphones are located.

Specifically, the voice processing device performs kalman filtering processing on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal according to the energy of each frequency point, filters out the audio signals generated by the sound area corresponding to the non-first microphone in the first frequency domain signal, filters out the audio signals generated by the sound area corresponding to the non-second microphone in the second frequency domain signal, and filters out the audio signals generated by the sound area corresponding to the non-third microphone in the third frequency domain signal, thereby obtaining the filtering signals corresponding to the sound areas where the different microphones are located.

Illustratively, the first microphone corresponds to the first filtered signal, the second microphone corresponds to the second filtered signal, and the third microphone corresponds to the third filtered signal.

s21, acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone and a third audio signal acquired by a third microphone.

S22, performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal.

S23, performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, and performing Fourier transform on the third audio signal to obtain a third frequency domain signal.

S24, determining a target signal from the first frequency domain signal, the second frequency domain signal and the third frequency domain signal in sequence, and taking the rest signals as reference signals.

In some embodiments, the speech processing apparatus uses the first frequency domain signal as the target signal and uses the second frequency domain signal and the third frequency domain signal as the reference signal in sequence.

In some embodiments, the speech processing apparatus uses the second frequency domain signal as the target signal and uses the first frequency domain signal and the fourth frequency domain signal as the reference signal in sequence.

In some embodiments, the speech processing apparatus uses the third frequency domain signal as the target signal and uses the first frequency domain signal and the second frequency domain signal as the reference signal in sequence.

S25, comparing the first energy of the target signal under each frequency point with the second energy of each reference signal.

Specifically, the voice processing device traverses all the frequency points to obtain the energy corresponding to the target signal (i.e., the first energy) and the energy corresponding to the reference signal (i.e., the second energy) under each frequency point. Further, the speech processing means compares the magnitude relation between the first energy and the second energy.

S26, in response to the fact that the first energy is smaller than the second energy, kalman filtering processing is conducted on the target signal through the reference signal, and the separation signal corresponding to the target signal under each frequency point is obtained.

Specifically, in the first target frequency point, in response to the first energy being smaller than the second energy, it is stated that the voice is generated in the voice zone corresponding to the reference signal, and the voice processing device performs kalman filtering processing on the target signal by using the reference signal to obtain a separation signal corresponding to the target signal in the first target frequency point.

Wherein the separation signal satisfies the following relationship:

Where d _n is a target signal, x _n is a reference signal, w _n is a kalman filter coefficient, and se _n is a separated signal output after filtering, which may also be referred to as an error signal.

In some embodiments, the separated signal may be a single frequency point, or may be a partial frequency band.

And S27, responding to the fact that the first energy is larger than or equal to the second energy, and taking the target signal as a separation signal.

Specifically, in response to the first energy being greater than or equal to the second energy, it is indicated that the voice is generated in a voice zone corresponding to the target signal, and the voice processing apparatus does not perform any operation.

S28, obtaining filtering signals corresponding to the sound areas where the different microphones are located according to the separation signals at the different frequency points.

In some embodiments, the first frequency domain signal is a target signal, the second frequency domain signal is a reference signal, and the speech processing device determines the first candidate filtered signal according to the obtained first separation signal (point) at different frequency points. Further, the voice processing device uses the first candidate filtering signal as a target signal and the third frequency domain signal as a reference signal, performs the steps of S25-S27 to obtain second separation signals (points) at different frequency points, and determines to obtain a filtering signal corresponding to the sound zone where the first microphone is located, that is, a filtering signal corresponding to the first sound zone.

Similarly, the filtered signal corresponding to the second audio zone and the filtered signal corresponding to the third audio zone can be obtained in the above manner.

S31, acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone and a third audio signal acquired by a third microphone.

S32, performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal.

S33, performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, and performing Fourier transform on the third audio signal to obtain a third frequency domain signal.

S34, determining a target signal from the first frequency domain signal, the second frequency domain signal and the third frequency domain signal in sequence, and taking the rest signals as reference signals.

S35, comparing the first energy of the target signal under each frequency point with the second energy of each reference signal.

S36, in response to the fact that the first energy is smaller than the second energy, kalman filtering processing is conducted on the target signal through the reference signal, and the separation signal corresponding to the target signal under each frequency point is obtained.

And S37, responding to the fact that the first energy is larger than or equal to the second energy, and taking the target signal as a separation signal.

S38, calculating a first frequency domain coherence value between the target signal and the separated signal at each frequency point.

Specifically, the voice processing device acquires a target separation signal corresponding to a target microphone. And then calculating a first frequency domain coherence value between the target signal and the separated signal at each frequency point.

The first frequency domain coherence value is used for describing the similarity degree between the target signal and the separated signal under the current frequency point. The higher the first frequency domain coherence value, the more similar the energy values of the target signal and the separation signal are at the frequency point.

In some embodiments, the first frequency domain coherence value satisfies the following relationship:

Wherein D (f) is a target signal and E (f) is a separation signal.

S39, calculating a second frequency domain coherence value between the target signal and each reference signal under each frequency point.

The second frequency domain coherence value is used for describing the similarity degree between the target signal and a certain reference signal under the current frequency point. The higher the second frequency domain coherence value, the more similar the energy values of the target signal and the reference signal are at the frequency point.

In some embodiments, the second frequency domain coherence value satisfies the following relationship:

wherein D (f) is a target signal, and X (f) is a reference signal.

S40, determining the suppression factor under each frequency point based on the first frequency domain coherence value and the second frequency domain coherence value corresponding to the same frequency point.

Specifically, the voice processing device calculates the suppression factor corresponding to each frequency point by using the first frequency domain coherence value and the second frequency domain coherence value under each frequency point.

In some embodiments, the inhibitor is derived according to the following formula:

α=max(0.05,min(1-cohxd,cohed))

Where α is a suppression factor, cohed is a first frequency domain coherence value, and cohxd is a second frequency domain coherence value.

S41, obtaining the filter signals corresponding to different microphones by using the suppression factors and the separation signals.

Specifically, the voice processing device calculates the product of the suppression factor and the target separation signal at each frequency point to obtain a target filtering signal corresponding to the target microphone, and so on to obtain filtering signals corresponding to different microphones.

It should be noted that, the kalman filter can only remove the content of the target signal coherent with the reference signal, and some residues exist in the separated signal, so the steps are needed to further remove the residual interference signal.

In some embodiments, S38-S41 may also be referred to as nonlinear processing.

In addition, considering the limitation of the Kalman filter, the application further post-processes the result after the Kalman filter processing to further remove the residual interference signal in the voice signal.

s51, acquiring a first audio signal acquired by the first microphone, a second audio signal acquired by the second microphone and a third audio signal acquired by the third microphone.

S52, performing differential beam processing on the first audio signal to obtain a first processing signal, and performing differential beam processing on the second audio signal to obtain a second processing signal.

S53, performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, and performing Fourier transform on the third audio signal to obtain a third frequency domain signal.

S54, determining a target signal from the first frequency domain signal, the second frequency domain signal and the third frequency domain signal in sequence, and taking the rest signals as reference signals.

S55, comparing the first energy of the target signal at each frequency point with the second energy of each reference signal.

S56, in response to the fact that the first energy is smaller than the second energy, kalman filtering processing is conducted on the target signal through the reference signal, and the separation signal corresponding to the target signal under each frequency point is obtained.

And S57, responding to the fact that the first energy is larger than or equal to the second energy, and taking the target signal as a separation signal.

And S58, obtaining the filtering signals corresponding to the sound areas where the different microphones are located according to the separation signals at the different frequency points.

S59, performing inverse Fourier transform on the first filtered signal corresponding to the first sound zone to obtain a first target audio signal generated by the first sound zone.

In some embodiments, the first filtered signal is a time domain signal, and it can be appreciated that the time domain signal is difficult to directly apply to a subsequent voice interaction task such as voice wake-up and voice recognition, so that the first filtered signal belonging to the time domain signal needs to be converted into the first target audio signal belonging to the frequency domain signal.

Illustratively, the speech processing means performs a short-time fourier transform on the first processed signal and the speech processing means performs an short-time inverse fourier transform on the first filtered signal.

S60, performing inverse Fourier transform on the second filtered signal corresponding to the second sound zone to obtain a second target audio signal generated by the second sound zone.

In some embodiments, the second filtered signal is a time domain signal, and it can be appreciated that the time domain signal is difficult to directly apply to a subsequent voice interaction task such as voice wake-up and voice recognition, so that the second filtered signal belonging to the time domain signal needs to be converted into the second target audio signal belonging to the frequency domain signal.

The speech processing means may be configured to apply a short-time fourier transform to the second processed signal and to apply an inverse short-time fourier transform to the second filtered signal.

And S61, performing inverse Fourier transform on the third filtering signal corresponding to the third sound zone to obtain a third target audio signal generated by the third sound zone.

In some embodiments, the third filtered signal is a time domain signal, and it can be appreciated that the time domain signal is difficult to directly apply to a subsequent voice interaction task such as voice wake-up and voice recognition, so that the third filtered signal belonging to the time domain signal needs to be converted into a third target audio signal belonging to the frequency domain signal.

Illustratively, the speech processing means performs a short-time fourier transform on the third processed signal and the speech processing means performs an short-time inverse fourier transform on the third filtered signal.

s71, acquiring a first audio signal acquired by the first microphone, a second audio signal acquired by the second microphone and a third audio signal acquired by the third microphone.

Wherein, the vehicle is provided with at least first microphone, second microphone and third microphone in inside, and every microphone corresponds a sound district. The first microphone corresponds to the first sound zone, and the second microphone corresponds to the second sound zone.

S72, a first suppression angle of the first microphone relative to the second sound zone and a second suppression angle of the second microphone relative to the first sound zone are obtained.

Specifically, in response to the target distance between the first microphone and the second microphone being less than a preset threshold, the speech processing apparatus determines that the first microphone and the second microphone constitute a linear array microphone. Further, the voice processing device determines a first suppression angle θ ₂ of the first microphone relative to the second sound zone and a second suppression angle θ ₁ of the second microphone relative to the first sound zone from pickup data of the first microphone and the second microphone acquired by a large number of real vehicles in advance and theoretical modeling results.

And S73, performing differential beam processing on the first audio signal by using the first suppression angle to obtain a first processed signal, and performing differential beam processing on the second audio signal by using the second suppression angle to obtain a second processed signal.

Specifically, the voice processing device performs differential beam processing on the first audio signal by using the first suppression angle θ ₂ to obtain a first processed signal.

And the voice processing device performs differential beam processing on the second audio signal by using the second suppression angle theta ₁ to obtain a second processed signal.

S74, carrying out Kalman filtering processing according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point, filtering the audio signals generated by the sound areas corresponding to the non-first microphones in the first processing signal, filtering the audio signals generated by the sound areas corresponding to the non-second microphones in the second processing signal and filtering the audio signals generated by the sound areas corresponding to the non-third microphones in the third audio signal.

In addition, the method for processing the in-vehicle voice provided by the application increases the energy difference of the voice zone signals acquired by the microphones in compact arrangement by utilizing the differential beam processing method, and improves the filtering effect of the subsequent Kalman filtering.

s81, acquiring a first audio signal acquired by the first microphone, a second audio signal acquired by the second microphone and a third audio signal acquired by the third microphone.

S82, a first suppression angle of the first microphone relative to the second sound zone and a second suppression angle of the second microphone relative to the first sound zone are obtained.

S83, determining a first filter coefficient according to the first suppression angle, the target distance and the signal angular frequency.

In some embodiments, the speech processing device calculates a target distance to sound velocity ratio to obtain a target ratio.

Further, the voice processing device determines a first filter coefficient according to the first suppression angle, the target ratio and the signal angular frequency.

Wherein the first filter coefficient satisfies the following relationship:

wherein C1 is the first filter coefficient.

S84, differential beam processing is carried out on the first audio signal by using the first filter coefficient, so as to obtain a first processing signal.

Specifically, the voice processing device performs differential beam processing on the first audio signal by using the first filter coefficient to obtain a first processed signal.

S85, determining a second filter coefficient according to the second suppression angle, the target distance and the signal angular frequency.

In some embodiments, the speech processing device determines the second filter coefficient based on the second suppression angle, the target coefficient, and the signal angular frequency.

Wherein the second filter coefficient satisfies the following relationship:

Wherein C2 is the first filter coefficient.

S86, performing differential beam processing on the second audio signal by using the second filter coefficient to obtain a second processing signal.

Specifically, the voice processing device performs differential beam processing on the second audio signal by using the second filter coefficient to obtain a second processed signal.

S87, carrying out Kalman filtering processing according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point, filtering the audio signals generated by the sound areas corresponding to the non-first microphones in the first processing signal, filtering the audio signals generated by the sound areas corresponding to the non-second microphones in the second processing signal and filtering the audio signals generated by the sound areas corresponding to the non-third microphones in the third audio signal.

Referring to fig. 3, fig. 3 is a flowchart illustrating another embodiment of a method for processing in-vehicle speech according to the present application.

As shown in fig. 3, another embodiment of the method for processing in-vehicle speech provided by the present application may specifically include the following steps:

S101, acquiring a first audio signal acquired by a first microphone, a second audio signal acquired by a second microphone, a third audio signal acquired by a third microphone and a fourth audio signal acquired by a fourth microphone.

The first microphone and the second microphone form a linear array microphone array, the third microphone does not form a linear array microphone array with other microphones, and the fourth microphone does not form a linear array microphone array with other microphones.

In the present embodiment, the layout of the vehicle-mounted microphone is shown in fig. 2B.

S102, respectively performing differential beam processing on the first audio signal and the second audio signal to obtain a first processing signal corresponding to the first audio signal and a second processing signal corresponding to the second audio signal.

Because the distance between the first microphone and the second microphone is relatively close, the difference of energy between the voice signals corresponding to the first voice zone and the second voice zone picked up by the first microphone and the second microphone respectively is relatively small, the effect of subsequent Kalman filtering is affected, and the problem of relatively great voice damage exists.

Accordingly, the voice processing apparatus performs corresponding differential beam processing on the first audio signal s ₁ and the second audio signal s ₂ corresponding to the first microphone and the second microphone which are compactly arranged, and obtains the corresponding first processing signal s '₁ and the second processing signal s' ₂ by directionally suppressing interference, weakening the sound of the second sound zone received by the first microphone, and weakening the sound of the first sound zone received by the second microphone.

Through the steps, the energy difference of different sound zone signals acquired by the vehicle-mounted microphone is increased.

S103, performing Fourier transform on the first processing signal to obtain a first frequency domain signal, performing Fourier transform on the second processing signal to obtain a second frequency domain signal, performing Fourier transform on the third audio signal to obtain a third frequency domain signal, and performing Fourier transform on the fourth audio signal to obtain a fourth frequency domain signal.

Specifically, the speech processing apparatus performs fourier transform on the first processed signal s '₁, the second processed signal s' ₂, the third audio signal s ₃, and the fourth audio signal s ₄ to obtain a first frequency domain signal X ₁, a second frequency domain signal X ₂, a third frequency domain signal X ₃, and a fourth frequency domain signal X ₄, so as to implement transformation from a time domain signal to a frequency domain signal.

And S104, respectively carrying out Kalman filtering on the first frequency domain signal, the second frequency domain signal, the third frequency domain signal and the fourth frequency domain signal at each frequency point to obtain corresponding first filtering signals, second filtering signals, third filtering signals and fourth filtering signals.

Specifically, the kalman filter performed by the present embodiment satisfies the following relationship:

Where d _n is the target signal, x _n is the reference signal, w _n is the kalman filter coefficient, and se _n is the error signal, that is, the filtered signal that is output after filtering.

It is assumed that only one speaker is present in each frequency point, and the energy of the signal collected by the microphone closest to the sounding zone is the largest, so that it is determined whether the current sounding zone is speaking according to the energy of the frequency point, and whether the frequency point performs filtering operation is determined. Each microphone signal is subjected to a filtering process with all the remaining microphone signals, respectively, as a result of the separation of the sound zones.

In some application scenarios, S104 may specifically include the following steps:

The voice processing device uses the first frequency domain signal X ₁ as a target signal, uses the second frequency domain signal X ₂, the third frequency domain signal X ₃ and the fourth frequency domain signal X ₄ as reference signals, and traverses all frequency points.

If the energy of the reference signal is greater than the energy of the target signal at a certain frequency point, the speech processing device can be considered to speak into the voice zone corresponding to the reference signal, and the speech processing device performs the Kalman filtering operation on the target signal. Otherwise, the speech processing device does not perform any operation to obtain the separation signal se ₁ of the first voice zone.

Similarly, the speech processing apparatus repeats the above steps with the second frequency domain signal X ₂ as the target signal, the first frequency domain signal X ₁, the third frequency domain signal X ₃, and the fourth frequency domain signal X ₄ as the reference signals, respectively, to obtain the separation signal se ₂ of the second audio region.

Similarly, the speech processing apparatus repeats the above steps with the third frequency domain signal X ₃ as the target signal, the first frequency domain signal X ₁, the second frequency domain signal X ₂, and the fourth frequency domain signal X ₄ as the reference signals, respectively, to obtain the separation signal se ₃ of the third audio zone.

Similarly, the speech processing apparatus repeats the above steps using the fourth frequency domain signal X ₄ as the target signal, the first frequency domain signal X ₁, the second frequency domain signal X ₂, and the third frequency domain signal X ₃ as the reference signals, respectively, to obtain the separation signal se ₄ of the rear row of the assistant driving.

S105, respectively performing nonlinear processing on the first filtering signal, the second filtering signal, the third filtering signal and the fourth filtering signal at each frequency point to obtain a voice separation result corresponding to each voice zone.

The kalman filter can only eliminate the content of the target signal, which is coherent with the reference signal, and some residues can exist. In order to overcome the above problems, the speech processing apparatus calculates the frequency domain coherence values of the target signal d _n and the error signal se _n, and the target signal d _n and the reference signal x _n at each frequency point, and the calculation formulas are as follows:

Wherein D (f), X (f), E (f) correspond to the target signal, the reference signal, and the error signal, respectively. The nonlinear performance is realized by calculating the inhibition factor according to the coherence value, and the calculation formula is as follows:

α=max(0.05,min(1-cohxd,cohed))

Further, the voice processing device calculates the product of the suppression factor and the error signal, and determines a voice separation result S1 corresponding to the first voice zone, a voice separation result S2 corresponding to the second voice zone, a voice separation result S3 corresponding to the third voice zone, and a voice separation result S4 corresponding to the fourth voice zone, so as to suppress the residual signal.

The above-mentioned voice separation results are all time domain signals.

With continued reference to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of a terminal device according to the present application. The terminal device 500 of the embodiment of the present application includes a processor 51, a memory 52.

The processor 51 and the memory 52 are connected to the bus, and the memory 52 stores program data, and the processor 51 is configured to execute the program data to implement the method for processing in-vehicle speech according to the above embodiment.

In an embodiment of the present application, the processor 51 may also be referred to as a CPU (Central Processing Unit ). The processor 51 may be an integrated circuit chip with signal processing capabilities. Processor 51 may also be a general purpose processor, a digital signal processor (DSP, digital Signal Process), an Application SPECIFIC INTEGRATED Circuit (ASIC), a field programmable gate array (FPGA, field Programmable GATE ARRAY) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The general purpose processor may be a microprocessor or the processor 51 may be any conventional processor or the like.

The present application further provides a computer storage medium, please continue to refer to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of the computer storage medium provided by the present application, the computer storage medium 600 stores program data 61, and the program data 61 is used to implement the method for processing in-vehicle voice in the above embodiment when being executed by a processor.

Embodiments of the present application may be stored in a computer readable storage medium when implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and the equivalent structures or equivalent processes disclosed in the specification and the drawings are used in the same way or directly or indirectly in other related technical fields, which are also included in the scope of the application.

Claims

1. A method for processing in-vehicle speech, wherein at least a first microphone, a second microphone, and a third microphone are disposed in a vehicle, each microphone corresponding to a sound zone, the method comprising:

acquiring a first audio signal acquired by the first microphone, a second audio signal acquired by the second microphone and a third audio signal acquired by the third microphone, wherein the target distance between the first microphone and the second microphone is smaller than a preset threshold value, and the distances between the third microphone and the first microphone and the second microphone are respectively larger than the preset threshold value;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The kalman filter processing according to the energy of the first processing signal, the second processing signal and the third audio signal at each frequency point comprises the following steps:

the step of performing kalman filtering according to the energies of the first processing signal, the second processing signal and the third audio signal at each frequency point, filtering the audio signals generated in the audio region corresponding to the non-first microphone in the first processing signal, filtering the audio signals generated in the audio region corresponding to the non-second microphone in the second processing signal, and filtering the audio signals generated in the audio region not corresponding to the third microphone in the third audio signal includes:

And carrying out Kalman filtering processing on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal according to the energy of each frequency point, filtering audio signals generated by a sound zone which is not corresponding to the first microphone in the first frequency domain signal, filtering audio signals generated by a sound zone which is not corresponding to the second microphone in the second frequency domain signal, and filtering audio signals generated by a sound zone which is not corresponding to the third microphone in the third frequency domain signal to obtain filtering signals corresponding to the sound zones where different microphones are located.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

The kalman filtering processing is performed on the first frequency domain signal, the second frequency domain signal and the third frequency domain signal according to the energy of each frequency point to obtain filtering signals corresponding to the sound areas where different microphones are located, including:

Responding to the first energy being smaller than the second energy, and performing Kalman filtering processing on the target signal by utilizing the reference signal to obtain a separation signal corresponding to the target signal at each frequency point;

Responsive to the first energy being greater than or equal to the second energy, treating the target signal as the separation signal;

And obtaining the filtering signals corresponding to the sound areas where the different microphones are positioned according to the separation signals at different frequency points.

4. The method of claim 3, wherein the step of,

The filtering signals corresponding to different microphones are obtained according to the separation signals under different frequency points, and the filtering signals comprise:

calculating a first frequency domain coherence value between the target signal and the separation signal at each frequency point;

Calculating a second frequency domain coherence value between the target signal and each reference signal at each frequency point;

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

The inhibition factor is obtained according to the following formula:

6. The method of claim 3, wherein the step of,

The first microphone corresponds to a first sound zone, the second microphone corresponds to a second sound zone, and the third microphone corresponds to a third sound zone;

after the step of obtaining the filtered signals corresponding to the sound areas where the different microphones are located according to the separated signals at different frequency points, the method comprises the following steps:

Performing inverse Fourier transform on a second filtering signal corresponding to the second sound zone to obtain a second target audio signal generated by the second sound zone;

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The first microphone corresponds to a first sound zone, and the second microphone corresponds to a second sound zone;

The differential beam processing is performed on the first audio signal to obtain a first processed signal, and the differential beam processing is performed on the second audio signal to obtain a second processed signal, including:

acquiring a first suppression angle of the first microphone relative to the second sound zone, and a second suppression angle of the second microphone relative to the first sound zone;

Performing differential beam processing on the first audio signal by using the first suppression angle to obtain the first processed signal; and performing differential beam processing on the second audio signal by using the second suppression angle to obtain the second processed signal.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

The performing differential beam processing on the first audio signal by using the first suppression angle to obtain the first processed signal, including:

performing differential beam processing on the first audio signal by using the first filter coefficient to obtain the first processed signal;

and performing differential beam processing on the second audio signal by using the second suppression angle to obtain the second processed signal, where the differential beam processing includes:

And performing differential beam processing on the second audio signal by using the second filter coefficient to obtain the second processed signal.

9. Terminal equipment, characterized in that it comprises a processor, a memory connected to the processor, wherein,

The memory stores program instructions;

The processor is configured to execute program instructions stored in the memory to implement the method of any one of claims 1 to 8.

10. A computer readable storage medium, characterized in that the storage medium stores program instructions which, when executed, implement the method of any one of claims 1 to 8.