CN119107947A

CN119107947A - In-vehicle voice interaction method, device, equipment and storage medium

Info

Publication number: CN119107947A
Application number: CN202411209506.3A
Authority: CN
Inventors: 孙作为; 余萧峰
Original assignee: Voyah Automobile Technology Co Ltd
Current assignee: Voyah Automobile Technology Co Ltd
Priority date: 2024-08-30
Filing date: 2024-08-30
Publication date: 2024-12-10

Abstract

The present application discloses a vehicle-mounted voice interaction method, device, equipment and storage medium, which relates to the field of voice recognition technology. The vehicle-mounted voice interaction method includes: obtaining microphone audio signals, vehicle-mounted machine feedback signals, camera visual signals and seat sensor signals; processing the microphone audio signals, vehicle-mounted machine feedback signals, camera visual signals and seat sensor signals to obtain audio features, visual information and seat sensor position information; inputting the audio features, visual information and seat sensor position information into the target voice signal separation model to obtain voice signals corresponding to the positions of each seat; responding to the voice signals to complete the vehicle-mounted voice interaction. The present application can accurately respond to the instructions of each passenger, realize the simultaneous interaction between the vehicle-mounted machine and multiple speakers, and improve the accuracy of voice interaction and user experience.

Description

Vehicle-mounted voice interaction method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a vehicle-mounted speech interaction method, apparatus, device, and storage medium.

Background

With the advancement of intelligent technology, vehicle-mounted voice interaction systems are becoming an important configuration of modern automobiles. Users want to realize functions such as navigation, telephone, music playing and the like through voice instructions so as to improve driving safety and convenience. The demand promotes the development of vehicle-mounted voice interaction technology, so that the vehicle-mounted voice interaction technology can provide more natural and smooth human-computer interaction experience in the driving process.

Currently, vehicle-mounted voice interaction systems mainly rely on single-channel voice recognition technology, and instruction analysis is generally performed through a central processing unit. The user can activate the voice assistant by pressing a button and complete the operation by simple instructions. Although this system can meet basic interaction requirements, recognition accuracy and response speed are still limited when complex instructions are processed. The vehicle-mounted environment is complex and comprises music, navigation sound, air-conditioning noise, fetal noise and the like, and the success rate of vehicle-mounted voice interaction is seriously affected by the factors. The advanced vehicle-mounted voice interaction system needs to support independent interaction of a plurality of positions such as a main driver, a secondary driver, a rear row and the like, so that higher response rate of vehicle control instructions is realized. Patent CN110459234a provides a multi-step processing scheme for echo cancellation, blind source separation, zone localization and noise reduction by microphone array signals. However, this method has problems of inaccurate signal separation and influence between steps when dealing with large volume and complex noise in a vehicle-mounted environment. Patent CN117789748a adopts a microphone array to perform echo cancellation and voice zone separation, but the method of judging the voice zone by energy is prone to error when a person in the vehicle leans against or turns around, and the traditional algorithm is difficult to process complex background noise. Patent CN115019810a combines microphone array and seat sensor to judge the authority of sending voice command, but when multiple people use the system in the car at the same time, there are still problems of erroneous judgment of sound area and noise interference.

The existing single-channel voice recognition technology often has recognition confusion when facing a plurality of speakers, is difficult to accurately distinguish different sound sources, and has the problems of serial processing steps, low signal separation precision, large noise interference and the like. In addition, when the environment in the vehicle is noisier or a plurality of instructions are sent out simultaneously, the accuracy and the reaction speed of the system are greatly affected. Particularly in the case where multiple persons are speaking simultaneously in the vehicle, it is difficult for the system to efficiently process multiple speech inputs. Therefore, how to realize simultaneous interaction between a vehicle and multiple speakers is a problem to be solved.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The application aims to provide a vehicle-mounted voice interaction method, device, equipment and storage medium, and aims to solve the technical problem of how to realize simultaneous interaction of a vehicle machine and a plurality of speakers.

In order to achieve the above purpose, the present application provides a vehicle-mounted voice interaction method, which includes:

Acquiring a microphone audio signal, a car machine stoping signal, a camera vision signal and a seat sensor signal;

Processing the microphone audio signal, the car machine stoping signal, the camera visual signal and the seat sensor signal to obtain audio characteristics, visual information and seat sensor position information;

inputting the audio characteristics, the visual information and the seat sensor position information into a target voice signal separation model to obtain voice signals corresponding to the positions of the seats;

And responding to the voice signal to complete vehicle-mounted voice interaction.

In an embodiment, the step of processing the microphone audio signal, the vehicle extraction signal, the camera vision signal, and the seat sensor signal to obtain audio features, vision information, and seat sensor position information includes:

Performing short-time Fourier transform on the microphone audio signal and the vehicle-mounted stoping signal to obtain audio characteristics;

Visual information is extracted from the camera visual signals;

And analyzing the seat sensor signal to obtain seat sensor position information.

In an embodiment, the audio features include a first audio feature, a second audio feature, and a third audio feature, and the step of performing short-time fourier transform on the microphone audio signal and the car-set stoping signal to obtain the audio features includes:

Performing short-time Fourier transform on the vehicle-mounted stoping signal to obtain a third audio characteristic;

framing the microphone audio signal to obtain an audio frame;

Processing the audio frame through a preset window function to obtain a weighted audio frame;

processing the weighted audio frames through discrete Fourier transform to obtain frequency domain signals;

And obtaining a first audio feature and a second audio feature according to the frequency domain signal.

In an embodiment, the frequency domain signal includes amplitude information and phase information, and the step of obtaining the first audio feature and the second audio feature from the frequency domain signal includes:

Extracting the amplitude information to obtain a first audio feature;

according to the phase information, obtaining the phase value of each frequency point;

Subtracting the phase values of each pair of microphones at the same frequency point to obtain the phase difference of the same frequency point;

And processing the phase difference of the same frequency point to obtain a second audio characteristic.

In an embodiment, the step of extracting visual information from the camera visual signal includes:

extracting facial information from the camera visual signal;

extracting seat information when the face information does not exist;

extracting lip movement information in the face information when the face information exists;

And integrating the facial information, the lip movement information and the seat information to obtain visual information.

In an embodiment, before the step of inputting the audio feature, the visual information, and the seat sensor position information into a target speech signal separation model, the method further comprises:

Acquiring an initial neural network model, historical voice signals at different positions and historical noise signals;

mixing the historical voice signal and the historical noise signal to obtain a historical mixed signal;

And training the initial neural network model by using the historical mixed signal, the historical voice signal and the historical noise signal to obtain a target voice signal separation model.

In an embodiment, the step of responding to the voice signal to complete the vehicle-mounted voice interaction includes:

Analyzing the voice signal to obtain a voice command and position information corresponding to the voice command;

And responding according to the voice command and the position information to complete vehicle-mounted voice interaction.

In addition, in order to achieve the above object, the present application also provides a vehicle-mounted voice interaction device, which includes:

the signal acquisition module is used for acquiring microphone audio signals, car machine stoping signals, camera vision signals and seat sensor signals;

The signal processing module is used for processing the microphone audio signal, the car machine stoping signal, the camera visual signal and the seat sensor signal to obtain audio characteristics, visual information and seat sensor position information;

The signal separation module is used for inputting the audio characteristics, the visual information and the seat sensor position information into a target voice signal separation model to obtain voice signals corresponding to the positions of the seats;

and the response module is used for responding to the voice signal and completing vehicle-mounted voice interaction.

In addition, in order to achieve the aim, the application also provides vehicle-mounted voice interaction equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is configured to realize the steps of the vehicle-mounted voice interaction method.

In addition, in order to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the vehicle-mounted voice interaction method as described above.

One or more technical schemes provided by the application have at least the following technical effects:

the method comprises the steps of obtaining microphone audio signals, car machine stoping signals, camera visual signals and seat sensor signals, processing the microphone audio signals, the car machine stoping signals, the camera visual signals and the seat sensor signals to obtain audio characteristics, visual information and seat sensor position information, inputting the audio characteristics, the visual information and the seat sensor position information into a target voice signal separation model to obtain voice signals corresponding to positions of seats, and responding to the voice signals to complete vehicle-mounted voice interaction. Firstly, by acquiring microphone audio signals, car machine stoping signals, camera visual signals and seat sensor signals, the system comprehensively captures sound and visual information in a car and passenger seat position data, a multi-dimensional car environment data base is established, and the acquisition of the signals provides a complete information source for subsequent processing. And then, the signals are processed, including extracting audio characteristics, analyzing vehicle-machine extraction signals to remove interference, extracting visual information in a camera, and analyzing seat sensor data, so that the accuracy and the practicability of the data are improved, and clear input is provided for subsequent voice separation. The processed audio features, visual information and seat position data are then input into a trained target speech signal separation model that uses the integrated information to separate individual speech signals for each seat position, thereby effectively identifying and distinguishing speech sources within the vehicle. And finally, analyzing the separated voice signals, identifying specific voice commands, and executing corresponding operations, such as adjusting volume or setting navigation, according to the instruction of each seat position. Through the series of steps, the vehicle-mounted system can accurately respond to the instruction of each passenger, the simultaneous interaction of the vehicle and the plurality of speakers is realized, and the accuracy and the user experience of voice interaction are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a vehicle-mounted voice interaction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a second embodiment of a vehicle-mounted voice interaction method according to the present application;

fig. 3 is a schematic diagram of a signal feature extraction and processing architecture of a vehicle-mounted voice interaction method according to a second embodiment of the present application;

fig. 4 is a schematic block diagram of a vehicle-mounted voice interaction device according to an embodiment of the present application;

Fig. 5 is a schematic device structure diagram of a hardware operating environment related to a vehicle-mounted voice interaction method in an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

With the development of intelligent technology, a vehicle-mounted voice interaction system has become a key configuration for improving driving safety and convenience, and allows a user to control functions such as navigation, telephone, music playing and the like through voice instructions. Although the current system is mainly based on a single-channel voice recognition technology and meets basic interaction requirements by analyzing instructions through a central processing unit, the system still has limitations in processing complex instructions, distinguishing multi-speaker voices and accuracy and response speed in a noisy environment. Particularly in the case of multiple simultaneous utterances, existing systems have difficulty in efficiently managing and recognizing multiple speech inputs, which limits the performance and user experience of the vehicle-mounted voice interaction technology.

The main solution of the embodiment of the application is that the system firstly establishes a comprehensive vehicle-mounted environment data base by acquiring microphone audio signals, vehicle-mounted stoping signals, camera vision signals and seat sensor signals. Subsequently, by processing these signals to extract audio features, remove interference, analyze visual information, and analyze seat data, the accuracy and utility of the data is improved. The processed information is input into a trained voice signal separation model, so that voice signals of each seat position are accurately separated, and specific voice commands are identified. And finally, executing corresponding operations, such as adjusting volume or setting navigation, according to the instruction of each seat, and improving the accuracy and the functionality of vehicle-mounted voice interaction.

It should be noted that, the execution body of the embodiment of the present application may be a computing service device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, etc., or an electronic device, a vehicle-mounted system, a voice interaction system, etc. capable of implementing the above functions. The present embodiment and the following embodiments will be described below by taking a voice interaction system as an example.

Based on this, the embodiment of the application provides a vehicle-mounted voice interaction method, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the vehicle-mounted voice interaction method of the application.

In this embodiment, the vehicle-mounted voice interaction method includes steps S10 to S40:

step S10, obtaining a microphone audio signal, a car machine stoping signal, a camera vision signal and a seat sensor signal;

The microphone audio signal refers to audio data captured by a plurality of microphone arrays in the vehicle, and these data include speaking sounds and background noise in the vehicle, which are used to extract speech features and perform source separation and localization. The vehicle set stoping signals refer to audio signals such as music and navigation sound played by the vehicle set system and collected by the vehicle-mounted system, and the signals help to identify and remove audio interference sent by the vehicle set so as to ensure the accuracy of voice interaction. The camera visual signal comprises video data captured by a camera in the vehicle, and is mainly used for extracting facial information or seat positions of passengers and assisting in positioning the positions and directions of speakers. The seat sensor signals refer to data captured by pressure sensors inside the seat that provide occupant seat occupancy, aid in confirming speaker position and further improve the accuracy of separation of the speech signals.

It will be appreciated that first, audio signals are acquired through an array of microphones within the vehicle, which signals record the ambient noise and speaker's voice within the vehicle for subsequent sound source separation and localization processes. And secondly, the vehicle-mounted extraction signals refer to sounds such as music and navigation sounds played by the vehicle-mounted system and collected by the vehicle-mounted system, and the signals are helpful for identifying and removing background noise generated by the vehicle-mounted system, so that the definition of voice instructions is improved. The camera visual signals, in turn, refer to video images captured by the in-vehicle camera that are used to extract the facial features or seating status of the occupant, helping to determine the specific location and orientation of the speaker. Finally, the seat sensor signals come from pressure sensors inside the vehicle seat, which provide occupancy of each seat for further confirmation of speaker position and to assist in accurate separation of the audio signals. By comprehensively processing the signals, accurate identification and response to the voice command in the vehicle can be realized.

Step S20, processing the microphone audio signal, the car machine stoping signal, the camera vision signal and the seat sensor signal to obtain audio characteristics, vision information and seat sensor position information;

It should be noted that audio features refer to key information extracted from microphone audio signals, and these features help to identify and separate speaking sounds and background noise in the vehicle, supporting sound source localization and separation. Visual information refers to data extracted from camera visual signals, such as seat status, which is used to assist in identifying speaker position and motion, such as seat occupancy, to improve the accuracy of separation of speech signals. The seat sensor position information refers to data obtained from the seat sensor signals, providing the occupancy state and position of each seat. This information is used to confirm the specific location of the passenger, further optimizing the positioning and processing of the audio signal.

It will be appreciated that the microphone audio signal is processed and audio features are extracted by fourier transform or the like, including spectral and directional information of the sound, for separating the speaking sound from the background noise in the vehicle. The car machine extraction signal is analyzed to extract audio features to help remove interfering sounds played by the car machine, and the camera vision signal is processed to extract visual information, such as facial features or seat status of the passenger, which is used to determine the speaker's position and motion. The seat sensor signals are interpreted to obtain seat position information, which provides an occupancy status for each seat, helping to accurately confirm the speaker's specific position. By combining these processing results, the sound source in the vehicle can be accurately identified and located.

Step S30, inputting the audio characteristics, the visual information and the seat sensor position information into a target voice signal separation model to obtain voice signals corresponding to the positions of the seats;

It should be noted that the target speech signal separation model is a model obtained by training a neural network (such as DNN, CNN, LSTM, CRN, transformer, etc.), and is intended to process and integrate audio features, visual information and seat sensor position information from different sensors, and the model can perform complex speech signal processing according to these input information, so as to separate independent speech signals of each seat position in the vehicle. The voice signals refer to voice signals extracted from audio data in the vehicle after model separation and processing, and the signals correspond to the voice of a specific speaker at each seat position in the vehicle and are used for accurately identifying and responding to voice instructions at each position.

It will be appreciated that the processed audio features, visual information, and seat sensor position information (e.g., occupancy status of each seat) are input into a trained target speech signal separation model. The model uses these comprehensive input data to perform complex analysis and processing to distinguish independent speech signals for each seat position within the vehicle. In this way, the model can accurately separate the voices of multiple speakers in the vehicle and provide corresponding clear voice signals for each seat position, thereby supporting accurate voice recognition and response.

As an example, before the step of inputting the audio feature, the visual information and the seat sensor position information into the target voice signal separation model, the method further comprises the steps of acquiring an initial neural network model, historical voice signals at different positions and historical noise signals, mixing the historical voice signals with the historical noise signals to obtain a historical mixed signal, and training the initial neural network model by using the historical mixed signal, the historical voice signals and the historical noise signals to obtain the target voice signal separation model.

The initial neural network model refers to a preliminarily constructed neural network framework for processing and separating speech signals, which model has not been specifically optimized for the vehicle environment prior to training. Historical speech signals refer to previously recorded data collected from within the vehicle representing the speech of a speaker at different seat positions that is used to train a model of how to identify and separate speech signals from different sources. Historical noise signals refer to various background noise data recorded in the in-vehicle environment, such as music, navigation sounds, air conditioning noise, etc., that help models learn how to remove interference from the mixed signal. The historical mixed signal refers to a signal obtained by mixing a historical voice signal and a historical noise signal according to the characteristics of an actual vehicle-mounted environment, and is used for training a model to simulate the mixing condition of voice and noise in actual use, so that the separation capability of the model in the actual environment is improved.

An initial neural network model is first obtained, which is a basic model that has not been optimized for a particular application environment. Next, historical speech signals and historical noise signals are collected at different locations, which are used to simulate speech and noise conditions in an actual vehicle environment. The historical speech signal is mixed with the historical noise signal to generate a historical mixed signal which simulates the actual speech and noise mixing conditions in the in-vehicle environment. The historical mixed signal, the historical speech signal, and the historical noise signal are then used to train an initial neural network model to optimize the model to accurately isolate the speech signal within the vehicle. The trained model is the target voice signal separation model, and can accurately separate the voice signals of each seat position according to the input audio characteristics, visual information and seat sensor position information in practical application.

And step S40, responding to the voice signal to complete vehicle-mounted voice interaction.

It should be noted that, the response means that the system performs appropriate operation or feedback according to the processed and separated voice signals of the respective seat positions. This includes recognizing and parsing voice commands issued at various locations and then executing corresponding vehicle control commands (e.g., adjusting volume, navigation commands, air conditioning settings, etc.), or providing appropriate voice feedback (e.g., confirming information or requesting further commands). The responding process ensures that the vehicle-mounted voice system can provide accurate and timely interaction according to the positions and instructions of different passengers, and improves user experience in the vehicle.

It will be appreciated that the system first parses and understands the individual speech signals for each seat position derived from the target speech signal separation model, and then the system performs corresponding operations based on the recognized speech signals, such as adjusting in-vehicle settings (e.g., volume, navigation path, air conditioning temperature, etc.), or providing feedback information based on the speech signals. The whole process comprises recognition, processing and execution of voice instructions, so that passengers at each seat position can effectively interact with the vehicle-mounted system, accurate vehicle-mounted function control is realized, and user experience is improved.

As an example, the step of responding to the voice signal to complete vehicle-mounted voice interaction comprises the steps of analyzing the voice signal to obtain a voice command and position information corresponding to the voice command, and responding to the voice command and the position information to complete vehicle-mounted voice interaction.

Voice commands refer to specific instructions or requests identified from processed and separated voice signals, such as "adjust air conditioning temperature" or "navigate home," which reflect the operation that the passenger wishes the vehicle system to perform. The position information refers to specific seat position data associated with each voice command, such as primary, secondary, or rear left side, which helps the system determine the commanded passenger position, thereby ensuring that the command is properly executed to the corresponding position.

In this process step, the system first parses the processed speech signal to identify and extract specific speech commands and location information corresponding to each command. Then, the system performs corresponding operations based on the extracted voice command and the location information, and applies the command correctly to the specified location. For example, if the main driving location issues a command to "adjust air conditioning temperature", the system will adjust the air conditioning settings of the main driving area accordingly. Through the mode, the system can effectively complete vehicle-mounted voice interaction, achieve accurate function control and improve user experience.

The embodiment provides a vehicle-mounted voice interaction method, which comprises the steps of obtaining microphone audio signals, vehicle-mounted stoping signals, camera visual signals and seat sensor signals, processing the microphone audio signals, the vehicle-mounted stoping signals, the camera visual signals and the seat sensor signals to obtain audio features, visual information and seat sensor position information, inputting the audio features, the visual information and the seat sensor position information into a target voice signal separation model to obtain voice signals corresponding to positions of seats, and responding to the voice signals to complete vehicle-mounted voice interaction. Firstly, by acquiring microphone audio signals, car machine stoping signals, camera visual signals and seat sensor signals, the system comprehensively captures sound and visual information in a car and passenger seat position data, a multi-dimensional car environment data base is established, and the acquisition of the signals provides a complete information source for subsequent processing. And then, the signals are processed, including extracting audio characteristics, analyzing vehicle-machine extraction signals to remove interference, extracting visual information in a camera, and analyzing seat sensor data, so that the accuracy and the practicability of the data are improved, and clear input is provided for subsequent voice separation. The processed audio features, visual information and seat position data are then input into a trained target speech signal separation model that uses the integrated information to separate individual speech signals for each seat position, thereby effectively identifying and distinguishing speech sources within the vehicle. And finally, analyzing the separated voice signals, identifying specific voice commands, and executing corresponding operations, such as adjusting volume or setting navigation, according to the instruction of each seat position. Through the series of steps, the vehicle-mounted system can accurately respond to the instruction of each passenger, the simultaneous interaction of the vehicle and the plurality of speakers is realized, and the accuracy and the user experience of voice interaction are improved.

In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the description above, and will not be repeated. On this basis, please refer to fig. 2, fig. 2 is a flowchart of a second embodiment of the vehicle-mounted voice interaction method of the present application, wherein step S20 of the vehicle-mounted voice interaction method includes steps S21 to S23:

Step S21, carrying out short-time Fourier transform on the microphone audio signal and the vehicle-mounted machine stoping signal to obtain audio characteristics;

It should be noted that Short-time fourier transform (STFT, short-Time Fourier Transform) is a signal processing technique for analyzing the spectral characteristics of a non-stationary signal. It acquires frequency domain information of a signal within each time window by dividing the signal into a plurality of short segments and applying fourier transform to each segment. This transformation combines time and frequency information so that both the time variation and the frequency characteristics of the signal can be observed. In vehicle systems, STFT is used to extract spectral features of audio signals to help identify and separate different sound sources, such as the reproduced signal played by the vehicle and the ambient noise in the vehicle.

It will be appreciated that the signals are first divided into short time windows, with the signal data within each window being fourier transformed. By this processing, the system converts the signal from the time domain to the frequency domain, resulting in spectral features within each time window. These spectral features reflect the intensity and variation of the signal at different frequencies so that individual components in the audio signal, such as in-car background noise and in-car played back tones, can be analyzed and distinguished in detail. The frequency domain information provided by the short-time fourier transform facilitates further processing and separation of different sound sources, thereby improving the accuracy of speech recognition and processing.

As an example, the audio features include a first audio feature, a second audio feature and a third audio feature, and the step of performing short-time Fourier transform on the microphone audio signal and the vehicle extraction signal to obtain the audio features includes performing short-time Fourier transform on the vehicle extraction signal to obtain the third audio feature, framing the microphone audio signal to obtain an audio frame, processing the audio frame through a preset window function to obtain a weighted audio frame, processing the weighted audio frame through discrete Fourier transform to obtain a frequency domain signal, and obtaining the first audio feature and the second audio feature according to the frequency domain signal.

The first audio feature is referred to as the short-time fourier spectrum, a spectrogram obtained by short-time fourier transform (STFT), which shows the frequency distribution and intensity of the signal over different time windows. The short-time fourier transform converts the time domain signal into the frequency domain, resulting in a spectrogram within each time window that shows the energy distribution of the signal over the various frequency components, helping to identify the frequency characteristics of the sound, such as the fundamental frequency and harmonic components of the speech. The first audio feature is used to analyze the frequency content of the microphone audio signal to help identify and separate sounds of different origin within the vehicle, such as the speech of the passenger and air conditioning noise, tire noise, road noise, wind noise, etc. within the vehicle. The second audio feature is referred to as the short-time fourier phase difference, which is the phase difference between the different microphones of the signal resulting from the short-time fourier transform. The phase difference represents the time delay or phase shift of the signal sampling at different locations, which is critical for locating the sound source, because the signals captured by the microphones at different locations are temporally different, reflecting the spatial position of the sound source. By analyzing the phase differences, the method can help to determine the sound source positions of different sound areas in the vehicle and help to accurately separate and identify the voice signals of different positions. The third audio feature is frequency spectrum information obtained by performing short-time Fourier transform on a vehicle-mounted stoping signal, wherein the vehicle-mounted stoping signal refers to a signal of music or navigation sound played by a vehicle-mounted device, and the signal is converted into frequency domain information through short-time Fourier transform to obtain intensity distribution of the signal on different frequencies, so that interference of the vehicle-mounted device audio is removed from noise in the vehicle. The third audio feature is used to identify and isolate audio signals played by the vehicle to avoid interference of these signals with voice interactions, ensuring that the system is able to accurately process and respond to the voice instructions of the passengers. An audio frame is a fixed-length short segment into which a continuous audio signal is divided in time, each segment containing audio data for a certain period of time. The audio signal is split into a plurality of small segments during processing, each segment being referred to as an audio frame, this being done in order to enable a short-time fourier transform to be performed on each small segment, the spectral characteristics of the signal being analysed over that period. The audio frames enable a short-time fourier transform to perform frequency domain analysis over a local time range, thereby extracting audio features within each time window. The preset window function is a mathematical function for weighting each audio frame to reduce spectral leakage and boundary effects. At the boundaries of the audio frames, abrupt changes in the signal may result in spectral leakage, and a preset window function (e.g., hanning window or hamming window) is weighted over each frame, so that the signal changes more smoothly at the boundaries of the window function, thereby improving the accuracy of the frequency domain analysis. By applying the window function to the audio frame, the spectrum analysis quality of the short-time Fourier transform can be improved, errors in the analysis process can be reduced, and the obtained audio features can be ensured to be more accurate. The weighted audio frames are audio frames processed by a preset window function, and the window function weights the data of each frame so as to reduce boundary effects. After the window function is applied to the audio frame, the boundary region of the signal value is smoothed, so that the frequency spectrum leakage phenomenon is reduced, and the weighted audio frame can better reflect the real signal characteristic during frequency domain analysis. The weighted audio frame improves the precision of short-time Fourier transform by reducing boundary effect, so that the extracted frequency domain signal is more accurate, and the recognition and separation effects of the voice signal are improved. Discrete fourier transform (DFT, discrete Fourier Transform) is a mathematical method for converting a discrete-time signal into its frequency domain representation, calculating the amplitude and phase of the signal at each frequency. The DFT processes the weighted audio frames, converting them from the time domain to the frequency domain, generating a frequency domain signal, revealing the frequency content of the signal and its intensity, the result of the DFT being a spectrogram describing the energy distribution of the signal at different frequencies. The DFT is used to obtain frequency domain features of the audio signal and to provide data support for extracting the first audio feature (spectrum) and the second audio feature (phase difference), thereby effectively identifying and separating sound sources within the vehicle. The frequency domain signal is a representation of the signal obtained by discrete fourier transformation, showing the amplitude and phase of the signal at different frequencies. The frequency domain signal reflects the energy distribution and phase relationship of the signal at different frequencies so that the spectral characteristics of the signal can be analyzed, which characteristics help to identify and separate different components in the audio signal. The frequency domain signals provide detailed frequency spectrum information for calculating the first and second audio features, thereby helping to accurately extract and process different sound signals in the vehicle and improving the accuracy of voice interaction.

First, the vehicle-mounted stopback signal is subjected to short-time Fourier transform to obtain a third audio feature, and the feature helps to identify audio components such as music and navigation sounds played by the vehicle-mounted stopback signal. The microphone audio signal is then framed, each frame representing audio data for a short period of time, so that the processing can analyze the local characteristics of the signal. And applying a preset window function to the audio frames to obtain weighted audio frames, so as to reduce spectrum leakage and boundary effect and improve analysis accuracy. The weighted audio frames are then processed using a discrete fourier transform to obtain a frequency domain signal, which converts the time domain signal into the frequency domain to reveal components of different frequencies. Finally, a first audio feature (short-time fourier spectrum) is extracted from the frequency domain signal, showing the intensity distribution of the signal at each frequency, and a second audio feature (short-time fourier phase difference) provides phase information between the different microphones, which together serve to accurately identify and separate the speech signal in the vehicle.

As an example, the frequency domain signal comprises amplitude information and phase information, and the step of obtaining the first audio feature and the second audio feature according to the frequency domain signal comprises the steps of extracting the amplitude information to obtain the first audio feature, obtaining phase values of all frequency points according to the phase information, subtracting the phase values of each pair of microphones on the same frequency point to obtain the same frequency point phase difference, and processing the same frequency point phase difference to obtain the second audio feature.

The amplitude information is a numerical value describing the intensity of each frequency component in the frequency domain signal, reflects the energy of the signal at a specific frequency, is obtained by modulo the discrete fourier transform result of the signal, is usually represented as an amplitude part of a spectrogram, and can help identify and distinguish the main frequency components in the audio signal. The phase information is a numerical value describing the phase angle of each frequency component in the frequency domain signal, indicating the relative time position of the signal at that frequency, and is obtained by angling the discrete fourier transform results of the signal, which helps to understand the waveform characteristics of the signal and the relative timing relationship between the different audio sources. The phase value refers to a specific angle value provided by the phase information at a specific frequency point, and the phase value of each frequency point represents the phase angle of the frequency component and can be used for analyzing the waveform change of the audio signal. The phase difference of the same frequency point refers to the phase difference of the same frequency point, which is obtained by calculating the phase value difference of the same frequency point aiming at the difference between the phase values measured by the same frequency point in the frequency domain signals of two different microphones, and reflects the relative position relation of sound sources in space and the influence of the sound sources on different microphones.

Firstly, amplitude information is extracted from a frequency domain signal, and the data show the intensity distribution of the signal at each frequency point, so that a first audio characteristic (short-time Fourier spectrum) is obtained and used for identifying the frequency components of the signal. Next, phase values for each frequency point are calculated from the phase information, the values reflecting the phase angles of the signal at the different frequency points. Then, the phase values measured by each pair of microphones at the same frequency point are subtracted to obtain the phase difference of the same frequency point, and the difference reflects the difference of the phases of the sounds received by the two microphones, so that the spatial position of the sound source can be analyzed. Finally, these phase differences are further processed to obtain a second audio feature (short-time fourier phase difference) that is used to accurately distinguish between different sound sources and their locations, improving the separation and recognition accuracy of the speech signals.

Step S22, visual information is extracted from the camera visual signals;

The visual information refers to feature data for analysis extracted from image data captured by the camera. These feature data may include object recognition in the image, scene analysis, and detection of dynamic changes. By processing these image data, the spatial layout and possible sources of interference within the vehicle can be identified and integrated with the audio signal. Therefore, the sound source can be positioned more accurately, and the accuracy and the reliability of the vehicle-mounted voice interaction system are improved.

It will be appreciated that the step of extracting visual information from the camera visual signal includes analyzing the image data captured by the camera to extract key information that aids in understanding the in-vehicle environment and the location of the passengers. Such visual information may include spatial layout within the vehicle, location of objects, and other dynamic changes. By processing these image data, key features within the vehicle, such as structural and environmental changes in the cabin, can be identified, which aids in the analysis of the auxiliary audio signal. The visual information and the audio signal are combined for use, so that the source and influence of sound can be accurately judged, and the overall performance and accuracy of the vehicle-mounted voice interaction system are improved.

As an example, the step of extracting visual information from the camera visual signal includes extracting face information from the camera visual signal, extracting seat information when the face information is not present, extracting lip movement information from the face information when the face information is present, and integrating the face information, the lip movement information and the seat information to obtain visual information.

The face information refers to passenger facial features identified and extracted from image data captured by the camera. This includes the location, shape, expression, etc. of the face for identifying the identity or location of the passenger. The seat information refers to the position information and configuration of the seats in the vehicle extracted from the camera images, including the presence, position and possible adjustment of the seats, for determining the specific seat position of the passenger. Lip movement information refers to passenger lip movement data extracted from facial information, including lip movement and shape changes, to aid in recognizing the speaker's voice, particularly in noisy environments to help enhance the accuracy of speech recognition.

First, face information is extracted from an image to identify the identity and location of a passenger. If no facial information is detected in the image, seat information is instead extracted to determine which seat positions are free of passengers, thereby excluding interference from these positions. If the face information exists in the image, lip movement information, namely lip movement data of the passenger, is further extracted from the face information so as to assist voice recognition, and particularly, recognition accuracy is improved in a noisy environment. Finally, the extracted face information, lip movement information and seat information are integrated to form comprehensive visual information, which facilitates more accurate analysis and processing of the vehicle-mounted voice signal.

And S23, analyzing the seat sensor signals to obtain seat sensor position information.

It should be noted that the seat sensor position information refers to specific data related to the seat in the vehicle, including the position of the seat, that the sensor can detect the exact position of the seat in the vehicle, such as the front row, the rear row, the left side or the right side, which helps to determine the specific seat of the passenger. The occupancy state of the seat, i.e. whether the seat is occupied, i.e. whether a passenger is sitting in the seat, or whether the seat is idle, can be determined by the sensor, which includes a detected pressure or weight change. The seat adjustment is such that the sensor can also detect whether the seat is adjusted, such as reclined, moved back and forth, etc., to provide the current state of the seat.

It will be appreciated that the step of resolving the seat sensor signals to obtain seat sensor position information includes first reading data captured by the seat sensors including weight distribution, pressure change or position adjustment information of the seat. By analyzing these signals, the system can determine the specific location of each seat, including its coordinates and orientation within the vehicle, and the sensor data can also indicate the occupancy status of the seats, i.e., whether a passenger is sitting thereon, and if the seats are unoccupied, the sensor will send a corresponding signal to mark that the seats are empty. By combining the information, the system can accurately identify the specific position and state of the seat in the vehicle, and provide necessary data support for further voice signal processing and passenger identification.

The embodiment performs short-time Fourier transform on the microphone audio signal and the vehicle-mounted stoping signal to obtain audio features, performs short-time Fourier transform on the vehicle-mounted stoping signal to obtain third audio features, extracts visual information from the camera visual signal, and analyzes the seat sensor signal to obtain seat sensor position information. First, by performing short-time fourier transform on the car microphone audio signal and the car machine extraction signal, these signals can be converted from the time domain to the frequency domain, and audio features such as amplitude information and phase information can be extracted. These audio features help to effectively distinguish and identify different sound sources within the vehicle, such as speech, music, and noise, thereby improving the accuracy of speech recognition. And then, the third audio characteristic obtained by the short-time Fourier transform processing of the vehicle-mounted stopback signal is further used for helping to identify and inhibit the background sound played by the vehicle-mounted stopback signal and optimizing the identification process of the voice command. Meanwhile, visual information extracted from visual signals captured by the camera can be used for identifying the position and the state of passengers in the vehicle, so that the system can provide more accurate personalized services. In addition, by analyzing the seat sensor position information obtained from the seat sensor signals, the occupancy state and position of each seat can be determined, providing critical information about the occupant profile to the system. The steps work together to ensure that the vehicle-mounted voice system can accurately recognize and respond to voice instructions in a complex vehicle environment, thereby improving the overall effect and user experience of vehicle-mounted voice interaction.

For an example, to help understand the implementation flow of the vehicle-mounted voice interaction method obtained by combining the first embodiment with the first embodiment, please refer to fig. 3, fig. 3 is a schematic diagram of a signal feature extraction and processing architecture of the vehicle-mounted voice interaction method according to the second embodiment of the present application, specifically:

The figure shows a signal processing flow in a vehicle-mounted voice interaction system, wherein the system firstly acquires an audio signal through a vehicle-mounted microphone array, and simultaneously collects a vehicle stoping signal, a camera vision signal and a seat sensor signal. These signals are fed to a signal processing module which performs a short time fourier transform to extract audio features, including spectral information (first audio feature, audio feature 1 in the figure) and phase difference information (second audio feature, audio feature 2 in the figure). In addition, the vehicle-mounted extraction signal is subjected to short-time Fourier transform to obtain a third audio feature (namely an audio feature 3 in the figure) for distinguishing and removing audio interference played by the vehicle-mounted extraction signal. The camera visual signal is processed to extract visual information, such as facial and lip movement information, and the seat sensor signal is analyzed to obtain seat sensor position information. All of this characteristic information is then input into a target speech signal separation model, which is a trained neural network, for separating and identifying speech signals from different seat positions (primary, secondary, after primary, after secondary, etc.). Finally, the separated voice signals are sent to a response module, and corresponding vehicle control operations, such as volume adjustment or navigation setting, are performed according to the recognized voice commands and the position information, so that accurate vehicle-mounted voice interaction is achieved. The process not only improves the accuracy of voice recognition, but also enhances the user experience, especially in complex vehicle-mounted environments with multiple passengers and high noise.

It should be noted that the foregoing examples are only for understanding the present application, and do not constitute a limitation of the vehicle-mounted voice interaction method of the present application, and many simple changes based on this technical concept are all within the scope of the present application.

The application also provides a vehicle-mounted voice interaction device, referring to fig. 4, the vehicle-mounted voice interaction device comprises:

the signal acquisition module 10 is used for acquiring microphone audio signals, car machine stoping signals, camera vision signals and seat sensor signals;

the signal processing module 20 is configured to process the microphone audio signal, the car-set extraction signal, the camera vision signal and the seat sensor signal to obtain audio characteristics, vision information and seat sensor position information;

the signal separation module 30 is configured to input the audio feature, the visual information, and the seat sensor position information into a target voice signal separation model, so as to obtain voice signals corresponding to positions of the seats;

And the response module 40 is used for responding to the voice signal and completing vehicle-mounted voice interaction.

In an embodiment, the audio features include a first audio feature, a second audio feature, and a third audio feature, and the signal processing module 20 is further configured to perform short-time fourier transform on the microphone audio signal and the vehicle-to-machine extraction signal to obtain an audio feature, perform the short-time fourier transform on the vehicle-to-machine extraction signal to obtain a third audio feature, extract visual information from the camera visual signal, and parse the seat sensor signal to obtain seat sensor position information.

In an embodiment, the signal processing module 20 is further configured to perform short-time fourier transform on the vehicle-engine extraction signal to obtain a third audio feature, frame-divide the microphone audio signal to obtain an audio frame, process the audio frame through a preset window function to obtain a weighted audio frame, process the weighted audio frame through discrete fourier transform to obtain a frequency domain signal, and obtain a first audio feature and a second audio feature according to the frequency domain signal.

In an embodiment, the signal processing module 20 is further configured to extract the amplitude information to obtain a first audio feature, obtain a phase value of each frequency point according to the phase information, subtract the phase values of each pair of microphones on the same frequency point to obtain a phase difference of the same frequency point, and process the phase difference of the same frequency point to obtain a second audio feature.

In an embodiment, the signal processing module 20 is further configured to extract facial information from the camera visual signal, extract seat information when the facial information is not present, extract lip movement information in the facial information when the facial information is present, and integrate the facial information, the lip movement information and the seat information to obtain visual information.

In an embodiment, the signal separation module 30 is further configured to obtain an initial neural network model, historical voice signals at different positions, and historical noise signals, mix the historical voice signals with the historical noise signals to obtain a historical mixed signal, and train the initial neural network model by using the historical mixed signal, the historical voice signals, and the historical noise signals to obtain a target voice signal separation model.

In an embodiment, the response module 40 is further configured to parse the voice signal to obtain a voice command and location information corresponding to the voice command, and respond according to the voice command and the location information to complete vehicle-mounted voice interaction.

The vehicle-mounted voice interaction device provided by the application can solve the technical problem of how to realize simultaneous interaction of a vehicle machine and a plurality of speakers by adopting the vehicle-mounted voice interaction method in the embodiment. Compared with the prior art, the beneficial effects of the vehicle-mounted voice interaction device provided by the application are the same as those of the vehicle-mounted voice interaction method provided by the embodiment, and other technical features of the vehicle-mounted voice interaction device are the same as those disclosed by the method of the embodiment, so that the description is omitted herein.

The application provides vehicle-mounted voice interaction equipment which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the vehicle-mounted voice interaction method in the first embodiment.

Referring now to fig. 5, a schematic diagram of an in-vehicle voice interaction device suitable for use in implementing embodiments of the present application is shown. The in-vehicle voice interaction device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal DIGITAL ASSISTANT: personal digital assistant), a PAD (Portable Application Description: tablet computer), a PMP (Portable MEDIA PLAYER: portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The vehicle-mounted voice interaction device shown in fig. 5 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present application.

As shown in fig. 5, the in-vehicle voice interaction device may include a processing means 1001 (e.g., a central processor, a graphic processor, etc.) which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the in-vehicle voice interaction device are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, a system including an input device 1007 such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc., an output device 1008 including a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, etc., a storage device 1003 including a magnetic tape, a hard disk, etc., and a communication device 1009 may be connected to the I/O interface 1006. The communication means 1009 may allow the in-vehicle voice interaction device to communicate with other devices wirelessly or by wire to exchange data. While the figures illustrate a vehicle-mounted voice interaction device having various systems, it is to be understood that not all illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

The vehicle-mounted voice interaction equipment provided by the application can solve the technical problem of how to realize simultaneous interaction of a vehicle machine and a plurality of speakers by adopting the vehicle-mounted voice interaction method in the embodiment. Compared with the prior art, the beneficial effects of the vehicle-mounted voice interaction device provided by the application are the same as those of the vehicle-mounted voice interaction method provided by the embodiment, and other technical features of the vehicle-mounted voice interaction device are the same as those disclosed by the method of the previous embodiment, so that the description is omitted.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The present application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the vehicle-mounted voice interaction method in the above-described embodiments.

The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (Radio Frequency) and the like, or any suitable combination of the foregoing.

The computer readable storage medium may be included in the vehicle-mounted voice interaction device or may exist alone without being incorporated in the vehicle-mounted voice interaction device.

The computer readable storage medium carries one or more programs, when the one or more programs are executed by the vehicle-mounted voice interaction device, the vehicle-mounted voice interaction device is enabled to acquire microphone audio signals, vehicle-mounted stoping signals, camera visual signals and seat sensor signals, process the microphone audio signals, the vehicle-mounted stoping signals, the camera visual signals and the seat sensor signals to acquire audio characteristics, visual information and seat sensor position information, input the audio characteristics, the visual information and the seat sensor position information into a target voice signal separation model to acquire voice signals corresponding to positions of seats, and respond to the voice signals to complete vehicle-mounted voice interaction.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer program) for executing the vehicle-mounted voice interaction method, so that the technical problem of how to realize simultaneous interaction of a vehicle and a plurality of speakers can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the vehicle-mounted voice interaction method provided by the embodiment, and are not repeated here.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the vehicle-mounted voice interaction method as described above.

The computer program product provided by the application can solve the technical problem of how to realize simultaneous interaction of the vehicle machine and a plurality of speakers. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the vehicle-mounted voice interaction method provided by the embodiment, and are not repeated here.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A vehicle-mounted voice interaction method, characterized in that the method comprises the following steps:

2. The method of claim 1, wherein the step of processing the microphone audio signal, the car-set extraction signal, the camera vision signal, and the seat sensor signal to obtain audio features, vision information, and seat sensor position information comprises:

Visual information is extracted from the camera visual signals;

3. The method of claim 2, wherein the audio features include a first audio feature, a second audio feature, and a third audio feature, and wherein the step of performing a short-time fourier transform on the microphone audio signal and the car-set extraction signal to obtain the audio features comprises:

framing the microphone audio signal to obtain an audio frame;

4. The method of claim 3, wherein the frequency domain signal includes amplitude information and phase information, and wherein the step of deriving the first audio feature and the second audio feature from the frequency domain signal comprises:

Extracting the amplitude information to obtain a first audio feature;

5. The method of claim 2, wherein the step of extracting visual information from the camera visual signal comprises:

extracting facial information from the camera visual signal;

extracting seat information when the face information does not exist;

6. The method of claim 1, wherein before the step of inputting the audio features, the visual information, and the seat sensor position information to a target speech signal separation model, further comprising:

7. The method of any one of claims 1 to 6, wherein the step of completing an in-vehicle voice interaction in response to the voice signal comprises:

8. A vehicle-mounted voice interaction device, the device comprising:

9. An in-vehicle voice interaction device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the in-vehicle voice interaction method according to any one of claims 1 to 7.

10. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the vehicle-mounted voice interaction method according to any one of claims 1 to 7.