CN114446288A

CN114446288A - Voice interaction method, device and equipment

Info

Publication number: CN114446288A
Application number: CN202011116808.8A
Authority: CN
Inventors: 马骁; 田彪; 杨智慧; 纳跃跃; 余磊; 袁斌; 左玲云; 付强
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2022-05-06

Abstract

A voice interaction method, device and equipment are disclosed. Carrying out voice activity detection on the collected audio data; judging whether the duration of the detected voice data is greater than a first threshold value; and if the duration of the voice data is less than or equal to the first threshold, not delivering the voice data to the voice recognition system for voice recognition. If the duration of the voice data obtained based on the voice activity detection is less than the first threshold, the voice data can be considered as echo data (such as residual echo) of the voice output by the device, so that the voice data is not delivered to the voice recognition system for voice recognition, interference on normal voice interaction caused by voice recognition of the echo of the voice output by the device can be avoided, and the voice interaction service quality is improved.

Description

Voice interaction method, device and equipment

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a voice interaction method, apparatus, and device.

Background

With the popularization of intelligent hardware, the application of voice interaction technology in life is more and more extensive, such as smart televisions, air conditioners, sound boxes, vehicle-mounted voice interaction equipment, voice ticket buyers and the like.

The voice interaction belongs to the category of human-computer interaction, and is a leading-edge interaction mode developed by the human-computer interaction to the present. Voice interaction is the process by which a user gives instructions to a machine through natural language to achieve his or her own objectives.

In the voice interaction process, the machine can generally interact with the user through a voice broadcast mode so as to improve the interaction experience. Echoes of machine broadcast speech can interfere with normal speech recognition, thus avoiding recognition of machine broadcast speech.

Taking a voice wake-up scenario as an example, in response to a voice wake-up command of a user, a device usually sends out a prompt voice to remind the user that the device is awake. If the echo cancellation processing is not performed on the prompt voice, the prompt voice sent by the device is received by the device and input to the voice recognition system for voice recognition, and the recognition of the prompt voice is meaningless and interferes with normal voice interaction. When the echo cancellation algorithm is used for performing echo cancellation processing on the prompt voice, the prompt voice belongs to sudden phrase voice, so that the prompt voice cannot be completely cancelled, partial residue exists, and the residue entering the voice recognition system also interferes with voice interaction, so that the voice recognition result is deteriorated.

Therefore, how to reduce the influence of the echo of the voice output by the device on the voice interaction is particularly critical for improving the quality of service of the voice interaction.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a voice interaction scheme capable of reducing the influence of echo of a prompt voice on voice interaction.

According to a first aspect of the present disclosure, there is provided a voice interaction device, comprising: the pickup module is used for collecting audio data; the voice activity detection module is used for carrying out voice activity detection on the audio data collected by the pickup module; the judging module is used for judging whether the duration of the voice data detected by the voice activity detecting module is greater than a first threshold value; and the data processing module is used for not delivering the voice data to the voice recognition system for voice recognition if the duration is less than or equal to the first threshold.

According to a second aspect of the present disclosure, there is provided a voice interaction device, comprising: pickup module, treater and output module, pickup module gathers audio data, the treater carries out voice wake-up detection to the audio data that pickup module gathered, respond to and detect the word of awaking up, output module output suggestion pronunciation, the treater still carries out voice activity to the audio data that pickup module gathered after voice wake-up and detects, and under the condition that detects voice data, judge whether voice data prompts pronunciation, whether the decision result according to whether voice data prompts pronunciation confirms that whether to hand over voice data by speech recognition system carries out speech recognition.

According to a third aspect of the present disclosure, there is provided a smart device comprising: the pickup module is used for collecting audio data; and the processor is used for carrying out voice activity detection on the audio data collected by the pickup module, judging whether the duration of the detected voice data is greater than a first threshold value or not, and if the duration is less than or equal to the first threshold value, not handing the voice data to the voice recognition system for voice recognition.

According to a fourth aspect of the present disclosure, there is provided an in-vehicle apparatus including: the pickup module is used for collecting audio data; and the processor is used for carrying out voice activity detection on the audio data collected by the pickup module, judging whether the duration of the detected voice data is greater than a first threshold value or not, and if the duration is less than or equal to the first threshold value, not handing the voice data to the voice recognition system for voice recognition.

According to a fifth aspect of the present disclosure, there is provided a voice chip comprising: and the processing module is used for carrying out voice activity detection on the collected audio data, judging whether the duration of the detected voice data is greater than a first threshold value or not, and if the duration is less than or equal to the first threshold value, not handing the voice data to the voice recognition system for voice recognition.

According to a sixth aspect of the present disclosure, there is provided a voice interaction method, comprising: carrying out voice activity detection on the collected audio data; judging whether the duration of the detected voice data is greater than a first threshold value; and if the duration of the voice data is less than or equal to the first threshold, not delivering the voice data to the voice recognition system for voice recognition.

According to a seventh aspect of the present disclosure, there is provided a voice interaction method, comprising: acquiring voice data; judging whether the voice data is output by equipment or not; and determining whether to deliver the voice data to a voice recognition system for voice recognition according to the judgment result of whether the voice data is output by the equipment.

According to an eighth aspect of the present disclosure, there is provided a voice interaction apparatus, comprising: the voice activity detection module is used for carrying out voice activity detection on the collected audio data; the judging module is used for judging whether the duration of the detected voice data is greater than a first threshold value; and the processing module is used for not delivering the voice data to the voice recognition system for voice recognition if the duration of the voice data is less than or equal to the first threshold.

According to a ninth aspect of the present disclosure, there is provided a voice interaction apparatus, comprising: the acquisition module is used for acquiring voice data; the judging module is used for judging whether the voice data is output by equipment or not; and the processing module is used for determining whether to deliver the voice data to the voice recognition system for voice recognition according to the judgment result of whether the voice data is output by the equipment.

According to a tenth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of the sixth or seventh aspect.

According to an eleventh aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of the sixth or seventh aspect as described above.

Therefore, the time length of the voice data obtained by carrying out voice activity detection on the collected audio data is compared with the first threshold, if the time length is smaller than the first threshold, the corresponding voice data is considered to be echo data (such as residual echo) of the voice output by the equipment, the voice data is not delivered to the voice recognition system for voice recognition, so that the interference on normal voice interaction caused by carrying out voice recognition on the echo of the voice output by the equipment can be avoided, and the voice interaction service quality is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic flow chart of a method of voice interaction according to one embodiment of the present disclosure.

Fig. 2 and 3 show schematic diagrams of the present disclosure in a voice wake-up scenario.

FIG. 4 shows a schematic structural diagram of a voice interaction device according to one embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of a voice interaction device according to another embodiment of the present disclosure.

Fig. 6 shows a schematic structural diagram of a smart device according to an embodiment of the present disclosure.

Fig. 7 shows a schematic configuration diagram of an in-vehicle apparatus according to an embodiment of the present disclosure.

Fig. 8 shows a schematic structural diagram of a voice interaction device according to an embodiment of the present disclosure.

Fig. 9 shows a schematic structural diagram of a voice interaction device according to another embodiment of the present disclosure.

FIG. 10 shows a schematic structural diagram of a computing device, according to one embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to reduce the influence of echo of device output voice (such as prompt voice output by the device after voice awakening, voice output by the device in response to a specific trigger condition, voice output by the device in a conversation process with a user and the like) on voice interaction, the method can judge whether the obtained voice data is output by the device (such as prompt voice after voice awakening), and determine whether to deliver the voice data to a voice recognition system for voice recognition according to the judgment result of whether the voice data is output by the device. If the voice data is judged not to be the equipment output voice, the voice data is delivered to a voice recognition system for voice recognition; and if the voice data is judged to be the equipment output voice, discarding the voice data and not delivering the voice data to a voice recognition system for voice recognition.

And judging whether the voice data is output by the equipment or not, namely judging whether the voice data is output by the equipment or not. Taking the example of performing echo cancellation on the device output voice by using an echo cancellation algorithm in the voice interaction process, it is determined whether the voice data is the device output voice, that is, whether the voice data is the residual voice after performing echo cancellation on the device output voice, that is, the residual echo data.

Specifically, whether the device outputs voice or not can be judged in various ways. For example, whether the voice data is the device output voice can be judged by judging whether the characteristics (such as audio frequency, tone and tone) of the voice data are matched with the device output voice; whether the voice data output by the equipment is voice or not can be judged in a mode of detecting whether the voice data contains the content corresponding to the equipment output voice or not.

Under the condition that echo cancellation processing is not carried out on equipment output voice by using an echo cancellation algorithm in the voice interaction process, the method aims to solve the problem that the echo of the complete equipment output voice influences voice interaction, and the duration of the echo of the complete equipment output voice is generally equivalent to the duration of prompting voice. While the device outputs speech that is known to the device, i.e., the device knows the length of time (i.e., duration) it takes to output speech. Taking the example that the device outputs voice as the voice to wake up the prompt voice, the prompt voice is generally a preset response word or short sentence, that is, the duration of the prompt voice is limited and is generally a fixed value.

In the case that an echo cancellation algorithm is used to perform echo cancellation processing on the device output voice in the voice interaction process, the present disclosure aims to solve the problem that residual echo (i.e., residual device output voice) after echo cancellation processing affects voice interaction, and the duration of the residual echo is generally limited by a quality index for performing echo cancellation processing on the voice output by the voice interaction device, i.e., the quality index for performing echo cancellation processing on the voice output by the voice interaction device (e.g., the more accurate the echo cancellation algorithm is), the better the residual echo is.

Generally, the duration of voice uttered by the user during the voice interaction process is longer than the duration of voice output by the device. Therefore, the present disclosure proposes a simple and practical way to determine whether the voice data is output by the device.

As an example, a threshold may be set according to the duration of the voice output by the device and/or the quality index of the echo cancellation processing performed on the voice output by the voice interaction device, and for convenience of distinction, the threshold may be referred to as a first threshold. Whether the duration of voice data (such as voice data after voice awakening) in the collected audio data is smaller than or equal to the first threshold is judged through voice activity detection, if the duration is smaller than or equal to the first threshold, the voice data can be considered as device output voice (such as residual voice after echo cancellation processing), and the device output voice is not sent to a voice recognition system for voice recognition, otherwise, if the duration is larger than the first threshold, the voice data can be considered not as the device output voice, that is, the voice data can be considered as voice data corresponding to effective recognition content sent by a user, and the voice data can be sent to the voice recognition system for voice recognition. Wherein the first threshold may be used to characterize a maximum value of echo data (e.g., residual echo) of the device output speech under normal conditions.

Therefore, in the voice awakening scene, aiming at the scene that the voice interaction instruction is not immediately spoken after voice awakening, even if the echo cancellation algorithm is used for carrying out echo cancellation on the prompt voice, the residue is still left, the voice recognition process cannot be started, and therefore voice recognition cannot be influenced; and aiming at the scene that the voice interaction instruction is spoken immediately after voice awakening, namely, the one-shot interaction scene of the awakening word and the voice interaction instruction can be ensured to be completely sent to a voice recognition system for voice recognition based on voice activity detection, wherein the one-shot interaction scene means that a user can wake up the device and perform voice interaction with the device by speaking voice including the awakening word and the voice interaction instruction at one time.

Further details regarding the present disclosure are provided below.

FIG. 1 shows a schematic flow chart diagram of a voice interaction method according to one embodiment of the present disclosure. The method shown in fig. 1 may be performed by a voice interaction device supporting a voice interaction function, wherein the voice interaction device may be, but is not limited to, a smart television, an air conditioner, a sound box, a vehicle-mounted voice interaction device, a voice ticket purchaser, and the like.

Referring to fig. 1, audio data is collected at step S110. Audio data may be collected by a pickup module in the voice interaction device.

In step S120, voice activity detection is performed on the collected audio data.

Voice Activity Detection (VAD), also referred to as Voice endpoint detection, may be used to identify the locations of Voice presence and Voice absence in an audio signal. Voice activity detection techniques may be utilized herein to detect the presence of voice data in the captured audio data, as well as the start time and end time of the voice data present.

If the detection result of step S120 is that there is no voice data, step S140 may be executed to directly discard the audio data, that is, not perform voice recognition on the audio data. This is because speech recognition on the audio data does not lead to a valid recognition result at this time, and therefore further speech recognition on the audio data is not required.

If the detection result in step S120 is that there is voice data, step S130 may be executed to determine whether the duration of the voice data is greater than the first threshold. The starting time and the ending time of the voice data in the audio data can be determined through voice activity detection, and the time length of the voice data is obtained by subtracting the starting time from the ending time.

Before the method shown in fig. 1 is performed, the first threshold may be set based on the duration of the alert tone.

In the case that an echo cancellation algorithm is not used to perform echo cancellation processing on the device output voice in the voice interaction process, the present disclosure aims to solve the influence of the echo of the complete device output voice on the voice interaction, the duration of the echo of the complete device output voice is generally equivalent to the duration of the prompt voice, at this time, the first threshold may be set to be equivalent to the duration of the prompt voice, for example, the first threshold may be equal to the duration of the prompt voice, or the first threshold may also be slightly smaller than the duration of the prompt voice, or the first threshold may also be slightly larger than the duration of the prompt voice.

In the case that an echo cancellation algorithm is used to perform echo cancellation processing on the device output voice in the voice interaction process, the present disclosure aims to solve the problem that residual echo (i.e., residual device output voice) after echo cancellation processing affects voice interaction, and the duration of the residual echo is generally limited by a quality index for performing echo cancellation processing on the voice output by the voice interaction device, i.e., the quality index for performing echo cancellation processing on the voice output by the voice interaction device (e.g., the more accurate the echo cancellation algorithm is), the better the residual echo is. Therefore, the quality index of the echo cancellation process performed on the speech output by the speech interaction device, such as the quality of the echo cancellation algorithm (i.e. the algorithm is good or bad), the general duration of the residual echo obtained empirically, and so on, can be used.

As an example, in a case where an echo cancellation algorithm is used to perform echo cancellation processing on the device output voice in the voice interaction process, the first threshold may be set in combination with a duration of the voice output by the voice interaction device and a quality index of the echo cancellation processing performed on the voice output by the voice interaction device.

If the duration of the voice data is less than or equal to the first threshold as a result of the determination in step S130, the current voice may be considered as the residual prompt voice, and step S140 may be executed to discard the voice data and not to deliver the voice data to the voice recognition system for voice recognition.

If the determination result in the step S130 is that the duration of the voice data is greater than the first threshold, it may be determined that the current voice is the voice data corresponding to the valid recognition content, and at this time, the step S150 may be executed to deliver the voice data to the voice recognition system for voice recognition.

The speech recognition system can be deployed at the device side or the server side. Taking the example that the voice recognition system is deployed at the server side, voice data is not delivered to the voice recognition system for voice recognition, namely the voice data is not sent to the server; correspondingly, the voice data is delivered to a voice recognition system for voice recognition, namely the voice data is sent to a server, and the server performs voice recognition on the voice data.

Therefore, under the condition that the equipment outputs voice in the voice interaction process, the duration of the voice data detected by using the voice activity detection technology can be compared with a preset threshold (namely a first threshold), and whether the voice data is delivered to the voice recognition system for voice recognition or not is determined according to the comparison result, so that the influence of the echo data of the voice output by the equipment on the voice recognition can be reduced.

The voice interaction method disclosed by the invention can be suitable for scenes that equipment outputs voice in various voice interaction processes, such as but not limited to a voice awakening scene that the equipment outputs prompt voice after voice awakening, a scene that the equipment outputs voice in response to the triggering of a specific condition, a conversation scene that the equipment makes one or more rounds of conversations with a user through voice, and the like.

Taking the voice wake-up scene as an example, the voice wake-up detection may be performed on the audio data acquired in step S110 to detect whether a designated wake-up word exists in the acquired audio data, and if the designated wake-up word exists, the voice may be considered to be successfully woken up, and a prompt voice is output. For voice wake-up detection, reference may be made to the prior art, and voice wake-up detection is not an important point in the present disclosure, and therefore, the present disclosure is not described in detail. The voice activity detection step (i.e., step S120) may be to perform voice activity detection on the audio data collected after the voice wake-up.

Optionally, the audio data collected after voice wake-up may also be subjected to echo cancellation processing. Here, An Echo Cancellation (AEC) algorithm is mainly used to cancel echoes of the prompt voice sent out after the voice is woken up, so as to filter Echo data corresponding to the prompt voice in the collected audio data. For the specific implementation principle of echo cancellation, reference may be made to the prior art, and since echo cancellation is not the focus of the present disclosure, the detailed description of the present disclosure is omitted. Thus, the voice activity detection step (i.e., step S120) may refer to performing voice activity detection on the audio data after being subjected to the echo cancellation processing.

Considering that the prompt voice is a sudden short voice, the prompt voice can not be completely eliminated through echo cancellation processing operation, and a part of residual prompt voice also exists, and the residual prompt voice entering the voice recognition system can also cause adverse effects on normal voice interaction. For this reason, the influence of the residual prompt voice on the voice interaction process may be reduced by performing steps S120 to S150. The specific implementation process may refer to the above related description, and is not described herein again.

Therefore, for the scene that the voice interaction instruction is not spoken after awakening, for the audio data collected after awakening, even if the front end still has residual prompt voice after echo cancellation processing, or even if echo cancellation processing is not performed on the front end, the residual prompt voice or the echo data of the prompt voice cannot trigger voice recognition, and the voice recognition cannot be triggered until the user speaks the voice interaction instruction.

In other words, for a scene in which a voice interaction instruction is not immediately issued after wake-up, regardless of whether echo cancellation is performed on the prompt voice, voice data in audio data acquired after wake-up all belong to echo data of the prompt voice, and the duration of the echo data of the prompt voice is limited by the duration of the prompt voice, so that by setting a threshold (i.e., a first threshold) based on the duration of the prompt voice and comparing the duration of the voice data in the audio data acquired after wake-up with the threshold, if the duration of the voice data is lower than the preset threshold, it can be considered that the current voice data is the echo data of the prompt voice, and voice recognition is not performed on the echo data. After that, if new voice data is detected, the new voice data can be regarded as a voice interaction instruction sent by a user, and the new voice data is directly delivered to a voice recognition system for voice recognition.

FIG. 2 shows a schematic diagram of the present disclosure in a scenario where a voice interaction instruction is not issued immediately after waking up. As shown in fig. 2, taking the smart speaker as an example, the user may wake up the smart speaker by speaking a wake-up word (small H), and the smart speaker may output a "while" prompt voice after being woken up. The echo data of the prompt voice can be captured by the intelligent sound box again, and the method can be used for judging that the duration of the echo data is lower than the threshold (namely the first threshold), so that the echo data can be abandoned, namely voice recognition is not carried out on the echo data, and the influence of the echo of the prompt voice on voice interaction can be avoided.

At a certain moment after the intelligent sound box outputs the prompt voice, the user can output a voice interaction instruction such as 'help me play Zhou Jieren nunchaku', after the voice data is captured by the intelligent sound box, the method can judge that the duration of the voice data is higher than a threshold (namely a first threshold), so that voice recognition can be carried out on the voice data, and normal voice interaction cannot be influenced.

For a scene in which a voice interaction instruction is issued immediately after waking up, that is, a one-shot (one-word-to-one) interaction scene of a wake-up word + a voice interaction instruction, a device generally outputs a prompt voice after detecting that a user stops speaking. For example, if voice activity detection is performed on audio data collected after a voice wake-up and voice data is not detected for a duration exceeding a predetermined duration (e.g., a third threshold), it may be determined that the user voice input is over. The device can respond to the voice input of the user to finish outputting the prompt voice, or can output more targeted prompt voice after recognizing the voice input of the user to obtain a recognition result.

The duration of voice data sent by a user in a one-shot interaction mode is generally longer, and the duration of echo data of the prompt voice (especially, residual prompt voice after echo cancellation processing) is shorter, so that the method for detecting whether the duration of the voice data detected by comparing based on voice activity is greater than the first threshold value is also applicable to one-shot interaction scenes.

That is, the present disclosure can also distinguish a simple prompt voice echo residue from user voice data in a one-shot manner in a one-shot interaction scene, where the simple prompt voice echo residue is discarded and not recognized, and the user voice data in the one-shot manner is recognized.

As an example, in a one-shot interaction scenario, if the device outputs a prompt voice immediately after the user utters a voice including a wakeup word and a voice interaction instruction (i.e., before the user voice is recognized), the prompt voice may be an ambiguous response voice such as "good", "please wait for a while", or "do you for a while". The audio data detected after the voice is awakened directly and before the device outputs the prompt voice may be handed over to the voice recognition system for voice recognition. For example, a prompt voice may be output (immediately or within a very short time) in response to detecting a wake-up word and performing voice activity detection on audio data collected after voice wake-up without detecting that the duration of voice data is greater than a third threshold; and the voice data detected after the voice awakening and before the prompt voice is output is delivered to the voice recognition system for voice recognition.

In a one-shot interaction scenario, if the device outputs a prompt voice after detecting a wakeup word, the prompt voice echo data is mixed with a voice interaction instruction of the user, and at this time, the voice data in the audio data collected after wakeup is mixed data of the voice interaction instruction + the prompt voice (such as residual prompt voice), and the duration of the voice data is generally higher than a preset threshold (i.e., a first threshold).

Therefore, if the duration of the voice data in the audio data collected after awakening is higher than the first threshold value, the voice data can be delivered to the voice recognition system for voice recognition, so that the voice interaction instruction can be completely sent to the voice recognition system.

Through the analysis, the method for comparing whether the duration of the voice data detected based on the voice activity detection is greater than the first threshold value or not can be applied to various one-shot interaction scenes.

Fig. 3 shows a schematic diagram in a one-shot interaction scenario according to an embodiment of the present disclosure.

As shown in fig. 3, the wake up word for the smart speaker may be "small H". The user can say "small H and small H to help me play the jieren nunchaku" at a time, which is voice data including a wakeup word and a voice interaction instruction.

After voice data sent by a user is collected, the intelligent sound box firstly finds that a wake-up word exists through voice wake-up detection, and voice wake-up is successful. And then detecting the existence of voice data through voice activity detection, and judging that the user finishes voice input when the duration of undetected voice data is longer than a preset duration. When the voice input of the user is judged to be finished, the voice data detected after the voice awakening (or the audio data collected after the voice awakening) can be directly delivered to the voice recognition system for recognition.

And when the voice input of the user is judged to be finished, the intelligent sound box can output the response prompt voice with fuzzy contents such as 'good', 'please slightly wait' and the like. The echo data of the prompt voice output by the intelligent sound box can be directly discarded without carrying out voice recognition by using the method shown in the figure 1.

As an optional embodiment of the present disclosure, for a scenario in which the voice interaction instruction is not issued immediately after the voice is awakened, for example, a scenario in which the user speaks the awakening word first and then speaks the voice interaction instruction after sensing the prompt voice, step S130 may not be executed, but voice data within a certain time period after the voice awakening is successful may be directly discarded, and is not sent to the voice recognition system. At this time, for the voice data acquired after the voice wakeup, it may be determined whether the voice data is within a predetermined time period after the voice wakeup, if the voice data is within the predetermined time period after the voice wakeup, the voice data is not delivered to the voice recognition system for voice recognition, and/or if the voice data is not within the predetermined time period after the voice wakeup, the voice data is delivered to the voice recognition system for voice recognition. The predetermined time period range may be set according to actual conditions, such as 10s after voice wake-up.

Taking the application of the present disclosure to a scenario in which a device actively outputs voice while satisfying a voice output condition as an example, the voice activity detection step (i.e., step S120) described above in conjunction with fig. 1 may refer to performing voice activity detection on audio data collected after a time when outputting of voice starts. The voice output condition may include, but is not limited to, an abnormal event of the device, an error operation event of the user, and other events that require the user to be reminded by outputting voice. After the user hears the voice output by the equipment, the user can interact with the equipment in a voice emitting mode.

Taking the application of the present disclosure to a scenario in which a device has a conversation with a user through speech as an example, the device output speech may refer to a speech output by the device in response to a speech input by the user, and the speech activity detection step (i.e., step S120) described above in connection with fig. 1 may refer to performing speech activity detection on audio data collected after a time when the speech output starts.

So far, details related to the voice interaction method of the present disclosure are explained in detail.

The voice interaction method of the present disclosure may also be performed by a voice interaction device. FIG. 4 shows a schematic structural diagram of a voice interaction device according to one embodiment of the present disclosure. In the following, functional modules that the voice interaction device can have and operations that each functional module can perform are briefly described, and for the details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.

It should be noted that, in the drawings described in the present disclosure, the connection lines between different functional modules are used to represent that there is data interaction between the functional modules, and the data interaction between the functional modules may be implemented in a wired or wireless manner. Moreover, for two functional modules with data interaction, the two functional modules can directly communicate to realize data interaction, and the data interaction can also be indirectly realized through other functional modules. The present disclosure is not limited thereto.

Referring to fig. 4, the voice interaction apparatus 400 includes a sound pickup module 410, a voice activity detection module 420, a determination module 430, and a data processing module 440. It should be noted that, in the following description,

the pickup module 410 is used to collect audio data. The voice activity detection module 420 is configured to perform voice activity detection on the audio data collected by the sound pickup module 410. The determining module 430 is configured to determine whether the duration of the voice data detected by the voice activity detecting module 420 is greater than a first threshold. The data processing module 4400 is configured to not deliver the voice data to the voice recognition system for voice recognition if the duration is less than or equal to the first threshold. The voice recognition system may be located in the voice interaction device 400 or may be located in the server.

As an example, the voice interaction device 400 may further include: the voice awakening detection module is used for carrying out voice awakening detection on the audio data collected by the pickup module; the first output module is configured to output a prompt voice in response to the detection of the voice wake-up detection module detecting the wake-up word, where the voice activity detection module 420 may be specifically configured to perform voice activity detection on the audio data acquired by the pickup module 410 after the voice wake-up.

Optionally, the first output module may be specifically configured to respond to that the voice wake-up detection module detects a wake-up word, and the voice activity detection module 420 performs voice activity detection on the audio data acquired by the pickup module 410 after the voice wake-up, and a duration of the voice data not detected exceeds a third threshold, and outputs a prompt voice, and the data processing module 440 may also be configured to hand the voice data detected by the voice activity detection module after the voice wake-up and before the first output module outputs the prompt voice to the voice recognition system for voice recognition.

Optionally, the voice interaction apparatus 400 may further include: and an echo cancellation module, configured to perform echo cancellation on the audio data acquired by the pickup module 410 after the voice wake-up to filter a prompt voice in the audio data, where the voice activity detection module 420 is specifically configured to perform voice activity detection on the audio data filtered by the echo cancellation module.

As an example, the voice interaction device 400 may further include: a second output module for outputting voice when the voice output condition is satisfied, and the voice activity detection module 420 may specifically perform voice activity detection on the audio data collected by the sound pickup module 410 after the time when the second output module starts to output voice.

As an example, the voice interaction device 400 may further include: a third output module, configured to output a voice of the reply for the voice input, where the voice activity detection module 420 may be specifically configured to perform voice activity detection on the audio data collected by the sound pickup module 410 after the time when the third output module starts to output the voice.

The data processing module 440 may be further configured to deliver the voice data to a voice recognition system for voice recognition if the duration is greater than the first threshold.

Fig. 5 shows a schematic structural diagram of a voice interaction device according to another embodiment of the present disclosure. As shown in fig. 5, the voice interaction apparatus 500 includes a pickup module 510, a processor 520, and an output module 530.

The pickup module 510 collects audio data. The processor 520 performs voice wake-up detection on the audio data collected by the pickup module. In response to detecting the wake-up word, the output module 530 outputs a prompt voice, and the processor 520 further performs voice activity detection on the audio data collected by the sound pickup module 510 after the voice wake-up, determines whether the voice data prompts voice or not in case that the voice data is detected, and determines whether to deliver the voice data to the voice recognition system for voice recognition according to a determination result of whether the voice data prompts voice or not.

As an example, the output module 530 may be specifically configured to output a prompt voice in response to the processor 520 detecting a wake-up word and performing voice activity detection on the audio data collected by the sound pickup module 510 after voice wake-up without detecting that the duration of the voice data exceeds a third threshold. The processor 520 may also be configured to deliver the detected voice data after the voice wake-up and before the output module 530 outputs the prompt voice to the voice recognition system for voice recognition.

Optionally, the processor 520 may further be configured to perform echo cancellation on the audio data acquired by the sound pickup module 510 after the voice wake-up to filter a prompt voice in the audio data, where the processor 520 is specifically configured to perform voice activity detection on the audio data after the echo cancellation processing.

Processor 520 may be further configured to pass the speech data to a speech recognition system for speech recognition if the duration is greater than a first threshold.

The voice interaction method of the present disclosure may also be performed by a smart device. The intelligent device can be, but is not limited to, an electronic device such as a smart phone, a smart television, a smart refrigerator, a smart air conditioner, a smart sound box, a voice ticket buying machine, and the like.

Fig. 6 shows a schematic structural diagram of a smart device according to an embodiment of the present disclosure. In the following, functional modules that the intelligent device may have and operations that each functional module may perform are briefly described, and details related thereto may be referred to the above description, which is not repeated herein.

Referring to fig. 6, the smart device 600 includes a pickup module 610 and a processor 620.

The pickup module 610 is used to collect audio data. The processor 620 is configured to perform voice activity detection on the audio data collected by the pickup module, determine whether a duration of the detected voice data is greater than a first threshold, and if the duration is less than or equal to the first threshold, not deliver the voice data to the voice recognition system for voice recognition. The voice recognition system may be located in the smart device 600 or may be located in the server.

As an example, the processor 620 may be further configured to perform voice wake-up detection on the audio data collected by the sound pickup module 610, and the smart device 600 may further include an output module configured to output a prompt voice in response to the processor 620 detecting a wake-up word, where the processor 620 may be specifically configured to perform voice activity detection on the audio data collected by the sound pickup module 610 after the voice wake-up.

Optionally, the first output module may be specifically configured to, in response to the processor 620 detecting a wake-up word, and the processor 620 detecting voice activity of audio data collected by the pickup module 610 after voice wake-up and not detecting that the duration of the voice data exceeds a third threshold, output a prompt voice, and the processor 620 may be further configured to hand over the voice data detected after voice wake-up and before the first output module outputs the prompt voice to a voice recognition system for voice recognition.

Optionally, the processor 620 may be further configured to perform echo cancellation on the audio data acquired by the sound pickup module 410 after the voice wake-up to filter a prompt voice in the audio data, where the processor 620 may be specifically configured to perform voice activity detection on the audio data after the echo cancellation filtering.

As an example, the smart device 600 may further include: a second output module for outputting voice when the voice output condition is satisfied, and the processor 620 may specifically perform voice activity detection on the audio data collected by the sound pickup module 610 after the time when the second output module starts outputting voice.

As an example, the smart device 600 may further include: the third output module is configured to output a voice corresponding to the reply of the voice input, and the processor 620 may be specifically configured to perform voice activity detection on the audio data collected by the sound pickup module 610 after the time when the third output module starts to output the voice.

The processor 620 may be further configured to forward the voice data to a voice recognition system for voice recognition if the duration is greater than the first threshold.

The voice interaction method can also be executed by vehicle-mounted equipment (such as a vehicle-mounted central control computer). Fig. 7 shows a schematic configuration diagram of an in-vehicle apparatus according to an embodiment of the present disclosure. In the following, functional modules that the vehicle-mounted device may have and operations that each functional module may perform are briefly described, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.

Referring to fig. 7, the in-vehicle apparatus 700 includes a pickup module 710 and a processor 720. The pickup module 710 is used for collecting audio data; the processor 720 is configured to perform voice activity detection on the audio data collected by the pickup module, determine whether the duration of the detected voice data is greater than a first threshold, and if the duration is less than or equal to the first threshold, not deliver the voice data to the voice recognition system for voice recognition.

The speech recognition system may be located in the in-vehicle device 700 or in the server.

As an example, the processor 720 may further be configured to perform voice wake-up detection on the audio data collected by the sound pickup module 710, and the vehicle device 700 may further include an output module configured to output a prompt voice in response to the processor 720 detecting a wake-up word, where the processor 720 may be specifically configured to perform voice activity detection on the audio data collected by the sound pickup module 710 after the voice wake-up.

Optionally, the first output module may be specifically configured to, in response to the processor 720 detecting a wake-up word, and the processor 720 detecting voice activity of the audio data collected by the pickup module 710 after the voice wake-up and not detecting that the duration of the voice data exceeds the third threshold, output a prompt voice, and the processor 720 may be further configured to hand the voice data detected after the voice wake-up and before the first output module outputs the prompt voice to a voice recognition system for voice recognition.

Optionally, the processor 720 may further be configured to perform echo cancellation on the audio data collected by the sound pickup module 710 after the voice wake-up to filter the prompt voice in the audio data, where the processor 720 may be specifically configured to perform voice activity detection on the audio data after the echo cancellation filtering.

As an example, the in-vehicle apparatus 700 may further include: a second output module for outputting voice when the voice output condition is satisfied, and the processor 720 may specifically perform voice activity detection on the audio data collected by the sound pickup module 710 after the time when the second output module starts outputting voice.

As an example, the in-vehicle apparatus 700 may further include: the processor 720 may be specifically configured to perform voice activity detection on audio data collected by the sound pickup module 710 after a time when the third output module starts outputting the voice.

Processor 720 may be further configured to pass the speech data to a speech recognition system for speech recognition if the duration is greater than a first threshold.

The voice interaction method disclosed by the invention can be applied to a voice chip or a chip module, namely, the voice chip or the chip module can execute the voice interaction method disclosed by the invention. The voice chip or the chip module can be deployed in the electronic equipment to provide the electronic equipment with a voice interaction function.

The voice chip may include a processing module configured to perform voice activity detection on the acquired audio data, determine whether a duration of the detected voice data is greater than a first threshold, and if the duration is less than or equal to the first threshold, not deliver the voice data to the voice recognition system for voice recognition. The audio data may be collected by a voice chip or by a device where the voice chip is located.

The voice interaction method of the present disclosure can also be implemented as a voice interaction apparatus. Fig. 8 shows a schematic structural diagram of a voice interaction device according to an embodiment of the present disclosure. Wherein the functional elements of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional units described in fig. 8 may be combined or divided into sub-units to implement the inventive principles described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.

In the following, functional units that the voice interaction apparatus can have and operations that each functional unit can perform are briefly described, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.

Referring to fig. 8, the voice interaction apparatus 800 includes a voice activity detection module 810, a determination module 820, and a processing module 830.

The voice activity detection module 810 is configured to perform voice activity detection on the collected audio data. The determining module 820 is used for determining whether the duration of the detected voice data is greater than a first threshold. The processing module 830 is configured to not deliver the voice data to the voice recognition system for voice recognition if the duration of the voice data is less than or equal to the first threshold. The voice recognition system may be located in the voice interaction apparatus 800 or may be located in the server. The processing module 830 may be further configured to, if the duration is greater than the first threshold, deliver the voice data to a voice recognition system for voice recognition.

The voice interaction apparatus 800 may further include: the voice awakening detection module is used for carrying out voice awakening detection on the acquired audio data; the first output module is configured to output a prompt voice in response to detecting the wake-up word, where the voice activity detection module 810 may be specifically configured to perform voice activity detection on audio data acquired after voice wake-up.

The voice interaction apparatus 800 may further include: the echo cancellation processing module is configured to perform echo cancellation processing on the audio data acquired after the voice wake-up to filter a prompt voice in the audio data, where the voice activity detection module 810 may be specifically configured to perform voice activity detection on the filtered audio data to determine voice data in the audio data.

The first output module can be specifically used for responding to the voice awakening detection module to detect awakening words, the duration of voice activity detection of the voice data collected by the pickup module after voice awakening by the voice activity detection module and voice data not detected exceeds a third threshold value, prompt voice is output, and the processing module can also be used for handing the voice data detected by the voice activity detection module after voice awakening and before the first output module outputs the prompt voice to the voice recognition system for voice recognition.

The voice interaction apparatus 800 may further include: and the voice activity detection module can be used for specifically carrying out voice activity detection on the audio data acquired after the moment when the second output module starts to output the voice.

The voice interaction apparatus 800 may further include: and the voice activity detection module can be specifically used for carrying out voice activity detection on the audio data acquired by the pickup module after the moment when the third output module starts to output voice.

Fig. 9 shows a schematic structural diagram of a voice interaction device according to another embodiment of the present disclosure. Wherein the functional elements of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional units described in fig. 9 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.

Referring to fig. 9, the voice interaction apparatus 900 includes an obtaining module 910, a determining module 920, and a processing module 930.

The obtaining module 910 is configured to obtain voice data. The judging module 920 is configured to judge whether the device outputs voice. The processing module 930 is configured to determine whether to deliver the voice data to the voice recognition system for voice recognition according to a determination result of whether the voice data is output by the device.

The obtaining module 910 may include a voice activity detection module for performing voice activity detection on the collected audio data. The determining module 920 may be specifically configured to determine whether the duration of the voice data detected by the voice activity detecting module is greater than a first threshold. The processing module 930 may be specifically configured to not deliver the voice data to the voice recognition system for voice recognition if the duration of the voice data is less than or equal to the first threshold. The voice recognition system may be located in the voice interaction apparatus 900, or may be located in the server. The processing module 930 may be further configured to, if the duration is greater than the first threshold, forward the voice data to the voice recognition system for voice recognition.

The voice interaction apparatus 900 may further include: the voice awakening detection module is used for carrying out voice awakening detection on the acquired audio data; the voice activity detection module can be specifically used for carrying out voice activity detection on audio data acquired after voice awakening.

The voice interaction apparatus 900 may further include: and the echo cancellation processing module is used for performing echo cancellation processing on the audio data acquired after the voice is awakened so as to filter the prompt voice in the audio data, wherein the voice activity detection module can be specifically used for performing voice activity detection on the filtered audio data so as to determine the voice data in the audio data.

The voice interaction apparatus 900 may further include: and the voice activity detection module can be used for specifically carrying out voice activity detection on the audio data acquired after the moment when the second output module starts to output the voice.

The voice interaction apparatus 900 may further include: and the voice activity detection module can be specifically used for carrying out voice activity detection on the audio data acquired by the pickup module after the moment when the third output module starts to output voice.

FIG. 10 is a schematic structural diagram of a computing device that can be used to implement the voice interaction method according to an embodiment of the present disclosure.

Referring to fig. 10, the computing device 1000 includes a memory 1010 and a processor 1020.

The processor 1020 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1020 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 1020 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 1010 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are needed by the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, among others. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1020 to perform the voice interaction methods described above.

The voice interaction method, apparatus and device according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A voice interaction device, comprising:

the pickup module is used for collecting audio data;

the voice activity detection module is used for carrying out voice activity detection on the audio data collected by the pickup module;

the judging module is used for judging whether the duration of the voice data detected by the voice activity detecting module is greater than a first threshold value;

and the data processing module is used for not delivering the voice data to a voice recognition system for voice recognition if the duration is less than or equal to the first threshold.

2. The voice interaction device of claim 1, further comprising:

the voice wake-up detection module is used for carrying out voice wake-up detection on the audio data collected by the pickup module;

and the first output module is used for responding to the voice awakening detection module to detect the awakening words and outputting prompt voice, wherein the voice activity detection module is specifically used for carrying out voice activity detection on the audio data acquired by the pickup module after voice awakening.

3. The voice interaction device of claim 2, further comprising:

the echo cancellation module is used for carrying out echo cancellation on the audio data acquired by the pickup module after voice awakening so as to filter prompt voice in the audio data, wherein the voice activity detection module is specifically used for carrying out voice activity detection on the audio data filtered by the echo cancellation module.

4. The voice interaction device of claim 2,

the first output module is specifically configured to output the prompt voice in response to the voice wake-up detection module detecting a wake-up word and the voice activity detection module performing voice activity detection on the audio data acquired by the pickup module after voice wake-up without detecting that the duration of the voice data exceeds a third threshold,

the data processing module is further configured to deliver the voice data detected by the voice activity detection module after the voice wake-up and before the first output module outputs the prompt voice to a voice recognition system for voice recognition.

5. The voice interaction device of claim 1, further comprising:

a second output module for outputting a voice when a voice output condition is satisfied,

the voice activity detection module is used for specifically detecting voice activity of audio data collected by the pickup module after the moment when the second output module starts to output voice.

6. The voice interaction device of claim 1, further comprising:

a third output module for outputting a voice of a reply to the voice input,

the voice activity detection module is specifically used for detecting voice activity of audio data collected by the pickup module after the time when the third output module starts outputting voice.

7. The voice interaction device of any of claims 2-6, wherein the first threshold is set based on the following parameters:

the duration of the voice output by the voice interaction equipment; and/or

And the quality index is used for performing echo cancellation processing on the voice output by the voice interaction equipment.

8. The voice interaction device of claim 1,

and the data processing module is further used for handing the voice data to a voice recognition system for voice recognition if the duration is greater than a first threshold.

9. A voice interaction device, comprising: a sound pickup module, a processor and an output module,

the sound pickup module collects audio data and transmits the audio data,

the processor performs voice wake-up detection on the audio data collected by the pickup module,

in response to detecting the wake-up word, the output module outputs a prompt voice,

the processor also detects voice activity of the audio data collected by the pickup module after voice awakening, judges whether the voice data prompts voice or not under the condition that the voice data is detected, and determines whether the voice data is handed to a voice recognition system for voice recognition or not according to the judgment result of whether the voice data prompts voice or not.

10. A smart device, comprising:

the pickup module is used for collecting audio data;

and the processor is used for carrying out voice activity detection on the audio data collected by the pickup module, judging whether the duration of the detected voice data is greater than a first threshold value or not, and if the duration is less than or equal to the first threshold value, not handing the voice data to a voice recognition system for voice recognition.

11. An in-vehicle apparatus comprising:

the pickup module is used for collecting audio data;

12. A speech chip comprising:

the processing module is used for carrying out voice activity detection on the collected audio data, judging whether the duration of the detected voice data is greater than a first threshold value or not, and if the duration is less than or equal to the first threshold value, not handing the voice data to a voice recognition system for voice recognition.

13. A voice interaction method, comprising:

carrying out voice activity detection on the collected audio data;

judging whether the duration of the detected voice data is greater than a first threshold value;

and if the duration of the voice data is less than or equal to the first threshold, not submitting the voice data to a voice recognition system for voice recognition.

14. The voice interaction method of claim 13, further comprising:

carrying out voice awakening detection on the collected audio data;

responding to the detection of the awakening word, and outputting prompt voice, wherein the step of performing voice activity detection on the collected audio data comprises the following steps: and carrying out voice activity detection on the audio data collected after voice awakening.

15. The voice interaction method of claim 14, further comprising:

and performing echo cancellation processing on the audio data acquired after the voice awakening so as to filter prompt voice in the audio data, wherein the step of performing voice activity detection on the audio data acquired after the voice awakening comprises the following steps of: and performing voice activity detection on the filtered audio data to determine voice data in the audio data.

16. The voice interaction method of claim 14, wherein outputting a prompt voice in response to detecting the wake word comprises: responding to the detection of the awakening word and the voice activity detection of the audio data collected after voice awakening, wherein the duration of the voice data not detected is greater than a third threshold value, outputting prompt voice, and the method further comprises the following steps:

and delivering the detected voice data after the voice awakening and before the prompt voice is output to a voice recognition system for voice recognition.

17. The voice interaction method of claim 13, further comprising:

when the prompt voice output condition is met, outputting prompt voice, wherein the step of carrying out voice activity detection on the collected audio data comprises the following steps: and carrying out voice activity detection on the audio data collected after the moment of starting outputting the prompt voice.

18. The voice interaction method of claim 13, further comprising:

outputting a voice of a reply to the voice input, wherein the step of voice activity detection of the captured audio data comprises: and carrying out voice activity detection on audio data collected after the moment of starting outputting the voice.

19. The voice interaction method of claim 13, further comprising:

and if the duration is greater than a first threshold value, the voice data is delivered to a voice recognition system for voice recognition.

20. The voice interaction method of claim 13, further comprising:

judging whether the voice data is voice data within a preset time period after voice awakening;

and if the voice data is the voice data within the preset time period range after voice awakening, the voice data is not delivered to a voice recognition system for voice recognition, and/or if the voice data is not the voice data within the preset time period range after voice awakening, the voice data is delivered to the voice recognition system for voice recognition.

21. The voice interaction method of claim 13, further comprising:

and if the duration of the voice data is less than or equal to the first threshold and new voice data is acquired after the voice data is acquired, handing the new voice data to a voice recognition system for voice recognition.

22. A voice interaction method, comprising:

acquiring voice data;

judging whether the voice data is output by equipment or not;

and determining whether to deliver the voice data to a voice recognition system for voice recognition according to the judgment result of whether the voice data is output by equipment.

23. A voice interaction device, comprising:

the voice activity detection module is used for carrying out voice activity detection on the collected audio data;

the judging module is used for judging whether the duration of the detected voice data is greater than a first threshold value;

and the processing module is used for not delivering the voice data to a voice recognition system for voice recognition if the duration of the voice data is less than or equal to the first threshold.

24. A voice interaction device, comprising:

the acquisition module is used for acquiring voice data;

the judging module is used for judging whether the voice data is output by equipment or not;

and the processing module is used for determining whether to deliver the voice data to a voice recognition system for voice recognition according to the judgment result of whether the voice data is output by equipment or not.

25. A computing device, comprising:

a processor; and

a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of any of claims 13 to 21.

26. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 13 to 21.