CN105261368B

CN105261368B - A kind of voice awakening method and device

Info

Publication number: CN105261368B
Application number: CN201510549435.6A
Authority: CN
Inventors: 马涛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Guangdong Gaohang Intellectual Property Operation Co ltd; Nanjing Advanced Biomaterials And Process Equipment Research Institute Co ltd
Priority date: 2015-08-31
Filing date: 2015-08-31
Publication date: 2019-05-21
Anticipated expiration: 2035-08-31
Also published as: CN105261368A

Abstract

The embodiment of the present invention provides a kind of voice awakening method and device.This method comprises: carrying out periodic samples to audio signal, wherein in t_iInstance sample obtains sampled signal；Calculate the audio power of sampled signal；It is greater than or equal to t in audio power_iWhen the first threshold at moment, wakes up DSP and carry out voice activation detection VAD；Fail when VAD is detected, and in t_iContinuous n times detect failure and the first noise energy and t before moment_iWhen the difference of the first threshold at moment is greater than preset first threshold value, second threshold is generated according to the first noise energy, and using second threshold as t_i+1The first threshold at moment, wherein the first noise energy is that tracking filter obtains at a slow speed by being extracted with the first extraction yield 1/x to sampled signal, and to the sampled point progress extracted.The embodiment of the present invention can reduce the number for carrying out VAD, realize the reduction of terminal power consumption under noisy environment.

Description

Voice awakening method and device

Technical Field

The present invention relates to voice wake-up technologies, and in particular, to a voice wake-up method and apparatus.

Background

With the development of science and technology, terminals generally have a voice awakening function, and users use the voice awakening terminal to perform corresponding voice control on the terminal.

The current voice wake-up scheme is to wake up the terminal by two-stage cooperation of a Microphone Activity Detection (MAD) circuit and a Digital Signal Processor (DSP). If the energy of the current audio signal detected by the MAD circuit is greater than a preset threshold value, waking up the DSP to perform Voice Activity Detection (VAD) so as to identify whether the audio signal is the Voice of a user or not through the VAD; if so, waking up the terminal; if not, the DSP is awakened to be invalid or mistakenly awakened. Specifically, the VAD determines whether the voice signal is the voice of the user by comparing the characteristics of the audio signal with the characteristics of the voice of the user.

By adopting the voice awakening scheme, when the terminal is in different environments, for example, the quiet environment is switched to the noisy environment, and the preset threshold value is fixed, invalid awakening or mistaken awakening often occurs, so that the power consumption of the terminal in the noisy environment is high.

Disclosure of Invention

The embodiment of the invention provides a voice awakening method and device, which are used for reducing power consumption of a terminal in a noisy environment.

In a first aspect, an embodiment of the present invention provides a voice wake-up method, including:

periodically sampling the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iI is a positive integer;

calculating the sampling signal y_iAudio energy T of_i；

At the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Performing voice activity detection VAD under the condition of (1);

when the VAD has failed n consecutive tests, and when the VAD fails, and at the t_iN consecutive detection failures before the time, and a first noise energy S₀And said t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀According to said first noise energy S₀Generating a second threshold A₁And applying said second threshold A₁As t_i+1First threshold value A of time₀Wherein the first noise energy S₀By applying a first decimation rate 1/x to the sample points y_iAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, n is a positive integer and n is smaller than i.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the first noise energy S is obtained according to the first noise energy S₀Generating a second threshold A₁The method comprises the following steps:

the first noise energy S₀As the second threshold value A₁；

Or, the first noise energy S is used₀With a preset first correction quantity N₀The sum is used as the second threshold value A₁；

Or, the first noise energy S is used₀With a predetermined first coefficient a₀The product of the first and second threshold values is used as the second threshold value A₁。

With reference to the first aspect, in a second possible implementation manner of the first aspect, the calculating the sampling signal y is performed_iAudio energy T of_iThen, the method further comprises the following steps:

at the audio energy T_iLess than t_iFirst threshold value A of time₀And from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all greater thanSet second threshold value M₁Performing VAD, wherein m is a positive integer and m is smaller than i;

when the VAD detection is successful, according to the second noise energy F₀Generating a third threshold A₂And applying said third threshold A₂As t_i+1First threshold value A of time₀Wherein the second noise energy F₀By applying a second decimation rate 1/z to said sampled signal y_iAnd extracting, and performing fast tracking filtering on the extracted sampling point yf to obtain the target product, wherein z is a natural number greater than x.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the second noise energy F is obtained according to the second noise energy F₀Generating a third threshold A₂The method comprises the following steps:

converting the second noise energy F₀As the third threshold value A₂；

Or, the second noise energy F₀And a preset second correction quantity N₁The sum is used as the third threshold value A₂；

Or, the second noise energy F₀With a predetermined second coefficient a₁The product of the first and second threshold values is used as the third threshold value A₂。

With reference to the second or third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the third threshold a is set₂As t_i+1First threshold value A of time₀Before, still include:

record the t_iThe moment is the moment of reducing the threshold value;

when the t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeWhen the third threshold A is executed₂As t_i+1First threshold value A of time₀Otherwise, said step of not executing said step ofThird threshold value A₂As t_i+1First threshold value A of time₀The step (2).

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the calculating the sampling signal y is performed_iAudio energy T of_iThen, the method further comprises the following steps:

at the audio energy T_iLess than t_iFirst threshold value A of time₀And said t_iFirst threshold value A of time₀And the first noise energy S₀Is greater than a preset third threshold value M₂According to the first noise energy S₀Generating a fourth threshold A₃And applying the fourth threshold A₃As t_i+1First threshold value A of time₀。

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the first noise energy S is obtained according to the first noise energy₀Generating a fourth threshold A₃The method comprises the following steps:

the first noise energy S₀As the fourth threshold value A₃；

Or, the first noise energy S is used₀With a preset third correction quantity N₂The sum is used as the fourth threshold value A₃；

Or, the first noise energy S is used₀With a predetermined third coefficient a₂The product of the first and second threshold values is used as the fourth threshold value A₃。

With reference to the fifth or sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the fourth threshold a is set₃As t_i+1First threshold value A of time₀Before, still include:

record the t_iThe moment is the moment of reducing the threshold value;

when saidt_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the fourth threshold A is executed₃As t_i+1First threshold value A of time₀Otherwise, said fourth threshold A is not executed₃As t_i+1First threshold value A of time₀The step (2).

In a second aspect, an embodiment of the present invention provides a voice wake-up apparatus, including:

a sampling frequency converter SRC for periodically sampling the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iI is a positive integer;

an arithmetic circuit for calculating the sampling signal y_iAudio energy T of_i；

A threshold decision circuit for deciding the audio energy T_iWhether or not t is greater than or equal to t_iFirst threshold value A of time₀(ii) a At the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Under the condition of (1), triggering an interrupt processing circuit to output an interrupt pulse signal to an interrupt control circuit, and enabling a Digital Signal Processor (DSP) or a processor to carry out Voice Activation Detection (VAD) by the interrupt control circuit;

a first decimator having an input coupled to an output of the SRC for decimating the sampled signal y at a first decimation rate 1/x_iExtracting to obtain sampling points ys, wherein x is a natural number greater than 1;

an input end of the STF is coupled to an output end of the first decimator and is used for performing slow tracking filtering on the sampling points ys obtained by decimation to obtain first noise energy S₀；

A comparator having an input coupled to an output of the STF and the threshold decision circuit for comparing the first noise energy S₀And said t_iFirst threshold value A of time₀Whether the difference is greater than a preset first threshold value M₀；

Configurator for when VAD detection fails and at t_iN consecutive detection failures before the time of day, and the first noise energy S₀And said t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀According to said first noise energy S₀Generating a second threshold A₁And applying said second threshold A₁As t_i+1First threshold value A of time₀And sending the data to the threshold value judging circuit, wherein n is a positive integer and is smaller than i.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the configurator is specifically configured to:

the first noise energy S₀As the second threshold value A₁；

With reference to the second aspect, in a second possible implementation manner of the second aspect, the method further includes:

a second decimator having an input coupled to an output of the SRC for decimating the sampled signal y at a second decimation rate 1/z_iExtracting to obtain sampling points yf, wherein z is a natural number larger than x;

an input end of the FTF is coupled to an output end of the second extractor and used for performing fast tracking filtering on sampling points yf obtained by extraction to obtain second noise energy F₀A second noise energy;

the comparator, together with the output of the FTF, is further configured to compare the audio energy T_iLess than t_iFirst threshold value A of time₀In the case of (1), the first threshold value and the second noise energy F at each time are compared₀Whether the difference is greater than a preset second threshold value M₁(ii) a And when from t_i-mFrom time until t_iRespective first threshold A of time₀And the second noise energy F₀Are all larger than a preset second threshold value M₁Triggering the interrupt processing circuit to output an interrupt pulse signal to the interrupt control circuit, enabling the DSP or the processor to perform VAD by the interrupt control circuit, wherein m is a positive integer and is smaller than i;

the configurator is further used for detecting whether the VAD is successful according to the second noise energy F₀Generating a third threshold A₂And applying said third threshold A₂As t_i+1First threshold value A of time₀And the threshold value judgment circuit is issued to the threshold value judgment circuit.

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the configurator is specifically configured to:

converting the second noise energy F₀As the third threshold value A₂；

With reference to the second or third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the configurator is further configured to:

record the t_iWhen the time is lower than the thresholdEngraving;

when the t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeWhen the third threshold A is executed₂As t_i+1First threshold value A of time₀Otherwise, said third threshold A is not executed₂As t_i+1First threshold value A of time₀The step (2).

With reference to the second aspect, in a fifth possible implementation manner of the second aspect, the configurator is further configured to:

With reference to the fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the configurator is specifically configured to:

the first noise energy S₀As the fourth threshold value A₃；

With reference to the fifth or sixth possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, the configurator is further configured to:

record the t_iThe moment is the moment of reducing the threshold value;

when the t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the fourth threshold A is executed₃As t_i+1First threshold value A of time₀Otherwise, said fourth threshold A is not executed₃As t_i+1First threshold value A of time₀The step (2).

The embodiment of the invention provides a voice awakening method and device, which are used for acquiring t_iSampling at a moment to obtain a sampling signal y_iAudio energy T of_iAnd at the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Performing VAD in the case of (1); when the VAD detection fails, and at t_iN consecutive detection failures before the time, and a first noise energy S₀And t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀While adjusting the first threshold A₀Is given by the size of (1) to obtain t_i+1First threshold value A of time₀: according to the first noise energy S₀Generating a second threshold A₁And applying a second threshold A₁As t_i+1First threshold value A of time₀. Wherein the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely t_i+1First threshold value A of time₀Is according to t_iFirst noise energy S of time₀Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise₀Is set to the first threshold value A at each time₀And the frequency of VAD is reduced by matching with the environment, so that the power consumption of the terminal in a noisy environment is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flowchart of a voice wake-up method according to a first embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of a first threshold value under different environments according to the voice wake-up method of the present invention;

FIG. 3 is a flowchart of a voice wake-up method according to a second embodiment of the present invention;

FIG. 4 is a flowchart of a voice wake-up method according to a third embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a voice wake-up apparatus according to a first embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a voice wake-up apparatus according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a voice wake-up apparatus according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The meaning of voice wakeup is that in any case, the terminal can be activated and a specific application can be executed through a predefined wakeup word. The screen is lighted like a user key, and the processing of the mobile phone is activated. The advantage of voice wake-up is that the user's hands are freed.

In a voice wake-up scheme for a smart phone, the standby power consumption of the smart phone is about 2.2 milliamps x 3.8 volts in a quiet environment; in a noisy environment, the standby power consumption of the smartphone is 5.5 milliamps × 3.8 volts. It can be seen that the difference in power consumption of the smartphone is about 12 mw, (5.5-2.2) × 3.8 ═ 12 in both noisy and quiet environments.

According to the power consumption estimation model: the average power consumption is equal to quiet power consumption multiplied by 70% + noisy power consumption multiplied by 30%, so that the power consumption in a noisy environment should be reduced, and the embodiment of the invention focuses on power consumption optimization in a noisy environment.

The embodiment of the invention provides a method and a device for waking up a digital signal processor in a terminal by voice, which are used for reducing the frequency of waking up a DSP in the terminal to perform VAD and realizing the reduction of power consumption of the terminal in a noisy environment.

Fig. 1 is a flowchart of a voice wake-up method according to a first embodiment of the present invention. The method may be performed by a voice wake-up apparatus, which may be implemented in hardware. The voice wake-up device may be integrated in a terminal, such as a tablet computer, a smart phone, a Personal Digital Assistant (PDA), and the like. As shown in fig. 1, the voice wake-up method includes:

s101, periodically sampling an audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iAnd i is a positive integer.

Similarly, t_i-1The sampled signal at a time can be denoted as y_i-1，t_i+1The sampled signal at a time can be denoted as y_i+1By analogy, this is not to be taken as an enumeration.

In any embodiment of the present invention, the audio signal may be a signal collected by a sound collection device such as a microphone. An audio signal collected by a sound collection device such as a microphone is periodically sampled by a sampling frequency converter (SRC). Or, after the audio signal collected by the sound collection device such as the microphone is processed by a filter such as a band pass filter, the audio signal is periodically sampled by the SRC, which is not limited in the embodiment of the present invention.

S102, calculating a sampling signal y_iAudio energy T of_i。

It should be noted that the calculation of the audio energy of the sampled signal may be performed after obtaining the sampled signal, for example: at t_i-1Sampling at a moment to obtain a sampling signal y_i-1Then, the sampled signal y is also calculated_i-1Corresponding audio energy T_i-1。

Those skilled in the art will appreciate that the signal y is sampled_iIs constant, and therefore, the sampling signal y_iAudio energy T of_iCan be obtained by calculation.

In particular, x (j) is used to represent the sampling signal y_iThe amplitude at the j-th sample point, x (j) x (j), represents the sample signal y_iThe energy at the j-th moment, j is an integer between 0 and M-1, M is the total number of sampling points, and the coefficient a_jIs used to represent the weight, T, of each sample point_iRepresenting the sampled signal y_iOf the audio energy. For example, the following equation is a normalized process, specifically representing the percentage of the total energy occupied at each sample point:

wherein,

the calculation of the sampling signal y is illustrated here only by way of example_iAudio energy T of_iThe embodiment of the present invention is not limited thereto, and the sampling signal may be obtained by Root Mean Square (RMS) or other similar methodsy_iAudio energy T of_iE.g., without normalization, etc.

S103, in the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀In the case of (2), VAD is performed.

Specifically, the component performing VAD may be a DSP or a processor in the terminal.

S104, when the VAD detection fails, and at t_iN consecutive detection failures before the time, and a first noise energy S₀And t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀According to the first noise energy S₀Generating a second threshold A₁And applying a second threshold A₁As t_i+1First threshold value A of time₀Wherein the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, n is a positive integer and n is smaller than i.

Note that when VAD detection fails, and at t_iThe detection failure for n times before the time point is that: at t_iVAD detection at time instant fails and from t_i-nTime t_i-1All VAD detections performed at the time point fail, specifically, if n is 2, when VAD detection fails, and at the time point t, the VAD detection fails_iThe detection failure for n times before the time point is that: at t_iTwo consecutive instants (i.e. from t) before the failure of the VAD detection made at instant_i-2Time t_i-1Time of day) failed continuously 2 times. Further, to facilitate better understanding of the technical solution of the present invention, VAD detection failure is exemplified, such as: at present, the sound of the automobile engine is generated, and the audio energy of the sound is larger than the first threshold A at the current moment₀VAD is required, but VAD fails because it can be determined that the voice is not the user's voice. In other words, if at the terminalIn a high noise environment, the noise energy of the environmental noise is relatively high, once the noise energy of the environmental noise is larger than the first threshold a at the current moment₀It is necessary to activate the VAD, however, since the environmental noise itself is chaotic, the VAD detection fails because a useful voice signal cannot be detected from the VAD detection. First noise energy S₀Representing the energy level of the stationary noise of the environment in which the terminal is located. First threshold value M₀Is a preset parameter and can be determined by debugging.

It should also be noted that, in any embodiment of the present invention, the first and the second are used for distinguishing the same term, for example, the "first" of the "first threshold" and the "second" of the "second threshold" are only named ways for distinguishing different thresholds, and do not represent the order between the thresholds.

In an actual application scene, the noise magnitude is different in different application scenes. For example, in a quiet environment, the noise is about 30 to 35 decibels (db); in a noisy environment, the ambient noise may refer to the following data: the noise of the market is about 60db, the noise of the road is about 70db, the noise of the interior of the airplane cabin is about 70db, the noise of the public transport is about 80db, the noise of the subway is about 90db, and the like. In addition, the noise level differs from place to place at different times. For example, the noise at the same location, day and night may differ by 10 to 15 db.

Moreover, when a user carries out conversation or talk in a noisy environment, the voice volume can be increased subconsciously, so that the Signal to Noise Ratio (SNR) is increased, and a feasible basis is provided for voice awakening.

Therefore, in the current voice awakening scheme adopting a uniform noise threshold, namely a preset threshold, when the terminal is awakened by voice, a quiet environment and a noisy environment cannot be distinguished, and if the preset threshold is set to be too high, voice missing detection can be caused; if the preset threshold is set too low, the processor is frequently awakened, and the power consumption is large.

In the inventionIn the embodiment, the first threshold A of each time is timely adjusted₀The size of (2).

Specifically, by S101 to S103, the value at t is obtained_iSampling at a moment to obtain a sampling signal y_iAudio energy T of_iAnd the audio energy T_iRelative t_iFirst threshold value A of time₀And when the audio energy T is_iIs greater than or equal to t_iFirst threshold value A of time₀In the case of (3), the VAD is performed so that the DSP, the processor, or the like performs VAD and whether or not to wake up the terminal is determined based on the result of VAD. Wherein VAD detection is successful, i.e. the DSP or the processor or other elements capable of VAD sample the signal y_iIf the voice of the user is detected, the terminal is awakened; otherwise, the VAD detection fails, i.e. the DSP or the processor or other elements capable of VAD sample the signal y_iIf the voice of the user is not detected, the terminal is not awakened.

In S104, at a first noise energy S₀And t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀When it is indicated that the terminal may currently be in an environment with high background noise. At this time, according to the first noise energy S₀Generating a second threshold A₁And applying a second threshold A₁As t_i+1First threshold value A of time₀. Wherein the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, and n is a positive integer smaller than i. In practice, the signal y is sampled_iMay include t_iSpeech and ambient noise of the user at a time, or sampled signal y_iIncluding only t_iAmbient noise at the moment. At t_iTime of day obtaining t_i+1First threshold value A of time₀I.e. t_i+1At the time, the terminal executes the first threshold used in S103 and S104 in the voice wakeup method.

If t_iThe voice awakening at the moment is the first voice awakening, then t_iOf time of dayFirst threshold A₀May be preset. It can be considered that the preset first threshold value A₀Is an optimization parameter corresponding to a possible application scenario, for example, the first threshold A is set₀Preset to 50 db, can be considered as the background noise threshold in quiet environments. Wherein fig. 2 illustrates the first threshold in a quiet environment versus a noisy environment. As shown in fig. 2, in a quiet environment, the first threshold is higher than the ambient noise by a first preset value; and under the noisy environment, the first threshold is higher than the ambient noise by a second preset value. Additionally, the first threshold for a noisy environment is higher than the first threshold for a quiet environment.

In addition, S103 may be: 1) at the audio energy T_iAnd t_i-1Audio energy T of time of day_i-1Is greater than or equal to t_iTime difference threshold A₀₀Performing VAD in the case of (1); or, 2) at an audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀And, the audio energy T_iAnd t_i-1Audio energy T of time of day_i-1Is greater than or equal to t_iTime difference threshold A₀₀Performing VAD in the case of (1); or, 3) at audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Or, audio energy T_iAnd t_i-1Audio energy T of time of day_i-1Is greater than or equal to t_iTime difference threshold A₀₀And VAD is performed when both satisfy one of the conditions. Wherein, t_i-1Audio energy T of time of day_i-1Is cached in the terminal, at t_i-1Time-of-day calculation sampling signal y_i-1The audio energy of (a).

If 1), then t is similarly adjusted_iFirst threshold value A of time₀Method of adjusting t_iTime difference threshold A₀₀(ii) a If 2), then t is similarly adjusted_iFirst threshold value A of time₀Method of simultaneously adjusting t_iFirst threshold value A of time₀And t_iTime difference threshold A₀₀(ii) a If 3), then t is similarly adjusted_iAt the first momentA threshold value A₀Method of adjusting t_iFirst threshold value A of time₀Or t_iTime difference threshold A₀₀。

The embodiment of the invention obtains t_iSampling at a moment to obtain a sampling signal y_iAudio energy T of_iAnd at the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Performing VAD in the case of (1); when the VAD detection fails, and at t_iN consecutive detection failures before the time, and a first noise energy S₀And t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀While adjusting the first threshold A₀Is given by the size of (1) to obtain t_i+1First threshold value A of time₀: according to the first noise energy S₀Generating a second threshold A₁And applying a second threshold A₁As t_i+1First threshold value A of time₀. Wherein the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely t_i+1First threshold value A of time₀Is according to t_iFirst noise energy S of time₀Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise₀Is set to the first threshold value A at each time₀And the frequency of VAD is reduced by matching with the environment, so that the power consumption of the terminal in a noisy environment is reduced.

In the above embodiment, the first noise energy S is used as the basis₀Generating a second threshold A₁The method comprises the following steps: the first noise energy S₀As a second threshold value A₁(ii) a Or, the first noise energy S₀With a preset first correction quantity N₀The sum is used as a second threshold value A₁I.e. A₁＝S₀+N₀(ii) a Or, the first noise energy S₀With a predetermined first coefficient a₀The product of the first and second thresholds is used as the second threshold A₁I.e. A₁＝a₀×S₀。

Wherein if the first correction amount N₀Is larger, indicating a second threshold A₁At a first noise energy S₀Fast rise of the foundation; if the first correction amount N₀Is smaller, indicating a second threshold value A₁At a first noise energy S₀The rising speed of the foundation can be set according to actual requirements. Wherein the first correction amount N₀The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the first coefficient a₀Is larger, indicating a second threshold A₁At a first noise energy S₀Fast rise of the foundation; if the first coefficient a₀Is smaller, indicating a second threshold value A₁At a first noise energy S₀The rising speed of the foundation can be set according to actual requirements. Wherein the first coefficient a₀The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.

Optionally, the first noise energy S may also be₀With a predetermined first coefficient a₀Is multiplied by a preset first correction quantity N₀As a second threshold value A₁，A₁＝a₀×S₀+N₀。

Fig. 3 is a flowchart of a voice wake-up method according to a second embodiment of the present invention. As shown in fig. 3, the method may include:

s301, periodically sampling the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iAnd i is a positive integer.

S302, calculating a sampling signal y_iAudio energy T of_i。

S303, in the audio energy T_iLess than t_iFirst threshold value A of time₀And from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all greater than a predetermined differenceIs lower than the second threshold value M₁When VAD is performed, m is a positive integer and m is smaller than i.

Exemplarily, if m is 2, then when the audio energy T is_iLess than t_iFirst threshold value A of time₀And t is_i-2First threshold value A of time₀And a second noise energy F₀Is greater than a second threshold value M₁，t_i-1First threshold value A of time₀And a second noise energy F₀Is greater than a second threshold value M₁And t and_ifirst threshold value A of time₀And a second noise energy F₀Is greater than a second threshold value M₁Then, VAD is performed.

S304, when the VAD detection is successful, according to the second noise energy F₀Generating a third threshold A₂And a third threshold value A₂As t_i+1First threshold value A of time₀Wherein the second noise energy F₀By sampling the signal y at a second decimation rate 1/z_iAnd extracting, and performing fast tracking filtering on the extracted sampling point yf to obtain the target product, wherein z is a natural number greater than x.

For specific description of S301 and S302, reference may be made to the embodiment shown in fig. 1, which is not repeated herein.

For S303, at audio energy T_iLess than t_iFirst threshold value A of time₀In the case of prior art voice wake-up schemes, VAD is no longer performed, and thus, there may be a situation where the user's voice is missed. E.g. t_iFirst threshold value A of time₀Suitable for noisy environments, but when the terminal is in a relatively quiet environment (e.g., low background noise environment), resulting in a sampled signal y_iMissed detection of the user's voice. In the embodiment of the invention, t is changed through S303 and S304_i+1First threshold value A of time₀Matching it to the current environment.

When from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all larger than a preset second threshold value M₁While, i.e. accumulating m +1 occurrences of the first threshold A₀And a second noise energy F₀Is greater than a preset second threshold value M₁The current first threshold a, which indicates that the terminal is in a quiet environment (environment with low background noise) at this time₀Larger, requiring a down-regulation to match a quiet environment. Wherein the second threshold value M₁Is a preset parameter and can be obtained through debugging.

For S304, the VAD detection is successful, which indicates that the sampled signal y_iIncluding the user's voice, and based on the second noise energy F to avoid the user's voice missing detection₀Generating a third threshold A₂And a third threshold value A₂As t_i+1First threshold value A of time₀. Wherein the second noise energy F₀By sampling the signal y at a second decimation rate 1/z_iExtracting, and performing fast tracking filtering on the extracted sampling point yf, so that the second noise energy F₀The energy level of the transient noise of the environment in which the terminal is located can be reflected to a certain extent.

The embodiment of the invention obtains t_iSampling at a moment to obtain a sampling signal y_iAudio energy T of_iAnd at the audio energy T_iLess than t_iFirst threshold value A of time₀And from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all larger than a preset second threshold value M₁Performing VAD in the case of (1); when the VAD detection is successful, according to the second noise energy F₀Generating a third threshold A₂And a third threshold value A₂As t_i+1First threshold value A of time₀. Wherein the second noise energy F₀By sampling the signal y at a second decimation rate 1/z_iThe sampling is performed, and the extracted sampling point yf is subjected to fast tracking filtering, that is, t_i+1First threshold value A of time₀Is according to t_iOf time of daySecond noise energy F₀Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise₀Is set to the first threshold value A at each time₀Is matched with the environment so as to further avoid sampling the signal y under the condition of reducing the VAD times and realizing the reduction of the power consumption of the terminal in the noisy environment_iMissed detection of the user's voice.

In the above embodiment, the second noise energy F is based on₀Generating a third threshold A₂Specifically, the method may include: second noise energy F₀As a third threshold value A₂(ii) a Or, the second noise energy F₀And a preset second correction quantity N₁The sum is used as a third threshold value A₂I.e. A₂＝F₀+N₁(ii) a Or, the second noise energy F₀With a predetermined second coefficient a₁The product of the first and second threshold values is used as a third threshold value A₂I.e. A₂＝a₁×F₀。

Wherein if the second correction amount N₁Is greater, indicating a third threshold value A₂At a second noise energy F₀Fast rise of the foundation; if the second correction amount N₁Is smaller, indicating a third threshold value A₂At a second noise energy F₀The rising speed of the foundation can be set according to actual requirements. Wherein the second correction amount N₁The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the second coefficient a₁Is greater, indicating a third threshold value A₂At a second noise energy F₀Fast rise of the foundation; if the second coefficient a₁Is smaller, indicating a third threshold value A₂At a second noise energy F₀The rising speed of the foundation can be set according to actual requirements. Wherein the second coefficient a₁The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.

Optionally, the second noise energy F can also be₀And preA second coefficient of₁Is multiplied by a preset second correction quantity N₁As a third threshold value A₂，A₂＝a₁×F₀+N₁。

Fig. 4 is a flowchart of a voice wake-up method according to a third embodiment of the present invention. As shown in fig. 4, the method may include:

s401, periodically sampling the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iAnd i is a positive integer.

S402, calculating a sampling signal y_iAudio energy T of_i。

S403, in the audio energy T_iLess than t_iFirst threshold value A of time₀And t is_iFirst threshold value A of time₀And a first noise energy S₀Is greater than a preset third threshold value M₂According to the first noise energy S₀Generating a fourth threshold A₃And applying a fourth threshold A₃As t_i+1First threshold value A of time₀。

For specific description of S401 and S402, reference may be made to the embodiment shown in fig. 1, which is not described herein again.

As for S403, at audio energy T_iLess than t_iFirst threshold value A of time₀In the case of prior art voice wake-up schemes, VAD is no longer performed, and thus, there may be a situation where the user's voice is missed. E.g. t_iFirst threshold value A of time₀Suitable for noisy environments, but when the terminal is in a relatively quiet environment, resulting in a sampled signal y_iMissed detection of the user's voice. In the embodiment of the invention, t is changed through S403_i+1First threshold value A of time₀Matching it to the current environment.

When t is_iFirst threshold value A of time₀And a first noise energy S₀Is greater than a preset third threshold value M₂Is, i.e., t_iFirst threshold value A of time₀Compare the first noise energy S₀Larger, meaning that the terminal is now in a relatively quiet environment, t_iFirst threshold value A of time₀Larger, down-regulated to match the environment. Wherein the third threshold value M₂Is a preset parameter and can be obtained through debugging.

Due to the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iExtracting, and performing slow tracking filtering on the extracted sampling point ys to obtain the first noise energy S₀The stable energy of the reaction environment. Therefore, S403 does not need to compare the first threshold a at a plurality of times as in S303₀And a first noise energy S₀Is greater than a preset third threshold value M₂. When t is_iFirst threshold value A of time₀And a first noise energy S₀Is greater than a preset third threshold value M₂Time, the sampling signal y can be illustrated_iThe voice of the user is contained, and in order to avoid the missing detection of the voice of the user, the voice is detected according to the first noise energy S₀Generating a fourth threshold A₃And applying a fourth threshold A₃As t_i+1First threshold value A of time₀。

The embodiment of the invention obtains t_iSampling at a moment to obtain a sampling signal y_iAudio energy T of_iAnd at the audio energy T_iLess than t_iFirst threshold value A of time₀And t is_iFirst threshold value A of time₀And a first noise energy S₀Is greater than a preset third threshold value M₂According to the first noise energy S₀Generating a fourth threshold A₃And applying a fourth threshold A₃As t_i+1First threshold value A of time₀. Wherein the first noise energy S₀By sampling the signal y at a first decimation rate 1/x_iExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely t_i+1First threshold value A of time₀Is according to t_iFirst noise energy S of time₀Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise₀Is set to the first threshold value A at each time₀Is matched with the environment so as to further avoid sampling the signal y under the condition of reducing the VAD times and realizing the reduction of the power consumption of the terminal in the noisy environment_iMissed detection of the user's voice.

Based on the above embodiment, wherein, according to the first noise energy S₀Generating a fourth threshold A₃The method can comprise the following steps: the first noise energy S₀As a fourth threshold value A₃(ii) a Or, the first noise energy S₀With a preset third correction quantity N₂The sum is used as a fourth threshold value A₃I.e. A₃＝S₀+N₂(ii) a Or, the first noise energy S₀With a predetermined third coefficient a₂The product of the first and second threshold values is used as a fourth threshold value A₃I.e. A₃＝a₂×S₀。

Wherein if the third correction amount N₂The larger value of (A) indicates the fourth threshold value A₃At a first noise energy S₀Fast rise of the foundation; if the third correction amount N₂Is smaller, the fourth threshold A is shown₃At a first noise energy S₀The rising speed of the foundation can be set according to actual requirements. Wherein the third correction amount N₂The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the third coefficient a₂The larger value of (A) indicates the fourth threshold value A₃At a first noise energy S₀Fast rise of the foundation; if the third coefficient a₂Is smaller, the fourth threshold A is shown₃At a first noise energy S₀The rising speed of the foundation can be set according to actual requirements. Wherein the third coefficient a₂The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.

Optionally, the first noise energy S may also be₀With a predetermined third coefficient a₂Is multiplied by a preset third correction quantity N₂As a fourth threshold value A₃I.e. A₃＝a₂×S₀+N₂。

Incidentally, the second correction amount N₁And a third correction amount N₂The first threshold A being respectively reflected in different conditions₀Relative noise energy rise value. Wherein the first threshold value A₀Relative second noise energy F₀Large second correction quantity N₁First threshold value A₀Relative first noise energy S₀Large third correction quantity N₂. In addition, due to the first noise energy S₀For slow tracking filtering, the second noise energy F₀For fast tracking filtering, therefore, optionally, the third correction quantity N₂Greater than the second correction amount N₁To achieve a fast match to the environment.

Furthermore, the embodiment of the present invention may also record a scene of the first threshold value change. For a scene with a first threshold value raised, the threshold value raised moment can be recorded; for a scenario where the first threshold is lowered, the threshold lowering time may be recorded.

Specifically, the third threshold A is set₂As t_i+1First threshold value A of time₀Previously, the method may further comprise: record t_iThe moment is the moment of reducing the threshold value; when t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the third threshold A is executed₂As t_i+1First threshold value A of time₀Otherwise, the third threshold A is not executed₂As t_i+1First threshold value A of time₀The step (2).

After the fourth threshold value A₃As t_i+1First threshold value A of time₀Previously, the method may further comprise: record t_iThe moment is the moment of reducing the threshold value; when t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the fourth threshold A is executed₃As t_i+1First threshold value A of time₀Otherwise, the fourth threshold A is not executed₃As t_i+1First threshold value A of time₀The step (2).

The above two specific implementations can prevent the first threshold a₀The ping-pong switching does not affect the reliability of voice detection, and reduces the voice missing detection probability.

The embodiment of the invention continuously monitors and tracks the environmental background noise, and adaptively adjusts the first threshold A according to the size of the environmental background noise₀And for the first threshold A₀The adjustment adopts a slow rising or slow falling mode, so that the voice missing detection probability is reduced. In addition, the first threshold A₀The dynamic adjustment of (2) makes the power consumption under quiet environment and noisy environment be close, thereby can promote user experience, improves product competitiveness.

Fig. 5 is a schematic structural diagram of a voice wake-up apparatus according to a first embodiment of the present invention. The voice wake-up device can be realized in a hardware mode. The voice wake-up device can be integrated in a terminal such as a tablet computer, a smart phone, a PDA and the like. As shown in fig. 5, the voice wake-up apparatus 10 includes: the system comprises an SRC11, an arithmetic circuit 12, a threshold decision circuit 13, a first decimator 14, a Slow Tracking Filter (STF) 15, a comparator 16, a configurator 17, and an interrupt processing circuit 18.

Wherein SRC11 is configured to periodically sample the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iAnd i is a positive integer. The arithmetic circuit 12 is used for calculating the sampling signal y_iAudio energy T of_i. The threshold decision circuit 13 is used for determining the audio energy T_iWhether or not t is greater than or equal to_iFirst threshold value A of time₀(ii) a At the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀In the case where the trigger interrupt processing circuit 18 outputs an interrupt pulse signal to the interrupt control circuit 20, the interrupt control circuit 20 enables the DSP or the processor 30 to performVAD. An input terminal of the first decimator 14 is coupled to an output terminal of the SRC11, the first decimator 14 is configured to decimate the sampled signal y at a first decimation rate 1/x_iAnd (4) extracting to obtain sampling points ys and outputting the sampling points ys, wherein x is a natural number greater than 1. The input end of the STF15 is coupled to the output end of the first decimator 14, and the STF15 is used for performing slow tracking filtering on the sampled sampling points ys to obtain the first noise energy S₀. An input terminal of the comparator 16 is coupled to an output terminal of the STF15 and the threshold decision circuit 13, the comparator 16 is used for comparing the first noise energy S₀And t_iFirst threshold value A of time₀Whether the difference is greater than a preset first threshold value M₀. The configurator 17 is used for when the VAD detection fails and at t_iN consecutive detection failures before the time, and a first noise energy S₀And t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀According to the first noise energy S₀Generating a second threshold A₁And applying a second threshold A₁As t_i+1First threshold value A of time₀And sends it to the threshold decision circuit 13, where n is a positive integer and n is smaller than i.

Referring to fig. 5, the configurator 17 configures parameters, such as the first threshold a mentioned above, for the voice wakeup device 10₀And the like. Those skilled in the art will understand that the configurator 17 receives the configuration parameters from the terminal and converts the configuration parameters into corresponding control signals for various logic modules in the voice wake-up apparatus 10, wherein the logic modules include the arithmetic circuit 12, the threshold decision circuit 13, the interrupt processing circuit 18, and the like. The SRC11 may specifically sample the audio signal in a down-sampling manner, for example, convert data of 32 kilohertz (KHz) into 16 KHz.

Sampling signal y_iThe flow directions in fig. 5 are:

SRC 11- > arithmetic circuit 12- > threshold decision circuit 13- > interrupt processing circuit 18 (optional) > interrupt control circuit 20 (optional) > DSP or processor 30 (optional).

At the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀In the case of (2), sampling the signal y_iComprises the above-mentioned optional portion; at the audio energy T_iLess than t_iFirst threshold value A of time₀In the case of (2), sampling the signal y_iDoes not include the optional portion described above.

The first decimator 14, STF15 and comparator 16 do not affect the normal voice wake-up, but are only used in conjunction with the configurator 17 to change the first threshold a in the voice wake-up₀。

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

In the above embodiments, the configurator 17 may be specifically configured to: the first noise energy S₀As a second threshold value A₁(ii) a Or, the first noise energy S₀With a preset first correction quantity N₀The sum is used as a second threshold value A₁I.e. A₁＝S₀+N₀(ii) a Or, the first noise energy S₀With a predetermined first coefficient a₀The product of the first and second thresholds is used as the second threshold A₁I.e. A₁＝a₀×S₀And the like, embodiments of the present invention are not limited thereto.

Fig. 6 is a schematic structural diagram of a voice wake-up apparatus according to a second embodiment of the present invention. The voice wake-up device can be realized in a hardware mode. The voice wake-up device can be integrated in a terminal such as a tablet computer, a smart phone, a PDA and the like. As shown in fig. 6, the voice wake-up apparatus 100 includes: the SRC110, the arithmetic circuit 120, the threshold decision circuit 130, the second decimator 140, the Fast Tracking Filter (FTF) 150, the comparator 160, the configurator 170, and the interrupt processing circuit 180.

Wherein the SRC110 is configured to periodically sample the audio signal, wherein at t_iSampling at a moment to obtain a sampling signal y_iAnd i is a positive integer. The arithmetic circuit 120 is used for calculating the sampling signal y_iAudio energy T of_i. The threshold decision circuit 130 is used for determining the audio energy T_iWhether or not t is greater than or equal to_iFirst threshold value A of time₀. An input of a second decimator 140 is coupled to an output of the SRC110, the second decimator 140 being configured to decimate the sampled signal y by a second decimation rate 1/z_iAnd extracting to obtain a sampling point yf, wherein z is a natural number larger than x. An input end of the FTF 150 is coupled to an output end of the second decimator 140, and the FTF 150 is configured to perform fast tracking filtering on the sampled sampling point yf to obtain a second noise energy F₀. An input of comparator 160 is coupled to an output of FTF 150, comparator 160 is configured to provide an output of audio energy T_iLess than t_iFirst threshold value A of time₀In the case of (1), the first threshold value and the second noise energy F at each time are compared₀Whether the difference is greater than a preset second threshold value M₁(ii) a And when from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all larger than a preset second threshold value M₁In this case, the trigger interrupt processing circuit 180 outputs an interrupt pulse signal to the interrupt control circuit 200, and the interrupt control circuit 200 enables the DSP or the processor 300 to perform VAD, where m is a positive integer and m is smaller than i. The configurator 170 is used for determining the second noise energy F according to the VAD detection success₀Generating a third threshold A₂And a third threshold value A₂As t_i+1First threshold value A of time₀And then sent to the threshold decision circuit 130.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.

On the basis of the above embodiments, the configurator may be specifically configured to: second noise energy F₀As a third threshold value A₂(ii) a Or, the second noise energy F₀And a preset second correction quantity N₁The sum is used as a third threshold value A₂I.e. A₂＝F₀+N₁(ii) a Or, the second noise energy F₀And a predetermined second seriesNumber a₁The product of the first and second threshold values is used as a third threshold value A₂I.e. A₂＝a₁×F₀And the like, embodiments of the present invention are not limited thereto.

Optionally, the configurator 170 may be further configured to: record t_iThe moment is the moment of reducing the threshold value; when t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the third threshold A is executed₂As t_i+1First threshold value A of time₀Otherwise, the third threshold A is not executed₂As t_i+1First threshold value A of time₀So that the first threshold value A can be prevented₀The ping-pong switching does not affect the reliability of voice detection, and reduces the voice missing detection probability.

Referring to fig. 5, configurator 17 may also be configured to: at the audio energy T_iLess than t_iFirst threshold value A of time₀And t is_iFirst threshold value A of time₀And a first noise energy S₀Is greater than a preset third threshold value M₂According to the first noise energy S₀Generating a fourth threshold A₃And applying a fourth threshold A₃As t_i+1First threshold value A of time₀。

At this time, the apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 4, and the implementation principle and the technical effect are similar, which are not described herein again.

Further, the configurator 17 may be specifically configured to: the first noise energy S₀As a fourth threshold value A₃(ii) a Or, the first noise energy S₀With a preset third correction quantity N₂The sum is used as a fourth threshold value A₃I.e. A₃＝S₀+N₂(ii) a Or, the first noise energy S₀With a predetermined third coefficient a₂The product of the first and second threshold values is used as a fourth threshold value A₃I.e. A₃＝a₂×S₀And the like, embodiments of the present invention are not limited thereto.

Still further, the configurator 17 may also be configured to: record t_iThe moment is the moment of reducing the threshold value; when t is_iThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value T_timeThen, the fourth threshold A is executed₃As t_i+1First threshold value A of time₀Otherwise, the fourth threshold A is not executed₃As t_i+1First threshold value A of time₀So that the first threshold value A can be prevented₀The ping-pong switching does not influence the reliability of the voice detection, and reduces the voice missing detection probability

Referring to fig. 5 and 6, the first decimator 14 and the second decimator 140 perform data decimation for a long period or a short period, respectively. The STF15 is a slow converging filter for stably tracking environmental noise variation. FTF 150 is a fast converging filter for fast tracking of ambient noise variations. Optionally, the STF15 is a slow converging filter for stable tracking of environmental noise variations. STF15 and FTF 150 are used to track the energy of the current computational window, and are constructed similarly to operational circuit 12 or operational circuit 120. The STF15 and the FTF 150 are different in the order and parameters of the filter, which are set according to the actual debugging situation. FTF 150 is used to perform short-term filtering, i.e., recently occurring data changes can quickly affect the output of the filter. The STF15 is a long period filter, i.e. the recently occurring data changes have a relatively small and slow effect on the output of the filter.

Alternatively, on the basis of fig. 5, in combination with fig. 6, the structure shown in fig. 7 is obtained. Fig. 7 is a schematic structural diagram of a voice wake-up apparatus according to a third embodiment of the present invention. As shown in fig. 7, the voice wake-up apparatus 1000 includes: SRC11, arithmetic circuit 12, threshold decision circuit 13, first decimator 14, second decimator 140, STF15, FTF 150, comparator 16, configurator 17, and interrupt processing circuit 18.

The threshold decision circuit 13 also has the function of the threshold decision circuit 130; the comparator 16 also has the function and function of the comparator 160; the configurator 17 also has the functions and functions of the configurator 170; the interrupt processing circuit 18 also has the function and function of the interrupt processing circuit 180. The specific principle is as the above embodiment, and is not described herein again.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units or modules is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A voice wake-up method, comprising:

calculating the sampling signal y_iAudio energy T of_i；

when V isAD detection fails and at said t_iN consecutive detection failures before the time, and a first noise energy S₀And said t_iFirst threshold value A of time₀Is greater than a preset first threshold value M₀According to said first noise energy S₀Generating a second threshold A₁And applying said second threshold A₁As t_i+1First threshold value A of time₀Wherein the first noise energy S₀By applying a first decimation rate 1/x to said sampled signal y_iAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, n is a positive integer and n is smaller than i.

2. The method of claim 1, wherein said first noise energy S is based on said first noise energy₀Generating a second threshold A₁The method comprises the following steps:

the first noise energy S₀As the second threshold value A₁；

3. The method of claim 1, wherein said computing said sampled signal y_iAudio energy T of_iThen, the method further comprises the following steps:

at the audio energy T_iLess than t_iFirst threshold value A of time₀And from t_i-mFrom time until t_iRespective first threshold A of time₀And a second noise energy F₀Are all larger than a preset second threshold value M₁Performing VAD, wherein m is a positive integer and m is smaller than i;

4. Method according to claim 3, characterized in that said function is based on said second noise energy F₀Generating a third threshold A₂The method comprises the following steps:

converting the second noise energy F₀As the third threshold value A₂；

5. Method according to claim 3 or 4, characterized in that said third threshold value A is used₂As t_i+1First threshold value A of time₀Before, still include:

record the t_iThe moment is the moment of reducing the threshold value;

6. The method of claim 1, wherein said computing said sampled signal y_iAudio energy T of_iThen, the method further comprises the following steps:

7. The method of claim 6, wherein said first noise energy S is based on said first noise energy₀Generating a fourth threshold A₃The method comprises the following steps:

the first noise energy S₀As the fourth threshold value A₃；

8. Method according to claim 6 or 7, characterized in that said fourth threshold value A is used₃As t_i+1First threshold value A of time₀Before, still include:

record the t_iThe moment is the moment of reducing the threshold value;

9. A voice wake-up apparatus, comprising:

A threshold decision circuit for deciding the audio energy T_iWhether or not t is greater than or equal to t_iFirst threshold value A of time₀(ii) a At the audio energy T_iIs greater than or equal to t_iFirst threshold value A of time₀Under the condition of (1), triggering an interrupt processing circuit to output an interrupt pulse signal to an interrupt control circuit, and enabling a processor to carry out Voice Activation Detection (VAD) by the interrupt control circuit;

A comparator having an input coupled to the output of the first decimator and the threshold decision circuit for comparing the first noise energy S₀And said t_iFirst threshold value A of time₀Whether the difference is greater than a preset first threshold value M₀；

10. The apparatus of claim 9, wherein the configurator is specifically configured to:

the first noise energy S₀As the second threshold value A₁；

11. The apparatus of claim 9, further comprising:

an input end of the FTF is coupled to an output end of the second extractor and used for performing fast tracking filtering on sampling points yf obtained by extraction to obtain second noise energy F₀；

The comparator, the input end of which is coupled to the output end of the FTF, is also used for receiving the audio energy T_iLess than t_iFirst threshold value A of time₀In the case of (1), the first threshold value and the second noise energy F at each time are compared₀Whether the difference is greater than a preset second threshold value M₁(ii) a And when from t_i-mFrom time until t_iRespective first threshold A of time₀And the second noise energy F₀Are all larger than a preset second threshold value M₁Triggering the interrupt processing circuit to output an interrupt pulse signal to the interrupt control circuit, enabling the processor to perform VAD by the interrupt control circuit, wherein m is a positive integer and is smaller than i;

the configurator further usesAccording to the second noise energy F when the VAD detection is successful₀Generating a third threshold A₂And applying said third threshold A₂As t_i+1First threshold value A of time₀And the threshold value judgment circuit is issued to the threshold value judgment circuit.

12. The apparatus of claim 11, wherein the configurator is specifically configured to:

converting the second noise energy F₀As the third threshold value A₂；

13. The apparatus of claim 11 or 12, wherein the configurator is further configured to:

record the t_iThe moment is the moment of reducing the threshold value;

14. The apparatus of claim 9, wherein the configurator is further configured to:

15. The apparatus of claim 14, wherein the configurator is specifically configured to:

the first noise energy S₀As the fourth threshold value A₃；

16. The apparatus of claim 14 or 15, wherein the configurator is further configured to:

record the t_iThe moment is the moment of reducing the threshold value;