Disclosure of Invention
The embodiment of the invention provides a voice awakening method and device, which are used for reducing power consumption of a terminal in a noisy environment.
In a first aspect, an embodiment of the present invention provides a voice wake-up method, including:
periodically sampling the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiI is a positive integer;
calculating the sampling signal yiAudio energy T ofi;
At the audio energy TiIs greater than or equal to tiFirst threshold value A of time0Performing voice activity detection VAD under the condition of (1);
when the VAD has failed n consecutive tests, and when the VAD fails, and at the tiN consecutive detection failures before the time, and a first noise energy S0And said tiFirst threshold value A of time0Is greater than a preset first threshold value M0According to said first noise energy S0Generating a second threshold A1And applying said second threshold A1As ti+1First threshold value A of time0Wherein the first noise energy S0By applying a first decimation rate 1/x to the sample points yiAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, n is a positive integer and n is smaller than i.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the first noise energy S is obtained according to the first noise energy S0Generating a second threshold A1The method comprises the following steps:
the first noise energy S0As the second threshold value A1;
Or, the first noise energy S is used0With a preset first correction quantity N0The sum is used as the second threshold value A1;
Or, the first noise energy S is used0With a predetermined first coefficient a0The product of the first and second threshold values is used as the second threshold value A1。
With reference to the first aspect, in a second possible implementation manner of the first aspect, the calculating the sampling signal y is performediAudio energy T ofiThen, the method further comprises the following steps:
at the audio energy TiLess than tiFirst threshold value A of time0And from ti-mFrom time until tiRespective first threshold A of time0And a second noise energy F0Are all greater thanSet second threshold value M1Performing VAD, wherein m is a positive integer and m is smaller than i;
when the VAD detection is successful, according to the second noise energy F0Generating a third threshold A2And applying said third threshold A2As ti+1First threshold value A of time0Wherein the second noise energy F0By applying a second decimation rate 1/z to said sampled signal yiAnd extracting, and performing fast tracking filtering on the extracted sampling point yf to obtain the target product, wherein z is a natural number greater than x.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the second noise energy F is obtained according to the second noise energy F0Generating a third threshold A2The method comprises the following steps:
converting the second noise energy F0As the third threshold value A2;
Or, the second noise energy F0And a preset second correction quantity N1The sum is used as the third threshold value A2;
Or, the second noise energy F0With a predetermined second coefficient a1The product of the first and second threshold values is used as the third threshold value A2。
With reference to the second or third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the third threshold a is set2As ti+1First threshold value A of time0Before, still include:
record the tiThe moment is the moment of reducing the threshold value;
when the t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeWhen the third threshold A is executed2As ti+1First threshold value A of time0Otherwise, said step of not executing said step ofThird threshold value A2As ti+1First threshold value A of time0The step (2).
With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the calculating the sampling signal y is performediAudio energy T ofiThen, the method further comprises the following steps:
at the audio energy TiLess than tiFirst threshold value A of time0And said tiFirst threshold value A of time0And the first noise energy S0Is greater than a preset third threshold value M2According to the first noise energy S0Generating a fourth threshold A3And applying the fourth threshold A3As ti+1First threshold value A of time0。
With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the first noise energy S is obtained according to the first noise energy0Generating a fourth threshold A3The method comprises the following steps:
the first noise energy S0As the fourth threshold value A3;
Or, the first noise energy S is used0With a preset third correction quantity N2The sum is used as the fourth threshold value A3;
Or, the first noise energy S is used0With a predetermined third coefficient a2The product of the first and second threshold values is used as the fourth threshold value A3。
With reference to the fifth or sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the fourth threshold a is set3As ti+1First threshold value A of time0Before, still include:
record the tiThe moment is the moment of reducing the threshold value;
when saidtiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the fourth threshold A is executed3As ti+1First threshold value A of time0Otherwise, said fourth threshold A is not executed3As ti+1First threshold value A of time0The step (2).
In a second aspect, an embodiment of the present invention provides a voice wake-up apparatus, including:
a sampling frequency converter SRC for periodically sampling the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiI is a positive integer;
an arithmetic circuit for calculating the sampling signal yiAudio energy T ofi;
A threshold decision circuit for deciding the audio energy TiWhether or not t is greater than or equal to tiFirst threshold value A of time0(ii) a At the audio energy TiIs greater than or equal to tiFirst threshold value A of time0Under the condition of (1), triggering an interrupt processing circuit to output an interrupt pulse signal to an interrupt control circuit, and enabling a Digital Signal Processor (DSP) or a processor to carry out Voice Activation Detection (VAD) by the interrupt control circuit;
a first decimator having an input coupled to an output of the SRC for decimating the sampled signal y at a first decimation rate 1/xiExtracting to obtain sampling points ys, wherein x is a natural number greater than 1;
an input end of the STF is coupled to an output end of the first decimator and is used for performing slow tracking filtering on the sampling points ys obtained by decimation to obtain first noise energy S0;
A comparator having an input coupled to an output of the STF and the threshold decision circuit for comparing the first noise energy S0And said tiFirst threshold value A of time0Whether the difference is greater than a preset first threshold value M0;
Configurator for when VAD detection fails and at tiN consecutive detection failures before the time of day, and the first noise energy S0And said tiFirst threshold value A of time0Is greater than a preset first threshold value M0According to said first noise energy S0Generating a second threshold A1And applying said second threshold A1As ti+1First threshold value A of time0And sending the data to the threshold value judging circuit, wherein n is a positive integer and is smaller than i.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the configurator is specifically configured to:
the first noise energy S0As the second threshold value A1;
Or, the first noise energy S is used0With a preset first correction quantity N0The sum is used as the second threshold value A1;
Or, the first noise energy S is used0With a predetermined first coefficient a0The product of the first and second threshold values is used as the second threshold value A1。
With reference to the second aspect, in a second possible implementation manner of the second aspect, the method further includes:
a second decimator having an input coupled to an output of the SRC for decimating the sampled signal y at a second decimation rate 1/ziExtracting to obtain sampling points yf, wherein z is a natural number larger than x;
an input end of the FTF is coupled to an output end of the second extractor and used for performing fast tracking filtering on sampling points yf obtained by extraction to obtain second noise energy F0A second noise energy;
the comparator, together with the output of the FTF, is further configured to compare the audio energy TiLess than tiFirst threshold value A of time0In the case of (1), the first threshold value and the second noise energy F at each time are compared0Whether the difference is greater than a preset second threshold value M1(ii) a And when from ti-mFrom time until tiRespective first threshold A of time0And the second noise energy F0Are all larger than a preset second threshold value M1Triggering the interrupt processing circuit to output an interrupt pulse signal to the interrupt control circuit, enabling the DSP or the processor to perform VAD by the interrupt control circuit, wherein m is a positive integer and is smaller than i;
the configurator is further used for detecting whether the VAD is successful according to the second noise energy F0Generating a third threshold A2And applying said third threshold A2As ti+1First threshold value A of time0And the threshold value judgment circuit is issued to the threshold value judgment circuit.
With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the configurator is specifically configured to:
converting the second noise energy F0As the third threshold value A2;
Or, the second noise energy F0And a preset second correction quantity N1The sum is used as the third threshold value A2;
Or, the second noise energy F0With a predetermined second coefficient a1The product of the first and second threshold values is used as the third threshold value A2。
With reference to the second or third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the configurator is further configured to:
record the tiWhen the time is lower than the thresholdEngraving;
when the t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeWhen the third threshold A is executed2As ti+1First threshold value A of time0Otherwise, said third threshold A is not executed2As ti+1First threshold value A of time0The step (2).
With reference to the second aspect, in a fifth possible implementation manner of the second aspect, the configurator is further configured to:
at the audio energy TiLess than tiFirst threshold value A of time0And said tiFirst threshold value A of time0And the first noise energy S0Is greater than a preset third threshold value M2According to the first noise energy S0Generating a fourth threshold A3And applying the fourth threshold A3As ti+1First threshold value A of time0。
With reference to the fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the configurator is specifically configured to:
the first noise energy S0As the fourth threshold value A3;
Or, the first noise energy S is used0With a preset third correction quantity N2The sum is used as the fourth threshold value A3;
Or, the first noise energy S is used0With a predetermined third coefficient a2The product of the first and second threshold values is used as the fourth threshold value A3。
With reference to the fifth or sixth possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, the configurator is further configured to:
record the tiThe moment is the moment of reducing the threshold value;
when the t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the fourth threshold A is executed3As ti+1First threshold value A of time0Otherwise, said fourth threshold A is not executed3As ti+1First threshold value A of time0The step (2).
The embodiment of the invention provides a voice awakening method and device, which are used for acquiring tiSampling at a moment to obtain a sampling signal yiAudio energy T ofiAnd at the audio energy TiIs greater than or equal to tiFirst threshold value A of time0Performing VAD in the case of (1); when the VAD detection fails, and at tiN consecutive detection failures before the time, and a first noise energy S0And tiFirst threshold value A of time0Is greater than a preset first threshold value M0While adjusting the first threshold A0Is given by the size of (1) to obtain ti+1First threshold value A of time0: according to the first noise energy S0Generating a second threshold A1And applying a second threshold A1As ti+1First threshold value A of time0. Wherein the first noise energy S0By sampling the signal y at a first decimation rate 1/xiExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely ti+1First threshold value A of time0Is according to tiFirst noise energy S of time0Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise0Is set to the first threshold value A at each time0And the frequency of VAD is reduced by matching with the environment, so that the power consumption of the terminal in a noisy environment is reduced.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The meaning of voice wakeup is that in any case, the terminal can be activated and a specific application can be executed through a predefined wakeup word. The screen is lighted like a user key, and the processing of the mobile phone is activated. The advantage of voice wake-up is that the user's hands are freed.
In a voice wake-up scheme for a smart phone, the standby power consumption of the smart phone is about 2.2 milliamps x 3.8 volts in a quiet environment; in a noisy environment, the standby power consumption of the smartphone is 5.5 milliamps × 3.8 volts. It can be seen that the difference in power consumption of the smartphone is about 12 mw, (5.5-2.2) × 3.8 ═ 12 in both noisy and quiet environments.
According to the power consumption estimation model: the average power consumption is equal to quiet power consumption multiplied by 70% + noisy power consumption multiplied by 30%, so that the power consumption in a noisy environment should be reduced, and the embodiment of the invention focuses on power consumption optimization in a noisy environment.
The embodiment of the invention provides a method and a device for waking up a digital signal processor in a terminal by voice, which are used for reducing the frequency of waking up a DSP in the terminal to perform VAD and realizing the reduction of power consumption of the terminal in a noisy environment.
Fig. 1 is a flowchart of a voice wake-up method according to a first embodiment of the present invention. The method may be performed by a voice wake-up apparatus, which may be implemented in hardware. The voice wake-up device may be integrated in a terminal, such as a tablet computer, a smart phone, a Personal Digital Assistant (PDA), and the like. As shown in fig. 1, the voice wake-up method includes:
s101, periodically sampling an audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiAnd i is a positive integer.
Similarly, ti-1The sampled signal at a time can be denoted as yi-1,ti+1The sampled signal at a time can be denoted as yi+1By analogy, this is not to be taken as an enumeration.
In any embodiment of the present invention, the audio signal may be a signal collected by a sound collection device such as a microphone. An audio signal collected by a sound collection device such as a microphone is periodically sampled by a sampling frequency converter (SRC). Or, after the audio signal collected by the sound collection device such as the microphone is processed by a filter such as a band pass filter, the audio signal is periodically sampled by the SRC, which is not limited in the embodiment of the present invention.
S102, calculating a sampling signal yiAudio energy T ofi。
It should be noted that the calculation of the audio energy of the sampled signal may be performed after obtaining the sampled signal, for example: at ti-1Sampling at a moment to obtain a sampling signal yi-1Then, the sampled signal y is also calculatedi-1Corresponding audio energy Ti-1。
Those skilled in the art will appreciate that the signal y is samplediIs constant, and therefore, the sampling signal yiAudio energy T ofiCan be obtained by calculation.
In particular, x (j) is used to represent the sampling signal yiThe amplitude at the j-th sample point, x (j) x (j), represents the sample signal yiThe energy at the j-th moment, j is an integer between 0 and M-1, M is the total number of sampling points, and the coefficient ajIs used to represent the weight, T, of each sample pointiRepresenting the sampled signal yiOf the audio energy. For example, the following equation is a normalized process, specifically representing the percentage of the total energy occupied at each sample point:
wherein,
the calculation of the sampling signal y is illustrated here only by way of exampleiAudio energy T ofiThe embodiment of the present invention is not limited thereto, and the sampling signal may be obtained by Root Mean Square (RMS) or other similar methodsyiAudio energy T ofiE.g., without normalization, etc.
S103, in the audio energy TiIs greater than or equal to tiFirst threshold value A of time0In the case of (2), VAD is performed.
Specifically, the component performing VAD may be a DSP or a processor in the terminal.
S104, when the VAD detection fails, and at tiN consecutive detection failures before the time, and a first noise energy S0And tiFirst threshold value A of time0Is greater than a preset first threshold value M0According to the first noise energy S0Generating a second threshold A1And applying a second threshold A1As ti+1First threshold value A of time0Wherein the first noise energy S0By sampling the signal y at a first decimation rate 1/xiAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, n is a positive integer and n is smaller than i.
Note that when VAD detection fails, and at tiThe detection failure for n times before the time point is that: at tiVAD detection at time instant fails and from ti-nTime ti-1All VAD detections performed at the time point fail, specifically, if n is 2, when VAD detection fails, and at the time point t, the VAD detection failsiThe detection failure for n times before the time point is that: at tiTwo consecutive instants (i.e. from t) before the failure of the VAD detection made at instanti-2Time ti-1Time of day) failed continuously 2 times. Further, to facilitate better understanding of the technical solution of the present invention, VAD detection failure is exemplified, such as: at present, the sound of the automobile engine is generated, and the audio energy of the sound is larger than the first threshold A at the current moment0VAD is required, but VAD fails because it can be determined that the voice is not the user's voice. In other words, if at the terminalIn a high noise environment, the noise energy of the environmental noise is relatively high, once the noise energy of the environmental noise is larger than the first threshold a at the current moment0It is necessary to activate the VAD, however, since the environmental noise itself is chaotic, the VAD detection fails because a useful voice signal cannot be detected from the VAD detection. First noise energy S0Representing the energy level of the stationary noise of the environment in which the terminal is located. First threshold value M0Is a preset parameter and can be determined by debugging.
It should also be noted that, in any embodiment of the present invention, the first and the second are used for distinguishing the same term, for example, the "first" of the "first threshold" and the "second" of the "second threshold" are only named ways for distinguishing different thresholds, and do not represent the order between the thresholds.
In an actual application scene, the noise magnitude is different in different application scenes. For example, in a quiet environment, the noise is about 30 to 35 decibels (db); in a noisy environment, the ambient noise may refer to the following data: the noise of the market is about 60db, the noise of the road is about 70db, the noise of the interior of the airplane cabin is about 70db, the noise of the public transport is about 80db, the noise of the subway is about 90db, and the like. In addition, the noise level differs from place to place at different times. For example, the noise at the same location, day and night may differ by 10 to 15 db.
Moreover, when a user carries out conversation or talk in a noisy environment, the voice volume can be increased subconsciously, so that the Signal to Noise Ratio (SNR) is increased, and a feasible basis is provided for voice awakening.
Therefore, in the current voice awakening scheme adopting a uniform noise threshold, namely a preset threshold, when the terminal is awakened by voice, a quiet environment and a noisy environment cannot be distinguished, and if the preset threshold is set to be too high, voice missing detection can be caused; if the preset threshold is set too low, the processor is frequently awakened, and the power consumption is large.
In the inventionIn the embodiment, the first threshold A of each time is timely adjusted0The size of (2).
Specifically, by S101 to S103, the value at t is obtainediSampling at a moment to obtain a sampling signal yiAudio energy T ofiAnd the audio energy TiRelative tiFirst threshold value A of time0And when the audio energy T isiIs greater than or equal to tiFirst threshold value A of time0In the case of (3), the VAD is performed so that the DSP, the processor, or the like performs VAD and whether or not to wake up the terminal is determined based on the result of VAD. Wherein VAD detection is successful, i.e. the DSP or the processor or other elements capable of VAD sample the signal yiIf the voice of the user is detected, the terminal is awakened; otherwise, the VAD detection fails, i.e. the DSP or the processor or other elements capable of VAD sample the signal yiIf the voice of the user is not detected, the terminal is not awakened.
In S104, at a first noise energy S0And tiFirst threshold value A of time0Is greater than a preset first threshold value M0When it is indicated that the terminal may currently be in an environment with high background noise. At this time, according to the first noise energy S0Generating a second threshold A1And applying a second threshold A1As ti+1First threshold value A of time0. Wherein the first noise energy S0By sampling the signal y at a first decimation rate 1/xiAnd (4) extracting, and performing slow tracking filtering on the extracted sampling points ys to obtain the sampling points ys, wherein x is a natural number greater than 1, and n is a positive integer smaller than i. In practice, the signal y is samplediMay include tiSpeech and ambient noise of the user at a time, or sampled signal yiIncluding only tiAmbient noise at the moment. At tiTime of day obtaining ti+1First threshold value A of time0I.e. ti+1At the time, the terminal executes the first threshold used in S103 and S104 in the voice wakeup method.
If tiThe voice awakening at the moment is the first voice awakening, then tiOf time of dayFirst threshold A0May be preset. It can be considered that the preset first threshold value A0Is an optimization parameter corresponding to a possible application scenario, for example, the first threshold A is set0Preset to 50 db, can be considered as the background noise threshold in quiet environments. Wherein fig. 2 illustrates the first threshold in a quiet environment versus a noisy environment. As shown in fig. 2, in a quiet environment, the first threshold is higher than the ambient noise by a first preset value; and under the noisy environment, the first threshold is higher than the ambient noise by a second preset value. Additionally, the first threshold for a noisy environment is higher than the first threshold for a quiet environment.
In addition, S103 may be: 1) at the audio energy TiAnd ti-1Audio energy T of time of dayi-1Is greater than or equal to tiTime difference threshold A00Performing VAD in the case of (1); or, 2) at an audio energy TiIs greater than or equal to tiFirst threshold value A of time0And, the audio energy TiAnd ti-1Audio energy T of time of dayi-1Is greater than or equal to tiTime difference threshold A00Performing VAD in the case of (1); or, 3) at audio energy TiIs greater than or equal to tiFirst threshold value A of time0Or, audio energy TiAnd ti-1Audio energy T of time of dayi-1Is greater than or equal to tiTime difference threshold A00And VAD is performed when both satisfy one of the conditions. Wherein, ti-1Audio energy T of time of dayi-1Is cached in the terminal, at ti-1Time-of-day calculation sampling signal yi-1The audio energy of (a).
If 1), then t is similarly adjustediFirst threshold value A of time0Method of adjusting tiTime difference threshold A00(ii) a If 2), then t is similarly adjustediFirst threshold value A of time0Method of simultaneously adjusting tiFirst threshold value A of time0And tiTime difference threshold A00(ii) a If 3), then t is similarly adjustediAt the first momentA threshold value A0Method of adjusting tiFirst threshold value A of time0Or tiTime difference threshold A00。
The embodiment of the invention obtains tiSampling at a moment to obtain a sampling signal yiAudio energy T ofiAnd at the audio energy TiIs greater than or equal to tiFirst threshold value A of time0Performing VAD in the case of (1); when the VAD detection fails, and at tiN consecutive detection failures before the time, and a first noise energy S0And tiFirst threshold value A of time0Is greater than a preset first threshold value M0While adjusting the first threshold A0Is given by the size of (1) to obtain ti+1First threshold value A of time0: according to the first noise energy S0Generating a second threshold A1And applying a second threshold A1As ti+1First threshold value A of time0. Wherein the first noise energy S0By sampling the signal y at a first decimation rate 1/xiExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely ti+1First threshold value A of time0Is according to tiFirst noise energy S of time0Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise0Is set to the first threshold value A at each time0And the frequency of VAD is reduced by matching with the environment, so that the power consumption of the terminal in a noisy environment is reduced.
In the above embodiment, the first noise energy S is used as the basis0Generating a second threshold A1The method comprises the following steps: the first noise energy S0As a second threshold value A1(ii) a Or, the first noise energy S0With a preset first correction quantity N0The sum is used as a second threshold value A1I.e. A1=S0+N0(ii) a Or, the first noise energy S0With a predetermined first coefficient a0The product of the first and second thresholds is used as the second threshold A1I.e. A1=a0×S0。
Wherein if the first correction amount N0Is larger, indicating a second threshold A1At a first noise energy S0Fast rise of the foundation; if the first correction amount N0Is smaller, indicating a second threshold value A1At a first noise energy S0The rising speed of the foundation can be set according to actual requirements. Wherein the first correction amount N0The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the first coefficient a0Is larger, indicating a second threshold A1At a first noise energy S0Fast rise of the foundation; if the first coefficient a0Is smaller, indicating a second threshold value A1At a first noise energy S0The rising speed of the foundation can be set according to actual requirements. Wherein the first coefficient a0The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.
Optionally, the first noise energy S may also be0With a predetermined first coefficient a0Is multiplied by a preset first correction quantity N0As a second threshold value A1,A1=a0×S0+N0。
Fig. 3 is a flowchart of a voice wake-up method according to a second embodiment of the present invention. As shown in fig. 3, the method may include:
s301, periodically sampling the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiAnd i is a positive integer.
S302, calculating a sampling signal yiAudio energy T ofi。
S303, in the audio energy TiLess than tiFirst threshold value A of time0And from ti-mFrom time until tiRespective first threshold A of time0And a second noise energy F0Are all greater than a predetermined differenceIs lower than the second threshold value M1When VAD is performed, m is a positive integer and m is smaller than i.
Exemplarily, if m is 2, then when the audio energy T isiLess than tiFirst threshold value A of time0And t isi-2First threshold value A of time0And a second noise energy F0Is greater than a second threshold value M1,ti-1First threshold value A of time0And a second noise energy F0Is greater than a second threshold value M1And t andifirst threshold value A of time0And a second noise energy F0Is greater than a second threshold value M1Then, VAD is performed.
S304, when the VAD detection is successful, according to the second noise energy F0Generating a third threshold A2And a third threshold value A2As ti+1First threshold value A of time0Wherein the second noise energy F0By sampling the signal y at a second decimation rate 1/ziAnd extracting, and performing fast tracking filtering on the extracted sampling point yf to obtain the target product, wherein z is a natural number greater than x.
For specific description of S301 and S302, reference may be made to the embodiment shown in fig. 1, which is not repeated herein.
For S303, at audio energy TiLess than tiFirst threshold value A of time0In the case of prior art voice wake-up schemes, VAD is no longer performed, and thus, there may be a situation where the user's voice is missed. E.g. tiFirst threshold value A of time0Suitable for noisy environments, but when the terminal is in a relatively quiet environment (e.g., low background noise environment), resulting in a sampled signal yiMissed detection of the user's voice. In the embodiment of the invention, t is changed through S303 and S304i+1First threshold value A of time0Matching it to the current environment.
When from ti-mFrom time until tiRespective first threshold A of time0And a second noise energy F0Are all larger than a preset second threshold value M1While, i.e. accumulating m +1 occurrences of the first threshold A0And a second noise energy F0Is greater than a preset second threshold value M1The current first threshold a, which indicates that the terminal is in a quiet environment (environment with low background noise) at this time0Larger, requiring a down-regulation to match a quiet environment. Wherein the second threshold value M1Is a preset parameter and can be obtained through debugging.
For S304, the VAD detection is successful, which indicates that the sampled signal yiIncluding the user's voice, and based on the second noise energy F to avoid the user's voice missing detection0Generating a third threshold A2And a third threshold value A2As ti+1First threshold value A of time0. Wherein the second noise energy F0By sampling the signal y at a second decimation rate 1/ziExtracting, and performing fast tracking filtering on the extracted sampling point yf, so that the second noise energy F0The energy level of the transient noise of the environment in which the terminal is located can be reflected to a certain extent.
The embodiment of the invention obtains tiSampling at a moment to obtain a sampling signal yiAudio energy T ofiAnd at the audio energy TiLess than tiFirst threshold value A of time0And from ti-mFrom time until tiRespective first threshold A of time0And a second noise energy F0Are all larger than a preset second threshold value M1Performing VAD in the case of (1); when the VAD detection is successful, according to the second noise energy F0Generating a third threshold A2And a third threshold value A2As ti+1First threshold value A of time0. Wherein the second noise energy F0By sampling the signal y at a second decimation rate 1/ziThe sampling is performed, and the extracted sampling point yf is subjected to fast tracking filtering, that is, ti+1First threshold value A of time0Is according to tiOf time of daySecond noise energy F0Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise0Is set to the first threshold value A at each time0Is matched with the environment so as to further avoid sampling the signal y under the condition of reducing the VAD times and realizing the reduction of the power consumption of the terminal in the noisy environmentiMissed detection of the user's voice.
In the above embodiment, the second noise energy F is based on0Generating a third threshold A2Specifically, the method may include: second noise energy F0As a third threshold value A2(ii) a Or, the second noise energy F0And a preset second correction quantity N1The sum is used as a third threshold value A2I.e. A2=F0+N1(ii) a Or, the second noise energy F0With a predetermined second coefficient a1The product of the first and second threshold values is used as a third threshold value A2I.e. A2=a1×F0。
Wherein if the second correction amount N1Is greater, indicating a third threshold value A2At a second noise energy F0Fast rise of the foundation; if the second correction amount N1Is smaller, indicating a third threshold value A2At a second noise energy F0The rising speed of the foundation can be set according to actual requirements. Wherein the second correction amount N1The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the second coefficient a1Is greater, indicating a third threshold value A2At a second noise energy F0Fast rise of the foundation; if the second coefficient a1Is smaller, indicating a third threshold value A2At a second noise energy F0The rising speed of the foundation can be set according to actual requirements. Wherein the second coefficient a1The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.
Optionally, the second noise energy F can also be0And preA second coefficient of1Is multiplied by a preset second correction quantity N1As a third threshold value A2,A2=a1×F0+N1。
Fig. 4 is a flowchart of a voice wake-up method according to a third embodiment of the present invention. As shown in fig. 4, the method may include:
s401, periodically sampling the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiAnd i is a positive integer.
S402, calculating a sampling signal yiAudio energy T ofi。
S403, in the audio energy TiLess than tiFirst threshold value A of time0And t isiFirst threshold value A of time0And a first noise energy S0Is greater than a preset third threshold value M2According to the first noise energy S0Generating a fourth threshold A3And applying a fourth threshold A3As ti+1First threshold value A of time0。
For specific description of S401 and S402, reference may be made to the embodiment shown in fig. 1, which is not described herein again.
As for S403, at audio energy TiLess than tiFirst threshold value A of time0In the case of prior art voice wake-up schemes, VAD is no longer performed, and thus, there may be a situation where the user's voice is missed. E.g. tiFirst threshold value A of time0Suitable for noisy environments, but when the terminal is in a relatively quiet environment, resulting in a sampled signal yiMissed detection of the user's voice. In the embodiment of the invention, t is changed through S403i+1First threshold value A of time0Matching it to the current environment.
When t isiFirst threshold value A of time0And a first noise energy S0Is greater than a preset third threshold value M2Is, i.e., tiFirst threshold value A of time0Compare the first noise energy S0Larger, meaning that the terminal is now in a relatively quiet environment, tiFirst threshold value A of time0Larger, down-regulated to match the environment. Wherein the third threshold value M2Is a preset parameter and can be obtained through debugging.
Due to the first noise energy S0By sampling the signal y at a first decimation rate 1/xiExtracting, and performing slow tracking filtering on the extracted sampling point ys to obtain the first noise energy S0The stable energy of the reaction environment. Therefore, S403 does not need to compare the first threshold a at a plurality of times as in S3030And a first noise energy S0Is greater than a preset third threshold value M2. When t isiFirst threshold value A of time0And a first noise energy S0Is greater than a preset third threshold value M2Time, the sampling signal y can be illustratediThe voice of the user is contained, and in order to avoid the missing detection of the voice of the user, the voice is detected according to the first noise energy S0Generating a fourth threshold A3And applying a fourth threshold A3As ti+1First threshold value A of time0。
The embodiment of the invention obtains tiSampling at a moment to obtain a sampling signal yiAudio energy T ofiAnd at the audio energy TiLess than tiFirst threshold value A of time0And t isiFirst threshold value A of time0And a first noise energy S0Is greater than a preset third threshold value M2According to the first noise energy S0Generating a fourth threshold A3And applying a fourth threshold A3As ti+1First threshold value A of time0. Wherein the first noise energy S0By sampling the signal y at a first decimation rate 1/xiExtracting and carrying out slow tracking filtering on the extracted sampling point ys, namely ti+1First threshold value A of time0Is according to tiFirst noise energy S of time0Obtained in this way, the terminal can adjust the first threshold A at the next moment according to the current environmental noise0Is set to the first threshold value A at each time0Is matched with the environment so as to further avoid sampling the signal y under the condition of reducing the VAD times and realizing the reduction of the power consumption of the terminal in the noisy environmentiMissed detection of the user's voice.
Based on the above embodiment, wherein, according to the first noise energy S0Generating a fourth threshold A3The method can comprise the following steps: the first noise energy S0As a fourth threshold value A3(ii) a Or, the first noise energy S0With a preset third correction quantity N2The sum is used as a fourth threshold value A3I.e. A3=S0+N2(ii) a Or, the first noise energy S0With a predetermined third coefficient a2The product of the first and second threshold values is used as a fourth threshold value A3I.e. A3=a2×S0。
Wherein if the third correction amount N2The larger value of (A) indicates the fourth threshold value A3At a first noise energy S0Fast rise of the foundation; if the third correction amount N2Is smaller, the fourth threshold A is shown3At a first noise energy S0The rising speed of the foundation can be set according to actual requirements. Wherein the third correction amount N2The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited. Similarly, if the third coefficient a2The larger value of (A) indicates the fourth threshold value A3At a first noise energy S0Fast rise of the foundation; if the third coefficient a2Is smaller, the fourth threshold A is shown3At a first noise energy S0The rising speed of the foundation can be set according to actual requirements. Wherein the third coefficient a2The size of (2) can be set according to actual scenes, and the embodiment of the invention is not limited.
Optionally, the first noise energy S may also be0With a predetermined third coefficient a2Is multiplied by a preset third correction quantity N2As a fourth threshold value A3I.e. A3=a2×S0+N2。
Incidentally, the second correction amount N1And a third correction amount N2The first threshold A being respectively reflected in different conditions0Relative noise energy rise value. Wherein the first threshold value A0Relative second noise energy F0Large second correction quantity N1First threshold value A0Relative first noise energy S0Large third correction quantity N2. In addition, due to the first noise energy S0For slow tracking filtering, the second noise energy F0For fast tracking filtering, therefore, optionally, the third correction quantity N2Greater than the second correction amount N1To achieve a fast match to the environment.
Furthermore, the embodiment of the present invention may also record a scene of the first threshold value change. For a scene with a first threshold value raised, the threshold value raised moment can be recorded; for a scenario where the first threshold is lowered, the threshold lowering time may be recorded.
Specifically, the third threshold A is set2As ti+1First threshold value A of time0Previously, the method may further comprise: record tiThe moment is the moment of reducing the threshold value; when t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the third threshold A is executed2As ti+1First threshold value A of time0Otherwise, the third threshold A is not executed2As ti+1First threshold value A of time0The step (2).
After the fourth threshold value A3As ti+1First threshold value A of time0Previously, the method may further comprise: record tiThe moment is the moment of reducing the threshold value; when t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the fourth threshold A is executed3As ti+1First threshold value A of time0Otherwise, the fourth threshold A is not executed3As ti+1First threshold value A of time0The step (2).
The above two specific implementations can prevent the first threshold a0The ping-pong switching does not affect the reliability of voice detection, and reduces the voice missing detection probability.
The embodiment of the invention continuously monitors and tracks the environmental background noise, and adaptively adjusts the first threshold A according to the size of the environmental background noise0And for the first threshold A0The adjustment adopts a slow rising or slow falling mode, so that the voice missing detection probability is reduced. In addition, the first threshold A0The dynamic adjustment of (2) makes the power consumption under quiet environment and noisy environment be close, thereby can promote user experience, improves product competitiveness.
Fig. 5 is a schematic structural diagram of a voice wake-up apparatus according to a first embodiment of the present invention. The voice wake-up device can be realized in a hardware mode. The voice wake-up device can be integrated in a terminal such as a tablet computer, a smart phone, a PDA and the like. As shown in fig. 5, the voice wake-up apparatus 10 includes: the system comprises an SRC11, an arithmetic circuit 12, a threshold decision circuit 13, a first decimator 14, a Slow Tracking Filter (STF) 15, a comparator 16, a configurator 17, and an interrupt processing circuit 18.
Wherein SRC11 is configured to periodically sample the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiAnd i is a positive integer. The arithmetic circuit 12 is used for calculating the sampling signal yiAudio energy T ofi. The threshold decision circuit 13 is used for determining the audio energy TiWhether or not t is greater than or equal toiFirst threshold value A of time0(ii) a At the audio energy TiIs greater than or equal to tiFirst threshold value A of time0In the case where the trigger interrupt processing circuit 18 outputs an interrupt pulse signal to the interrupt control circuit 20, the interrupt control circuit 20 enables the DSP or the processor 30 to performVAD. An input terminal of the first decimator 14 is coupled to an output terminal of the SRC11, the first decimator 14 is configured to decimate the sampled signal y at a first decimation rate 1/xiAnd (4) extracting to obtain sampling points ys and outputting the sampling points ys, wherein x is a natural number greater than 1. The input end of the STF15 is coupled to the output end of the first decimator 14, and the STF15 is used for performing slow tracking filtering on the sampled sampling points ys to obtain the first noise energy S0. An input terminal of the comparator 16 is coupled to an output terminal of the STF15 and the threshold decision circuit 13, the comparator 16 is used for comparing the first noise energy S0And tiFirst threshold value A of time0Whether the difference is greater than a preset first threshold value M0. The configurator 17 is used for when the VAD detection fails and at tiN consecutive detection failures before the time, and a first noise energy S0And tiFirst threshold value A of time0Is greater than a preset first threshold value M0According to the first noise energy S0Generating a second threshold A1And applying a second threshold A1As ti+1First threshold value A of time0And sends it to the threshold decision circuit 13, where n is a positive integer and n is smaller than i.
Referring to fig. 5, the configurator 17 configures parameters, such as the first threshold a mentioned above, for the voice wakeup device 100And the like. Those skilled in the art will understand that the configurator 17 receives the configuration parameters from the terminal and converts the configuration parameters into corresponding control signals for various logic modules in the voice wake-up apparatus 10, wherein the logic modules include the arithmetic circuit 12, the threshold decision circuit 13, the interrupt processing circuit 18, and the like. The SRC11 may specifically sample the audio signal in a down-sampling manner, for example, convert data of 32 kilohertz (KHz) into 16 KHz.
Sampling signal yiThe flow directions in fig. 5 are:
SRC 11- > arithmetic circuit 12- > threshold decision circuit 13- > interrupt processing circuit 18 (optional) > interrupt control circuit 20 (optional) > DSP or processor 30 (optional).
At the audio energy TiIs greater than or equal to tiFirst threshold value A of time0In the case of (2), sampling the signal yiComprises the above-mentioned optional portion; at the audio energy TiLess than tiFirst threshold value A of time0In the case of (2), sampling the signal yiDoes not include the optional portion described above.
The first decimator 14, STF15 and comparator 16 do not affect the normal voice wake-up, but are only used in conjunction with the configurator 17 to change the first threshold a in the voice wake-up0。
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.
In the above embodiments, the configurator 17 may be specifically configured to: the first noise energy S0As a second threshold value A1(ii) a Or, the first noise energy S0With a preset first correction quantity N0The sum is used as a second threshold value A1I.e. A1=S0+N0(ii) a Or, the first noise energy S0With a predetermined first coefficient a0The product of the first and second thresholds is used as the second threshold A1I.e. A1=a0×S0And the like, embodiments of the present invention are not limited thereto.
Fig. 6 is a schematic structural diagram of a voice wake-up apparatus according to a second embodiment of the present invention. The voice wake-up device can be realized in a hardware mode. The voice wake-up device can be integrated in a terminal such as a tablet computer, a smart phone, a PDA and the like. As shown in fig. 6, the voice wake-up apparatus 100 includes: the SRC110, the arithmetic circuit 120, the threshold decision circuit 130, the second decimator 140, the Fast Tracking Filter (FTF) 150, the comparator 160, the configurator 170, and the interrupt processing circuit 180.
Wherein the SRC110 is configured to periodically sample the audio signal, wherein at tiSampling at a moment to obtain a sampling signal yiAnd i is a positive integer. The arithmetic circuit 120 is used for calculating the sampling signal yiAudio energy T ofi. The threshold decision circuit 130 is used for determining the audio energy TiWhether or not t is greater than or equal toiFirst threshold value A of time0. An input of a second decimator 140 is coupled to an output of the SRC110, the second decimator 140 being configured to decimate the sampled signal y by a second decimation rate 1/ziAnd extracting to obtain a sampling point yf, wherein z is a natural number larger than x. An input end of the FTF 150 is coupled to an output end of the second decimator 140, and the FTF 150 is configured to perform fast tracking filtering on the sampled sampling point yf to obtain a second noise energy F0. An input of comparator 160 is coupled to an output of FTF 150, comparator 160 is configured to provide an output of audio energy TiLess than tiFirst threshold value A of time0In the case of (1), the first threshold value and the second noise energy F at each time are compared0Whether the difference is greater than a preset second threshold value M1(ii) a And when from ti-mFrom time until tiRespective first threshold A of time0And a second noise energy F0Are all larger than a preset second threshold value M1In this case, the trigger interrupt processing circuit 180 outputs an interrupt pulse signal to the interrupt control circuit 200, and the interrupt control circuit 200 enables the DSP or the processor 300 to perform VAD, where m is a positive integer and m is smaller than i. The configurator 170 is used for determining the second noise energy F according to the VAD detection success0Generating a third threshold A2And a third threshold value A2As ti+1First threshold value A of time0And then sent to the threshold decision circuit 130.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.
On the basis of the above embodiments, the configurator may be specifically configured to: second noise energy F0As a third threshold value A2(ii) a Or, the second noise energy F0And a preset second correction quantity N1The sum is used as a third threshold value A2I.e. A2=F0+N1(ii) a Or, the second noise energy F0And a predetermined second seriesNumber a1The product of the first and second threshold values is used as a third threshold value A2I.e. A2=a1×F0And the like, embodiments of the present invention are not limited thereto.
Optionally, the configurator 170 may be further configured to: record tiThe moment is the moment of reducing the threshold value; when t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the third threshold A is executed2As ti+1First threshold value A of time0Otherwise, the third threshold A is not executed2As ti+1First threshold value A of time0So that the first threshold value A can be prevented0The ping-pong switching does not affect the reliability of voice detection, and reduces the voice missing detection probability.
Referring to fig. 5, configurator 17 may also be configured to: at the audio energy TiLess than tiFirst threshold value A of time0And t isiFirst threshold value A of time0And a first noise energy S0Is greater than a preset third threshold value M2According to the first noise energy S0Generating a fourth threshold A3And applying a fourth threshold A3As ti+1First threshold value A of time0。
At this time, the apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 4, and the implementation principle and the technical effect are similar, which are not described herein again.
Further, the configurator 17 may be specifically configured to: the first noise energy S0As a fourth threshold value A3(ii) a Or, the first noise energy S0With a preset third correction quantity N2The sum is used as a fourth threshold value A3I.e. A3=S0+N2(ii) a Or, the first noise energy S0With a predetermined third coefficient a2The product of the first and second threshold values is used as a fourth threshold value A3I.e. A3=a2×S0And the like, embodiments of the present invention are not limited thereto.
Still further, the configurator 17 may also be configured to: record tiThe moment is the moment of reducing the threshold value; when t isiThe time interval between the moment and the last moment for reducing the threshold value is greater than a preset value TtimeThen, the fourth threshold A is executed3As ti+1First threshold value A of time0Otherwise, the fourth threshold A is not executed3As ti+1First threshold value A of time0So that the first threshold value A can be prevented0The ping-pong switching does not influence the reliability of the voice detection, and reduces the voice missing detection probability
Referring to fig. 5 and 6, the first decimator 14 and the second decimator 140 perform data decimation for a long period or a short period, respectively. The STF15 is a slow converging filter for stably tracking environmental noise variation. FTF 150 is a fast converging filter for fast tracking of ambient noise variations. Optionally, the STF15 is a slow converging filter for stable tracking of environmental noise variations. STF15 and FTF 150 are used to track the energy of the current computational window, and are constructed similarly to operational circuit 12 or operational circuit 120. The STF15 and the FTF 150 are different in the order and parameters of the filter, which are set according to the actual debugging situation. FTF 150 is used to perform short-term filtering, i.e., recently occurring data changes can quickly affect the output of the filter. The STF15 is a long period filter, i.e. the recently occurring data changes have a relatively small and slow effect on the output of the filter.
Alternatively, on the basis of fig. 5, in combination with fig. 6, the structure shown in fig. 7 is obtained. Fig. 7 is a schematic structural diagram of a voice wake-up apparatus according to a third embodiment of the present invention. As shown in fig. 7, the voice wake-up apparatus 1000 includes: SRC11, arithmetic circuit 12, threshold decision circuit 13, first decimator 14, second decimator 140, STF15, FTF 150, comparator 16, configurator 17, and interrupt processing circuit 18.
The threshold decision circuit 13 also has the function of the threshold decision circuit 130; the comparator 16 also has the function and function of the comparator 160; the configurator 17 also has the functions and functions of the configurator 170; the interrupt processing circuit 18 also has the function and function of the interrupt processing circuit 180. The specific principle is as the above embodiment, and is not described herein again.
The embodiment of the invention continuously monitors and tracks the environmental background noise, and adaptively adjusts the first threshold A according to the size of the environmental background noise0And for the first threshold A0The adjustment adopts a slow rising or slow falling mode, so that the voice missing detection probability is reduced. In addition, the first threshold A0The dynamic adjustment of (2) makes the power consumption under quiet environment and noisy environment be close, thereby can promote user experience, improves product competitiveness.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units or modules is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.