CN118571214A

CN118571214A - Model training method, VAD parameter determining method, device and equipment

Info

Publication number: CN118571214A
Application number: CN202410655233.9A
Authority: CN
Inventors: 田红
Original assignee: Avatr Technology Chongqing Co Ltd
Current assignee: Avatr Technology Chongqing Co Ltd
Priority date: 2024-05-24
Filing date: 2024-05-24
Publication date: 2024-08-30

Abstract

The embodiment of the application relates to the technical field of voice recognition, and discloses a model training method, a VAD parameter determining device and model training equipment, wherein the model training method comprises the following steps: and performing model training according to the voice signal samples and the preset VAD parameter sequence to obtain an initial VAD parameter determination model. And then, training the initial VAD parameter determination model according to the voice signal of the current environment to generate the VAD parameter determination model. The preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample. The target VAD parameters generated by the technical scheme are more fit with the current environment, the accuracy is higher, and the accuracy and the robustness of the VAD are further improved.

Description

Model training method, VAD parameter determining method, device and equipment

Technical Field

The embodiment of the application relates to the technical field of voice recognition, in particular to a model training method, a VAD parameter determining device and VAD parameter determining equipment.

Background

In applications such as real-time communication, speech recognition, speech enhancement, etc., speech activity detection (Voice Activity Detection, VAD) techniques play an important role.

At present, the VAD technology monitors voice activity in real time according to a set judgment threshold after detecting the start of voice, and starts a timer after the voice signal continuously exceeds the judgment threshold for a certain time. If a continuous speech signal is detected within a fixed time delay, the system considers that the user is still speaking, and the time delay is prolonged; otherwise, if no speech signal is detected within a fixed time delay, the system considers that the user has finished speaking, and then starts a subsequent speech processing flow.

However, since the fixed time delay in the prior art is set manually, there is a problem of low accuracy.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a model training method, a device and equipment for determining VAD parameters, which are used to solve the problem of low accuracy in the prior art.

According to an aspect of an embodiment of the present application, there is provided a model training method, the method including:

Model training is carried out according to a plurality of voice signal samples and a preset VAD parameter sequence, an initial VAD parameter determination model is obtained, and the preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample;

inputting a voice signal of a current environment into the initial VAD parameter determining model, and obtaining VAD parameters to be adjusted output by the initial VAD parameter determining model;

adjusting the VAD parameters to be adjusted according to the voice signals of the current environment to generate VAD parameters of the current environment;

And continuing training the initial VAD parameter determination model according to the voice signal of the current environment and the VAD parameter of the current environment to generate a VAD parameter determination model.

According to another aspect of the embodiments of the present application, there is provided a method for determining VAD parameters, the method including:

Acquiring a voice signal to be recognized acquired in a current environment;

Inputting the voice signal to be recognized into a voice activity detection VAD parameter determination model, and obtaining target VAD parameters output by the VAD parameter determination model; the VAD parameter determination model is a model obtained by training according to the model training method;

Outputting the target VAD parameters.

According to another aspect of an embodiment of the present application, there is provided a model training apparatus including:

the training module is used for carrying out model training according to a plurality of voice signal samples and a preset VAD parameter sequence to obtain an initial VAD parameter determination model, wherein the preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample;

the input module is used for inputting the voice signal of the current environment into the initial VAD parameter determining model and obtaining the VAD parameter to be adjusted output by the initial VAD parameter determining model;

The adjusting module is used for adjusting the VAD parameters to be adjusted according to the voice signals of the current environment to generate the VAD parameters of the current environment;

The training module is used for continuing training the initial VAD parameter determination model according to the voice signal of the current environment and the VAD parameter of the current environment to generate a VAD parameter determination model.

According to another aspect of the embodiments of the present application, there is provided a VAD parameter determining apparatus, the apparatus including:

The acquisition module is used for acquiring the voice signal to be identified acquired in the current environment;

The input module is used for inputting the voice signal to be recognized into a voice activity detection VAD parameter determination model and obtaining target VAD parameters output by the VAD parameter determination model; the VAD parameter determination model is a model obtained by training according to the model training method;

and the output module is used for outputting the target VAD parameters.

According to another aspect of an embodiment of the present application, there is provided an electronic apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the model training method and the VAD parameter determination method described above.

According to yet another aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored therein at least one executable instruction for causing an electronic device/apparatus to perform the following model training operations:

model training is carried out according to a plurality of voice signal samples and a preset voice activity detection VAD parameter sequence, an initial VAD parameter determination model is obtained, and the preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample;

Or the executable instructions cause the electronic device/apparatus to perform the following VAD parameter determination operations:

Acquiring a voice signal to be recognized acquired in a current environment;

Outputting the target VAD parameters.

According to the model training method, the VAD parameter determining device and the VAD parameter determining equipment provided by the embodiment of the application, model training is carried out according to a plurality of voice signal samples and a preset VAD parameter sequence, and an initial VAD parameter determining model is obtained. And then, training the initial VAD parameter determination model according to the voice signal of the current environment to generate the VAD parameter determination model. The preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample. In practical application, a voice signal to be recognized acquired in a current environment is acquired, the voice signal to be recognized is input into a VAD parameter determination model, and target VAD parameters output by the VAD parameter determination model are acquired. Compared with fixed VAD parameters, the technical scheme outputs the corresponding target VAD parameters of the current environment through the VAD parameter determination model, and effectively improves the accuracy and the robustness of the VAD. In addition, in the training process, the initial VAD parameter determining model is trained again based on the noise environment and the voice signal of the application environment, and the accuracy of the VAD parameter determining model is improved again.

The foregoing description is only an overview of the technical solutions of the embodiments of the present application, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present application can be more clearly understood, and the following specific embodiments of the present application are given for clarity and understanding.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic diagram of a speech recognition system according to the present application;

FIG. 2 shows a flow chart of a first embodiment of the model training method provided by the present application;

FIG. 3 shows a flow chart of a second embodiment of the model training method provided by the present application;

FIG. 4 shows a flow chart of a third embodiment of the model training method provided by the present application;

FIG. 5 is a flowchart of a first embodiment of a method for determining VAD parameters provided by the present application;

FIG. 6 shows a schematic structural diagram of an embodiment of the model training apparatus provided by the present application;

fig. 7 is a schematic structural diagram of an embodiment of a VAD parameter determination apparatus provided by the present application;

fig. 8 is a schematic structural diagram of a first embodiment of an electronic device according to the present application;

fig. 9 shows a schematic structural diagram of a second embodiment of the electronic device provided by the application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein.

First, an application background to which the present application relates will be explained.

VAD is a signal processing technique used to identify speech and non-speech portions (e.g., background noise, silence, etc.) in a speech signal. The voice recognition method has the main functions of transmitting the voice signal when the voice part is recognized and keeping silence in the non-voice part, so that the communication quality is improved, the data transmission quantity is reduced, and the interference of the background noise on voice communication and voice recognition is effectively reduced. VAD technology is widely applied in the fields of mobile phone conversation, network telephone, voice recognition and the like, and plays an important role in improving user experience and communication quality.

Two problems in detecting voice activity are to be noted, namely, the background noise problem, namely, how to detect silence in larger background noise; and secondly, the problem of front and back edge shearing. Wherein, for the front and back edge clipping problem, it means that there is a time delay between the actual speaking start and the detection of the voice, which may cause the front or back of the voice signal to be truncated or omitted. In this case, if the user has a pause before beginning to speak, or has a relatively long pause time after speaking a sentence, the speech recognition system is likely to fail to accurately capture the entire content of the speech signal, thereby affecting the accuracy of speech understanding.

Currently, the delay used by the voice provider in front-to-back edge clipping is fixed, say 800ms. This period is considered to have ended if no human speech is detected within 800ms. However, the manually set time delay cannot adapt to all environments, so that if there is a pause in the middle or the environments are noisy, the semantics that the user wants to express cannot be correctly understood.

In summary, the prior art uses fixed time delay to detect voice activity, which has the problem of lower accuracy.

Based on the technical problems, the technical conception of the application is as follows: preliminary model training may be performed based on a plurality of speech signal samples to obtain an initial VAD parameter determination model capable of outputting VAD parameters based on the speech signal. Meanwhile, according to the voice signals acquired in the environment where voice activity detection is actually carried out, the initial VAD parameter determination model can be trained again in consideration of different requirements of different environments on time delay, so that the VAD parameter determination model obtained through training can be more suitable for the environment where voice activity detection is actually carried out, and accuracy of VAD parameters output in the actual application process is improved. It should be appreciated that the VAD parameter is the fixed delay.

Based on the technical conception, the application provides a model training method, a VAD parameter determining device and VAD parameter determining equipment.

Fig. 1 shows a schematic diagram of a voice recognition system according to the present application. As shown in fig. 1, the speech recognition system 100 includes a VAD parameter determination model 110 and a speech recognition model 120.

The VAD parameter determination model 110 and the speech recognition model 120 may be deployed on the same device or on different devices, which may be determined according to the actual situation, which is not particularly limited in the embodiment of the present application.

When actually detecting voice activity, the VAD parameter determination model 110 acquires the voice signal to be recognized acquired by the current environment in real time, and determines the target VAD parameter corresponding to the current environment according to the voice signal to be recognized. Then, the voice recognition model 120 obtains the target VAD parameter output by the VAD parameter determination model 110, and performs voice recognition on the voice signal to be recognized according to the target VAD parameter, so as to obtain a recognition result.

The VAD parameter determination model is obtained by pre-training based on the current environment, and a specific training process will be explained in the following embodiments, which are not described herein.

Fig. 2 shows a flowchart of a first embodiment of the model training method provided by the application, which is performed by an electronic device. As shown in fig. 2, the method comprises the steps of:

step 210: model training is carried out according to a plurality of voice signal samples and a preset voice activity detection VAD parameter sequence, and an initial VAD parameter determination model is obtained.

The execution body of the embodiment is an electronic device, which may be a terminal device, for example, a mobile phone, a desktop computer, a notebook computer, or a server. In practical application, the electronic device is specifically a terminal device or a server, and may be determined according to practical situations, which is not specifically limited.

In this step, a training set needs to be collected or created before model training can be performed. Since the purpose of model training is to determine the corresponding VAD parameters from the speech signal, the training set may be a plurality of speech signal samples and the VAD parameters for each speech signal sample.

In practical application, the VAD parameter of each speech signal sample is the true value of each speech signal sample, and is obtained by manually calibrating the VAD parameter for performing supervised model training. Alternatively, the voice signal samples may also be voice signals that were previously successfully detected for voice activity, and the VAD parameters of the voice signal samples may be the VAD parameters used in the previous detection of voice activity.

Alternatively, the speech signal samples may contain only human voice, or both human voice and noise.

Specifically, a speech signal sample comprising only pure human voice (i.e. comprising only human voice) may be obtained, and a speech signal sample of different types and/or different signal-to-noise ratios (i.e. comprising both human voice and noise) may be added to the speech signal sample comprising only pure human voice.

That is, the noise types between different speech signal samples may be the same or different; the signal-to-noise ratio between different speech signal samples may be the same or different.

Alternatively, the noise may be added by way of additive noise, multiplicative noise, channel simulation, reverberation addition, time domain disturbance, or the like.

In one possible implementation, the VAD parameters for each speech signal sample and the preset VAD parameter sequence may be determined as a training set, and model training may be performed through the training set, thereby obtaining an initial VAD parameter determination model.

The preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample.

In another possible implementation, signal feature vectors and emotion feature vectors for each speech signal sample may be extracted. And performing model training according to a preset VAD parameter sequence and signal characteristic vectors and emotion characteristic vectors of each voice signal sample to obtain an initial VAD parameter determination model.

In the implementation mode, a preset VAD parameter sequence, a signal characteristic vector and an emotion characteristic vector of each voice signal sample are determined to be a training set, and model training is carried out through the training set, so that an initial VAD parameter determination model is obtained.

Optionally, the signal feature vector is used to distinguish between a speech part and a non-speech part, and may include Short-time energy (Short-TIME ENERGY, STE), short-time zero-crossing rate (Short-Time Zero Crossing Rate, ZCC), frequency domain features, cepstral features, harmonic features, long-time information features, and the like.

The emotion feature vector generally refers to a feature vector for describing emotion content in a speech signal sample. These feature vectors may include basic features of sound such as pitch, volume, speed of speech, etc., as well as advanced features such as emotional tendency, emotional intensity, emotional type, etc.

In this implementation, model training is performed using the signal feature vector and emotion feature vector of each speech signal sample as input features, and a preset VAD parameter sequence as desired output.

In the two implementations, the initial model may be trained according to the training set, so as to obtain an initial VAD parameter determination model. Alternatively, the initial model may be a gaussian mixture model (Gaussian Mixture Model, GMM) or a neural network. For example, when the initial model is a gaussian mixture model, the initial VAD parameter determination model generated by training the initial model through the training set can support multiple sampling rates and frame lengths, and provide different modes to adapt to different application scenarios.

Step 220: and inputting the voice signal of the current environment into an initial VAD parameter determining model, and obtaining the VAD parameters to be adjusted output by the initial VAD parameter determining model.

In this step, because the VAD parameters are required to be different in different environments, before the initial VAD parameter determination model is put into practical use, the model parameters in the initial VAD parameter determination model need to be adjusted according to the current environment, so as to generate the VAD parameter determination model.

In order to adjust the model parameters in the initial VAD parameter determination model according to the current environment, a voice signal of the current environment and the VAD parameters corresponding to the voice signal need to be acquired. In order to improve accuracy of VAD parameters corresponding to the voice signals, the voice signals of the current environment can be input into an initial VAD parameter determining model, and then VAD parameters to be adjusted, which are output by the initial VAD parameter determining model, are obtained, so that the VAD parameters to be adjusted are adjusted according to the current environment, and the adjusted target VAD parameters are determined to be the VAD parameters corresponding to the voice signals of the current environment.

In one possible implementation, the voice signal of the current environment may be acquired in real time by the voice acquisition device as the person speaks, and input into the initial VAD parameter determination model.

Step 230: and adjusting the VAD parameters to be adjusted according to the voice signals of the current environment to generate the VAD parameters of the current environment.

In this step, because the VAD parameters are required to be different in different environments, for example, given that the environments are noisy, the user's voice signal may be relatively difficult to accurately capture, requiring a large delay for voice activity detection. Assuming a relatively quiet environment, the user's voice signal is more easily and accurately recognized, requiring less time delay for voice activity detection. Therefore, the VAD parameter to be adjusted can be adjusted according to the voice signal of the current environment, so as to generate the VAD parameter of the current environment, so that the initial VAD parameter determination model is trained again according to the voice signal of the current environment and the corresponding VAD parameter of the current environment.

Step 240: and continuing training the initial VAD parameter determination model according to the voice signal of the current environment and the VAD parameters of the current environment to generate a VAD parameter determination model.

In this step, the voice signal of the current environment and the VAD parameter of the current environment may be determined as a new training set, and the initial VAD parameter determination model may be continuously trained according to the new training set to generate the VAD parameter determination model.

In one possible implementation, signal feature vectors and emotion feature vectors of a speech signal of a current environment may be extracted. And training the initial VAD parameter determining model continuously according to the VAD parameters of the current environment and the signal characteristic vector and emotion characteristic vector of the voice signal of the current environment to generate the VAD parameter determining model.

The embodiment of the application provides a model training method, which comprises the steps of firstly carrying out model training according to a plurality of voice signal samples and a preset VAD parameter sequence to obtain an initial VAD parameter determination model. And then, inputting the voice signal of the current environment into an initial VAD parameter determining model, and obtaining the VAD parameters to be adjusted output by the initial VAD parameter determining model. And then, adjusting the VAD parameters to be adjusted according to the voice signals of the current environment to generate the VAD parameters of the current environment. Finally, training the initial VAD parameter determining model according to the voice signal of the current environment and the VAD parameter of the current environment, and generating the VAD parameter determining model. The preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample. According to the technical scheme, the VAD parameter determination model is obtained through training in a machine learning mode, and can be input into corresponding VAD parameters according to input voice signals in practical application, so that the accuracy and the robustness of the VAD are effectively improved compared with fixed VAD parameters. Further, after the initial VAD parameter determination model is generated, the initial VAD parameter determination model is retrained based on the noise environment and the speech signal of the application environment, thereby again improving the accuracy of the VAD parameter determination model.

Based on the embodiment shown in fig. 2, step 230 is specifically described below.

Fig. 3 shows a flowchart of a second embodiment of the model training method provided by the application, which is performed by an electronic device. As shown in fig. 3, step 230 may be implemented by:

Step 310: and detecting voice activity of the voice signal in the current environment, and if voice activity cannot be detected, acquiring the noise level of the current environment.

In this step, if voice activity can be detected according to the voice signal of the current environment, it is explained that the VAD parameter to be adjusted output by the initial VAD parameter determination model may be adapted to the current environment, and the VAD parameter to be adjusted may be directly determined as the VAD parameter of the current environment. If voice activity cannot be detected according to the voice signal of the current environment, the noise level of the current environment needs to be further determined, and then whether the VAD parameter to be adjusted needs to be adjusted is determined to be increased or decreased.

In one particular implementation, voice activity detection may be performed on voice signals by a VAD. Specifically, when the voice signal exceeds the judgment threshold, the voice activity is indicated; otherwise, when the voice signal is smaller than or equal to the judgment threshold, the voice activity is not existed.

Step 320: and if the noise level of the current environment is greater than the preset noise level, increasing the VAD parameter to be adjusted to obtain the VAD parameter of the current environment.

In this step, if the noise level of the current environment is greater than the preset noise level, it is indicated that the current environment is noisy, and the voice signal of the user may be relatively difficult to capture accurately, and a larger time delay is required to detect the voice activity, so that the VAD parameter to be adjusted needs to be increased, and the VAD parameter of the current environment needs to be obtained.

It should be understood that the preset noise level may be preset according to the actual situation, which is not particularly limited in the embodiment of the present application.

It should be understood that, in practical applications, when the noise level of the current environment is greater than the preset noise level, the VAD parameter to be adjusted may be increased by a preset value, so as to generate the VAD parameter of the current environment. The difference between the noise level of the current environment and the preset noise level can also be calculated, the increment corresponding to the difference is determined according to the preset mapping relation, and the sum of the VAD parameter to be adjusted and the increment is determined as the VAD parameter of the current environment.

Optionally, the preset mapping relationship is used to represent a correspondence between the difference and the increment.

Alternatively, the preset value may be preset according to the actual situation, and this is not particularly limited.

Step 330: and if the noise level of the current environment is smaller than or equal to the preset noise level, reducing the VAD parameter to be adjusted to obtain the VAD parameter of the current environment.

In this step, if the noise level of the current environment is less than or equal to the preset noise level, it is indicated that the current environment is relatively quiet, the voice signal of the user is more easily and accurately identified, and a smaller delay is required for voice activity detection. It is therefore desirable to reduce the VAD parameters to be adjusted in order to obtain the current environmental VAD parameters.

It should be understood that the manner of decreasing the VAD parameter to be adjusted may refer to the increasing of the relevant content in the VAD parameter to be adjusted in step 330, which is not described herein.

In the embodiment, the VAD parameter demand of the current environment is determined by the noise environment and the voice signal of the current environment, so that the VAD parameter to be adjusted, which is output by the initial VAD parameter determination model, is adjusted, and the accuracy of the generated VAD parameter of the current environment is effectively improved. The initial VAD parameter determining model is trained again based on the voice signal of the current environment and the corresponding VAD parameter of the current environment, and accuracy of the VAD parameter determining model obtained through training is further improved.

With respect to the above-described embodiment, it will be explained by a specific example.

Fig. 4 shows a flowchart of a third embodiment of the model training method provided by the application, which is performed by an electronic device. As shown in fig. 4, step 230 may be implemented by:

Step 410, initializing VAD parameters to be adjusted of the current environment.

In this step, when it is determined that the current environment needs to apply the VAD parameter determination model, a voice signal of the current environment may be collected, and the voice signal of the current environment may be input into the initial VAD parameter determination model, so as to obtain the VAD parameter to be adjusted output by the initial VAD parameter determination model.

Step 420, performing voice activity detection on the voice signal of the current environment.

If voice activity can be detected, then step 430 is performed; if voice activity cannot be detected, step 440 is performed.

Step 430, determining the VAD parameter to be adjusted as the current environmental VAD parameter.

Step 440, determining whether the noise level of the current environment is greater than a preset noise level.

If yes, go to step 450; if not, step 460 is performed.

Step 450, increasing the VAD parameter to be adjusted to obtain the current environmental VAD parameter.

Step 460, the VAD parameter to be adjusted is reduced, and the current environmental VAD parameter is obtained.

Optionally, in some embodiments, after the VAD parameter determination model is generated, the performance of the VAD parameter determination model may also be evaluated by evaluating the metrics. The index includes accuracy, recall, F1 score, etc.

In this embodiment, the VAD parameter determination model and parameters may be further adjusted according to the evaluation result, so as to further improve the accuracy and robustness of the VAD parameter determination model.

After the VAD parameter determination model is obtained, the VAD parameter corresponding to the current environment may be determined using the VAD parameter determination model. The method for determining the VAD parameters of the current environment using the VAD parameter determination model is described in detail below with reference to specific embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

In particular, the execution subject of the model training method may be an electronic device having processing capability, such as a terminal or a server. It should be understood that the electronic device for determining the VAD parameter and the electronic device for performing the model training method may be the same device or different devices.

Fig. 5 shows a flowchart of a first embodiment of a method for determining VAD parameters provided by the present application, which is performed by an electronic device. As shown in fig. 5, the VAD parameter determination method may be implemented by the following steps:

Step 510: and acquiring the voice signal to be recognized acquired in the current environment.

In practical application, when determining to detect voice activity of the current environment, the voice signal to be recognized of the current environment needs to be collected, so that the VAD parameter corresponding to the current environment is determined according to the VAD parameter determination model.

In one possible implementation, the voice signal to be recognized in the current environment may be acquired in real time by the voice acquisition device, and the voice signal to be recognized is input into the initial VAD parameter determination model.

For example, in the application scenarios of real-time conversation and voice enhancement, the voice signal in the conversation process can be collected in real time in the conversation process; in the voice recognition scene, the voice signal of the current environment can be acquired in real time when the trigger condition is met.

Illustratively, the trigger condition may be:

1. specific voice instructions or keywords are received: for example, when a voice signal to be recognized is collected by a voice assistant, the collection may be started after a specific password such as "Hey voice assistant" is collected.

2. The sound level reaches a certain threshold.

3. A particular gesture or action is detected: for example, when the voice signal to be recognized is collected by the voice assistant, the collection may be started under the triggering of a gesture such as a user clapping or waving a hand.

4. And (3) timing triggering: the collection of the speech signal to be recognized is started at a predetermined point in time, typically for a timed recording or a timed task execution.

5. External event triggering: the collection of the speech signal is started upon receipt of an external event signal. For example, when a voice signal to be recognized is collected by the in-vehicle voice recognition system, recording may be started upon receiving a vehicle start signal or a driver operation.

Step 520: and inputting the voice signal to be identified into the VAD parameter determining model, and obtaining the target VAD parameter output by the VAD parameter determining model.

In this step, the VAD parameter determination model is a model trained according to the model training method described in any of the above embodiments. The speech signal to be identified may be input into a VAD parameter determination model to obtain a target VAD parameter corresponding to the current environment.

In one possible implementation, signal feature vectors and emotion feature vectors of the speech signal to be recognized may be extracted. And then inputting the voice signal to be recognized into the VAD parameter determining model to obtain the target VAD parameter output by the VAD parameter determining model.

Step 530: outputting the target VAD parameter.

In this step, the accuracy of voice activity detection for the current environment is higher by using the target VAD parameters of the current environment determined in steps 510 and 520. In real life, there are many scenarios involving voice activity detection, such as voice recognition. Thus, the target VAD parameters may also be output for subsequent use in correlating the current environment in other scenarios.

According to the VAD parameter determining method provided by the embodiment of the application, the voice signal to be recognized acquired in the current environment is acquired, the voice signal to be recognized is input into the VAD parameter determining model, the target VAD parameter output by the VAD parameter determining model is acquired, and the target VAD parameter is output. The VAD parameter determination model is a model trained according to the model training method described in any of the above embodiments. In the technical scheme, because the VAD parameter determination model is obtained by model training based on the noise environment and the voice signal of the current environment, the accuracy of the determined target VAD parameter corresponding to the current environment is higher.

Alternatively, in some embodiments, the trained VAD parameter determination model may be applied to the speech recognition system shown in FIG. 1. That is, after obtaining the target VAD parameter output by the VAD parameter determination model, the speech signal to be recognized may also be input into the speech recognition model, and the recognition result output by the speech recognition model may be obtained.

The voice recognition model carries out voice recognition on the voice signal to be recognized through the target VAD parameters.

Specifically, the voice recognition model detects voice activity of the voice signal to be recognized through the target VAD parameters, and obtains a plurality of sub-voice signals obtained through detection. And then, carrying out voice recognition on each sub-voice signal and outputting a recognition result.

In one possible implementation, the speech signal to be recognized comprises speech of the speaking portion of the user and speech of the surrounding portion. The voice recognition model detects voice activity of the voice signal to be recognized through the target VAD parameters, extracts voice of a user speaking part in the voice signal to be recognized, and segments the voice of the user speaking part into a plurality of sub-voice signals by taking a sentence as a unit. That is, each sub-speech signal contains one sentence, or contains a plurality of successive sentences. Further, the voice recognition model performs voice recognition on each sub-voice signal to obtain a corresponding sub-recognition result, and the recognition result of the voice signal to be recognized comprises sub-recognition results corresponding to all the sub-voice signals.

It should be understood that after the sub-recognition result corresponding to each sub-voice signal is generated, the sub-recognition results may be spliced according to the time sequence of the sub-voice signals, so as to generate the recognition result of the voice signal to be recognized.

The recognition result may be text corresponding to the speech signal to be recognized. Alternatively, the recognition may be other forms of results, for example, identity confirmation of voiceprint recognition or emotion status of emotion recognition, which may be determined according to practical situations, which is not limited in particular by the embodiment of the present application.

It should be understood that the speech signal to be recognized in this embodiment and the speech signal to be recognized in fig. 5 may be the same or different. Specifically, the method shown in the embodiment of fig. 5 may be used to collect the voice signal to be recognized, determine the target VAD parameter corresponding to the current environment, and configure the voice recognition model according to the target VAD parameter, and only need to continuously collect the voice signal to be recognized in the current environment according to the voice recognition model and perform voice recognition on the voice signal to be recognized. In this case, the speech signal to be recognized in fig. 5 is used to configure the speech recognition model, and does not participate in the subsequent speech recognition processing, that is, the speech signal to be recognized in the present embodiment is different from the speech signal to be recognized in fig. 5. Of course, the voice signal to be recognized may be collected first, the target VAD parameter corresponding to the current environment may be determined, the voice recognition model may be configured according to the target VAD parameter, and the voice recognition may be performed on the voice signal to be recognized according to the configured voice recognition model. In this case, the voice signal to be recognized in the present embodiment is the same as the voice signal to be recognized in fig. 5, and the determination operation of the target VAD parameter is performed once before each voice recognition, further improving the accuracy of voice recognition.

Fig. 6 shows a schematic structural diagram of an embodiment of the model training apparatus provided by the present application. As shown in fig. 6, the model training apparatus 600 includes: a training module 610, an input module 620, and an adjustment module 630.

The training module 610 is configured to perform model training according to a plurality of voice signal samples and a preset voice activity detection VAD parameter sequence, obtain an initial VAD parameter determination model, and the preset VAD parameter sequence includes a preset VAD parameter corresponding to each voice signal sample.

The input module 620 is configured to input a voice signal of the current environment into the initial VAD parameter determination model, and obtain VAD parameters to be adjusted output by the initial VAD parameter determination model.

The adjusting module 630 is configured to adjust the VAD parameter to be adjusted according to the voice signal of the current environment, and generate the VAD parameter of the current environment.

The training module 610 is further configured to continuously train the initial VAD parameter determination model according to the speech signal of the current environment and the VAD parameter of the current environment, so as to generate the VAD parameter determination model.

In one possible implementation, the adjusting module 630 is specifically configured to:

and detecting voice activity of the voice signal in the current environment, and if voice activity cannot be detected, acquiring the noise level of the current environment.

And if the noise level of the current environment is greater than the preset noise level, increasing the VAD parameter to be adjusted to obtain the VAD parameter of the current environment.

And if the noise level of the current environment is smaller than or equal to the preset noise level, reducing the VAD parameter to be adjusted to obtain the VAD parameter of the current environment.

In one possible implementation, the speech signal sample contains only human voice, or both human voice and noise.

In one possible implementation, the training module 610 is specifically configured to:

and extracting a signal characteristic vector and an emotion characteristic vector of each voice signal sample.

And performing model training according to a preset VAD parameter sequence and signal characteristic vectors and emotion characteristic vectors of each voice signal sample to obtain an initial VAD parameter determination model.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in the form of software calls through the processing elements. Or may be implemented entirely in hardware. The method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. In addition, all or part of the modules may be integrated together or may be implemented independently. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

As can be seen from the above, compared with the fixed VAD parameters, the model training device provided by the embodiment of the present application generates the VAD parameters by determining the model by the VAD parameters, thereby effectively improving the accuracy and robustness of the VAD. Further, in this embodiment, after the initial VAD parameter determination model is obtained, the initial VAD parameter determination model is retrained according to the voice signal of the current environment, so as to generate the VAD parameter determination model, thereby improving the accuracy of the VAD parameter determination model again.

Fig. 7 is a schematic structural diagram of an embodiment of a VAD parameter determination apparatus provided by the present application. As shown in fig. 7, the VAD parameter determining apparatus 700 includes: an acquisition module 710 and an input module 720.

The acquiring module 710 is configured to acquire a voice signal to be recognized acquired in a current environment.

The input module 720 is configured to input the voice signal to be recognized into the voice activity detection VAD parameter determination model, and obtain the target VAD parameter output by the VAD parameter determination model. The VAD parameter determination model is a model trained according to the model training method described in any of the above embodiments.

An output module 730 for outputting the target VAD parameter.

In one possible implementation, the obtaining module 710 is further configured to input the speech signal to be recognized into a speech recognition model, obtain a recognition result output by the speech recognition model, and perform speech recognition on the speech signal to be recognized by using the target VAD parameter.

In one possible implementation, the obtaining module 710 is specifically configured to:

detecting voice activity of a voice signal to be recognized through target VAD parameters, and obtaining a plurality of sub-voice signals obtained through detection;

and carrying out voice recognition on each sub-voice signal and outputting a recognition result.

As can be seen from the above, the VAD parameter determining apparatus provided by the embodiment of the present application can acquire the voice signal to be recognized in the current environment, and input the voice signal to be recognized into the VAD parameter determining model, so as to obtain the target VAD parameter output by the VAD parameter determining model.

Fig. 8 is a schematic structural diagram of a first embodiment of an electronic device according to the present application, and the specific embodiment of the present application is not limited to the specific implementation of the electronic device.

As shown in fig. 8, the electronic device may include: a processor (processor) 802, a communication interface (Communications Interface) 804, a memory (memory) 806, and a communication bus 808.

Wherein: processor 802, communication interface 804, and memory 806 communicate with each other via a communication bus 808. A communication interface 804 for communicating with network elements of other devices, such as clients or other servers. The processor 802 is configured to execute the program 810, and may specifically perform relevant steps in the above-described embodiment of the method for determining the model training method or VAD parameters.

In particular, program 810 may include program code including computer-executable instructions.

The processor 802 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors comprised by the electronic device may be the same type of processor, such as one or more CPUs. But may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 806 for storing a program 810. The memory 806 may include high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

Program 810 may be specifically invoked by processor 802 to cause an electronic device to perform the following model training operations:

Inputting the voice signal of the current environment into an initial VAD parameter determining model, and obtaining VAD parameters to be adjusted output by the initial VAD parameter determining model;

adjusting VAD parameters to be adjusted according to voice signals of the current environment to generate VAD parameters of the current environment;

And continuing training the initial VAD parameter determination model according to the voice signal of the current environment and the VAD parameters of the current environment to generate a VAD parameter determination model.

Or executable instructions cause the electronic device/apparatus to perform the following VAD parameter determination operations:

Acquiring a voice signal to be recognized acquired in a current environment;

Outputting the target VAD parameter.

As can be seen from the above, when the electronic device provided by the embodiment of the application performs model training, after the initial VAD parameter determination model is obtained, the initial VAD parameter determination model is retrained according to the voice signal of the current environment, so as to generate the VAD parameter determination model, and the accuracy of the VAD parameter determination model is improved again. In practical application, VAD parameters are generated through the VAD parameter determination model, so that the accuracy and the robustness of the VAD are effectively improved. And, compared with fixed VAD parameters, the VAD parameters generated by the VAD parameter determination model are more fit with the current environment, so that the accuracy of VAD processing is higher.

Fig. 9 shows a schematic structural diagram of a second embodiment of the electronic device provided by the application. As shown in fig. 5, the apparatus 900 includes: a sound sensor 910, one or more processors 920, and a communication interface 930.

Wherein the sound sensor is used for collecting voice signals

The processor is configured to perform the steps of the model training method and the method embodiment of determining VAD parameters described above.

The embodiment of the application provides a computer readable storage medium, which stores at least one executable instruction, and the executable instruction enables an electronic device to execute the model training method and the VAD parameter determining method in any method embodiment when the executable instruction runs on the electronic device.

The executable instructions may be particularly useful for causing an electronic device/apparatus to perform the following model training operations:

Acquiring a voice signal to be recognized acquired in a current environment;

Outputting the target VAD parameter.

As can be seen from the above, when the computer readable storage medium provided by the embodiment of the present application performs model training, after the initial VAD parameter determination model is obtained, the initial VAD parameter determination model is retrained according to the voice signal of the current environment, so as to generate the VAD parameter determination model, and the accuracy of the VAD parameter determination model is improved again. In practical application, VAD parameters are generated through the VAD parameter determination model, so that the accuracy and the robustness of the VAD are effectively improved. And, compared with fixed VAD parameters, the VAD parameters generated by the VAD parameter determination model are more fit with the current environment, so that the accuracy of VAD processing is higher.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present application are not directed to any particular programming language.

In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the application may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the application, various features of embodiments of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A method of model training, the method comprising:

2. The method according to claim 1, wherein said adjusting the VAD parameter to be adjusted according to the speech signal of the current environment, generating a current environment VAD parameter, comprises:

Performing voice activity detection on the voice signal of the current environment, and if voice activity cannot be detected, acquiring the noise level of the current environment;

if the noise level of the current environment is larger than the preset noise level, the VAD parameter to be adjusted is increased, and the current environment VAD parameter is obtained;

3. The method according to claim 1 or 2, wherein the speech signal samples comprise human voice alone or both human voice and noise.

4. The method according to claim 1 or 2, wherein the model training based on a plurality of speech signal samples and a preset VAD parameter sequence to obtain an initial VAD parameter determination model comprises:

Extracting a signal characteristic vector and an emotion characteristic vector of each voice signal sample;

And performing model training according to the preset VAD parameter sequence and the signal characteristic vector and emotion characteristic vector of each voice signal sample to obtain the initial VAD parameter determination model.

5. A method for determining VAD parameters, the method comprising:

Acquiring a voice signal to be recognized acquired in a current environment;

Inputting the voice signal to be identified into a VAD parameter determination model, and obtaining target VAD parameters output by the VAD parameter determination model; wherein the VAD parameter determination model is a model trained according to the method of any of claims 1-4;

Outputting the target VAD parameters.

6. The method of claim 5, wherein the method further comprises:

Inputting the voice signal to be recognized into a voice recognition model, and obtaining a recognition result output by the voice recognition model, wherein the voice recognition model carries out voice recognition on the voice signal to be recognized through the target VAD parameter.

7. The method of claim 6, wherein inputting the speech signal to be recognized into a speech recognition model, and obtaining the recognition result output by the speech recognition model, comprises:

Performing voice activity detection on the voice signal to be recognized through the target VAD parameters to obtain a plurality of sub-voice signals obtained through detection;

and carrying out voice recognition on each sub-voice signal, and outputting the recognition result.

8. A model training apparatus, the apparatus comprising:

The training module is used for carrying out model training according to a plurality of voice signal samples and a preset voice activity detection VAD parameter sequence to obtain an initial VAD parameter determination model, wherein the preset VAD parameter sequence comprises preset VAD parameters corresponding to each voice signal sample;

9. A device for determining VAD parameters, the device comprising:

the input module is used for inputting the voice signal to be recognized into a voice activity detection VAD parameter determination model and obtaining target VAD parameters output by the VAD parameter determination model; wherein the VAD parameter determination model is a model trained according to the method of any of claims 1-4;

and the output module is used for outputting the target VAD parameters.

10. An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

The memory is configured to store at least one executable instruction that causes the processor to perform the operations of the method of any one of claims 1-7.