CN109192224B

CN109192224B - Voice evaluation method, device and equipment and readable storage medium

Info

Publication number: CN109192224B
Application number: CN201811073869.3A
Authority: CN
Inventors: 金海�; 吴奎; 竺博; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2021-08-17
Anticipated expiration: 2038-09-14
Also published as: CN109192224A

Abstract

The application discloses a voice evaluation method, a device, equipment and a readable storage medium, wherein the application acquires a voice to be evaluated and a keyword serving as an evaluation standard, further detects whether a voice segment corresponding to the keyword exists in the voice to be evaluated to obtain a detection result, and determines the evaluation result of the voice to be evaluated according to the detection result. According to the method and the device, by acquiring the keywords serving as the evaluation standard, whether the speech to be evaluated has the speech segments corresponding to the keywords can be automatically detected, and the evaluation result of the speech to be evaluated is determined according to the detection result.

Description

Voice evaluation method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, device, and readable storage medium.

Background

With the continuous deepening of education reform, spoken language examinations are developed in provinces and cities of the whole country. A spoken test is typically given a piece of material and sets several topics for that material. After the material is read by the test taker, the answer is spoken in spoken language for each question.

Most of the existing spoken language examinations evaluate answers of examinees through professional teachers according to correct answer information corresponding to questions. The manual evaluation mode is extremely easily influenced by human subjectivity, so that the evaluation result is artificially interfered, and a large amount of labor cost is consumed.

Disclosure of Invention

In view of the above, the present application provides a speech evaluation method, apparatus, device and readable storage medium, which are used to solve the disadvantages of the existing manual oral test evaluation method.

In order to achieve the above object, the following solutions are proposed:

a speech evaluation method comprises the following steps:

acquiring a voice to be evaluated and a keyword serving as an evaluation standard;

detecting whether a voice segment corresponding to the keyword exists in the voice to be evaluated or not to obtain a detection result;

and determining the evaluation result of the speech to be evaluated according to the detection result.

Preferably, the detecting whether a speech segment corresponding to the keyword exists in the speech to be evaluated includes:

identifying the speech to be evaluated to obtain identified text information;

and matching the keywords with the text information to obtain a matching result, wherein the matching result shows the inclusion condition of the speech to be evaluated on the speech segments corresponding to the keywords.

Preferably, the recognizing the speech to be evaluated to obtain recognized text information includes:

extracting acoustic features of the speech to be evaluated;

and inputting the acoustic features into a preset first acoustic recognition model to obtain text information corresponding to the speech to be evaluated and output by the first acoustic recognition model.

Preferably, the first acoustic recognition model is a general acoustic recognition model, or an acoustic recognition model obtained by adapting the general acoustic recognition model to the recognition result of the speech to be evaluated by using the general acoustic recognition model.

Preferably, the first acoustic recognition model is an acoustic recognition model corresponding to a decoding space formed by the keywords and a filter, and the filter represents all the non-keywords.

Preferably, the detecting whether a speech segment corresponding to the keyword exists in the speech to be evaluated further includes:

obtaining hidden layer average acoustic features output by a hidden layer of the first acoustic recognition model and converted from the acoustic features;

inputting the hidden layer average acoustic features and the word vector features of the keywords into a preset first keyword classifier to obtain the classification results of the speech to be evaluated, which are output by the first keyword classifier, on the speech segments corresponding to the keywords and the non-keywords;

the first keyword classifier is obtained by training a hidden layer average acoustic feature of acoustic features of voice training data after hidden layer conversion of a first acoustic recognition model and word vector features of the keywords as training samples and training the classification labeling results of the voice training data on the voice segments corresponding to the keywords and non-keywords as sample labels.

windowing the voice to be evaluated to obtain at least one windowed voice to be evaluated;

obtaining a hidden layer of the first acoustic recognition model, and averaging windowed acoustic features of the hidden layer after the windowed acoustic features are converted corresponding to the windowed acoustic features of each windowed speech to be evaluated;

inputting each hidden layer average windowing acoustic feature into a preset second keyword classifier to obtain a classification result of each windowed to-be-evaluated voice output by the second keyword classifier on the voice segments corresponding to the keywords and the non-keywords;

the second keyword classifier is a keyword classifier trained by using hidden layer average acoustic features of keywords and hidden layer average features of non-keywords after the acoustic features of the speech segments corresponding to the keywords and the non-keywords in the speech training data are subjected to hidden layer conversion by the first acoustic recognition model.

Preferably, the determining an evaluation result of the speech to be evaluated according to the detection result includes:

determining evaluation characteristics according to the detection result, wherein the evaluation characteristics comprise any one or more of hit keywords, confidence of the hit keywords, keyword hit rate and Gaussian duration of the hit keywords;

the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated; the confidence of the hit keywords is the recognition confidence of the first acoustic recognition model on the hit keywords, or the confidence of the hit keywords is the classification confidence of the first keyword classifier on the hit keywords, or the confidence of the hit keywords is the classification confidence of the second keyword classifier on the hit keywords; the keyword hit rate is the number of the hit keywords and accounts for the proportion of the total number of the keywords, and the Gaussian duration is determined by the pronunciation duration of the hit keywords in the speech to be evaluated;

and determining the evaluation result of the speech to be evaluated according to the evaluation characteristic.

A speech evaluation apparatus comprising:

the data acquisition unit is used for acquiring the speech to be evaluated and the keywords serving as the evaluation standard;

the voice detection unit is used for detecting whether a voice segment corresponding to the keyword exists in the voice to be evaluated or not to obtain a detection result;

and the evaluation result determining unit is used for determining the evaluation result of the speech to be evaluated according to the detection result.

Preferably, the voice detection unit includes:

the text recognition unit is used for recognizing the speech to be evaluated to obtain recognized text information;

and the text matching unit is used for matching the keywords with the text information to obtain a matching result, and the matching result shows the inclusion condition of the speech to be evaluated on the speech segments corresponding to the keywords.

Preferably, the text recognition unit includes:

the acoustic feature extraction unit is used for extracting the acoustic features of the speech to be evaluated;

and the first acoustic recognition model prediction unit is used for inputting the acoustic features into a preset first acoustic recognition model to obtain text information corresponding to the speech to be evaluated and output by the first acoustic recognition model.

Preferably, the voice detection unit further includes:

a global hidden layer feature obtaining unit, configured to obtain hidden layer average acoustic features output by a hidden layer of the first acoustic recognition model and obtained after the acoustic features are converted;

the first keyword classifier prediction unit is used for inputting the hidden layer average acoustic features and the word vector features of the keywords into a preset first keyword classifier to obtain the classification results of the speech to be evaluated, which are output by the first keyword classifier, on the speech segments corresponding to the keywords and the non-keywords;

Preferably, the voice detection unit further includes:

the voice windowing unit is used for windowing the voice to be evaluated to obtain at least one windowed voice to be evaluated;

a windowing hidden layer feature obtaining unit, configured to obtain a hidden layer of the first acoustic recognition model, and for each windowed speech to be evaluated, obtain an average windowed acoustic feature of the hidden layer after the windowed acoustic feature is converted;

the second keyword classifier prediction unit is used for inputting each hidden layer average windowing acoustic feature into a preset second keyword classifier to obtain a classification result of each windowed speech to be evaluated, which is output by the second keyword classifier, on the speech segments corresponding to the keywords and the non-keywords;

Preferably, the evaluation result determination unit includes:

the first evaluation characteristic determining unit is used for determining evaluation characteristics according to the detection result, wherein the evaluation characteristics comprise any one or more of hit keywords, the confidence of the hit keywords, the keyword hit rate and the Gaussian duration of the hit keywords;

the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated; the confidence of the hit keyword is the recognition confidence of the first acoustic recognition model on the hit keyword; the keyword hit rate is the number of the hit keywords and accounts for the proportion of the total number of the keywords, and the Gaussian duration is determined by the pronunciation duration of the hit keywords in the speech to be evaluated;

and the first evaluation characteristic processing unit is used for determining an evaluation result of the speech to be evaluated according to the evaluation characteristic.

Preferably, the evaluation result determination unit includes:

the second evaluation characteristic determining unit is used for determining evaluation characteristics according to the detection result, wherein the evaluation characteristics comprise any one or more of hit keywords, the confidence of the hit keywords, the keyword hit rate and the Gaussian duration of the hit keywords;

the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated; the confidence of the hit keywords is the classification confidence of the first keyword classifier on the hit keywords, or the classification confidence of the second keyword classifier on the hit keywords; the keyword hit rate is the number of the hit keywords and accounts for the proportion of the total number of the keywords; the Gaussian duration is determined by the pronunciation duration of the hit keyword in the speech to be evaluated;

and the second evaluation characteristic processing unit is used for determining an evaluation result of the speech to be evaluated according to the evaluation characteristic.

A speech evaluating apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program to realize the steps of the voice evaluation method.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech evaluation method as described above.

According to the technical scheme, the voice evaluation method provided by the embodiment of the application obtains the voice to be evaluated and the keyword serving as the evaluation standard, further detects whether the voice segment corresponding to the keyword exists in the voice to be evaluated to obtain a detection result, and determines the evaluation result of the voice to be evaluated according to the detection result. According to the method and the device, by acquiring the keywords serving as the evaluation standard, whether the speech to be evaluated has the speech segments corresponding to the keywords can be automatically detected, and the evaluation result of the speech to be evaluated is determined according to the detection result.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a speech evaluation method disclosed in the embodiments of the present application;

FIG. 2 illustrates a schematic diagram of hidden layer average feature extraction of keywords and non-keywords of a speech sample to be evaluated;

FIG. 3 is a schematic structural diagram of a speech evaluation device disclosed in the embodiments of the present application;

fig. 4 is a block diagram of a hardware structure of a speech evaluation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problem that the existing spoken language evaluation depends on manual work, so that the evaluation result is interfered by human beings and the labor cost is wasted, the speech evaluation method realizes automatic speech evaluation based on the speech detection technology, which is described in detail by combining with the attached drawing 1, and can comprise the following steps:

and S100, obtaining the voice to be evaluated and the key words serving as the evaluation standard.

Specifically, taking a spoken language test scenario as an example, the speech to be evaluated may be a spoken language answer recording given by an examinee. Correspondingly, in this embodiment, a keyword as an evaluation criterion may be preset. Taking the material reading spoken language examination subject as an example, the keywords serving as the evaluation criteria may be keywords extracted from the reading material. In addition, for spoken language examinations of other types of questions, the keywords as evaluation criteria may be keywords extracted from answers corresponding to the questions.

In this step, the obtaining mode of the speech to be evaluated may be receiving through a recording device, and the recording device may include a microphone, such as a head-mounted microphone.

The key words used as the evaluation criteria can reflect the core points of the standard answers. The keywords may be specified by a user in advance, or a keyword extraction technique may be used to extract keywords from answers corresponding to the questions, such as a TF-IDF (term frequency-inverse document frequency) keyword extraction method.

It is to be understood that the number of keywords as evaluation criteria is not limited, and may be one or more.

And S110, detecting whether the voice to be evaluated has the voice segment corresponding to the keyword or not to obtain a detection result.

Specifically, the foregoing keywords that are determined as the evaluation criteria reflect the core points of the answers, and in this step, keyword detection may be performed on the speech to be evaluated, that is, whether a speech segment corresponding to the keyword exists in the speech to be evaluated is detected, so as to obtain a detection result.

The detection result reflects the inclusion condition of the speech to be evaluated to the speech segment corresponding to the keyword. And when the keyword is one, the detection result is whether the speech to be evaluated contains the speech segment corresponding to the keyword. And when the number of the key words is at least two, the detection result is the inclusion condition of the speech to be evaluated to the speech segment corresponding to each key word.

And step S120, determining an evaluation result of the speech to be evaluated according to the detection result.

Specifically, as can be seen from the foregoing description, the keyword reflects a core point of the answer corresponding to the question, and therefore the keyword can represent the answer corresponding to the question to some extent. In this step, the evaluation result of the speech to be evaluated is determined according to the inclusion condition of the speech to be evaluated on the speech segment corresponding to the keyword.

It can be understood that as the number of the speech segments corresponding to the keywords in the speech to be evaluated increases, the better the evaluation result of the speech to be evaluated is.

According to the voice evaluation method provided by the embodiment of the application, whether the voice segment corresponding to the keyword exists in the voice to be evaluated can be automatically detected by acquiring the keyword serving as the evaluation standard, the evaluation result of the voice to be evaluated is determined according to the detection result, and because manual evaluation is not needed, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.

In an embodiment of the present application, an implementation manner of the step S110 is introduced to detect whether a speech segment corresponding to the keyword exists in the speech to be evaluated, and optionally, the process may include:

and S1, recognizing the speech to be evaluated to obtain recognized text information.

Specifically, speech recognition may be performed on the speech to be evaluated to obtain the recognized text information.

And S2, matching the keywords with the text information to obtain a matching result, wherein the matching result shows the inclusion condition of the speech to be evaluated to the speech segments corresponding to the keywords.

In this step, based on the keyword that is used as the evaluation criterion a priori, the keyword is matched with the text information to obtain a matching result, and the matching result can be used as the detection result.

It will be appreciated that the matching result may include whether there is a word in the text message that matches each keyword. Further optionally, the matching result may further include a confidence level of a word matching each keyword existing in the text information, and the confidence level of the word may be a recognition confidence level of each word in the speech recognition process.

The process of identifying the speech to be evaluated and obtaining the identified text information in S1 may include:

and S11, extracting the acoustic features of the speech to be evaluated.

The acoustic features are used for speech recognition, and are typically spectral features of speech data, such as Mel Frequency Cepstrum Coefficient (MFCC) features or Perceptual Linear Prediction (PLP) features.

In the specific extraction, the speech to be evaluated can be subjected to framing processing in advance. And further, pre-emphasis is carried out on the speech to be evaluated after the frame division. And finally, sequentially extracting the frequency spectrum characteristics of each frame of speech to be evaluated.

And S12, inputting the acoustic features into a preset first acoustic recognition model to obtain text information corresponding to the speech to be evaluated and output by the first acoustic recognition model.

The first acoustic recognition model may be an acoustic recognition model in a neural network form obtained by training using a training corpus.

Several alternative configurations of the first acoustic recognition model are provided in this embodiment. Next, description will be made separately.

First, the first acoustic recognition model may be a generic acoustic recognition model, i.e., a generic acoustic recognition model trained using an existing training corpus.

It should be noted that although the general acoustic recognition model can perform acoustic recognition, since training corpus of the general acoustic recognition model may not cover all scenes of the spoken language test, differences of the scenes of the spoken language test are large, and pronunciation differences of different regions are large, the recognition accuracy rate of the general acoustic recognition model to the scenes of the spoken language test is reduced.

On the basis, the embodiment utilizes the general acoustic recognition model to perform one-time recognition on the speech to be evaluated to obtain one-time recognition result. Furthermore, the one-time recognition result and the speech to be evaluated can be used as training data, the general acoustic recognition model is subjected to self-adaptation, and the acoustic recognition model after self-adaptation is obtained and used as the first acoustic recognition model.

Optionally, in the process of performing adaptation on the general acoustic recognition model, a recognition result with a recognition confidence higher than a set threshold in a recognition result may be selected, and the corresponding speech to be evaluated is combined as training data.

Further, the present embodiment introduces a first acoustic recognition model of yet another structure.

The method aims at detecting whether the speech to be evaluated has the speech segment corresponding to the keyword or not based on the prior keyword. In order to further improve the keyword detection accuracy, a new acoustic recognition model of the decoding space is designed in this embodiment, which is different from the existing decoding space formed by all words in the dictionary, and the new decoding space designed in this embodiment is formed by the keywords and the filter, and the filter is used for absorbing all non-keywords except the keywords.

For example, keywords include: A. b and C, with N representing the filter, the new decoding space includes A, B, C and N.

The first acoustic recognition model of the new decoding space designed in this embodiment converts the speech recognition process into a priori-based keyword, and performs a process of actively detecting the keyword. By using the first acoustic recognition model of the embodiment, the keyword recognition accuracy is higher without being affected by the uneven distribution of the keywords and the non-keywords in the training data.

And taking the acoustic recognition model corresponding to the new decoding space as a first acoustic recognition model. When speech recognition is performed based on the first acoustic recognition model, the recognition result includes only keywords and a filter. The influence of all non-keywords is filtered out by the filter. Further, when the first acoustic recognition model carries out voice recognition, whether the recognition confidence coefficient of the keyword exceeds a set confidence coefficient threshold value is judged, if yes, the keyword is recognized as a corresponding keyword, and if not, the keyword is recognized as a filter.

Further, since the acoustic pronunciations of different words or phrases are not necessarily the same, the keyword confidence threshold is not applicable to all the problems, and the application provides a keyword confidence threshold adaptive method. An alarm set and a recall set can be constructed by utilizing the scores of the voices to be evaluated manually, wherein the alarm set contains the voices to be evaluated with low scores, and the recall set contains the voices to be evaluated with high scores. The confidence thresholds of the keywords as evaluation criteria are adjusted based on the alarm set and the recall set.

In another embodiment of the present application, another implementation manner is introduced in which, in the step S110, whether a speech segment corresponding to the keyword exists in the speech to be evaluated is detected. On the basis of the foregoing S1-S2, a process of performing keyword classification on the speech to be evaluated by the keyword classifier can be further added.

In this embodiment, two forms of keyword classifiers are introduced, which are introduced separately.

A first keyword classifier:

based on the foregoing S11-S12, the acoustic features of the speech to be evaluated are input into the first acoustic recognition model, and the output text information corresponding to the speech to be evaluated is obtained. In this embodiment, the hidden layer average acoustic feature output by the hidden layer of the first acoustic recognition model and converted from the acoustic feature may be obtained.

The hidden layer average acoustic feature of the first acoustic recognition model after the hidden layer converts the acoustic feature is a high-level abstract representation of the input acoustic feature. The hidden layer average acoustic feature is the result of averaging the hidden layer acoustic features of all frames in the speech to be evaluated.

Further, the hidden layer average acoustic features and the word vector features of the keywords are input into a preset first keyword classifier, and classification results of the speech to be evaluated, which are output by the first keyword classifier, on the speech segments corresponding to the keywords and the non-keywords are obtained.

Specifically, the classification result obtained in the present embodiment may be used as the detection result in step S110.

The first keyword classifier is obtained by training a hidden layer average acoustic feature of acoustic features of voice training data after hidden layer conversion of a first acoustic recognition model and word vector features of the keywords as training samples and training a sample label by using classification labeling result of the voice training data on the corresponding voice segments of the keywords and non-keywords.

A second keyword classifier:

in this embodiment, the speech to be evaluated may be windowed to obtain at least one windowed speech to be evaluated. Wherein, the window length size can be a first set frame number, such as 40 frames, and the window moving step size can be a second set frame number, such as 5 frames. Corresponding windowed acoustic features can be extracted for each windowed speech to be evaluated. Further, the hidden layer average windowing acoustic characteristics output by the hidden layer of the first acoustic recognition model and obtained after the windowing acoustic characteristics corresponding to the speech to be evaluated are converted are obtained.

Further, inputting each hidden layer average windowing acoustic feature into a preset second keyword classifier to obtain a classification result of each windowed to-be-evaluated voice output by the second keyword classifier on the voice segments corresponding to the keywords and the non-keywords.

The second keyword classifier is a keyword classifier trained by hidden layer average acoustic features of keywords and hidden layer average features of non-keywords after the acoustic features of the speech segments corresponding to the keywords and the non-keywords in the speech training data are subjected to hidden layer conversion by the first acoustic recognition model.

Specifically, when the second keyword classifier is trained, the first acoustic recognition model may be used to recognize the voice training data, and determine a voice segment corresponding to a keyword and a voice segment corresponding to a non-keyword in the voice training data according to the recognition result. And training a second keyword classifier by using the hidden layer average characteristics of the keywords of the corresponding voice segments of the keywords after hidden layer conversion of the first acoustic recognition model and the hidden layer average characteristics of the non-keywords of the corresponding voice segments of the non-keywords after hidden layer conversion of the first acoustic recognition model.

Referring to fig. 2, a schematic diagram of extracting hidden layer average features of keywords and non-keywords of a speech sample to be evaluated is shown.

One or both of the two keyword classifiers in the above example may be used, and the classification result obtained by the keyword classifier is used as the detection result in step S110.

In this embodiment, a hidden layer feature of the first acoustic recognition model is further added as an input feature of the keyword classifier, and the keyword classifier is used to output a classification result of the speech to be evaluated on the speech segments corresponding to the keywords and the non-keywords, where the classification result may be used as a detection result together with a matching result of the keywords of S1-S2 and the text information corresponding to the speech to be evaluated, so as to determine whether the speech segment corresponding to the keyword exists in the speech to be evaluated.

In another embodiment of the present application, the aforementioned step S120 is introduced, and a process of determining an evaluation result of the speech to be evaluated according to the detection result is provided.

Based on the detection results determined in the embodiments, the evaluation result of the speech to be evaluated can be determined, and the process can include two links, which are respectively:

the first link is as follows:

and determining an evaluation characteristic according to the detection result.

In this embodiment, various types of evaluation features are introduced, which are respectively:

1) and (3) hitting keywords:

and the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated. According to the determined detection result, the voice segment corresponding to which keywords are specifically contained in the voice to be evaluated can be determined.

When the hit keyword is used as an evaluation feature, the expression mode of the hit keyword can be in a one-hot vector form, namely, the hit keyword is expressed by an N-dimensional vector, N is the number of the keyword, each element bit in the N-dimensional vector corresponds to a unique keyword, two values exist in the element bit, the first value indicates that the keyword is the hit keyword, the second value indicates that the keyword is a non-hit keyword, wherein the first value can be 1, and the second value can be 0.

2) Confidence of hit keyword:

the confidence of the hit keyword may be a recognition confidence of the first acoustic recognition model for the hit keyword, or may be a classification confidence of the first keyword classifier or the second keyword classifier for the hit keyword.

3) Keyword hit rate:

the keyword hit rate is the ratio of the number of hit keywords to the total number of keywords.

4) Gaussian duration of hit keyword:

and the Gaussian duration of the hit keywords is determined by the pronunciation duration of the hit keywords in the speech to be evaluated. The gaussian duration of the hit keyword can be used as a measure of the pronunciation characteristics of the hit keyword by the examinee.

Specifically, the voice segment corresponding to which keywords are included in the voice to be evaluated and the position of the voice segment corresponding to the keywords can be determined according to the detection result. According to the pronunciation duration of the speech segment corresponding to the keyword in the speech to be evaluated, the Gaussian duration of the keyword can be determined.

The Gaussian duration assumes that the pronunciation duration of each syllable obeys normal distribution, the Gaussian duration hitting keywords can describe the pronunciation characteristics of the keywords of the examinee, and the extraction method comprises the following steps: first, a pronunciation duration mean and variance distribution table of each hit keyword or keyword component (such as syllable, phoneme, etc.) is constructed. The hit keyword component is taken as an example of syllable:

the Gaussian duration score of each syllable in the hit keyword can be calculated based on the constructed syllable pronunciation duration mean value and variance distribution table, the Gaussian duration score of all the syllables is averaged to be used as the Gaussian duration score of the hit keyword, and the calculation formula refers to the following steps:

wherein, w_gaussIs the Gaussian duration of the hit keyword, K is the number of syllables of the hit keyword, ph_gauss(k) Is the Gaussian duration of the kth syllable, mu_kAnd σ_kMean and variance of pronunciation duration, x, of the kth syllable, respectively_kThe pronunciation duration of the kth syllable which hits the keyword in the speech to be evaluated.

The syllable pronunciation time length mean value and variance distribution table can be a general syllable pronunciation time length mean value and variance distribution table constructed based on a large amount of oral test data, and can also be a self-adaptive syllable pronunciation time length mean value and variance distribution table constructed based on current oral test data.

And a second step:

Specifically, a plurality of evaluation features are determined in the first link, one or more combinations of the evaluation features can be selected, and an evaluation result of the speech to be evaluated is determined based on the selected evaluation features.

In this embodiment, the evaluation result of the speech to be evaluated may be determined based on the evaluation features and the pre-trained scoring regression model.

The scoring regression model may be linear regression, gaussian regression, neural network regression, or the like.

During training, the evaluation features of the voice training data can be used as training samples, and the evaluation results of the labeled voice training data can be used as sample labels.

The following describes the speech evaluation device provided in the embodiment of the present application, and the speech evaluation device described below and the speech evaluation method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech evaluation device disclosed in the embodiment of the present application. As shown in fig. 3, the apparatus may include:

the data acquisition unit 11 is used for acquiring a voice to be evaluated aiming at a target problem and a keyword serving as an evaluation standard;

the voice detection unit 12 is configured to detect whether a voice segment corresponding to the keyword exists in the voice to be evaluated, so as to obtain a detection result;

and an evaluation result determining unit 13, configured to determine an evaluation result of the speech to be evaluated according to the detection result.

Optionally, the voice detection unit may include:

Optionally, the text recognition unit may include:

Optionally, the first acoustic recognition model may be a general acoustic recognition model, or an acoustic recognition model obtained by performing adaptation on the general acoustic recognition model by using the general acoustic recognition model to recognize the result of the speech to be evaluated.

Optionally, the first acoustic recognition model may be an acoustic recognition model corresponding to a decoding space formed by the keywords and the filter, and the filter represents all the non-keywords.

Optionally, the voice detection unit may further include:

Optionally, the present application illustrates two optional structures of the evaluation result determining unit, which are respectively introduced as follows:

first, the evaluation result determination unit may include:

the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated; the confidence of the hit keyword is the recognition confidence of the first acoustic recognition model on the hit keyword; the keyword hit rate is the number of the hit keywords and accounts for the proportion of the total number of the keywords; the Gaussian duration is determined by the pronunciation duration of the hit keyword in the speech to be evaluated;

Second, the evaluation result determination unit may include:

The voice evaluation device provided by the embodiment of the application can be applied to voice evaluation equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Optionally, fig. 4 shows a block diagram of a hardware structure of the speech evaluation device, and referring to fig. 4, the hardware structure of the speech evaluation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a voice to be evaluated aiming at a target problem and a keyword serving as an evaluation standard;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech evaluation method, comprising:

extracting acoustic features of the speech to be evaluated;

inputting the acoustic features into a preset first acoustic recognition model to obtain text information corresponding to the speech to be evaluated and output by the first acoustic recognition model;

performing keyword classification on the voice to be evaluated through a first keyword classifier to obtain a classification result, and using the classification result as a detection result for detecting whether the voice to be evaluated contains a voice segment corresponding to the keyword; the first keyword classifier is obtained by taking hidden layer average acoustic features of voice training data after hidden layer conversion of a first acoustic recognition model and word vector features of the keywords as training samples and taking classification labeling results of the voice training data on the voice segments corresponding to the keywords and non-keywords as sample labels for training;

2. The method of claim 1, further comprising:

matching the keywords with the text information to obtain a matching result, and using the matching result as a detection result for detecting whether the speech to be evaluated contains a speech segment corresponding to the keywords; and the matching result shows the inclusion condition of the speech to be evaluated on the speech segment corresponding to the keyword.

3. The method according to claim 1, wherein the first acoustic recognition model is a generic acoustic recognition model, or an acoustic recognition model obtained by adapting the generic acoustic recognition model to the recognition result of the speech to be evaluated by using the generic acoustic recognition model.

4. The method of claim 1, wherein the first acoustic recognition model is an acoustic recognition model corresponding to a decoding space formed by the keywords and a filter, and wherein the filter characterizes all non-keywords.

5. The method according to claim 1, wherein the performing keyword classification on the speech to be evaluated through a first keyword classifier to obtain a classification result comprises:

and inputting the hidden layer average acoustic features and the word vector features of the keywords into a preset first keyword classifier to obtain the classification result of the speech to be evaluated, which is output by the first keyword classifier, on the speech segments corresponding to the keywords and the non-keywords.

6. The method according to claim 2, wherein the determining an evaluation result of the speech to be evaluated according to the detection result comprises:

7. The method according to claim 1, wherein the determining an evaluation result of the speech to be evaluated according to the detection result comprises:

the hit keywords are voice segments corresponding to the keywords in the voice to be evaluated; the confidence of the hit keywords is the classification confidence of the first keyword classifier on the hit keywords; the keyword hit rate is the number of the hit keywords and accounts for the proportion of the total number of the keywords; the Gaussian duration is determined by the pronunciation duration of the hit keyword in the speech to be evaluated;

8. A speech evaluation apparatus, comprising:

the voice detection unit is used for extracting the acoustic characteristics of the voice to be evaluated and inputting the acoustic characteristics into a preset first acoustic recognition model to obtain text information corresponding to the voice to be evaluated and output by the first acoustic recognition model; performing keyword classification on the voice to be evaluated through a first keyword classifier to obtain a classification result, and using the classification result as a detection result for detecting whether the voice to be evaluated contains a voice segment corresponding to the keyword; the first keyword classifier is obtained by taking hidden layer average acoustic features of voice training data after hidden layer conversion of a first acoustic recognition model and word vector features of the keywords as training samples and taking classification labeling results of the voice training data on the voice segments corresponding to the keywords and non-keywords as sample labels for training;

9. The apparatus of claim 8, wherein the voice detection unit is further configured to:

10. The apparatus of claim 8, wherein the voice detection unit comprises:

and the first keyword classifier prediction unit is used for inputting the hidden layer average acoustic features and the word vector features of the keywords into a preset first keyword classifier to obtain the classification results of the speech to be evaluated, which are output by the first keyword classifier, on the speech segments corresponding to the keywords and the non-keywords.

11. The apparatus according to any one of claims 9, wherein the evaluation result determination unit comprises:

12. The apparatus according to claim 8, wherein the evaluation result determination unit includes:

13. The voice evaluating device is characterized by comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the speech evaluation method according to any of claims 1-7.

14. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech evaluation method according to any one of claims 1 to 7.