CN109192224A

CN109192224A - A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN109192224A
Application number: CN201811073869.3A
Authority: CN
Inventors: 金海�; 吴奎; 竺博; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2019-01-11
Anticipated expiration: 2038-09-14
Also published as: CN109192224B

Abstract

This application discloses a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, the application obtains voice to be evaluated, and the keyword as evaluating standard, it further detects in voice to be evaluated with the presence or absence of sound bite corresponding with keyword, testing result is obtained, the evaluation result of voice to be evaluated is determined according to the testing result.The application is by obtaining the keyword as evaluating standard, it may be implemented in the detection voice to be evaluated of automation with the presence or absence of the corresponding sound bite of keyword, and determine the evaluation result of voice to be evaluated according to testing result, due to not needing manually to be evaluated and tested, not only interference of the subjective impact to evaluation result of people had been avoided, but also has reduced the consumption of cost of labor.

Description

A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing

Technical field

This application involves voice processing technology fields, more specifically to a kind of speech evaluating method, device, equipment And readable storage medium storing program for executing.

Background technique

With deepening continuously for educational reform, speaking test is carried out in each provinces and cities in the whole nation.Speaking test is usually given One section of material, and several topics are set for the material.After reading material by examinee, for per pass topic, pass through spoken language Form says answer.

Existing speaking test is mostly the answer by professional teacher, according to the corresponding correct option information of topic, to examinee It is evaluated and tested.This mode manually evaluated and tested extremely is easy the subjective impact by people, causes evaluation result by human interference, and And it can also consume a large amount of cost of labor.

Summary of the invention

In view of this, this application provides a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, for solving Disadvantage present in the mode of certainly existing artificial carry out speaking test evaluation and test.

To achieve the goals above, it is proposed that scheme it is as follows:

A kind of speech evaluating method, comprising:

Obtain voice to be evaluated, and the keyword as evaluating standard；

It detects in the voice to be evaluated with the presence or absence of sound bite corresponding with the keyword, obtains testing result；

According to the testing result, the evaluation result of the voice to be evaluated is determined.

Preferably, it whether there is sound bite corresponding with the keyword, packet in the detection voice to be evaluated It includes:

Identify the voice to be evaluated, the text information after being identified；

The keyword matched with the text information to obtain matching result, the matching result show it is described to Evaluation and test voice to the keyword correspond to sound bite include situation.

Preferably, the identification voice to be evaluated, the text information after being identified, comprising:

Extract the acoustic feature of the voice to be evaluated；

The acoustic feature is inputted to the first preset acoustics identification model, obtains the institute of the first acoustics identification model output State the corresponding text information of voice to be evaluated.

Preferably, the first acoustics identification model is general acoustics identification model, or, to utilize the general acoustics Identification model knows the recognition result of the voice to be evaluated to the acoustics after the general acoustics identification model progress adaptively Other model.

Preferably, the first acoustics identification model is to be corresponded to the solution code space that the keyword and filter are constituted Acoustics identification model, the filter characterizes all non-key words.

Preferably, it whether there is sound bite corresponding with the keyword in the detection voice to be evaluated, also Include:

The hidden layer output of the first acoustics identification model is obtained, is averaged sound to the hidden layer after acoustic feature conversion Learn feature；

The be averaged term vector feature of acoustic feature and the keyword of the hidden layer is inputted into the first preset keyword point Class device, the voice to be evaluated for obtaining the output of the first keyword classification device correspond to voice sheet to the keyword and non-key word The classification results of section；

The first keyword classification device is, with the acoustic feature of voice training data through the hidden of the first acoustics identification model Layer convert after hidden layer be averaged acoustic feature and the keyword term vector feature into training sample, with voice training data It is that sample label training obtains to the classification annotation result that the keyword and non-key word correspond to sound bite.

Adding window is carried out to the voice to be evaluated, obtains at least adding window voice to be evaluated；

It is special to correspond to adding window acoustics to each adding window voice to be evaluated for the hidden layer for obtaining the first acoustics identification model Hidden layer after sign conversion is averaged adding window acoustic feature；

Each hidden layer adding window acoustic feature that is averaged is inputted into the second preset keyword classification device, obtains the second key Each adding window voice to be evaluated of word classifier output corresponds to the classification results of sound bite to the keyword and non-key word；

The second keyword classification device is to correspond to sound bite using keyword, non-key word in voice training data Keyword hidden layer of the acoustic feature after the conversion of the hidden layer of the first acoustics identification model be averaged acoustic feature, non-key word hidden layer The keyword classification device of average characteristics training.

Preferably, described according to the testing result, determine the evaluation result of the voice to be evaluated, comprising:

Determine that evaluation and test feature, the evaluation and test feature include hit keyword, hit keyword according to the testing result It is any one or more in the Gauss duration of confidence level, keyword hit rate and hit keyword；

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit is closed The confidence level of keyword is recognition confidence of the first acoustics identification model to the hit keyword, or, the hit is closed The confidence level of keyword is classification confidence of the first keyword classification device to the hit keyword, or, the hit is closed The confidence level of keyword is classification confidence of the second keyword classification device to the hit keyword；The keyword hit Rate is the hit keyword number, accounts for the ratio of the total number of the keyword, the Gauss duration is by the language to be evaluated The pronunciation duration that keyword is hit described in sound determines；

According to the evaluation and test feature, the evaluation result of the voice to be evaluated is determined.

A kind of speech evaluating device, comprising:

Data capture unit, for obtaining voice to be evaluated, and the keyword as evaluating standard；

Speech detection unit, for detecting in the voice to be evaluated with the presence or absence of voice sheet corresponding with the keyword Section, obtains testing result；

Evaluation result determination unit, for determining the evaluation result of the voice to be evaluated according to the testing result.

Preferably, the speech detection unit includes:

Text identification unit, for identification voice to be evaluated, the text information after being identified；

Text matches unit obtains matching result for being matched the keyword with the text information, described Matching result show the voice to be evaluated to the keyword correspond to sound bite comprising situation.

Preferably, the text identification unit includes:

Acoustic feature extraction unit, for extracting the acoustic feature of the voice to be evaluated；

First acoustics identification model predicting unit identifies mould for the acoustic feature to be inputted the first preset acoustics Type obtains the corresponding text information of the voice to be evaluated of the first acoustics identification model output.

Preferably, the speech detection unit further include:

Global hidden layer feature acquiring unit, what the hidden layer for obtaining the first acoustics identification model exported, to described Hidden layer after acoustic feature conversion is averaged acoustic feature；

First keyword classification device predicting unit, for by the hidden layer be averaged acoustic feature and the keyword word to Measure feature inputs the first preset keyword classification device, obtains the voice to be evaluated of the first keyword classification device output to institute It states keyword and non-key word corresponds to the classification results of sound bite；

Preferably, the speech detection unit further include:

Voice windowing unit obtains at least adding window voice to be evaluated for carrying out adding window to the voice to be evaluated；

Adding window hidden layer feature acquiring unit, for obtaining the hidden layer of the first acoustics identification model, to it is each it is described plus The hidden layer that window voice to be evaluated corresponds to after the conversion of adding window acoustic feature is averaged adding window acoustic feature；

Second keyword classification device predicting unit, for by each hidden layer be averaged adding window acoustic feature input it is preset Second keyword classification device obtains each adding window voice to be evaluated of the second keyword classification device output to the keyword and non- Keyword corresponds to the classification results of sound bite；

Preferably, the evaluation result determination unit includes:

First evaluation and test characteristics determining unit, for determining evaluation and test feature, the evaluation and test feature packet according to the testing result It includes hit keyword, the confidence level for hitting keyword, keyword hit rate and hits any one in the Gauss duration of keyword Item is multinomial；

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit is closed The confidence level of keyword is recognition confidence of the first acoustics identification model to the hit keyword；The keyword hit Rate is the hit keyword number, accounts for the ratio of the total number of the keyword, the Gauss duration is by the language to be evaluated The pronunciation duration that keyword is hit described in sound determines；

First evaluation and test characteristic processing unit, for determining the evaluation and test knot of the voice to be evaluated according to the evaluation and test feature Fruit.

Preferably, the evaluation result determination unit includes:

Second evaluation and test characteristics determining unit, for determining evaluation and test feature, the evaluation and test feature packet according to the testing result It includes hit keyword, the confidence level for hitting keyword, keyword hit rate and hits any one in the Gauss duration of keyword Item is multinomial；

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit is closed The confidence level of keyword is classification confidence of the first keyword classification device to the hit keyword, or is closed for described second Classification confidence of the keyword classifier to the hit keyword；The keyword hit rate is the hit keyword number, Account for the ratio of the total number of the keyword；The Gauss duration hits the pronunciation of keyword as described in the voice to be evaluated Duration determines；

Second evaluation and test characteristic processing unit, for determining the evaluation and test knot of the voice to be evaluated according to the evaluation and test feature Fruit.

A kind of speech evaluating equipment, including memory and processor；

The memory, for storing program；

The processor realizes each step of speech evaluating method as described above for executing described program.

A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor Now each step of speech evaluating method as described above.

It can be seen from the above technical scheme that speech evaluating method provided by the embodiments of the present application, obtains language to be evaluated Sound, and as the keyword of evaluating standard, further detect in voice to be evaluated with the presence or absence of voice sheet corresponding with keyword Section, obtains testing result, the evaluation result of voice to be evaluated is determined according to the testing result.The application is by obtaining as evaluation and test The keyword of standard may be implemented with the presence or absence of the corresponding sound bite of keyword in the detection voice to be evaluated of automation, and Determine that the evaluation result of voice to be evaluated has both avoided the master of people due to not needing manually to be evaluated and tested according to testing result Viewing rings the interference to evaluation result, and reduces the consumption of cost of labor.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of speech evaluating method flow chart disclosed in the embodiment of the present application；

Fig. 2 illustrates a kind of speech samples keyword to be evaluated and non-key word hidden layer average characteristics extract schematic diagram；

Fig. 3 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application；

Fig. 4 is a kind of hardware block diagram of speech evaluating equipment disclosed in the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

In order to solve existing spoken assessment by manually, causes evaluation result by human interference and waste of manpower cost is asked Topic, the application are realized the speech evaluating of automation based on speech detection technology, are illustrated in detail in conjunction with attached drawing 1, speech evaluating Method may include:

Step S100, voice to be evaluated, and the keyword as evaluating standard are obtained.

Specifically, by taking speaking test scene as an example, voice to be evaluated can be spoken answer that examinee provides and record.It is corresponding , the keyword as evaluating standard can be preset in the present embodiment.By taking material reads speaking test topic as an example, as The keyword of evaluating standard can be the keyword extracted from reading material.In addition to this, for the mouth of other types problem Language examination, the keyword as evaluating standard can be problem and correspond to the keyword extracted in answer.

In this step, the acquisition modes of voice to be evaluated can be to be received by sound pick-up outfit, and sound pick-up outfit may include Microphone, such as head microphone.

Wherein, as the keyword of evaluating standard, it is able to reflect the core point of model answer.Keyword can be pre- by user It first specifies, or keyword can be extracted from the corresponding answer of problem using keyword extraction techniques, common keyword mentions Take technology such as TF-IDF (term frequency-inverse document frequency) keyword extracting method etc..

It is understood that the number of the keyword as evaluating standard does not limit, one or more can be.

Step S110, detecting whether there is sound bite corresponding with the keyword in the voice to be evaluated, obtain Testing result.

Specifically, the aforementioned keyword having determined that as evaluating standard, the keyword reflect the core point of answer, Keyword detection can be carried out to voice to be evaluated in this step, that is, detected corresponding with the presence or absence of keyword in voice to be evaluated Sound bite obtains testing result.

Testing result reflect voice to be evaluated to keyword correspond to sound bite comprising situation.When keyword is one When, testing result is whether voice to be evaluated includes the corresponding sound bite of a keyword.When keyword is at least two When a, testing result be voice to be evaluated to each keyword correspond to sound bite comprising situation.

Step S120, according to the testing result, the evaluation result of the voice to be evaluated is determined.

Specifically, according to preceding description it is found that keyword reflects the core point that problem corresponds to answer, therefore keyword exists The corresponding answer of problem can be represented to a certain extent.In this step, voice is corresponded to keyword according to aforementioned voice to be evaluated Segment includes situation, determines the evaluation result of voice to be evaluated.

It is understood that as the quantity that the keyword in the presence of voice to be evaluated corresponds to sound bite increases, to The evaluation result for evaluating and testing voice is better.

Speech evaluating method provided by the embodiments of the present application may be implemented by obtaining the keyword as evaluating standard It whether there is the corresponding sound bite of keyword in the detection of automation voice to be evaluated, and to be evaluated to determine according to testing result The evaluation result for surveying voice has both avoided interference of the subjective impact to evaluation result of people due to not needing manually to be evaluated and tested, Reduce the consumption of cost of labor again.

In one embodiment of the application, above-mentioned steps S110 is described, detects and whether is deposited in the voice to be evaluated In a kind of embodiment of sound bite corresponding with the keyword, the optional process may include:

S1, the identification voice to be evaluated, the text information after being identified.

Specifically, speech recognition can be carried out to voice to be evaluated, the text information after being identified.

S2, the keyword is matched to obtain matching result with the text information, the matching result shows institute State voice to be evaluated to the keyword correspond to sound bite comprising situation.

In this step, keyword is matched with text information, is obtained by the keyword as evaluating standard based on priori To matching result, which can be used as aforementioned testing result.

It is understood that matching result may include in text information with the presence or absence of the word with each Keywords matching. Further alternative, matching result can also include the confidence level present in text information with the word of each Keywords matching, The confidence level of the word can be speech recognition process, the recognition confidence of each word.

Wherein, above-mentioned S1 identifies voice to be evaluated, and the process of the text information after being identified may include:

S11, the acoustic feature for extracting the voice to be evaluated.

Wherein acoustic feature is used for speech recognition, and the acoustic feature is generally the spectrum signature of voice data, such as Meier Frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) feature or perception linear prediction (Perceptual Linear Predictive, PLP) feature etc..

When specific extraction, sub-frame processing can be carried out to voice to be evaluated in advance.Further, to the language to be evaluated after framing Sound carries out preemphasis.Finally, successively extracting the spectrum signature of every frame voice to be evaluated.

S12, the acoustic feature is inputted to the first preset acoustics identification model, obtains the output of the first acoustics identification model The corresponding text information of the voice to be evaluated.

Wherein, the acoustics that the first acoustics identification model can be the neural network form obtained using training corpus training is known Other model.

The first acoustics identification model of several alternative constructions is provided in the present embodiment.Next it is introduced respectively.

Firstly, the first acoustics identification model can be general acoustics identification model, i.e., instructed using existing training corpus The general acoustics identification model got.

It should be noted that although general acoustics identification model can carry out acoustics identification, due to its training corpus All scenes of speaking test may not be covered, and speaking test scene otherness is larger, and different regions pronunciation difference is big, leads to It can be reduced with recognition accuracy of the acoustics identification model to the speaking test scene.

On this basis, the present embodiment carries out an identification to voice to be evaluated using general acoustics identification model, obtains A time recognition result.General acoustics may further be identified using a recognition result and voice to be evaluated as training data Model carries out adaptive acoustics identification model after obtaining adaptively, as the first acoustics identification model.

Optionally, above-mentioned that adaptive process is carried out to general acoustics identification model, it can choose in a recognition result Recognition confidence is higher than the recognition result of given threshold, in conjunction with corresponding voice to be evaluated as training data.

Further, this embodiment describes the first acoustics identification models of yet another construction.

Since the application target is the keyword based on priori, detecting in the voice to be evaluated whether there is and the pass The corresponding sound bite of keyword.In order to further increase keyword detection accuracy, a kind of new solution is devised in the present embodiment The acoustics identification model of code space is different from the existing solution code space constituted with words all in dictionary, the present embodiment design New solution code space is made of keyword and filter, and the filter is for absorbing all non-key words in addition to keyword.

Show for example, keyword includes: A, B and C, represents filter with N, then new solution code space includes A, B, C and N.

First acoustics identification model of the new explanation code space of the present embodiment design, speech recognition process is converted to based on first The keyword tested carries out the process of keyword active detecting.Using the first acoustics identification model of the present embodiment, do not trained The influence that keyword and non-key word are unevenly distributed in data, keyword recognition accuracy are higher.

Using the corresponding acoustics identification model of above-mentioned new solution code space as the first acoustics identification model.Based on first sound When learning identification model progress speech recognition, recognition result only includes keyword and filter.It has been filtered out by filter all The influence of non-key word.Further, when the first acoustics identification model carries out speech recognition, judge that the recognition confidence of keyword is No is more than setting confidence threshold value, if it is, otherwise being identified as corresponding keyword is identified as filter.

Further, since the acoustic pronunciation of different words or phrase is not necessarily identical, above-mentioned keyword confidence level threshold Value is not particularly suited for all problems, and the application proposes a kind of keyword confidence threshold value adaptive approach.It is artificial right to can use The scoring of voice to be evaluated, building alarm collect and recall collection, and alarm collection includes the low voice to be evaluated that scores, recalls and collect comprising commenting Divide high voice to be evaluated.Based on alarm collection and recall confidence threshold value of the collection adjusting as the keyword of evaluating standard.

In another embodiment of the application, describe above-mentioned steps S110, detect in the voice to be evaluated whether In the presence of the another embodiment of sound bite corresponding with the keyword.It, can be further on the basis of aforementioned S1-S2 Increase the process for carrying out keyword classification to voice to be evaluated by keyword classification device.

The keyword classification device that two kinds of forms are described in the present embodiment, is introduced respectively.

First keyword classification device:

Based in aforementioned S11-S12, the acoustic feature of voice to be evaluated is inputted into the first acoustics identification model, is exported The corresponding text information of voice to be evaluated.In the present embodiment, the hidden layer of the available first acoustics identification model is exported , it is averaged acoustic feature to the hidden layer after acoustic feature conversion.

Wherein, the hidden layer of the first acoustics identification model is averaged acoustic feature to the hidden layer after acoustic feature conversion, is to defeated A kind of high abstraction for entering acoustic feature indicates.The hidden layer acoustic feature that is averaged is by the hidden layer acoustics of frames all in voice to be evaluated Result after feature averaging.

Further, the be averaged term vector feature of acoustic feature and the keyword of the hidden layer is inputted into preset first Keyword classification device obtains the voice to be evaluated of the first keyword classification device output to the keyword and non-key word pair Answer the classification results of sound bite.

Specifically, classification results obtained in the present embodiment can be used as the testing result in step S110.

Wherein, the first keyword classification device is, with the acoustic feature of voice training data through the first acoustics identification model Hidden layer convert after hidden layer be averaged acoustic feature and the keyword term vector feature into training sample, with voice training number It is that sample label training obtains according to the classification annotation result for corresponding to sound bite to the keyword and non-key word.

Second keyword classification device:

In the present embodiment, adding window can be carried out to the voice to be evaluated, obtain at least adding window voice to be evaluated.Its In, the long size of window can be the first setting frame number, and such as 40 frames, window moving step length can be the second setting frame number, such as 5 frames.For Each adding window voice to be evaluated can extract corresponding adding window acoustic feature.Further, the first acoustics identification model is obtained Hidden layer output, the hidden layer after adding window acoustic feature is converted correspond to each adding window voice to be evaluated be averaged adding window acoustics Feature.

Further, each hidden layer adding window acoustic feature that is averaged is inputted into the second preset keyword classification device, obtained Each adding window voice to be evaluated of second keyword classification device output corresponds to sound bite to the keyword and non-key word Classification results.

Wherein, the second keyword classification device is to correspond to voice using keyword, non-key word in voice training data Keyword hidden layer of the acoustic feature of segment after the conversion of the hidden layer of the first acoustics identification model is averaged acoustic feature, non-key word The keyword classification device of hidden layer average characteristics training.

Specifically, when the training of the second keyword classification device, it can use the first acoustics identification model to voice training data It is identified, and determines that keyword corresponds to sound bite in voice training data and non-key word corresponds to language according to recognition result Tablet section.The average spy of keyword hidden layer of the sound bite after the conversion of the hidden layer of the first acoustics identification model is corresponded to using keyword Sign and non-key word correspond to the average spy of non-key word hidden layer of the sound bite after the conversion of the hidden layer of the first acoustics identification model Sign, the second keyword classification device of training.

Referring to fig. 2, it illustrates a kind of speech samples keywords to be evaluated and the extraction of non-key word hidden layer average characteristics to show It is intended to.

Two kinds of keyword classification devices of above-mentioned example, can choose one of use, can also be used with both of which, crucial The classification results that word classifier obtains are as the testing result in step S110.

In the present embodiment, the hidden layer feature using the first acoustics identification model is further increased, as keyword classification The input feature vector of device, and export voice to be evaluated using keyword classification device and voice sheet is corresponded to the keyword and non-key word The classification results of section, the classification results can be in conjunction with of the keyword text information corresponding with voice to be evaluated of aforementioned S1-S2 With as a result, collectively as testing result, whether there is sound bite corresponding with keyword in determination voice to be evaluated.

In another embodiment of the application, introduce abovementioned steps S120, according to the testing result, determine it is described to Evaluate and test the process of the evaluation result of voice.

Based on the testing result that the various embodiments described above determine, the evaluation result of voice to be evaluated can be determined, which can To include two links, it is respectively as follows:

First link:

Evaluation and test feature is determined according to the testing result.

A plurality of types of evaluation and test features are described in the present embodiment, are introduced respectively:

1) keyword is hit:

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated.According to aforementioned true Fixed testing result can determine in voice to be evaluated specifically comprising the corresponding sound bite of which keyword.

Hit keyword as evaluation and test feature when, representation can with one-hot vector form, i.e., by N-dimensional to Amount indicates that N is keyword number, and each element position corresponds to a unique keyword in N-dimensional vector, element position there are two kinds of values, First value indicates the keyword as hit keyword, and the second value indicates that the keyword is keyword in unnatural death, wherein first Value can be 1, and the second value can be 0.

2) confidence level of keyword is hit:

The confidence level of the hit keyword, can be identification of the aforementioned first acoustics identification model to the hit keyword Confidence level, or can be, aforementioned keyword classification device, such as the first keyword classification device or the second keyword classification device are to the life The classification confidence of middle keyword.

3) keyword hit rate:

The keyword hit rate is the hit keyword number, accounts for the ratio of the total number of the keyword.

4) the Gauss duration of keyword is hit:

The pronunciation duration that the Gauss duration of the hit keyword hits keyword as described in the voice to be evaluated is true It is fixed.The Gauss duration of hit keyword can be used as a kind of measurement of the examinee to hit keyword pronunciation characteristics.

Specifically, it can determine to include the corresponding sound bite of which keyword in voice to be evaluated according to testing result, And keyword corresponds to the position of sound bite.The pronunciation duration that sound bite is corresponded to according to keyword in voice to be evaluated, can To determine the Gauss duration of keyword.

Gauss duration assumes that each syllable sounds duration Normal Distribution, the Gauss duration for hitting keyword can describe The keyword pronunciation characteristics of examinee, extracting method are as follows: constructing each hit keyword or keyword component units (such as sound first Section, phoneme etc.) pronunciation duration mean value and variance distribution table.It is illustrated for hitting keyword component units and being syllable:

The height of each syllable in the statistics available hit keyword of syllable sounds duration mean value and variance distribution table based on building Long score at this moment is averaged out Gauss duration score of the Gauss duration score as the hit keyword of syllable, calculation formula Referring to as follows:

Wherein, w_gaussFor the Gauss duration for hitting keyword, K is the syllable number for hitting keyword, ph_gaussIt (k) is kth The Gauss duration of a syllable, μ_kAnd σ_kThe pronunciation duration mean value and variance of respectively k-th syllable, x_kTo be ordered in voice to be evaluated The pronunciation duration of k-th of syllable of middle keyword.

Wherein, above-mentioned syllable sounds duration mean value and variance distribution table can be based on the building of a large amount of speaking test data Common syllable pronunciation duration mean value and variance distribution table, can also be the auto-adaptive syllable based on the building of current speaking test data Duration mean value of pronouncing and variance distribution table.

Second link:

Specifically, a variety of evaluation and test features have been determined in first link, the combination of one or more of them, base can be chosen In the evaluation and test feature of selection, the evaluation result of voice to be evaluated is determined.

In the present embodiment, voice to be evaluated can be determined based on evaluation and test feature and scoring regression model trained in advance Evaluation result.

Wherein, scoring regression model can be the forms such as linear regression, Gauss recurrence, neural net regression.

The evaluation and test feature that voice training data can be used when training uses the voice training number of mark as training sample According to evaluation result as sample label.

Speech evaluating device provided by the embodiments of the present application is described below, speech evaluating device described below with Above-described speech evaluating method can correspond to each other reference.

Referring to Fig. 3, Fig. 3 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application.As shown in figure 3, The apparatus may include:

Data capture unit 11, for obtaining the voice to be evaluated for being directed to target problem, and the key as evaluating standard Word；

Speech detection unit 12, for detecting in the voice to be evaluated with the presence or absence of voice corresponding with the keyword Segment obtains testing result；

Evaluation result determination unit 13, for determining the evaluation result of the voice to be evaluated according to the testing result.

Optionally, the speech detection unit may include:

Optionally, the text identification unit may include:

Optionally, the first acoustics identification model can be, general acoustics identification model, or, being using described general Acoustics identification model to the recognition result of the voice to be evaluated, to the general acoustics identification model carry out it is adaptive after sound Learn identification model.

Optionally, the first acoustics identification model can be, the solution code space constituted with the keyword and filter Corresponding acoustics identification model, the filter characterize all non-key words.

Optionally, the speech detection unit can also include:

Optionally, it present application illustrates two kinds of alternative constructions of evaluation result determination unit, is described below respectively:

The first, the evaluation result determination unit may include:

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit is closed The confidence level of keyword is recognition confidence of the first acoustics identification model to the hit keyword；The keyword hit Rate is the hit keyword number, accounts for the ratio of the total number of the keyword；The Gauss duration is by the language to be evaluated The pronunciation duration that keyword is hit described in sound determines；

Second, the evaluation result determination unit may include:

Speech evaluating device provided by the embodiments of the present application can be applied to speech evaluating equipment, such as PC terminal, cloud platform, clothes Business device and server cluster etc..Optionally, Fig. 4 shows the hardware block diagram of speech evaluating equipment, and referring to Fig. 4, voice is commented The hardware configuration of measurement equipment may include: at least one processor 1, at least one communication interface 2,3 He of at least one processor At least one communication bus 4；

In the embodiment of the present application, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one, And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4；

Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road etc.；

Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory) etc., a for example, at least magnetic disk storage；

Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:

Obtain the voice to be evaluated for being directed to target problem, and the keyword as evaluating standard；

Optionally, the refinement function of described program and extension function can refer to above description.

The embodiment of the present application also provides a kind of readable storage medium storing program for executing, which can be stored with and hold suitable for processor Capable program, described program are used for:

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of speech evaluating method characterized by comprising

Obtain voice to be evaluated, and the keyword as evaluating standard；

2. the method according to claim 1, wherein whether there is and institute in the detection voice to be evaluated State the corresponding sound bite of keyword, comprising:

The keyword is matched to obtain matching result with the text information, the matching result shows described to be evaluated Voice to the keyword correspond to sound bite comprising situation.

3. according to the method described in claim 2, it is characterized in that, described identify the voice to be evaluated, after being identified Text information, comprising:

Extract the acoustic feature of the voice to be evaluated；

The acoustic feature is inputted to the first preset acoustics identification model, obtain the first acoustics identification model output it is described to Evaluate and test the corresponding text information of voice.

4. according to the method described in claim 3, general acoustics identifies it is characterized in that, the first acoustics identification model is Model, or, knowing for the recognition result using the general acoustics identification model to the voice to be evaluated to the general acoustics Acoustics identification model after other model progress is adaptive.

5. according to the method described in claim 3, it is characterized in that, the first acoustics identification model is, with the keyword And the corresponding acoustics identification model of solution code space that filter is constituted, the filter characterize all non-key words.

6. according to the method described in claim 3, it is characterized in that, whether there is and institute in the detection voice to be evaluated State the corresponding sound bite of keyword, further includes:

The hidden layer output of the first acoustics identification model is obtained, it is special to the acoustics that be averaged of the hidden layer after acoustic feature conversion Sign；

The be averaged term vector feature of acoustic feature and the keyword of the hidden layer is inputted into the first preset keyword classification device, The voice to be evaluated for obtaining the output of the first keyword classification device corresponds to sound bite to the keyword and non-key word Classification results；

The first keyword classification device is to be turned with the acoustic feature of voice training data through the hidden layer of the first acoustics identification model The be averaged term vector feature of acoustic feature and the keyword of hidden layer after changing is training sample, with voice training data to institute It states keyword and non-key word corresponds to the classification annotation result of sound bite and obtains for sample label training.

7. method according to claim 1-5, which is characterized in that it is described according to the testing result, determine institute State the evaluation result of voice to be evaluated, comprising:

Evaluation and test feature is determined according to the testing result, and the evaluation and test feature includes the confidence hit keyword, hit keyword It is any one or more in the Gauss duration of degree, keyword hit rate and hit keyword；

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit keyword Confidence level be the first acoustics identification model to it is described hit keyword recognition confidence；The keyword hit rate is The hit keyword number, accounts for the ratio of the total number of the keyword；The Gauss duration is by the voice to be evaluated The pronunciation duration of the hit keyword determines；

8. according to the method described in claim 6, determining described to be evaluated it is characterized in that, described according to the testing result The evaluation result of voice, comprising:

The hit keyword is the presence of sound bite corresponding with keyword in the voice to be evaluated；The hit keyword Confidence level be the first keyword classification device to it is described hit keyword classification confidence；The keyword hit rate is The hit keyword number, accounts for the ratio of the total number of the keyword；The Gauss duration is by the voice to be evaluated The pronunciation duration of the hit keyword determines；

9. a kind of speech evaluating device characterized by comprising

Speech detection unit whether there is sound bite corresponding with the keyword for detecting in the voice to be evaluated, Obtain testing result；

10. device according to claim 9, which is characterized in that the speech detection unit includes:

Text matches unit obtains matching result, the matching for being matched the keyword with the text information The result shows that the voice to be evaluated to the keyword correspond to sound bite comprising situation.

11. device according to claim 10, which is characterized in that the text identification unit includes:

First acoustics identification model predicting unit is obtained for the acoustic feature to be inputted to the first preset acoustics identification model The corresponding text information of the voice to be evaluated exported to the first acoustics identification model.

12. device according to claim 11, which is characterized in that the speech detection unit further include:

Global hidden layer feature acquiring unit, what the hidden layer for obtaining the first acoustics identification model exported, to the acoustics Hidden layer after Feature Conversion is averaged acoustic feature；

First keyword classification device predicting unit, for by the hidden layer be averaged acoustic feature and the keyword term vector it is special Sign inputs the first preset keyword classification device, obtains the voice to be evaluated of the first keyword classification device output to the pass Keyword and non-key word correspond to the classification results of sound bite；

13. according to the described in any item devices of claim 9-11, which is characterized in that the evaluation result determination unit includes:

First evaluation and test characteristics determining unit, for determining that evaluation and test feature, the evaluation and test feature include life according to the testing result Middle keyword, hit the confidence level of keyword, keyword hit rate and hit any one in the Gauss duration of keyword or It is multinomial；

First evaluation and test characteristic processing unit, for determining the evaluation result of the voice to be evaluated according to the evaluation and test feature.

14. device according to claim 12, which is characterized in that the evaluation result determination unit includes:

Second evaluation and test characteristics determining unit, for determining that evaluation and test feature, the evaluation and test feature include life according to the testing result Middle keyword, hit the confidence level of keyword, keyword hit rate and hit any one in the Gauss duration of keyword or It is multinomial；

Second evaluation and test characteristic processing unit, for determining the evaluation result of the voice to be evaluated according to the evaluation and test feature.

15. a kind of speech evaluating equipment, which is characterized in that including memory and processor；

The memory, for storing program；

The processor is realized for executing described program such as speech evaluating method of any of claims 1-8 Each step.

16. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, each step such as speech evaluating method of any of claims 1-8 is realized.