CN101551947A

CN101551947A - Computer system for assisting spoken language learning

Info

Publication number: CN101551947A
Application number: CNA2008101232015A
Authority: CN
Inventors: 俞凯
Original assignee: Individual
Current assignee: Suzhou Speech Information Technology Co Ltd
Priority date: 2008-06-11
Filing date: 2008-06-11
Publication date: 2009-10-07

Abstract

The invention discloses a computer system for assisting spoken language learning, comprising the following components of a user interface for prompting a user to complete a certain spoken language learning content and collect voice response data of the user; a database which comprises a group of mode features used for describing the performances on acoustic and linguistic aspects which are related to the spoken language learning of the user; a voice analysis system for analyzing the voice response data and extracting acoustic mode features or linguistic mode features; a mode matching system for correspondingly matching one or more subsets in the mode feature extracted from the voice response data with the mode features in the database and generating feedback data according to the matching result; and a feedback system for feeding data back to the user and assisting the user to master the spoken language learning content. The computer system has higher interactivity and self-adaptability and can particularly heuristicly capture user errors and intelligently feed back rich teaching instruction information, thereby making up for deficiencies of the prior art on the aspect.

Description

The computer system of assisting spoken language learning

Technical field

The present invention relates to a kind of computer system of assisting spoken language learning.

Background technology

Owing to lack language exercise environment and individual coaching targetedly, spoken language learning is the most difficult link in the study foreign language.Though computing machine has been used for the language learning of auxiliary general meaning, in the spoken language learning field, the aided education of computing machine still effect is limited, is difficult to satisfactory.At present, some technology relevant with speech recognition and evaluating pronunciation tentatively has been used for assisting spoken language learning, but its performance still has significant limitation.

As WO 2006/031536; WO 2006/057896; WO 02/50803; US 6,963, and 841; US 2005/144010; With just relate to multiple spoken language learning method and the computer system thereof that builds on speech recognition and the evaluating pronunciation correlation technique basis in WO 99/40556 these open texts, but the utilization for speech recognition and evaluating pronunciation correlation technique in these class methods and the system is confined to correct in user speech and the imitation to user speech more, for example with the intonation of the user speech gathered, word speed, parameters,acoustics such as tonequality are reflected on the demonstration sound of machine, with this demonstration sound mapping is the sound similar to user voice, perhaps look for and select the demonstration sound similar the demonstration sound that directly in machine, prestores to user speech, thus with the further spoken dialog training of user in imitation and the comparison demonstration sound of user by simply is improved.Certainly in above-mentioned and user's spoken dialog training process, system also can carry out some simple miscues, and these miscues mostly occur and require the user to finish spoken dialogue again when the spoken voice as the user match with demonstration sound to train; But system does not carry out didactic feedback and evaluation and test to user's verbal learning situation in this course, especially can not didacticly catch the mistake in user's spoken language, and this mistake is corrected and instructed; Its result who causes makes the user can't in time find the deficiency of oneself, thereby practises improving the spoken language proficiency of oneself targetedly.

The existing in a word computer system that applies to auxiliary verbal learning can not the didactic user of inducing be goed deep into the study of conversational language, usually cause the learner to imitate correct spoken language pronunciation simply and can't from the interaction of this system the tutorial message that obtains enriching, its performance is incomplete, and as seen a kind of real important topic for needing to be resolved hurrily of computer system of assisting spoken language learning of perfect performance is provided.

Summary of the invention

The present invention seeks to: the computer system that the comparatively perfect assisting spoken language learning of a kind of performance is provided, it is used speech recognition and generates the pattern feature relevant with spoken language learning with voice and language analysis technology, the teaching pattern information of utilization structureization, particularly can catch the user error and the teaching-guiding information of feed back rich intelligently, to remedy the deficiency of indented material heuristicly.

Technical scheme of the present invention is: a kind of computer system of assisting spoken language learning comprises following assembly:

User interface comprises the prompting of machine to the user, requires the user to finish certain spoken language learning content, also comprises gathering the voice response data of user in this process;

Database, comprise characteristic data items, comprise a group mode feature in this characteristic data items, in order to describing the acoustics relevant and the performance of linguistics aspect with user's spoken language learning, and the feedback tutorial message and the learning content with specific of pattern feature quantification is corresponding;

Speech analysis system is in order to analyze the above-mentioned user's voice response data that is collected in, from the voice response extracting data acoustic mode feature or the linguistics pattern feature of above-mentioned collection;

Pattern matching system, in order to acoustic mode feature or the one or more subclass in linguistics pattern feature of said extracted in the user speech response data are carried out corresponding coupling with pattern feature in the database, and generate feedback data according to above-mentioned matching result;

Feedback system, in order to above-mentioned feedback data is fed back to above-mentioned user, auxiliary above-mentioned user grasps above-mentioned spoken language learning content.

The present invention plants in the above-mentioned main technical schemes of computer system of assisting spoken language learning and has specifically adopted the structurized database that pre-defines, include the error instance and the corresponding teaching-guiding that may run in the various language learnings in this database, these error instance comprise the error instance error instance relevant with linguistics that acoustics is relevant.Error instance is described by a series of pattern feature (also being proper vector), and pattern feature can be word sequence, numeral or symbol." machine " that relates in the main technical schemes of the present invention can be computing machine or other electronic equipments, concrete example as, but be not limited to desktop notebook computer, mobile computing device such as PDA; Certainly the computer system of assisting spoken language learning of the present invention can realize with distributed way by the internet, for example the client and server end system.The present invention can be applied to various language learnings, and is for example Chinese and English, and the present invention can also be according to the content application that is provided in teaching and test, and these will describe in detail below.

In the specific embodiment of the present invention, described structurized database is further used for storing one group of data item that is mutually related.These data item that are associated have comprised one " characteristic data items ", proper vector for example, and this vector comprises the especially wrong performance of various performances that a group mode feature may occur in spoken language exercise in order to differentiate the user.The data item that is associated has comprised " guide data item " simultaneously, this comprises some tutorial messages, this tutorial message is corresponding one by one with acoustic mode feature or linguistics pattern feature in the characteristic data items, improves or corrects mistake in the spoken language pronunciation (or provide award to user's correction) in order to guides user.Tutorial message can have various ways, for example, and voice guidance (use voice operation demonstrator), and/or literal guidance (with textual form output), and/or figure instructs (with diagrammatic form output).The data item that is associated comprises " learning content data item " simultaneously, in order to the some specific language learning objective in the sign spoken language learning content, by this learning content data item, the particular content of spoken language learning target just is mapped with the above-mentioned data item that is associated.That summarizes says, the above-mentioned data item that these are associated comprised to default user's response of specific spoken language learning content and with every kind of learning guide data item that response is corresponding.The content of spoken language learning can have various ways, and for example: exercise pronunciation, fluency, intonation (for example time locus of fundamental frequency), tone (for the language that tone is arranged), stress, vocabulary are selected and other related contents.With the language that tone is arranged is example, the content of spoken language learning can be designed as the exercise tone: the speech data of the user pronunciation of being gathered, the feature of from this speech data, extracting particularly, can be used for known one group of tone (for example five kinds) in a kind of the coupling.

Say in essence, use a pattern matching system that characteristic data items (or saying proper vector) default in the pattern feature of user pronunciation and the corresponding spoken language learning content is mated among the present invention, search corresponding tutorial message data item according to the result of coupling and provide corresponding guidance content again.In this way, corresponding guidance content can be mapped with some preset language study situations.For example, preestablish one group of contingent language learning mistake and corresponding learning guide content, select wherein the most targetedly the learning guide content feed to give the user according to the result of pattern match again.In an idealized system, default situation generally has multiple, and this is not a strict restriction certainly.

In real system, it is directly related with the acoustics or the linguistics characteristics of user speech input to mate used pattern feature.For example, with the fundamental frequency time locus and/or the relevant pattern feature of energy height of syllable or phoneme; The pattern feature relevant, such as the pattern feature of word order and abstract semanteme with the macrolinguistics factor.One group mode feature can be with a first vector description, each element of this vector can have different data types, a for example real number array (for example time locus of fundamental frequency), or certain ordered list (for example word sequence), or other similar elements.Certainly acoustics or linguistics pattern feature below will describe in detail.

In the specific embodiment of the present invention, speech analysis system is made up of acoustic mode analytic system and linguistics pattern analysis system.These two outputs that the pattern analysis system has all used basic speech recognition system, speech recognition system self have comprised acoustic model and the linguistic model based on statistics.Acoustic model is used to describe the similarity degree of certain sound bite and certain syllable or phoneme.Linguistic model is used to describe the mapping from syllable/phoneme to words, and the statistics prior probability of words.In further preferred embodiment of the present invention, speech recognition system is mainly used in the time border that generates words and phoneme/syllable, in view of the above the speech data of gathering is effectively cut apart, and the pattern feature of acoustic model and linguistic model is divided into groups.

In the specific embodiment of the present invention, the output of speech recognition system can be following three information: 1. phoneme/words; 2. phoneme/words+their time border; 3. the time border of phoneme/words; And speech recognition system may be exported above-mentioned three information simultaneously.

In the specific embodiment of the present invention, the acoustic mode analytic system is one or more phonemes, syllable and sentence, their time border, and corresponding confidence level information and relevant prosodic information are formed an acoustic feature vector.The employed feature of this feature and speech recognition system is different.These acoustic features (for example the fundamental frequency track of phoneme or the average energy of phoneme) are corresponding to the relevant phonetic feature of teaching.

The acoustic mode analytic system can be discerned the pronunciation of the many levels in the spoken voice described in the specific embodiment of the present invention, and for example one or more phonemes, syllable and sentence provide corresponding degree of confidence simultaneously, for example the posterior probability of syllable.Like this, the acoustic mode feature also just comprises one or more phonemes, syllable and sentence and corresponding confidence data thereof.In further preferred embodiment, the acoustic mode analytic system can also be discerned prosodic features from the voice response data of gathering, this prosodic features comprises the fundamental frequency feature of certain speech data fragment (corresponding to phoneme in the speech data of gathering or syllable), and the duration of corresponding speech data fragment and energy value.

In the specific embodiment of the present invention, linguistics pattern analysis system is used for discerning the syntactic structure of user speech.Have a large amount of various types of syntactic structures in the database, language analysis system is compared existing recorded in the voice gathered and the database and is provided recognition result.Following is a simple examples: we have a sentence " please bottle to be taken the kitchen ", and the syntactic structure that the language mode analytic system identifies this is " X is taken Y ".Subsequently, whether the language mode analytic system will be retrieved this syntactic structure and exist in database, and will return corresponding index sequence number.

In the further preferred embodiment of the present invention, linguistics pattern analysis system also carries out semanteme decoding, with gather and the voice of identification and some more the semantic meaning representation of broad sense mate.For example, sentence " may I ask the where is it restaurant? " can add " place " and add " food and drink department " for " inquiry " in that semantic level is abstract.Delivered the relevant achievement in research of utilizing speech recognition system to carry out semantic grade analysis (for example S.Seneff.Robust parsing for spoken language systems.In Proc.ICASSP, 2000) in the scientific and technical literature.Here, the semantic structure of gathering voice is used to determine one of the element of the proper vector of tutorial message database index value.

In the further preferred embodiment of the present invention, linguistics pattern analysis system can also discern the one or more keywords in the user speech, particularly " grammer " keyword, for example conjunction or preposition etc.Then, the acoustic mode analytic system provides each keyword that identifies and puts the letter scoring accordingly.In fact, the letter scoring of putting of these keywords also is to be used for one of element of the proper vector of definite tutorial message database index value.When these keywords were even more important for the meaning of understanding a language, above-mentioned feature was very valuable.

In the specific embodiment of the present invention, described acoustic mode analytic system and linguistics pattern analysis system, or the two one of can identification error or correct acoustics, linguistics/syntactic structure.In view of the above, system of the present invention can discern the frequent fault in user's spoken language learning and correct/improve.For example, Japanese is the sound (owing to not having the sound of " R " in the Japanese) that the people of mother tongue often sends out into the sound of phoneme " R " phoneme " L ", and system can discern and and guide the user to correct this pronunciation.Similarly, for example, system can discern dialogue " how do you do? " answer as the standard of certain dialogue exercise, but system can also point out the user other adopt the answer of informal syntactic structure to improve spoken language proficiency to help the user.

In the specific embodiment of the present invention, feedback data is made up of the index of the tutorial message in the structurized database.This index is by the decision of the matching degree of user's acoustic mode feature and some the acoustic mode feature that prestores in the database.Under the situation of known current learning content, whether correctly the best comparison result of the feature of phoneme, syllable or syntactic structure and so on can be used to judge user pronunciation and sentence structure (perhaps corresponding correct degree).Therefore, this guidance to the user is more near the sensation of natural language learning.

In the specific embodiment of the present invention, tutorial message is the classification arrangement, comprises acoustic level and other guidance of linguistics level at least.In view of the above, system can be according to user-selected training rank, user's oracy, with and/or the complexity of learning content select corresponding tutorial message.For example, for the beginner, the tutorial message of system feedback acoustic level; And for higher levels of spoken language exercise person, other tutorial message of system feedback linguistics rank or semantic class.In addition, the user can select the desired rank that instructs.

In the specific embodiment of the present invention, the feedback information of system comprises scoring.The scoring that computing machine generates is said in essence can be in numerical value section arbitrarily.But in the language evaluation and test of reality, true man teacher gets well or bad evaluation speaker's spoken voice, and when perhaps providing 1 to 10 scoring, appraisal result has very high consistance.Observe in view of the above, some embodiment has added a mapping function, and this function will be converted into the scoring that is used for system's output according to the scoring that the pattern feature matching degree provides.In fact, this function is by one group of training data (speech data of collection) training gained, is known at the true man teacher's of this group training data scoring.This function is used for the scoring that the conversion calculations machine generates, and makes under the condition of given scoring scope, and scoring that system generates and true man teacher's scoring have the degree of correlation more than 0.5,0.6,0.7,0.8,0.9 or 0.95.

In the preferred embodiment of the invention, the conversational language that system carries out the machine aided education comprises some language based on tone, for example Chinese.Accordingly, then comprise fundamental frequency time locus data and corresponding graphical information in the system feedback data.

The computer system of assisting spoken language learning of the present invention is adaptive, and can be to user learning.In fact such system has comprised a large amount of historgraphic data recordings, for example user's voice data and relevant acoustics and/or linguistic feature vector.By statistical study, therefrom can find some frequent take place but can't in database, find the feature of close coupling.In this case, database generates a new record, and this record is in fact corresponding to a kind of new general type of error.Therefore, some system embodiment comprises a coding module, is used for the new feature that do not have from the historical data identification database, and this feature is added in the database.Under some situation, above-mentioned module reclassifies existing feature in the database.For example 40 hertz to 100 hertz scope with original fundamental frequency frequency range is divided into 40 hertz to 70 hertz and 70 hertz to 100 hertz two parts.In further preferred embodiment of the present invention, the new feature that draws is inferred for the expert provides the interface by system with checking.Thereby the expert can provide corresponding tutorial message and add database to for new feature tutorial message data.In addition, whether the feature that all right counsel user of system is new is very relevant with certain mistake.These information can add database with the form of text.Before new database more, whether system sends to the user that other have same mistake with these " correction " data, effectively help most of user to correct mistake thereby differentiate corresponding tutorial message.

The computer system of assisting spoken language learning of the present invention can also be used for auxiliary oral test.In this case, feedback system can generate that a test report is replenished or replace feedback to the user.

In sum, in fact the present invention discloses a kind of computer system with assisting spoken language learning of adaptive characteristic, this system can carry out automatic speech recognition, extract the user's relevant acoustics and linguistics pattern feature, catch user error (analyzing multiple scheme) heuristicly and provide teaching-guiding with oral English teaching.Its basic implementation procedure is: the user is to certain electronic equipment exercise spoken language pronunciation, and system adopts speech recognition technology and voice and linguistic feature analytical technology analysis user voice, generates acoustics and linguistics error characteristic with this.The error characteristic of system's search subscriber in the database that includes default mistake and corresponding guidance.After the system discovery error characteristic coupling, feed back to the error analysis and the teaching-guiding of user individual by certain aptitude manner.System can also adjust automatically according to user's experience of study.Take this, system can grasp the teaching-guiding content of new knowledge or new personalization.System can carry out the short sentence teaching with the interactive conversation pattern, also can be to carry out the teaching of long sentence and paragraph with reading mode.

The computer system of assisting spoken language learning of the present invention in a further preferred embodiment, this system can provide objective quantitative machine scoring feedback, this scoring is through checking, provide the feedback evaluation and test to abundant and concrete acoustics and linguistics study point simultaneously, and the teaching-guiding that extendible personalization can be provided intelligently is to help correcting one's pronunciation mistake or improve spoken skill.Though scoring is generated by computing machine, this scoring process and true man teacher's scoring is verified, so this scoring is reliable.In addition, the present invention is convenient to collect new knowledge, is can be dynamically improved therefore.

The present invention also provides a kind of computer program code that is used to realize the computer system of aforementioned any described assisting spoken language learning of claim, this computer program code comprises the computer program code that is achieved as follows function: user interface, comprise the prompting of machine to the user, require the user to finish certain spoken language learning content, also comprise and gather the voice response data of user in this process; Database comprises characteristic data items, comprises a group mode feature in this characteristic data items, in order to describe the acoustics relevant with user's spoken language learning and the performance of linguistics aspect; Speech analysis system is in order to analyze the above-mentioned user's voice response data that is collected in, from the voice response extracting data acoustic mode feature or the linguistics pattern feature of above-mentioned collection; Pattern matching system, in order to acoustic mode feature or the one or more subclass in linguistics pattern feature of said extracted in the user speech response data are carried out corresponding coupling with pattern feature in the database, and generate feedback data according to above-mentioned matching result; Feedback system, in order to above-mentioned feedback data is fed back to above-mentioned user, auxiliary above-mentioned user grasps above-mentioned spoken language learning content.Code can provide by certain carrier, for example CD-ROM or DVD-ROM, perhaps programmable storage, for example hardware firmware.The code of embodiment of the present invention (and/or data) comprises that source code, file destination or executable code are (based on certain computer programming language, for example C or assembly language), be provided with or the code of control special IC (ASIC) or field programmable gate array (FPGA), the code that perhaps is used for hardware description language, as

Perhaps Very High Speed Integrated Circuit (VHSIC) hardware description language (VHDL).

Advantage of the present invention is:

1. the computer system of assisting spoken language learning of the present invention can provide feedback evaluation information at acoustics and philological multiple aspect for the spoken language learning person, these information are through statistical testing of business cycles, abundant and accurately, can make the user that the general performance of oneself is had abundant complete conception.

2. the computer system of assisting spoken language learning of the present invention has stronger interactivity, can be according to the teaching-guiding that offers the extendible personalization that the user enriches of the feedback intelligent of user input and database, these guidances comprise how correcting a mistake, and the sentence-making of making to measure for the user is instructed, make the user can in time find the deficiency of oneself, thereby practise improving the spoken language proficiency of oneself targetedly, to help correcting one's pronunciation mistake or improve spoken skill.

3. the computer system of assisting spoken language learning of the present invention has higher adaptivity, it can obtain new knowledge (new spoken pattern feature/teaching-guiding content), can adjust automatically according to user's experience of study, and it is take this to grasp the teaching-guiding content of new knowledge or new personalization, and As time goes on constantly perfect.Therefore, existing relatively non-didactic verbal learning computer system, the present invention is more intelligent, has more practicality.

Description of drawings

Below in conjunction with drawings and Examples the present invention is further described:

Fig. 1 is an allomeric function structured flowchart of the present invention;

Fig. 2 is the detail flowchart in the I frame of broken lines shown in Fig. 1;

Fig. 3 is the detail flowchart in the II frame of broken lines shown in Fig. 1;

Fig. 4 is the detail flowchart in the III frame of broken lines shown in Fig. 1;

Fig. 5 is the detailed structure view of database schema characteristic data items among the module B among Fig. 1;

Fig. 6 is the mutual relationship synoptic diagram of various piece among the module B among Fig. 1;

Fig. 7 be one by a left side and the right hidden Markov model with three output states (HMM);

Fig. 8 is the time boundary information of a sentence of having discerned;

Fig. 9 is a fundamental frequency track characteristic that is used for tone learning, has comprised the fundamental frequency track of four tones of Chinese.

Embodiment

Specifically, the invention discloses a kind of computer system of assisting spoken language learning, it is used speech recognition and generates the pattern feature relevant with spoken language learning with voice and language analysis technology, the teaching pattern information of utilization structureization (particularly error pattern), and the teaching-guiding information of feed back rich intelligently.Possible acoustics and linguistics pattern feature (study wrong or to the multiple oral expression mode of the same meaning) can be collected by actual foreign language teaching case.Applied for machines learning method of the present invention is analyzed as above pattern, draws one group of succinct proper vector, to reflect different teaching directions.These proper vectors can combine to be used for that the different contents of courses is provided concrete or overall scoring, the correctness that for example pronunciation, fluency or grammer use.By these machine scorings are carried out statistical regression with true man teacher's scoring, can guarantee the correlativity of height between the two.In addition, database is these pattern feature classification and storage, and with every kind of feature correspondence with different specific teaching-guidings.Therefore, we can think that teaching-guiding is a function of the pattern feature of user speech.

As language learner during facing to machine exercise voice, the audio-frequency information of input by machine processing to generate acoustics and linguistics pattern feature.System's one or more entries that search is complementary with it from database subsequently, thus find a series of corresponding teaching-guiding contents and it is comprehensively become a complete teaching-guiding feedback, with text or the output of multimedia form.System can use voice operation demonstrator or true man's recording that teaching-guiding is returned with speech form.When suitable feature can't be mated in system in database, this unknown characteristics can be fed back to central database.At every turn, after similar unknown characteristics was found, system counted, analyzes it and in due course it added database as new knowledge.When the user attempted overcoming certain specific type of error, the user can be required to import his/her study notes, and these information are classified equally and join in the database as new knowledge.

Below at first in conjunction with Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5 and Fig. 6, various piece to the computer system of assisting spoken language learning of the present invention, especially each functional module is carried out detailed explaining, and then provides an embodiment who combines the concrete syntax learning content.

Module 1 is a front-end processing module.This module is carried out signal Processing to the input voice, and extracts a series of original feature vector and be used for follow-up speech recognition and analysis.These proper vectors are real number vectors.These features comprise, but are not limited to following several:

Mel frequency cepstral coefficient (MFCC)

Perception linear prediction (PLP) coefficient

The waveform energy

The waveform fundamental frequency

Module 2 is sound identification modules.This module goes out the time border of words sequence and each syllable according to the speech recognition of input, and that can also export each syllable puts the letter scoring.The all or part of original acoustic feature that this module uses module 1 to be exported.Audio recognition method includes, but are not limited to following several:

Template matching method: the standard sound template of each syllable and the feature of input are mated the template that output is mated most

Method based on probability model: probability model, hidden Markov model (HMM) for example can be used to describe the probable value of original feature vector under the condition of known specific word sequence, and/or the prior probability of descriptor sequence.The acoustic feature of given input, probability model can output be obtained the words sequence of maximum a posteriori probability.In the identifying, the dimension that can use speech network or statistical language model based on grammer to reduce the search volume.Identifying is exported the time border of each syllable automatically.The degree of confidence scoring is passed through, and calculates but be not limited to following method:

-obscure the posterior probability of the syllable of words network.Recognizer can be exported the recognition result of multiple hypothesis.So the present invention calculates the posterior probability of each syllable in the hypothesis, this probability be given might assumed condition under the likelihood of certain syllable.This posterior probability can be direct, perhaps passes through certain suitable linear change, marks as the letter of putting of corresponding syllables.

-background model likelihood value relatively.We train a background model with a large amount of mixing voice data, and this model does not have the separating capacity of general language model to words.Can calculate probable value with it for the primitive character of certain syllable of having discerned.Then, we will compare with background model probable value of calculating and the probable value of calculating with correct statistical model, use comparative result, and for example percent information is used as the foundation that degree of confidence is marked.

Modules A is an add-on module, and the pairing text of expression user input voice is known in advance.This information can be used to replace the text output of module 2, the perhaps identifying of accelerating module 2, and be directly used in next step pattern feature analysis.In general, this module is used for the teaching of pure acoustic connection, promptly mainly exports to module 3.

Module 3 is acoustic mode characteristic extracting module.This module is used the output information of module 2 and module 1, and particular case can comprise the information of modules A down, produces one group of acoustic mode feature that is used for teaching purpose.These features are quantitative, and can directly reflect the acoustic characteristic of voice, for example pronunciation, tone, fluency etc.These features comprise, but are not limited to following several:

The primary speech signal of each syllable (waveform);

The original acoustic feature of each syllable that obtains according to module 1;

The duration of each syllable and/or phoneme (minimum acoustic elements);

The average energy of each syllable;

The speech pitch value of each syllable and/or sentence;

The degree of confidence scoring of each syllable or sound element or sentence;

Module 4 is linguistics pattern feature extraction modules.This module is used the output of module 2, produces one group of linguistics pattern feature that is used for teaching purpose.These features comprise, but are not limited to following several:

The word sequence of user's input;

The vocabulary that the user uses;

The probability of occurrence of grammer key word;

Predefined grammar indices;

The semantic item of input word sequence;

By words sequence and a series of predefined finite state grammar structure are mated the index that can return the syntactic structure of mating most.The index of grammer key characteristics and syntactic structure has constituted user's grammatical pattern feature.Semantic item can be obtained by a semantic parser.This semantic parser is mapped to standardized semantic item with word sequence.

Module 5 is teaching pattern analysis modules.This module is used the acoustic mode feature of module 3 outputs and the linguistics pattern feature of module 4 outputs, and with these features and teaching pattern and guidance content database, module B mates.

Module B is a teaching pattern that pre-defines and guidance content database.Every branch of database two big classes: teaching pattern feature and corresponding guidance content.The teaching pattern feature comprises foregoing acoustics and linguistics pattern feature.The structure of characteristic item as shown in Figure 5.These features can be real number type vector, symbol or index value.Guidance content is the explanatory information of corresponding teaching pattern.Guidance content can be the form that text message, picture, voice or image sample or other can exchange with the user by machine.Setting up this database needs to collect enough speech datas, corresponding text content, true man teacher's scoring and true man teacher's guidance content before.Afterwards then according to the training data characteristic information extraction and according to different guidance contents with tagsort.During given certain content of courses, each express corresponding pattern or wrong corresponding pattern all with database in the teaching-guiding content interrelate.This structure is represented by Fig. 6.

Coupling is according to the generalized distance between the reference feature in input feature vector and the database.This distance is passed through, but is not limited to, and following method is calculated:

For real number type feature, at first carry out normalization, thereby with the codomain of feature limits in 0 to 1; Calculate Euclidean distance then.

Can use a probability model and corresponding likelihood value to replace Euclidean distance.

For the index type feature,, otherwise return 0 if having the index of same numerical value in the database then return 1.

For the character type feature, for example word sequence calculates Hamming distance.

After as above searching for, from database, extract some tutorial messages, can select the record of minimal matching span correspondence or select a plurality of records according to the matching result of ordering according to different teaching patterns.Tutorial message comprises that error correcting instructs or the teaching suggestion.Tutorial message can be text, voice or other multimedia forms.For for the language learning of tone, for example Chinese, the tutorial message of relevant tone teaching can be aforesaid speech pitch value calibration figure.

Except tutorial message, the output of module 3 and module 4 can also be used to calculate quantitative scoring.This mark comprises at the quantitative mark of each teaching aspect and to the overall score of general performance.In general, in this mark and input feature vector and the database with reference to the generalized distance between the template characteristic linear or nonlinear relationship.Mark includes, but are not limited to, following several:

Pronunciation scoring to sentence, syllable or phoneme.This scoring is calculated by confidence, duration and energy value.

Tone scoring to syllable or phoneme.This scoring is calculated by the speech pitch value.

The fluency scoring.This scoring is calculated by confidence score and speech pitch value.

Percent of pass.This scoring is calculated by the ratio that the speech that obtains high phonetic and voice tone/fluency score value accounts for whole exercise vocabulary

Skill level.This scoring is calculated by above-mentioned all scoring linear weighted functions.

Above-mentioned original scoring will be through further linearity or Nonlinear Mapping conversion to meet true man teacher's standards of grading.This conversion is trained by statistical method based on the scoring of a large amount of language learning sample datas and teacher true man and computing machine.Above-mentioned scoring is shown to the user with digital form or picture form, uses statistical graph forms such as contrast table, histogram, pie chart and histogram simultaneously.

To sum up, the output of module 5 comprises aforesaid tutorial message and quantitative scoring.

Module 6 is feedback information generation modules.The tutorial message of these module synthesis module 5 outputs and quantitative scoring are to produce organized, a smooth and comprehensive teaching-guiding content.Final guidance content comprises text based instruction and multimedia sample.This guidance comprises comprehensive instruction, and at the special instruction of different acoustics or linguistics study point.In addition, the quantitative scoring that module 5 is produced makes the learning outcome imagery with histogram or other forms of graphic presentation.

Module 7 is optional text-to-speech modules.True man's pronunciation that the text based tutorial message of module 6 output can be converted to voice or prerecord by phonetic synthesis.

Module 8 is adaptation modules.This adaptation module can be realized mode of learning and tutorial message database, the renewal of module B and sound identification module 2 and acoustic analysis module 3.

Renewal for module B can be carried out in the following manner.At first, module is effectively organized possible feedback information according to active user's learning demand.Module is added up user's sound producing pattern (particularly error pattern) and is deposited it in database.These statistical informations mainly comprise the counting analysis result index accordingly of user's pattern feature.Next time, when same position user learns, thereby this user can recover his studying history or recognize progress by the historical statistics in current statistics and the database is compared.These statisticss can also be used for the learning stuff of design personalized, for example Ge Xinghua exercise course or senior reading material etc.Statistics shows with numerical value form or diagrammatic form.

Secondly, adaptation module is adjusted database to adapt to news.When finding new pattern feature, new pattern feature is fed back to central database by network, for example a server on the Internet.Central database is sorted out to these new feature counts and after the feature accumulation is a certain amount of.After new classification occurred, database upgraded to adapt to this news, for example a kind of new study mistake.New data can be reused by all users.On the other hand, after some users made progress, system can ask the user to import learning skill.This information also can be returned to system and add database.This adaptation module can keep a dynamic database, to satisfy the rich and personalized of content.

Renewal for module 2 and module 3 realizes by upgrading model parameter.The renewable model parameter of module 2 includes, but are not limited to: the parameter of Hidden Markov Model (HMM) and the prior probability of linguistic model, update mode can adopt maximal possibility estimation.The renewable parameter of module 3 includes but not limited to: the qualification frequency range of fundamental frequency, the limited range of average ability.

The following detailed description that provides a specific embodiment of the present invention---English learning system---.In the present embodiment, the user of study voice is set at the learner of mother tongue for Chinese.The study scope is set at travel information.Curriculum model is the dialogue of sentence level.Total system runs on the personal computer (PC) that is connected to Internet.Microphone and earphone are as user's input-output device.

At first (for example, " you think a restaurant that price is expensive to computing machine by one section Chinese meaning of user interface prompt."), please the user express the same meaning by a word in English then.After the user says an English to computing machine, the feature of various acoustics of Computer Analysis and linguistics aspect, and provide a abundant report of accessment and test and recommendation on improvement.Therefore, the core of the computer system of assisting spoken language learning of the present invention is speech analysis system and feedback system.Further combined with Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8 and shown in Figure 9, the function implementation procedure of this embodiment is provided progressively description:

At first, module 1 is carried out front-end processing (primitive character extraction).The information that the user imports computing machine at first is converted into

The digital speech waveform of WAV form.Wave data is divided into a series of overlapping data segments.The overlap length of adjacent segment is 10 milliseconds.The length of each fragment is 25 milliseconds.The original acoustic feature extraction that is to say that from each fragment per 10 milliseconds are extracted a proper vector, and this fragment is called as " frame ".When extracting feature, at first the voice signal in each frame being carried out short time discrete Fourier transform, obtain the spectrum information of signal, extract perception linear prediction (PLP) feature, energy and fundamental frequency then, also is fundamental frequency value or f0.For the fundamental frequency that suppresses in the signal Processing doubles problem, the present invention adopts mobile Gauss's smooth window to handle original fundamental frequency value.About PLP Feature Extraction technology, with reference to [H.Hermansky, N.Morgan, A.Bayya, and P.Kohn.RASTA-PLP speechanalysis technique.In Proc.ICASSP, 1992.]; Extractive technique about the fundamental frequency value, with reference to [A.Cheveigh and H.Kawahara.Yin, a fundamental frequencyestimator for speech and music.Journal of the Acoustical Society ofAmerica, 111 (4), 2002]; Energy value is the quadratic sum of all signals in the fragment.

PLP value and energy feature are input to a statistics of speech recognition module 2 to be obtained:

1. statistics goes up word sequence and the aligned phoneme sequence that maximum possible occurs

2. provide N alternative word/aligned phoneme sequence with grid configuration

3. the acoustic model probable value of each speech/phoneme and linguistic model probable value

4. the time border of each speech and phoneme

The statistics of speech recognition system comprises acoustic model, linguistic model and dictionary.Dictionary has provided by the mapping relations of syllable/phoneme to words.

The present invention has used many mappings pronouncing dictionary that comprises all non-mother tongue pronunciation distortion.Linguistic model adopts a ternary model, has provided each words, double word phrase, and the prior probability of ternary words group.Acoustic model adopts the hidden Markov model (HMM) of a continuous distribution, is used to describe the probability distribution of feature (observed quantity) vector under the condition of given certain phoneme.

Fig. 7 be one by a left side and right hidden Markov model, as shown in Figure 7, we adopt a kind of tlv triple hidden Markov model of striding the speech border through state clustering.The state output probability is the mixed Gauss model that is made of PLP proper vector (comprising static state, first and second order derivatives).We use the token ring pass-algorithm to search for, by keeping a plurality of tokens to obtain many alternative syllable/aligned phoneme sequence in search procedure.Voice identification result provides with the HTK grid configuration, concrete ins and outs are referring to [S.J.Young, D.Kershaw, J.J.Odell, D.Ollaason, V.Valtchev, and P.C.Woodland. (for HTKversion 3.0) .Cambridge University Engineering Department, 2000] viterbi algorithm can be discerned the time border of syllable/phoneme simultaneously, and the subsequence that this information is used for is as shown in Figure 8 analyzed.

Under some learning tasks, the content of text of user pronunciation is known, promptly has modules A, has comprised the known user content of speaking, and at this moment, identification module can be simplified.This means that can to make the operation of corresponding recognizer faster.In this case, the search volume of sound identification module 2 reduces greatly, only the time boundary information of given text and a spot of recognition result similar to user pronunciation among the generation module A.After the speech recognition, corresponding text (may directly come from modules A) and acoustic information input to module 3 and module 4, carry out acoustics and philological pattern analysis respectively.Module 3 is compiled or is extracted the following acoustic mode feature relevant with the content of courses.

1. duration of syllable/phoneme

2. the energy of syllable/phoneme

3. the fundamental frequency value and the time locus thereof of syllable/phoneme

4. the degree of confidence of syllable/phoneme scoring

5. the aligned phoneme sequence that many groups identify

The duration of syllable/phoneme is by module 2 outputs.The energy of syllable is calculated by the average energy of each frame in the syllable, and computing formula is as follows:

E_{w} = \frac{1}{T} Σ_{t = 1}^{T} E_{t} - - - (1)

E wherein _wBe the energy of syllable, E _tEnergy for every frame of module 1 gained.The fundamental frequency value of phoneme energy and syllable/phoneme adopts similar approach to calculate.The time locus of fundamental frequency is one group of fundamental frequency vector corresponding to certain syllable/phoneme.We use the dynamic time warping algorithm that this vector is normalized to standard length.The letter of putting of syllable is marked by the words grid computing of recognizer output.The acoustics probability and the language probable value of given each phoneme (or words) arc can adopt the posterior probability of forward direction-back to each arc of algorithm computation.Thus, original speech recognition phoneme (words) grid can be converted into a phoneme (words) degree of obscuring network, and the phoneme (words) that wherein has similar time border and similar content is merged.Upgrade the posterior probability of each phoneme (words) subsequently and with it as putting letter scoring.Ins and outs are with reference to [G.Evermann and P.C.Woodland.Posterior probability decoding confidence estimation and system combination.In Proc.of theNIST Speech Transcription Workshop 2000].Shu Chu aligned phoneme sequence is that of probability maximum in each possible sequence at last.

Module 4 is extracted the linguistic feature of user input voice, comprising:

1. optimum words sequence

2. user's vocabulary

3. the probability of occurrence of grammer keyword

4. predefined grammar indices

5. the semantic interpretation of voice

Optimum words sequence is the output of module 2.Vocabulary refers to the different vocabulary that the user uses.

The grammer keyword is the good a series of vocabulary of predefined, adopts Hash table to retrieve identification in real system.The probability of occurrence of grammer keyword is from the degree of confidence scoring of the keyword correspondence of being discerned.

Grammer keyword and predefine grammar indices value have showed user's grammatical pattern feature.

The semantic interpretation of sentence generates by going to analyze the words sequence with one group of grammer.At first each speech is labeled as noun or verb etc., compares with each predefined syntactic structure then, for example " please be with [noun/phrase] to [noun/phrase] ".Predefined syntactic structure not necessarily only is correct grammer, and this grammer tabulation also comprises a large amount of frequent fault syntactic structures and the different syntactic structure of expressing equivalent.This extraction algorithm and the extraction of semantics class of algorithms seemingly only replace with syntactic structure/term with common semantic item.

System adopts the semanteme parsing of robust to be used to understand user's input.Here, we use the method based on the phrase template.The algorithm details is with reference to [S.Seneff.Robust parsing for spoken language systems.InProc.ICASSP, 2000].Semantic decoded results is with " request (type=bar, food=Chinese, drink=beer) " form output.

After generating teaching relevant acoustics and linguistics pattern feature, system is itself and module B, mode of learning and tutorial message database, in predefined pattern feature and teaching-guiding mate.Because it is closely related that this part and intelligence are fed back, so we describe the formation of mode of learning and tutorial message database earlier.The content of given certain language learning, database comprise a large amount of paired " pattern-guidance " information.The acoustic mode collection comprises following continuation feature:

1. the mean value of the syllable of right pronunciation/phoneme duration and variance (mother tongue and the Chinese language that great speech)

2. represent the mean value and the variance of syllable/phoneme duration of the pronunciation of 5 kinds of skill levels (by good to poor)

The energy of syllable/phoneme and fundamental frequency value are according to as above pattern extraction.

With regard to the fundamental frequency time locus, the normalized fundamental frequency time locus of each phoneme and syllable is stored in database.The duration of normalized fundamental frequency is the mean value of the duration of each syllable/phoneme, is called " normalized duration ".The fundamental frequency time locus of all training datas uses the dynamic time warping algorithm to extend to the normalized duration.Each fundamental frequency time locus all average fundamental frequency value is deducted so that its baseline null value always.So at each time point, the average fundamental frequency value of training data is used as normalized value.What deserves to be mentioned is, we use three kinds of normalized fundamental frequency time locuses corresponding to good/in/differ from three kinds of pronunciation situations.

With regard to degree of confidence scoring, good/in/difference enunciator's mean value all is stored in database.Combination corresponding to the phonotactics of correct syllable and various dissimilar mistake has the mapping of a plurality of phoneme/syllables to words in the database.For example " thanks " has two kinds of different phonotactics, and one of them is correct, and another one is wrong combination, for example pronounces " borrowing ".

With regard to the linguistics pattern feature, the high frequency words and phrases in the specific content of courses are stored in the database, and complete vocabulary, grammer keyword, semantic interpretation etc. also are stored in the database.Also be stored in the database respectively at wrong vocabulary and the grammer keyword of common study.

In sum, we use the data training teaching of collection in advance relevant acoustics and linguistics pattern and are stored in the database, and therefore, on the statistical significance, these patterns have been represented multiple possible pattern (different expression or specific mistake).Given certain content of courses, each express corresponding pattern or wrong corresponding pattern all with database in the teaching-guiding content interrelate.These guidances are collected in true man teacher, and exist with text and multimedia form.For example, database comprise a text based and voice-based tutorial message be used for describing how to distinguish " thanks " and " by means of ".

Module 5 uses the pattern feature that prestores in the user model feature (output of acoustic analysis module 3 and linguistic analysis module 4) that obtains by acoustics and linguistic analysis and the database to mate.Module 5 is calculated/is selected according to matching process, and output is scoring and teaching-guiding content objectively, and the distance and the database definition of the eigenwert of module 3/4 input are as follows:

1. the coupling of syllable/phoneme duration adopts mahalanobis distance to calculate the duration of user speech and the distance between the reference value, and computing formula is as follows:

Δ_{d} = \frac{{(d - μ_{d})}^{2}}{σ_{d}} - - - (2)

Δ wherein _dIt is the distance between the reference value in the duration d of user speech and the database.μ _dBe certain specific phoneme or the average duration of syllable under certain skill level, σ _dBe corresponding variance.

2. the energy value Δ of syllable/phoneme _eWith fundamental frequency value Δ _pCoupling adopt the method for similar formula (2)

3. the coupling of fundamental frequency time locus.At first with the fundamental frequency time locus normalization of user speech, be calculated as follows the distance of the reference value in itself and the database then.

Δ_{trj} = \frac{1}{T} Σ_{t = 1}^{T} {(f (t) - μ_{f} (t))}^{2} - - - (3)

Δ wherein _TrjBe the distance between the reference value in the fundamental frequency time locus of user speech and the database, T is the length of normalization duration, and f (t) is the fundamental frequency value of normalized user speech, μ _f(t) be normalized in the database with reference to the fundamental frequency value.

4. the coupling of symbol sebolic addressing (phoneme or speech or semantic item).At first with the symbol sebolic addressing and the alignment of the reference sequences in the database of user's input, distance is between the two inserted the error sum for replacement error, deletion sum of errors.Alignment realizes by dynamic programming algorithm.

Draw and database in after the above-mentioned distance between the correct acoustic mode, can objectively mark to user pronunciation in the rank of phoneme, syllable or sentence.The phone-level scoring is defined as follows:

Δ_{phn} = - \frac{1}{2} \log (w_{1} Δ_{d} + w_{2} Δ_{e} + w_{3} Δ_{p}) - - - (4)

S_{phn} = \frac{w_{4}}{1 + \exp (α Δ_{phn} + β)} + w_{5} C_{phn} - - - (5)

W wherein ₁+ w ₂+ w ₃=1, w ₄+ w ₅=1 and all be positive number, for example 0.1,0.5 etc.C _PhnThe letter of putting that is phoneme is marked, and α and β are the parameters of score function.Syllable rank scoring Δ _WrdAdopt similar definition.The scoring of sentence level is defined as the average mark of syllable level scoring, that is:

S_{sent} = \frac{1}{N_{wrd}} \underset{wrd}{Σ} S_{wrd} - - - (6)

N wherein _WrdIt is the syllable number in the sentence.What deserves to be mentioned is, the weighting coefficient that parameter alpha and β use when being phoneme and syllable score calculation, these two parameters are obtained by precondition, thereby guarantee the high correlation of machine scoring and true man teacher's scoring.

Philological scoring is to calculate according to the error rate of speech and semantic item.Given user's word sequence Θ _WrdAnd semantic item sequence Θ _SemAnd the distance in the database between the corresponding reference value (number of errors), the linguistics score calculation is as follows:

S_{ling} = 1 - (w_{1} \frac{Θ_{wrd}}{N_{wrd}} + w_{2} \frac{Θ_{sem}}{N_{sem}}) - - - (7)

W wherein ₁+ w ₂=1 and all be positive number, for example 0.1 or 0.2.N _WrdBe the quantity of speech in the correct word sequence in the database, N _SemBe the quantity of correct semantic item in the database.

Except objective scoring, system generates simultaneously and is used to correct a mistake and/or improves the teaching-guiding information of spoken language proficiency.System provides corresponding tutorial message by specific error pattern feature or other the spoken pattern feature that prestores of coupling in database.With regard to acoustic connection, system provides the tutorial message of following personalization:

1. phoneme pronunciation mistake.Utilize the aligned phoneme sequence of each syllable of user input voice and the distance of the sequence in the database, the most close aligned phoneme sequence in the search database.If the aligned phoneme sequence of gained is certain typical fault, then select corresponding teaching-guiding information.

2. intonation analysis.The fundamental frequency time locus can provide the tone information of syllable and phoneme and words.The distance of the fundamental frequency time locus of given user speech can be found typical intonation mistake and provide corresponding tutorial message.

At the linguistics aspect, produce the tutorial message of following personalization

1. vocabulary instruction.System carries out counting statistics (same content repeatedly being practised the back the user) to user's word.If certain speech writes down in database and should have higher frequency of utilization but the user has only used seldom number of times, system provides tutorial message and uses corresponding vocabulary to encourage the user.

2. grammer error correction.If the grammar indices of coupling is corresponding to the grammer of predefined mistake, system provides corresponding tutorial message.If the grammar indices of coupling is corresponding to predefined correct grammer, system provides the answer of adopting another syntactic structure to express equivalent.

3. the grammer key word instructs.At certain concrete learning content, we are known correct grammer key word in advance.Therefore, according to the probability of occurrence of the grammer key word in user's pronunciation, system provides guidance at grammer key word that omit or the use deficiency.

4. semantic the guidance.If find that the semantic sequence of coupling is incorrect, then provide the reason that produces the semantic understanding mistake.

Module 5 provides different scorings and tutorial message.The feedback generation module, module 6, scoring report and complete guidance that its comprehensive also output is a detailed.

The scoring of syllable/phoneme and words shows that with represented as histograms overall score shows with the pie chart form.Provide relatively figure of tone simultaneously, comprise the fundamental curve of right pronunciation and the fundamental curve of user pronunciation (only providing the tone figure of problem syllable).Tutorial message branch " vocabulary use ", " grammer evaluation " and " intelligibility " several parts.In these tutorial messages, machine uses some common sentence structures to organize the tutorial message that is fed back, and for example " in addition, you can use.。。Express the same meaning.”

Module 7 is carried out the conversion of Text To Speech.Here, we use a voice operation demonstrator based on hidden Markov model.Long or instruct when comprising content of multimedia when the content of text of some teaching-guiding item, system does not carry out this module.

Module 8 is used to safeguard a dynamic database, makes it have abundant and personalized content, i.e. update module B.If can't find the content of coupling in database, we provide certain tutorial message general, tell that the user needs further to improve, and for example " your pronunciation and Received Pronunciation also have a segment distance, please change your study rank." simultaneously, system stores this pattern and primary voice data.After the EOP (end of program), the data of preservation are sent to server by Internet.Central server carries out counting statistics to these patterns, and after certain pattern ran up to some, central server divided into groups to it.After one group of new pattern occurred, true man teacher analyzed and provides corresponding instruction to it, and server is with new pattern, and for example " a kind of new wrong pattern " joins in the database.The pattern that this kind is new can be used for other users.On the one hand, when certain user made progress, system can ask the user to import its study notes and experience in addition, and this information also is fed back to central server and adds database.

Except adaptive content update, user's primary voice data also is used to upgrade the employed hidden Markov of speech recognition (HMM) model, and it has caused the parameter update of sound identification module 2 and acoustic analysis module 3.Here, we adopt the linear recurrence of maximum likelihood (MLLR) algorithm to upgrade the average and the variance of each HMM model mix Gauss model, with reference to [C.J.Leggetter and P.C.Woodland.Speaker adaptation of continuous density HMMs using multivariate linearregression.ICSLP, pages 451-454,1994].The model that upgrades can better be discerned specific user's voice.

In addition, mode of learning and teaching-guiding information database, module B, the statistical information of also storing user's sound producing pattern (particularly error pattern).These statistical informations mainly comprise the counting statistics of certain sound producing pattern of user and the index of corresponding analysis result.Next time, when same subscriber began to learn, this user can obtain his studying history record, perhaps saw progress by more current analysis result and the historical record in the database.These statisticss also are used for the teaching material of design personalized, for example Ge Xinghua exercise course or senior reading material etc.Statistical information can be showed with digitizing or patterned form.

Native system can adopt similar type to realize the aided education of other language.At the language based on tone, for example Chinese, an additional features of native system is to compare the training that figure instructs tone by fundamental frequency alignment as shown in Figure 9.

Among Fig. 9, provide with solid line, represented the fundamental frequency of corresponding phoneme or syllable with reference to the fundamental frequency value.The fundamental frequency time locus of learner's pronunciation provides with dotted line, and and reference curve alignment.The learner is can be from figure open-and-shut to find out the correctness of oneself pronunciation.Because the exercise of correction tone is visual, extremely helps to improve practitioner's tone level.Curve, figure, color or other graphic attributes can be different.

It should be noted that the present invention is not limited to aforesaid embodiment, the personage with corresponding working experience can make modification to form other embodiment according to the technology and the frame description of front.

Claims

1. the computer system of an assisting spoken language learning comprises following assembly:

2. the computer system of assisting spoken language learning as claimed in claim 1, it is characterized in that described database is used to store one group of data item that is associated, comprise characteristic data items in the data item that this group is associated, also comprise tutorial message data item and learning content data item; Pattern feature in the described characteristic data items comprises acoustic mode feature or the linguistics pattern feature that the user may occur in spoken language learning; Described tutorial message data item comprise with characteristic data items in acoustic mode feature or linguistics pattern feature tutorial message one to one, in order to the mistake in the spoken language pronunciation that guides user improves or correction is found from pattern matching process or not enough, described feedback data comprises above-mentioned tutorial message; And described learning content data item is in order to certain specific language learning objective in the identification spoken language learning content, and makes that the particular content in this specific language learning objective is corresponding with the data item that this group is associated.

3. the computer system of assisting spoken language learning as claimed in claim 1 or 2, it is characterized in that: described speech analysis system comprises an acoustic mode analytic system, and this acoustic mode analytic system is discerned one or more phonemes, speech and sentence and corresponding confidence data is provided from the voice response data of gathering; Described acoustic mode feature comprises above-mentioned one or more phonemes, speech and sentence and corresponding confidence data.

4. the computer system of assisting spoken language learning as claimed in claim 3 is characterized in that: described acoustic mode analytic system is also discerned prosodic features from the voice response data that obtain; Described prosodic features comprises the fundamental frequency feature corresponding to certain section voice, duration and energy; Described acoustic mode feature comprises above-mentioned prosodic features.

5. the computer system of assisting spoken language learning as claimed in claim 3 is characterized in that: described speech analysis system comprises that also a linguistics pattern analysis system mates syntactic structure in the voice response data of gathering and the one or more polytype syntactic structure in the database; Described linguistics pattern feature comprises user's grammatical pattern feature.

6. the computer system of assisting spoken language learning as claimed in claim 5 is characterized in that: described one or more polytype syntactic structures are included in the type of error of the syntactic structure of phoneme, words, sentence level.

7. the computer system of assisting spoken language learning as claimed in claim 5, it is characterized in that: described linguistics pattern analysis system can also discern the one or more keywords in the user speech, and described acoustic mode analytic system provides confidence data for these keywords of having discerned.

8. the computer system of assisting spoken language learning as claimed in claim 1 or 2, it is characterized in that: described speech analysis system comprises a speech recognition system, described speech recognition system comprises that simultaneously an acoustic model arrives the mapping of words and the priori statistical nature of words sequence in order to acoustic feature and linguistic model of describing specific words or acoustic elements in order to describe acoustic elements.

9. the computer system of assisting spoken language learning as claimed in claim 8 is characterized in that: described speech recognition system is used to discern phoneme or words and their time border; Pattern feature in the described characteristic data items also comprises the feature of user speech at the time of phoneme or words boundary.

10. the computer system of assisting spoken language learning as claimed in claim 2, it is characterized in that: feedback data comprises an index value, in order to the selected tutorial message of index, as the response to pattern match result and language-specific learning content; Described tutorial message has comprised detailed guide data, in order to correct the spoken language learning mistake of finding by pattern match or more oral expression mode is provided.

11. the computer system of assisting spoken language learning as claimed in claim 2 is characterized in that: described tutorial message is the classification arrangement, comprises acoustic level and linguistics rank at least; The described feedback system skilled rank that rank and user with the specific spoken language learning content of correspondence pre-defines that can from each rank, make one's options, perhaps the two one of.

12. the computer system of assisting spoken language learning as claimed in claim 1 or 2 is characterized in that: the conversational language of pointing out in the described spoken language learning content comprises certain language based on tone; And described feedback data comprises fundamental frequency time locus data and corresponding graphical information.

13. the computer system of assisting spoken language learning as claimed in claim 12, it is characterized in that: feedback data comprises the tone feedback that gives the user, this tone feedback comprises the fundamental frequency time locus corresponding to certain phoneme, words or sentence in user's the pronunciation, the employing graphics mode shows that feedback data also comprises the figure of the fundamental frequency time locus that provides the respective standard pronunciation simultaneously.

14. the computer system of assisting spoken language learning as claimed in claim 1 or 2 is characterized in that this computer system also comprises the history data store of different user; Described history data store comprises known acoustic mode feature and known linguistics pattern feature, one of or the two, comprise that also system is to evaluation of user information, also comprise the feature that in database, does not exist that system identification goes out, as new feature, and the feature that this is new is according to joining in the described structurized database with it.

15. the computer system of assisting spoken language learning as claimed in claim 14, it is characterized in that this computer system also comprises a system module, be used for to add described structurized database corresponding to the feedback data of new feature, and new feedback data comprises teaching-guiding information, information comprises by seeking advice from experts or obtain to the form that the user puts question to, puing question to how the inquiry user overcomes certain mistake corresponding to new feature.

16. the computer system of assisting spoken language learning as claimed in claim 1 or 2 is characterized in that: the feedback that gives described user comprises a scoring; The numerical result gained that described scoring produces by a mapping function computational transformation pattern match; Described mapping function makes scoring that computer system provides and true man teacher's scoring relevant.

17. the computer system of assisting spoken language learning as claimed in claim 1 or 2, it is characterized in that this computer system can be in the assisted learning conversational language or outside, the subtest conversational language, described feedback system when giving user feedback or outside, generate a test report.

18. the carrier of the computer program code of a computer system that is used to realize aforementioned any described assisting spoken language learning of claim, this computer program code comprises the computer program code that is achieved as follows function: