WO2002103672A1

WO2002103672A1 - Language assisted recognition module

Info

Publication number: WO2002103672A1
Application number: PCT/AU2002/000801
Authority: WO
Inventors: Dominique Estival; Ben Hutchison
Original assignee: Kaz Group Limited
Priority date: 2001-06-19
Filing date: 2002-06-19
Publication date: 2002-12-27
Also published as: AUPR578801A0

Abstract

A language assisted recognition module for an automated speech recognition system which incorporates sub-modules, wherein sub-modules include one or more modules selected from a duration model; a language repairer; an adaptive language assisted recognition message processor; a culler; a language model/digit sum checker; a confidence model. The module includes a dialogue processing system; dialogue flow controller and feedback means.

Description

LANGUAGE ASSISTED RECOGNITION MODULE

The present invention relates to a language assisted recognition module and, more particularly, to such a module suited for use with an automated speech recognition system.

BACKGROUND

Automated speech recognition is a complex task in itself. Automated speech understanding sufficient to provide automated dialogue with a user adds a further layer of complexity.

In this specification the term "automated speech recognition system" will refer to automated or substantially automated systems which perform automated speech recognition and also attempt automated speech understanding, at least to predetermined levels sufficient to provide a capability for at least limited automated dialogue with a user.

A generalized diagram of a commercial grade automated speech recognition system as can be used for example in call centres and the like is illustrated in Fig. 1.

With advances in digital computers and a significant lowering in cost per unit of computing capacity there have been a number of attempts in the commercial marketplace to install such automated speech recognition systems implemented substantially by means of digital computers.

However, to date,^' there remain problems in achieving 100% recognition and/or 100% understanding in real time.

Furthermore, current symptoms suffer from overly tight integration rendering the systems extremely inflexible and thereby particularly difficult to optimize for specific commercial contexts.

It is an object of the present invention to address or ameliorate one or more of the abovementioned disadvantages.

BRIEF DESCRIPTION OF INVENTION

Accordingly, in one broad form of the invention there is provided a language assisted recognition module for an automated speech recognition system, said module incorporating a plurality of sub-modules.

Preferably said sub-modules process data in series .

Preferably said sub-modules include one or more modules selected from a duration model; a language repairer; an adaptive language assisted recognition message processor; a culler; a language model/digit sum checker; a confidence model .

Preferably said language assisted recognition module is in communication with a dialogue processing system. Preferably said dialogue processing system includes a dialogue flow controller.

Preferably said language assisted recognition module incorporates feedback means from said utterance processing system to said language assisted recognition module.

Preferably said feedback means passes feedback data from said utterance processing system to said language assisted recognition module for adaptive processing within said language assisted recognition module.

Preferably said feedback data comprises an adaptive language assisted recognition message derived from a dialogue flow controller within said automated speech recognition system.

BRIEF DESCRIPTION OF DRAWINGS

Embodiments of the present invention will now be described with reference to the accompanying drawings wherein:

Fig. 1 is a generalized block diagram of a prior art automated speech recognition system;

Fig. 2 is a generalized block diagram of an automated speech recognition system suited for use in conjunction with an embodiment of the present invention; Fig. 3 is a more detailed block diagram of the utterance processing and dialogue processing portions of the system of Fig. 2;

Fig. 4 is a generalized block diagram of the system of Fig. 2 incorporating a language assisted recognition module in accordance with a first preferred embodiment of the present invention;

Fig. 5 is a detailed block diagram of the language assisted recognition module of Fig. 4; and

Fig. 6 is a block diagram of a class structure of an LAR system in accordance with a first example of the system of Fig. 4.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

With reference to Fig. 2 there is illustrated a generalized block diagram of an automated speech recognition system 10 adapted to receive human speech derived from user 11, and to process that speech with a view to recognizing and understanding the speech to a sufficient level of accuracy that a response 12 can be returned to user 11 by system 10. In the context of systems to which embodiments of the present invention are applicable the response 12 can take the form of an auditory communication, a written or visual communication or any other form of communication intelligible to user 11 or a combination thereof.

In all cases input from user 11 is in the form of a plurality of utterances 13 which are received by transducer 14 (for example a microphone) and converted into an electronic representation 15 of the utterances 13. In one exemplary form the electronic representation 15 comprises a digital representation of the utterances 13 in .WAV format. Each electronic representation 15 represents an entire utterance 13. The electronic representations 15 are processed through front end processor 16 to produce a stream of vectors 17, one vector for example for each 10ms segment of utterance 13. The vectors 17 are matched against knowledge base vectors 18 derived from knowledge base 19 by back end processor 20 so as to produce ranked results 1-N in the form of N best results 21. The results can comprise for example subwords, words or phrases but will depend on the application. N can vary from 1 to very high values, again depending on the application.

An utterance processing system 26 receives the N best results 21 and begins the task of assembling the results into a meaning representation 25 for example based on the data contained in language knowledge database 31. The utterance processing system 26 orders the resulting tokens or words 23 contained in N-best results 21 into a meaning representation 25 of token or word candidates which are passed to the dialogue processing system 27 where sufficient understanding is attained so as to permit functional utilization of speech input 15 from user 11 for the task to be performed by the automated speech recognition system 10. In this case the functionality includes attaining of sufficient understanding to permit at least a limited dialogue to be entered into with user/caller 11 by means of response 12 in the form of prompts so as to elicit further speech input from the user 11. In the alternative or in addition, the functionality for example can include a sufficient understanding to permit interaction with extended databases for data identification.

Fig. 3 illustrates further detail of the system of

Fig. 2 including listing of further functional components which make up the utterance processing system 26 and the dialogue processing system 27 and their interaction. Like components are numbered as for the arrangement of Fig. 2.

The utterance processing system 26 and the dialogue processing system 27 together form a natural language processing system. The utterance processing system 26 is event-driven and processes each of the utterances 13 of caller/user 11 individually. The dialogue processing system 27 puts any given utterance 13 of caller/user 11 into the context of the current conversation (usually in the context of a telephone conversation) . Broadly, in a telephone answering context, it will try to resolve the query from the caller and decide on an appropriate answer to be provided by way of response 12.

The utterance processing system 26 takes as input the output of the acoustic or speech recogniser 30 and produces a meaning representation 25 for passing to dialogue processing system 27.

In a typical, but not limiting form, the meaning representation 25 can take the form of value pairs. For example, the utterance "I want to go from Melbourne to Sydney on Wednesday" may be presented to the dialogue processing system 27 in the form of three value pairs, comprising:

1. Start; Melbourne

2. Destination; Sydney

3. Date; Wednesday

where, in this instance, the components Melbourne, Sydney, Wednesday of the value pairs 24 comprise tokens or words 23. With particular reference to Fig. 3 the recogniser 30 provides as output N best results 21 usually in the form of tokens or words 23 to the utterance processing system 26 where it is first disambiguated by language model 32. In one form the language model 32 is based on trigrams with cut off.

Analyser 33 specifies how words derived from language model 32 can be grouped together to form meaningful phrases which are used to interpret utterance 13. In one form the analyzer is based on a series of simple finite state automata which produce robust parses of phrasal chunks - for example noun phrases for entities and concepts and WH- phrases for questions, dates. Analyser 33 is driven by grammars such as meta-grammar 34. The grammars themselves must be tailored for each application and can be thought of as data created for a specific customer.

The resolver 35 then uses semantic information associated with the words of the phrases recognized as relevant by the analyzer 33 to refine the meaning representation 25 into its final form for passing through the dialogue flow controller 36 within dialogue processing system 27.

The dialogue processing system 27, in this instance with reference to Fig. 3, receives meaning representation 25 from resolver 35 and processes the dialogue according to the appropriate dialogue models. Again, dialogue models will be specific to different applications but some may be reusable . For example a protocol model may handle greetings, closures, interruptions, errors and the like across a number of different applications.

The dialogue flow controller 36 uses the dialogue history to keep track of the interactions.

The logic engine 37, in this instance, creates SQL queries based on the meaning representation 25. Again it will be dependent on the specific application and its domain knowledge base .

The generator 38 produces responses 12 (for example speech out) . In the simplest form the generator 38 can utilize generic text to speech (TTS) systems to produce a voiced response.

Language knowledge database 31 comprises, in the instance of Fig. 3, a lexicon 39 operating in conjunction with database 40. The lexicon 39 and database 40 operating in conjunction with knowledge base mapping tools 41 and, as appropriate, language model 32 and grammars 34 constitutes a language knowledge database 31 or knowledge base which deals with domain specific data. The structure and grouping of data is modeled in the knowledge base 31. Database 40 comprises raw data provided by a customer. In one instance this data may comprise names, addresses, places, dates and is usually organised in a way that logically relates to the way it will be used. The database 40 may remain unchanged or it may be updated throughout the lifetime of an application. Functional implementation can be by way of database servers such as MySQL, Oracle, Postgres .

As will be observed particularly with reference to Fig. 3, interaction between a number of components in the system can be quite complex with lexicon 39, in particular, being used by and interacting with multiple components of System 10.

With reference to Figs. 4, 5 and 6 a language assisted recognition module 710 and a first example of its operation are described in the context of an automated speech recognition system 10 of the type described with reference to Fig. 2. Like components are numbered as for the arrangement of Fig . 2.

Initially, with particular reference to Fig. 4 and Fig. 5, there is interposed between recogniser 30 and utterance processing system 26 the language assisted recognition module 710. The module 710 includes a plurality of sub-modules 711 which, in a preferred form, are linked in series as illustrated in Fig. 4 so as to provide a pipelined processing of the token or word - sequence candidates 23 (in this instance in the form of an N best list) so as to produce an M-best list of language assisted sequence candidates 721 for input into utterance processing system 26.

In this arrangement it is also to be noted that there is a feedback path 713 from the utterance processing system 26 back to the language assisted recognition module 710. In preferred forms the feedback path 713 provides feedback data 714.

With particular reference to Fig. 5 the utterance processing system 26 includes a dialogue flow controller

(DFC) 715 adapted, as will be discussed in further detail below, to provide feedback data 714 in the form of an adaptive language assisted recognition message (ALARM) 716 to at least one of the sub-modules 711.

In the instance of Fig. 5 the sub-modules 711 comprise, in this instance, respectively and in series, a duration model 717, a language repairer 718, an ALARM processor 719, a culler 720, a language model/digit sum checker 721, a confidence model 722 and a second culler 723. This modular arrangement, in preferred forms in conjunction with the feedback data 714, provides an exceedingly flexible arrangement for interposing into the processing path of data flowing through the automated speech recognition system 710 of Fig. 4.

Broadly speaking the problem which this arrangement seeks to overcome and a re-statement of the characteristics of the system above described with reference to Figs. 4 and 5 will now follow. Subsequently a detailed description of a first embodiment of a language assisted recognition module and that of a second embodiment of the language assisted recognition module will be described followed by the provision of a detailed first example of an implementation in a software environment.

In the past, the problem has been addressed by one of the following:

1. restricting the naturalness of utterances, by constraints on the form of utterances, and using recognition grammars that specify a narrow range of possibilities

2. using linguistic information in the recognition grammars, resulting in loss of efficiency in recognition

3. allowing wider range of recognition results to be sent to language processing components. While solutions (1) and (2) are not acceptable for efficient robust recognition of natural spoken language utterances, solution (3) puts unnecessary processing burden on the language processing components and also results in loss of efficiency and modularity.

The solution according to a first preferred embodiment is to create a separate modular component 710, which contains a set of procedures to 1) eliminate some of the results returned by the speech recogniser, 2) rescore those results, and 3) perform some repair on the results.

The solution differs from previous attempts in that: 1. the techniques used for rescoring and repairing are not incorporated in the speech recogniser, 2. the linguistic and application domain information added to the recognition grammar for repairing is minimal and does not impact on efficiency. 3. the linguistic and application domain information used by the LAR is not needed by the language processing components, resulting in greater modularity of application design and development.

There are four types of linguistic information that the LAR module 710 can use. They are: 1. N-gram frequencies of words 2. Parts of speech of words 3. Knowledge of which words do not make sense in the context of the dialogue

4. Information on when and how people speak disfluently (e.g. go urn or ah, or correct or repeat themselves)

The LAR module 110 is split into sub-modules 111 which use these different kinds of information. The sub-modules include :

• Language model (uses information type 1) : uses n-grams to estimate the likelihood of sequences of words and re-score the N-best list.

• Class based language model (uses 1 and 2) : same as above, but also uses part of speech information to overcome the sparse data problem

• Adaptive LAR (uses 3) : receives messages from the DFC telling it which concepts have been denied at a previous stage of the dialogue, and uses this information to rule out candidates in the N-best list.

• Language repairer (uses 4) : removes disfluencies from each utterance (assumes that the recogniser has correctly recognised the disfluencies) .

In any given application, any subset of these sub- modules may be used, and the method of configuring the LAR 710 is flexible enough to allow the sub-modules 711 to be applied in any order.

The techniques used for rescoring and repairing are not incorporated in the speech recogniser itself, and thus they do not impact on speech recognition efficiency. As the information added to the recognition grammar for repairing is minimal, it also does not impact on speech recognition efficiency.

As the linguistic and application domain information used by the LAR 710 is not needed by the language processing components of processing system 26, language processing is also more efficient, but the main advantage is that this architecture results in greater modularity of application design and development. The encapsulation of the procedures in the LAR component allows for greater modularity of the software components. The separation of information for speech recognition, LAR and language processing, which is usually integrated in the architecture of other spoken dialogue systems, allows for greater reusability of the language data used by these components. The data used by the LAR procedures are also more reusable and can be shared more easily across applications.

A core concept underlying the LAR is the encapsulation of procedures into a separate module 110, resulting in greater flexibility for development, while enhancing the output of the speech recognition component without negative consequences for efficiency.

The modularity of the architecture ensures rapid application development and greater reusability of data. Variations within the scope of embodiments of the invention include the addition of new modules, e.g. keyword spotting, modification of existing modules, e.g. new algorithm for language modeling, as well as applying the procedures to data structures other than an N-best list or a lattice.

FIRST EMBODIMENT

• From SYCON_an N-best list, tagged with acoustic scores and word durations

resolution would be used to cull the list further.

• A Language Repairer which repairs repetitions ("the final, final of the swimming"), pauses ("the um fencing") and corrections ("freestyle, no, swimming"). This would introduce new candidates into the N-best list.

• A Language Model, which uses N-gram statistics to rescore the N-best list

• An ALARM Processor, which interprets Adaptive LAR Messages and rescores the N-best list as a result

• A Confidence Model, which rescores the N-best list based on the frequency and or confidence scores of words in the N-best list

• A Duration Model, which rescores the N-best list based upon the duration of words or phrases. An adaptive duration model may also keep track of the durations of words/phrases in previous utterances.

• A Culler which reduces a (rescored) N-best list to an M-best list

I leave the order in which the sub-modules should be applied as a matter for further investigation.

Resources

Different sub-modules would require different resources:

• The Language Repairer requires a specification of the transformations it is allowed to make. These may be transformations on POS sequences (e.g. "GENDER, SIL NO GENDER₂" -> "GENDER₂"). Thus it would also require the Lexicon.

• The Language Model requires N-gram statistics. If class-based it will also require the Lexicon.

• The ALARM Processor requires no external resources

• The Confidence Model requires no external resources.

• The Duration Model requires statistics on word and phrase durations. The Culler requires no additional resources. SECOND EMBODIMENT

Introduction At later stages of the dialogue, the DFC

constrain the LAR. To see why, suppose our first recognition is "when is mens freestyle swimming". Our DFC asks for confirmation, and in response gets "no, meπs freestyle swimming". Obviously there has been a misrecognition here. If the DFC tells the LAR module to ignore "mens freestyle swimming", then the next best candidate, e.g. "no, womens freestyle swimming" will be chosen. We will refer to this as Adaptive LAR.

The ALARM Processor

At each stage of the dialogue, the DFC will pass to the LAR module a (possibly NULL) restriction on the output of the recogniser. I will call this an Adaptive LAR Message, or ALARM. For example, the DFC may send an ALARM:

"mens freestyle" The LAR ww* l then exclude all candidates containing the substring "mens freestyle".

The LAR M ^odule BaJdl! liave a sub-module, the ALARM Processor, which is responsible for interpreting and executing ALARMs.

ALARM format

An ALARM specifies what will be filtered out of the N-best list. It consists of:

1. words, phrases

2. AND, OR

3. parentheses: "(" and ")"

Examples:

"mens" AND "freestyle" filters out candidates containing both "mens" and "freestyle".

"mens" OR "freestyle" filters out candidates containing at least one of "mens" and "freestyle".

"mens freestyle" filters out candidates containing the phrase "mens freestyle".

This should give the Dialogue Designer freedom to specify any

A More Flexible Variant

^now discuss a variation of the above proposal which has two new features:

1. The ALARMs can specify both which words/phrases are bad, and which are good.

2. Candidates specified by an ALARM are not removed from the N-best list, but are assigned penalty scores.

EXAMPLE 1

Introduction

Language Assisted Recognition (LAR) is the process of using linguistic information to improve recognition. Some types of linguistic information that are useful are:

1 ) N-gram frequencies of words

2) Parts of speech of words

3) Knowledge of which words do not make sense in the context of the dialogue

4) Information on when and how people speak disfluently (e.g. go um or ah, or correct or repeat themselves)

In the current, linear, architecture, the "Language Assisted" and the "Recognition" part of LAR are performed by different components. The speech recogniser produces a scored N- best list; the "LAR module" then rescores this list using linguistic information.. This web document documents the theory behind, and design of, the LAR module, and also contains a user's guide.

Modu le struclurc

There are four types of linguistic information that the LAR module uses. They are:

1 ) N-gram frequencies of words

2) Parts of speech of words

3) Knowledge of which words do not make sense in the context of the dialogue

The LAR module is split into sub-modules which use these different kinds of information. The sub-modules are:

• Language model (uses information type 1 ): uses n-grams to estimate the likelihood of sequences of words and re-score the N-best list.

• Class based language model (uses 1 and 2): same as above, but also uses part of speech information to overcome the sparse data problem

• Adaptive LAR (uses 3): receives messages from the DFC telling it which concepts have been denied at a previous stage of the dialogue, and uses this information to rule out candidates in the N-best list.

• Language repairer (uses 4): removes disfluencies from each utterance (assumes that the recogniser has correctly recognised the disfluencies).

In any given application, any subset of these sub-modules may be used, and the method of configuring the LAR is flexible enough to allow the sub-modules to be applied in any order.

Language model

The language model used is a trigram model with backoff. It requires unigram, bigram and trigram frequencies, and it first attempts to use estimate the probability of a trigram using the trigram frequency. If however the frequency of the trigram falls below a certain threshold (denoted by trigger3 below), then the language model reverts to a bigram estimation. If, additionally, the bigram frequency falls below another threshold (called trigger2 below) then the language model reverts to just the unigrams to estimate the probability.

Once the language model has estimated the probability of an utterance, this probability is combined with recogniser's score to produce a new score for the utterance. In order to combine the two scores, a scaling function must first be applied. This scaling function is set in the configuration file. For example a typical way to combine the acoustic and LM scores is:

New score = acoustic score + 50 x LOG(LM score)

#words in utterance

Formulas

This section contains the formulas used in estimating the probability of an utterance. It can safely be skipped unless you are really interested in the nitty-gritty.

The language model uses the following formulas for sentence {Start w1 w2 .. wn End} scores: s = P(w1 IStart) ^* P(w2| Start w1 ) ^* P(w3|w1 w2) ^* P(w4|w2 w3) ^* ... ^* P(End|w(n-1 )wn)

0) and utterance { w1 w2 .. wn} scores. score = P(w1 ) ^* P(w2|w1) ^* P(w3|w1 w2) ^* P(w4|w2 w3) ^* ... ^* P(wn| (n-2) w(n=1)) ^{" ' " '} ' ^{• ''} (2)

Here are formulas for back-off probabilities P, . used in (1 ) and (2) instead of regular probabilities P(C|AB), P(B|A) and P(A):

^Pback ^{(C,AB) =} P(C|AB) = F(ABC)/F(AB), if F(ABC) > trigger3 or ^Pback ^(C'^{AB) =} normFactor3(AB)^* P_bacR (C|B), if F(ABC) < trigger3;

^Pbbaacckk ^{(C|B) = P(C,B) = F}(^BC)^/F(^β)- ^{lf F}<^{BC) ≥ tn}99^er2 oorr P b_ha„c_k,. (C|B) = normFactor2(B)^* P(C), if F(BC) < trigger2 Language Model

(Q) - F(C)/AIIF =;P(C)-._.

^•?' AllFri the sum όfrall.r

Class based language model

There is often a problem of insufficient data when training a language model. In fact, a great deal of data is required in order to get accurate relative probabilities of a word in all its different contexts. One way to overcome this problem is by using a class based language model. The class based LM estimates the probability of a word class given a context, and then multiplies this by the probability of a word given a word class. Thus it carries the underlying assumption that the probability of a word given a word class is independent of context. This assumption does not always hold, but class based language modeling still remains an effective tool when large sets of training data are unavailable.

Formulas

The class based LM calculates the probability of an utterance by the following formula:

P_c (_W1 w₂ w₃ w_n) = P(C₁ C₂ C₃ C_n) ^* P(w₁ |C₁) ^* P(w₂|C₂) ... P(w C_n)

where Ci is the class of word w

This can be approximated by the product, over each word w_R, of

^pcK I ^wk-ι ^wk-₂) = ^p(^w l ^ck) ^{* p} (^c I ^cκ-ι c_k_₂)

The probability P (C_k | C_k C_{R 2}) is calculated using the same trigram backoff formula as the standard language model.

Adaptive LA R

At later stages of the dialogue, the DFC should be able to constrain the LAR. To see why, suppose our first recognition is "when is mens freestyle swimming". Our DFC asks for confirmation, and in response gets "no, mens freestyle swimming". Obviously there has been a misrecognition here. If the DFC tells the LAR module to ignore "mens freestyle swimming", then the next best candidate in the n-best, say "no, womens freestyle swimming", will be chosen. We will refer to this as Adaptive LAR.

At each stage of the dialogue, the DFC will pass to the LAR module a (possibly NULL) restriction on the output of the recogniser. This restriction is called an Adaptive LAR Message, or ALARM, and specifies a semantic value which is to be excluded. All candidates containing words with this value will then be penalised.

Using Language Assisted Recognition

Using adaptive, LAR

Using the langύdg^repairer

Language repairer

The language repairer receives candidate utterances which may contain mark-ed up disfluencies and produce fluent versions of the utterances. For this purpose, three recognition tokens are reserved as meta-tokens.

The disfluency meta-tokens are:

• STARTMOD start of a phrase that is later modified

• ENDMOD end of a phrase that is later modified

• RESTART restart of an utterance

The actually phonemic spellings of these tokens may be either _ps_ (a short pause), a noise model, or a filled pause ("umm", or "ahh").

The placing of disfluencies with corresponding meta-tokens, in the fluent grammars is a challenging process. In contrast, the action performed by the language repairer is quite simple. The Language Repairer takes in an N-best list and edits each utterance by:

• Removing STARTMOD and ENDMOD tokens and everything in between

• Removing everything before a RESTART token

- - Note that the same functionality could be obtained by having treating restarts as a special case of modifications, and marking both the beginnings and ends of false starts with STARTMOD and ENDMOD respectively. However, this leads to left branching in the grammar, which is known to lead to larger recognition lattices and hence less efficient recognition. Left branching is unavoidable in the case of modifications.

Using Language Assisted Recognition

Using the language models

The first thing to do is to create your n-gram files. The method is different depending on which language model you wish to use. If you are using the word based language model, then this just involves running ngramDir . pl to produce the files unigram. bin, bigram. bin and trigram. bin. If you are using the class based language model however, you will first need to tag your training data. This can be done either with the brill tagger, or using the tagger .pl script. Once your data is tagged, run ngramclasses . pl to produce classesgram. bin, unigram.bin, bigram. bin and trigram. bin

The next step is to configure the language model. The configuration file has the following relevant settings:

# Language Model mandatory parameters

LM_FILES = (unigram.bin, bigram.bin, trigram.bin)

CBL _FILE = classesgram.bin

LM_NMAX = 3

LM_PROBABILISTIC = 1

LM_TrigramBackOffTrigger = 2

LM_BigramBackOEfTrigger = 2

LM_SCALING = LOGDIV if Language model optional parameters LM_ACWEIGHT = 1 LM_LAWEIGHT = 50

A description of each parameter follows.

• LM_FILES: takes a list containing the names of the files containing the unigram, bigram and trigram frequencies.

• CBLM_FILES: the name of the file containing the frequencies of each word occurring with a particular part of speech.

. LM_„NMAX: MUST BE SET-TO 3

• LM_PROBABILISTIC: 1 for probabilistic mode, 0 for non-probabilistic (IS THIS STILL SUPPORTED?)

• LM_τrigramBac θf fTrigger: if the trigram frequency is less than this number, the language model will back off

• LM_BigrarnBackθf fTrigger: if the bigram frequency is less than this number, the language model will back off

• LM_SCALING: specifies a function that applies to the language model probability before it gets combined with the acoustic score. Possible values are: o LINEAR: identity function o LOG: natural logarithm of the probability o LOGDIV: logarithm divided by the number of words in the utterance

• LM_AC EIGHT and LM_LA EIGHT: the multiples of the acoustic and language model score that will get combined to produce the final score.

For example, in the example configuration shown above, the final score will be:

1 x AcousticScore + 50 x Loq(LMScore)

#words in utterance Prog —rammer's Guide to LAR

Classes

When using LAR, you need to know about the following classes.

* UserSaid

^SpellmgLanguageModel ^•) POSLanguageModel

•JLanguageRepairer

•)ALARM

^Rej ect

UserSaid is the n-best list of utterances from the recognizer. Spell ingLanguageModel POSLanguageModel, LanguageRepairer, and ALARM represent LAR modules that operate on the n-best list. Rej ect is used in conjunction with ALARM; it represents a rejected attribute-value pair.

Pi ogi a mmg Overview

To use the LAR component, you need to instantiate a UserSaid object with an n-best list of utterances Also, you need to determine which LAR modules you are going to use and create an instance for each. These objects are then used to operate on the n-best list.

The LAR modules can be applied in any order. Most of the time, the language model is applied before the repairer. This is because the repairer removes disfluent parts of utterances and the data used by the language model typically includes the disfluencies However, if the language model data does nor include disfluencies, the repairer should be applied first.

Each module is optional with the exception of the LanguageRepairer. If a recognition grammar is used that outputs disfluency meta tokens for language repair, then the repairer is required.

There are two language, models. One is word based (SpellmgLanguageModel), the other is class based (POSLanguageModel). Which one to use depends on whether the data used by the system uses spellings (word based model) or word classes (class based model).

The LAR API

The following sample code shows the use of LAR modules

/ / Instantiate LAR modules LanguageRepairer repairer , SpellmgLanguageModel lm; ALARM alarm;

... other code ... Programmer's Guide

Class Structure ^'« • _. C ,

The class structure of the LAR system is showniϊelQw.. In addition to the dependencies shown in the diagram, the LanguageModel also instantiates each of the SentenceScore subclasses.

Desig Notes

Classes

SpellingLanguageM el

The SpellmgLanguageModel class implements a word based language model. The name SpellmgLanguageModel was chosen instead of "word language model" since in the SYLAN code, a Word is a class containing lexical information such as part of speech (POS).

POSLanguageModel

The POSLanguageModel class implements a class based language model It uses the POS rather than the spelling of each word.

Candidate

A Candidate represents a scored candidate sentence that is part of an n-best list.

A Candidate is an array of Word. It can be in one of two states. Either the words in the Candidate are just spellings (i.e., only the spelling field of the component words have been filled in) or they have been retrieved from the lexicon and contain full lexical details. The Candidate class has an attribute that tracks which of the two states the object is in.

Language Model Algoπthms

Implementation

A straightforward implementation of the language models would calculate and retain backoff probabilities for all n-grams. This straightforward implementation is unacceptable because for 1000 word vocabulary, the size of P_back(ABC) would be 10⁹ and access to a particular back-off probability would be to time consuming.

To solve this problem we do not store P_back with zero frequencies and instead calculate them at run time. This requires calculating and storing norm factors instead of P _ack . Since the dimensions of norm factors are (vocabulary size)^ and (vocabulary size) this is a large saving compared to (vocabulary size)³, (vocabulary size)², and ^"(vocabulary size)¹ for storing all P_back

The number of ABCs, with F(ABC) > 0 where F is the frequency, is normally significantly smaller than (vocabulary size)³ and the nu immber of ABs with F(AB) > 0 is normally significantly smaller than (vocabulary size)².

Moreover, if F(AB) = 0 then normfactor(AB) = 1 and we do not need to store/calculate the normfactor(AB) in that case. This means that the size of normfactor(AB) is equal to the size of the stored P_back(AB).

In this implementation, the same containers₃₁"e used for probabilities and frequencies. Design Notes

While reading original n-grams (non-zero frequencies), sums are accumulated in the

l 613.^'Tnis gives M

This document describes how to compile N-gram statistics to use in the language model, and also how to use the Language Model

Section 2 discusses the problem of selecting, obtaining and pre-processing corpora. Section 3 discusses the calculation of the N-gram statistics using the programs ngram pi and ngformat pi Section 4 discusses using the Language Model to evaluate N-best lists, and discusses the various configurations of SYLM that can be set Section 5 gives some examples of SYLM in action.

2. Corpora

Corpora can come from a variety of sources. transcriptions of dialogues, text downloaded from the web, or from other sources, text generated from a knowledge base,

Since we will be applying the Language Model to spoken data, ideally we would like a corpus o1 spoken utterances We would also like a corpus that is domain-specific Transcriptions of appropriate dialogues are the most domain specific The larger the corpus, the more accurate the statistics will be, provided that the corpus is from the right domain Written corpora are the easiest large corpora to acquire Furthermore, we would like a corpus that is complete, that is it should contain any trigram that we expect the user to utter Information generated from a knowledge base will contain the most complete lists of referring expressions This suggests using a corpus which is a mixture of ones from the three sources should be used

One problem with using written corpuses such as newspaper reports is that they will usually only contain declarative sentences, whereas we may be dealing with questions and so will want training data with structures such as "when is ", "where is ", "can you ", "do you " etc Note also that newspaper reports will not use the first or second person pronouns unless reporting speech, so we will not get N-grams for "I want to know ", "can you tell me .", "do you know ", etc Whereas first and second person pronouns are very common in spoken dialogue

When mixing corpora of various types, you must consider the balance of the different corpora

Once you have your text corpus, you should run the perl script clean.pl which removes all punctuation, converts to lower case, and adds two new lines at the end of every sentence

3. Calculating statistics

To calculate N-gram statistics for use by SYLM, you will need to run the programs getMultiWordTokens pi and ngramDir.pl (N-gram compiler, Directory version) First export your lexicon file using VDT The lexicon file I mean here is the one in a comma separated list

1271 , core , can , MD , , , 1271, core, can, MD, , , 1272, core, can_i, QUERY, QUERY, , 1273 , core, can_ e, QUERY, QUERY,

Then run getMultiWordTokens.pl on the lexicon, putting the results in a file tokenFile. Now you are ready to use ngramDir.pl. First put your corpus in a directory, and ngramDir.pl will recursively search this directory and process all file found. Second, you will need a file which contains all the multi-word tokens from the lexicon, one per line. Then you will need to run ngramDir.pl three times:

> perl getMultiWordTokens.pl lingdata > tokenFile

> perl ngramDir.pl uni tokenFile CORPUS/

> perl ngramDir.pl bi tokenFile CORPUS/

> perl ngramDir.pl tri tokenFile CORPUS/

_ per_l dicFilter.pl lingdata

This will create a set of files (which, despite being called unigram.bin, bigram.bin and trigram.bin are not actually binary files). These files will contain statistics on all words found in the corpus. We could then use these files in the language model, however it would be quite slow. It is best to first filter out all words that the recognition grammar won't recognise anyway. This is done using the script dicfilter.pl. dicfilter.pl filters which N-gram statistics will be kept. This is because there is no point in keeping statistics about words which the recogniser will not be recognising.

Now you just have to put the binaries in the in the Language Model's data directory.

4. Language Model

4.1 Using the Language Model

The Language model calculates the probabilities of certain sequences of words using uni-, bi- and trigram statistics. It takes in a file containing an N-best list returned from SYCON which also contains acoustic scores for each entry. The Language Model calculates a probability, weights this with the language score, and returns the best (or perhaps in the future the M- best) analyses.

The file containing the N-best list should be in a file with the following format:

-100.7 when is the platform diving -1000.8 when is the flat form diving - 2000.4 when is the mens riding

The filename of the N-best list is entered at the command prompt. In order to get your file into the above format, you should run SYCON to output in "Sycon3" mode, and then run the script cleanNbest.pl.

The configuration file looks like this:

If Conf iguration f ile for NLP components SYRINX_data = I:\nlp\public\sylancomponents30\data

# Language Model mandatory parameters

LM_WEIGHTS = {0.1, 0.3, 0.6}

LM_FILES = {unigram.bin, bigram.bin, trigram.bin}

LM_SCALING = LOGDIV

LM_PROBABILISTIC = 1

# Language model optional parameters

LM_AC EIGHT = 1.0 LM_LA EIGHT = 50.0

# The following parameter turns on Language model logging

# to the specified file. For logging to the terminal use

# LM_LOGFILE = SCREEN. The log is appended so use it

# only for testing.

#LM_LOGFILE = SCREEN

LM_LOGFILE = I:\nlp\public\sylancomponents30\logs\lm.log

# The following parameter is only needed when running the

# Language model test in file mode

LM_TESTFILES_DIR = I:\nlp\public\sylancomponents30\testdata

# Analyzer parameters

LingFile = num.lin GrammarFile = numgram.par

The configuration settings a Language Model researcher will most likely want to play with are the following:

The array LM_WEIGHTS assigns the relative importance placed on uni-, bi- and trigrams, respectively. The Language Model currently uses a smoothed N-gram model, with the smoothing parameters set by LM_WEIGHTS.

LM_PROBABILISTIC allows the user to toggle between statistical and non-statistical modes by setting LM-PROBABILISTIC to f or O, respectively.

LM_SCALING allows the user to apply a function to the probability score. The purpose of doing this is for comparing the language score with the acoustic score, as well as including factors such as the number of words in each input string. Current avilable options for scaling are LINEAR, LOG, LOGDIV...????

LM_ACWEIGHT and LM_LAWEIGHT specify the weightings of the acoustic score and the language model's score respectively.

It should be noted that the Language Model can be run to take either files or keyboard input, and can output to either the screen or a log file.

4.2 Theory behind the Language Model

If P denotes the N-gram approximation of probabilities, then we use LM_WEIGHTS =(a1 ,a2,a3) to define smoothed probabilities as follows:

F(wi wi-2,wi -D = alP(wi) + a2P(wi | wi-1) + a3P(wi | wi -2, wi -1)

F( i | i-1) = al/(al+a2)P(wi) + a2/(al+a2)P(wi | wi-1)

F(wi) = P(wi)

In the non-statistical case, the same formulas are used, except that we redefine P as:

P(wi I wi-1) = 1 , if the bigram wi,wi-l exists in our N-gram files

= 0 , otherwise

The language score is the product of the language function applied to each word in the string successively. /

L(w1 ,w2,...,wn) = F(w1)^*F(w2 | w1)^*F(w3 | w1 ,w2 ...^*F(wn | wn-2,wn-1 )

The final language model score is a weighting of the acoustic score A and the scaled language score:

LM(w1 ,w2,...wn) = b^*A(w1 wn) + (1 -b1 )^*sca//ng(L(w1 ,...,wn))

Where scaling is the scaling function specified by the user.

The output of the Language Model is then the N-best list candidate with the highest final language model score. We leave open the possibility of in the future returning an M-best list of highest scoring candidates. 5. Examples

In this section I show the the results that were obtained by running the Language Model with three different settings. In each case the input N-best list was SYCON's analysis of "When are the men's one thirty two eliminations?".

Example 1

In the first example we use a corpus of 250,00 words from the Sydney Morning Herald, together with referring expressions generated from the database, and the following configuration settings:

LM_ EIGHTS = (0.1,0.3,0.6)

LM_SCALING = LOGDIV

LM_PROBABILISTlC = 1

LM_AC EIGHT = 1.0

LM_LAWEIGHT = 50.0

Example 2

In the second example we use a corpus of 250,00 words irom the Sydney Morning Herald, together with referring expressions generated from the database, and also Questions from Robert and Lyn's reports, with the following configuration settings

LM_WEIGHTS = (0 1,0.3.0.6)

LM_SCALING = LOGDIV

LM_PROBABILISTIC = 1 Ben Hutchinson

This δai be useful in situations such as:

SYLAN: Did you say the mens or the womens breaststroke?

USER: The mens.

In this situation the DFC would send an ALARM that would specify that we're listening for "mens" and "womens", so that rival candidates to "the mens" that did not contain a gender term (e.g. "tennis", "swimming") would be penalised. However the user may respond as below:

SYLAN:

USER:

In these cases till be optimal if its acoustic sc

This proposal woβl j-equire two changes to the ALARM format.

1. The ALARM syntax would include "NOT". So that "NOT mens AND NOT womens" would penalise candidates that didn't contain a gender. If we have already ascertained that the speaker didn't say "breaststroke", the ALARM would be:

"(NOT mens AND NOT womens) OR breaststroke"

2. An ALARM could optionally include a penalty size, e.g. "-20" which would be added to the scores of candidates meeting the ALARM requirements. In the absence of a penalty size, the ALARM Processor would revert to a default. If the penalty size is set extremely high, it will have the effect of killing of candidates for good.

Example 3

In the final example we use a corpus of 250,00 words from the Sydney Morning Herald, together with referring expressions generated from the database, and also Questions from Robert and Lyn's reports, with the following configuration settings:

LM_WEIGHTS = (0.1,0.3,0.6)

LM_SCALING = LOGDIV

LM_PROBABILISTIC = 0

LM_AC EIGHT = 1.0

LM_LA EIGHT = 50.0

Example 4

Here I show how using the statistical setting can help us We run the LM twice on the same N-best list, once in statistical mode, and once in non-slatistical mode. In each case the corpus is 250,00 words from the Sydney Morning Herald, together with referring expressions generated from the database, and also Questions from Robert and Lyn's reports, with the following configuration settings:

LM_WEIGHTS = (0.1,0.3,0.6)

LM_SCALING = LOGDIV

LM_PROBABILISTIC = 0/1

LM_AC EIGHT = 1.0

LM_LAWEIGHT = 50.0

In non-statistical mode, the results were:

Note that the first two candidates received almost the same acoustic score, and the first candidate won on the basis of its linguistic score.

When the statistical method was used however the situation was reversed; the second candidate gets a better linguistic score and wins the contest This can be attributed to the fact that "final" is a more common word than "quarterfinal"

The above describes only some embodiments of the present invention and modifications, obvious to those skilled in the art, can be made thereto without departing from the scope and spirit of the present invention.

Claims

1. A language assisted recognition module for an automated ' speech recognition system, said module incorporating a plurality of sub-modules.

2. The language assisted recognition module of Claim 1 wherein said sub-modules process data in series.

3. The language assisted recognition module of Claim 1 or Claim 2 wherein said sub-modules include one or more modules selected from a duration model; a language repairer; an adaptive language assisted recognition message processor; a culler; a language model/digit sum checker; a confidence model.

4. The language assisted recognition module of any previous claim in communication with ^' a dialogue processing system.

5. The language assisted recognition module of Claim 4 wherein said dialogue processing system includes a dialogue flow controller.

6. The language assisted recognition module of Claim 4 or Claim 5 incorporating feedback means from said utterance processing system to said language assisted recognition module.

7. The language assisted recognition module of Claim 6 wherein said feedback means passes feedback data from said utterance processing system to said language assisted recognition module for adaptive processing within said language assisted recognition module.

8. The language assisted recognition module of Claim 7 wherein said feedback data comprises an adaptive language assisted recognition message derived from a dialogue flow controller within said automated speech recognition system.