CN100568222C - Divergence elimination language model - Google Patents
Divergence elimination language model Download PDFInfo
- Publication number
- CN100568222C CN100568222C CNB021065306A CN02106530A CN100568222C CN 100568222 C CN100568222 C CN 100568222C CN B021065306 A CNB021065306 A CN B021065306A CN 02106530 A CN02106530 A CN 02106530A CN 100568222 C CN100568222 C CN 100568222C
- Authority
- CN
- China
- Prior art keywords
- character
- phrase
- language model
- word
- word phrase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000008030 elimination Effects 0.000 title 1
- 238000003379 elimination reaction Methods 0.000 title 1
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000012545 processing Methods 0.000 claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims abstract description 6
- 230000008569 process Effects 0.000 claims description 12
- 238000000605 extraction Methods 0.000 description 13
- 239000013598 vector Substances 0.000 description 9
- 238000012795 verification Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 239000003550 marker Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000002093 peripheral effect Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006855 networking Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 241000283070 Equus zebra Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
一种用于语言处理系统如语音识别系统的语言模型,以相关字符、单词词组和上下文标记的功能来构成。一种用于生成训练语料库的方法和装置,该训练语料库用于训练该语言模型,以及一种使用这种公开的语言模型的系统或模块。
A language model for a language processing system such as a speech recognition system, structured as a function of associated characters, word phrases, and contextual tokens. A method and apparatus for generating a training corpus for training the language model, and a system or module for using the disclosed language model.
Description
发明背景Background of the invention
本发明涉及语言建模。更特别地,本发明涉及创建及使用一种用于使诸如输入语音的字符识别期间的歧义最小化的语言模型。The present invention relates to language modeling. More particularly, the invention relates to creating and using a language model for minimizing ambiguity during character recognition, such as input speech.
准确的语音识别不只需要一种声学模型来选择用户所说的正确的单词。换句话说,如果一个语音识别器必须选择或确定所发音的是哪一个单词,如果所有的单词都具有相同的发音,则该语音识别器将显然不能满意地执行。一种语言模型提供了一种指定词汇表中哪一个单词序列是可能的的方法或装置,或者通常,提供了有关各种单词序列相似性的信息。Accurate speech recognition requires more than an acoustic model to select the correct word spoken by the user. In other words, if a speech recognizer had to select or determine which word was pronounced, the speech recognizer would obviously not perform satisfactorily if all words had the same pronunciation. A language model provides a method or means of specifying which sequences of words in a vocabulary are possible, or generally, information about the similarity of various sequences of words.
语音识别经常被看作是一种自上至下的语言处理形式。语言处理的两种一般形式包括“自上至下”和“自下至上”。自上至下语言处理是以语言的最大单元开始来识别,例如一个句子,通过将其分类为比较小的单元来处理,例如词组,依次再分为更小的单元,例如单词。相反,自下至上语言处理是以单词开始,并从那里开始构造较大的词组和/或句子。语言处理的这两种形式都可以从语言模型中得到帮助。Speech recognition is often viewed as a top-down form of language processing. Two general forms of language processing include "top-down" and "bottom-up". Top-down language processing starts by identifying the largest unit of language, such as a sentence, and processes it by classifying it into smaller units, such as phrases, which in turn are subdivided into smaller units, such as words. In contrast, bottom-up language processing starts with words and builds larger phrases and/or sentences from there. Both forms of language processing can benefit from language models.
一种公知的分类技术是使用一种N个字符列语言模型。因为N个字符列可以结交大量的数据,N个单词的相关性通常提供句法和语义的压制的浅部结构。尽管N个字符列语言模型对于一般的口授可以执行的很好,但是同音异义字会产生很大的错误。一个同音异义字是诸如字符或音节这样的语言代码的一个元素,也就是发音类似但具有不同拼写的两个或多个元素之一。例如,当一个用户正拼写字符时,由于一些字符发音相同语音识别模块会输出错误的字符。同样的,对于当发音的时候听起来互相类似的不同字符语音识别模块也会输出错误的字符(例如“m”和“n”)。A well-known classification technique uses an N-gram language model. Because N-word columns can associate large amounts of data, N-word correlations usually provide shallow structure that suppresses both syntax and semantics. Although N-gram language models can perform well for general dictation, homophones can make large errors. A homonym is an element of a language code, such as a character or syllable, that is, one of two or more elements that sound similar but have different spellings. For example, when a user is spelling characters, the speech recognition module may output wrong characters because some characters sound the same. Likewise, the speech recognition module may output erroneous characters for different characters that sound similar to each other when pronounced (such as "m" and "n").
歧义问题在如日语或汉语等语言中尤其普遍,其主要是以汉字写入系统来书写。这些语言的字符是很多复杂的表示声音和意思的象形文字。这些字符形成了有限的音节,依次产生大量同音异义字,大大增加了通过口授生成文件所需的时间。特别是,在文件中必须识别错误的同音异义字字符并插入正确的同音异义字字符。The problem of ambiguity is especially prevalent in languages such as Japanese or Chinese, which are primarily written in the Kanji writing system. The characters of these languages are many complex pictographs representing sounds and meanings. These characters formed a limited number of syllables, which in turn produced a large number of homonyms, greatly increasing the time required to generate documents through dictation. In particular, incorrect homophone characters must be identified and correct homonym characters inserted in the file.
因此有一种持续的需求去开发新的方法,用于使在发同音异义字和具有不同意思的相似发音的语音时的歧义最小化。随着技术的发展,在更多的应用中都提供有语音识别,这就必须要得到一种更准确的语言模型。There is thus a continuing need to develop new methods for minimizing ambiguity in pronouncing homonyms and similar-sounding sounds with different meanings. With the development of technology, speech recognition is provided in more applications, which requires a more accurate language model.
发明概述Summary of the invention
语音识别器通常使用一种如N个字符列语言模型的语言模型来提高准确性。本发明的第一个方面包括生成一种语言模型,其在一个讲话者正识别一个字符或多个字符(例如一个音节)例如当拼写一个单词时特别有用。该语言模型有助于同音异义字和听起来互相类似的不同字符的歧义消除。该语言模型由包含一个字符串(可以是单个字符)的相关元素、一个具有字符串的单词词组(可以是单个单词)和一个上下文标记的训练语料库构造。使用一个单词表或字典,通过为每一个包含单词词组、上下文标记和单词词组的一个字符串的单词词组形成一个局部的句子或词组可以自动生成训练语料库。在另一个实施例中,为单词词组的每一个单词符生成一个词组。Speech recognizers typically use a language model such as an N-gram language model to improve accuracy. A first aspect of the invention involves generating a language model that is particularly useful when a speaker is recognizing a character or characters (eg a syllable) such as when spelling a word. The language model helps disambiguate homophones and different characters that sound similar to each other. The language model is constructed from a training corpus containing a string (which can be a single character) of associated elements, a word phrase with strings (which can be a single word), and a contextual token. Using a wordlist or dictionary, the training corpus can be automatically generated by forming a partial sentence or phrase for each wordphrase containing the wordphrase, context tokens, and a string of wordphrases. In another embodiment, a phrase is generated for each token of the word phrase.
本发明的另一个方面是一种使用上述用于识别所说的字符的语言模型的系统或模块。当说一个字符串时结合相关的单词词组中的上下文标记,语音识别模块确定用户正在拼写或识别字符的方式。该语音识别模块将只输出被识别的字符,而不输出上下文标记或相关的单词词组。在又一个实施例中,语音识别模块比较被识别的字符和一个被识别的单词词组以验证已被识别的正确的字符。如果被识别的字符不在被识别的单词词组中,则输出的字符是被识别单词词组的一个字符。Another aspect of the present invention is a system or module using the above-described language model for recognizing spoken characters. Combined with contextual markers in related word phrases when speaking a string, the speech recognition module determines how the user is spelling or recognizing characters. The speech recognition module will only output the recognized characters, not contextual tokens or associated word phrases. In yet another embodiment, the speech recognition module compares the recognized character to a recognized word phrase to verify that the correct character has been recognized. If the recognized character is not in the recognized word phrase, the output character is a character of the recognized word phrase.
附图的简要说明Brief description of the drawings
附图1是一个语言处理系统的方框图。Accompanying
附图2是一个典型的计算环境的方框图。Figure 2 is a block diagram of a typical computing environment.
附图3是一个典型的语音识别系统的方框图。Accompanying drawing 3 is a block diagram of a typical speech recognition system.
附图4是本发明的一种方法的流程图。Accompanying drawing 4 is the flowchart of a kind of method of the present invention.
附图5是用于实现附图4的方法的模块框图。Accompanying drawing 5 is a block diagram for realizing the method of accompanying drawing 4.
附图6是一种语音识别模块和一种可选的字符验证模块的方框图。Figure 6 is a block diagram of a voice recognition module and an optional character verification module.
说明性实施例的详细描述Detailed Description of Illustrative Embodiments
附图1示出了一种语言处理系统10,其接收一个语言输入12,并处理该语言输入12以提供一个语言输出14。例如,该语言处理系统10可以被具体化为一种接收由用户所说或所记录的语言的语言输入12的语音识别系统或模块。语言处理系统10处理所说的语言并提供以文字输出形式的识别单词和/或字符作为一个输出。FIG. 1 shows a
在处理期间,语音识别系统或模块10可以访问一个语言模型16以确定是哪个单词,特别是,是所说语言中的哪一个同音异义字或其他相似发音的元素。语言模型16对一种特定的语言编码,如英语、汉语、日语等等。在所释的实施例中,语言模型16可以作为一种统计语言模型,如一种N个字符列语言模型,上下文无关文法,或上述的混合,所有这些都是本领域公知的。本发明的一个主要方面是一种创建和构造语言模型16的方法。另一个主要方面是在语音识别中使用该方法。During processing, speech recognition system or
在详细讨论本发明以前,先概述一下操作环境是很有用的。附图2和相关的论述对实现本发明所在的一个适当的计算环境20提供了一个简要概括的说明。该计算环境20只是一个合适的计算环境的一个例子,并不会对本发明的使用范围或功能有任何限制。该计算环境20也不应解释为与图示典型操作环境20中的各部分之一或合成有任何依赖或需求。Before discussing the invention in detail, it is useful to provide an overview of the operating environment. Figure 2 and the associated discussion provide a brief overview of one suitable computing environment 20 in which the present invention may be practiced. The computing environment 20 is only one example of a suitable computing environment and is not intended to limit the scope of use or functionality of the invention. Neither should the computing environment 20 be interpreted as having any dependency or requirement relating to any one or combination of components within the illustrated exemplary operating environment 20 .
本发明以许多其他通用或专用计算系统环境或配置来操作。适合本发明使用的公知计算系统、环境、和/或配置的例子包括,但不局限于,个人计算机,服务器计算机,手持或膝上设备,多处理器系统,基于微处理器的系统,置顶盒,可编程消费电子设备,网络PC,小型机,大型机,包括任何上述系统或设备的分布式计算环境及类似设备。此外,本发明可以用在电话系统中。The invention operates with numerous other general purpose or special purpose computing system environments or configurations. Examples of known computing systems, environments, and/or configurations suitable for use with the present invention include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set-top boxes , programmable consumer electronic devices, network PCs, minicomputers, mainframes, distributed computing environments including any of the above systems or devices, and similar devices. Furthermore, the present invention can be used in telephone systems.
本发明可以以通常的计算机可执行指令的上下文来说明,如程序模块,由计算机执行。通常,程序模块包括例程、程序、对象、组件、数据结构等等执行特殊任务或实现特殊的抽象数据类型。本发明还应用在分布式计算环境中,其中由通过通讯网链接的远程处理设备执行任务。在分布式计算环境中,程序模块可以位于本地或远程的包含存储器存储设备的计算机存储介质中。通过程序和模块所执行的任务在下面并借助于附图加以说明。The invention may be described generally in the context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention also has application in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. The tasks performed by the programs and modules are explained below with the aid of the figures.
参考附图2,用于实现本发明的典型系统包括一个以计算机30为形式的通用计算设备。计算机30的构成包括,但不局限于,一个处理部件40,一个系统存储器50,和一条耦合包含系统存储器的各系统部件到处理部件40的系统总线41。该系统总线41可以是包括一个存储器总线或存储控制器、一条外部总线,和一条使用任意一种总线结构的局部总线的任意几种类型的总线结构。为了示范,但并不局限于此,这种结构包括工业标准结构(ISA)总线,微通道结构(MCA)总线,增强的ISA(EISA)总线,视频电子标准协会(VESA)局部总线,以及被称为附加板总线的外设部件互连(PCI)总线。Referring to FIG. 2 , an exemplary system for implementing the present invention includes a general purpose computing device in the form of
计算机30典型地包括一种多种计算机可读介质。计算机可读介质可以是任何通过计算机30可以被访问的可用的介质,包括易失性的和非易失性的介质,可移动的和非可移动的介质。为了示例,并不局限于此,计算机可读介质可以包含计算机存储介质和通讯介质。计算机存储介质包括在用于存储诸如计算机可读指令、数据结构、程序模块或其他数据的任何方法和技术中所实现的易失性的和非易失性的,可移动的和非可移动的介质。计算机存储介质包括,但不局限于,RAM、ROM、EEPROM、闪速存储器或其他存储技术、CD-ROM、数字化通用光盘(DVD)或其他光盘存储器、盒式磁带、磁带、磁盘存储器或其他磁盘存储设备,或任何其他可以用于存储所需信息并可由计算机20访问的介质。通讯介质典型地包含计算机可读指令,数据结构,程序模块或其他在已调制数据信号中的数据如载波或其他传输结构,并包括任何信息发送介质。术语“已调制数据信号”是指一种具有一个或多个字符组或以这种方式所改变的来编码信号中的信息的信号。作为示例,但不局限于此,通讯介质包括有线介质如有线网或直接有线连接、和无线介质如声音、FR、红外线和其他无线介质。上述任意组合都可以包含在计算机可读介质的范围内。
系统存储器50包括以易失性和/或非易失性存储器如只读存储器(ROM)51和随机存储器(RAM)52为形式的计算机存储介质。一个基本输入/输出系统53(BIOS)典型地被存储在ROM51中,它包含基本例程,帮助转换计算机30内各部件间的信息,如启动期间。RAM52典型地包含可即时被访问和/或由处理部件40操作的数据和/或程序模块。作为示例,但不局限于此,附图2示出了操作系统54,应用程序55,其他程序模块56和程序数据57。
计算机30还包括其他可移动/非可移动、易失性/非易失性的计算机存储介质。仅仅作为示例,附图2示出了一种从非可移动、非易失性磁介质中读取或向其写入的硬盘驱动器61,一个从一个可移动、非易失性磁盘72读取或向其写入的磁盘驱动器71,和一个从一个可移动、非易失性光盘76如CD ROM或其他光学介质中读取或向其写入的光盘驱动器75。其他可以用于典型操作环境中的可移动/非可移动,易失性/非易失性计算机存储介质包括,但不局限于,盒式磁带、闪存卡、数字化通用光盘,数字录像带、固态RAM、固态ROM和类似部件。硬盘驱动器61典型地通过一个如接口60的非可移动存储器接口与系统总线41相连,磁盘驱动器71和光盘驱动器75典型地通过一个如接口70的可移动存储器接口与系统总线41相连。The
上述讨论并示于图2中的驱动器和他们相关的计算机存储介质提供了计算机可读指令、数据结构、程序模块和其他用于计算机30的数据的存储器。在图2中,例如,硬盘驱动器61图示存储有操作系统64,应用程序65,其他程序模块66和程序数据67。注意这些部分既可以相同也可以不同于操作系统54,应用程序55,其他程序模块56和程序数据57。操作系统64,应用程序65,其他程序模块66和程序数据67在此给出不同的标记来说明,至少,他们是不同的版本。The drives and their associated computer storage media, discussed above and shown in FIG. 2 , provide storage of computer readable instructions, data structures, program modules and other data for the
一个用户通过如键盘82这样的输入设备,麦克风83,和一个诸如鼠标、跟踪球或触摸垫这样的定点设备81,可以输入命令和信息到计算机30中。其他输入设备A user may enter commands and information into the
(未示出)可以包括一个操纵杆、游戏垫、卫星反射器、扫描仪或类似设备。这些和其他输入设备通常通过一个与系统总线耦合的用户输入接口80与处理部件40相连,但是也可以通过其他接口和总线结构连接,如并行端口,游戏端口或通用串行总线(USB)。一个显示器84或其他类型的显示设备也通过一个接口如视频接口85与系统总线41相连。除了显示器,计算机还可以包括其他外围输出设备如扬声器87和打印机86,外围输出设备通过一个输出外设接口88连接。(not shown) may include a joystick, game pad, satellite reflector, scanner or similar device. These and other input devices are typically connected to processing
计算机30使用本地连接到一个或多个远程计算机如远程计算机94而操作于网络环境中。远程计算机94可以是一台个人计算机,一个手持设备,一台服务器,一个路由器,一个网络PC机,一个统计设备或其他公共网络节点,并典型地包括许多或所有上述与计算机30相关的部件。图2中描述的逻辑连接包括一个局域网(LAN)91和一个广域网(WAN)93,还可以包括其他网络。这种联网环境在办公、企业范围计算机网、企业内部互联网和因特网中很普遍。
当用于LAN联网环境中时,计算机30通过一个网络接口或适配器90连接到LAN91。当用于WAN联网环境中时,计算机30典型地包括一个调制解调器92或其他用于在WAN93上建立通讯的装置,如因特网。调制解调器92,可以是内置的或外置的,通过用户输入接口80或其他合适的结构与系统总线41相连。在一个网络环境中,描述的与计算机30或其中的某些部分相关的程序模块可以存储在远程存储器存储设备中。作为示例,但不局限于此,附图2示出了驻留在远程计算机94中的远程应用程序95。可以理解到所示的网络连接是典型的,可以使用其他建立计算机间通讯链路的装置。When used in a LAN networking environment,
一种语音识别系统100的一个典型实施例如图3所示。该语音识别系统100包括麦克风83,一个模数(A/D)转换器104,一个训练模块105,特征提取模块106,一个词典存储模块110、一个沿着senone树的声音模型112,一个树状搜索引擎114,语言模型16和一个通用语言模型111。应当注意整个系统100,或语音识别系统100的一部分,可以在图2所示环境中实现。例如,麦克风83通过一个合适的接口,并通过A/D转换器104可以作为计算机30的一个输入设备。训练模块105和特征提取模块106既可以是计算机30中的硬件模块,也可以是存储在图2公开的任意信息存储设备中的软件模块,并通过处理部件40或其他合适的处理器访问。此外,词典存储模块110,声音模型112和语言模型16和111最好也存储在如2所示的任意存储设备中。而且,树状搜索引擎114在处理部件40中实现(可以包括一个或多个处理器)或通过由计算机30使用的一个专用的语音识别处理器来执行。A typical implementation of a
在图示的实施例中,在语音识别期间,以音频声音信号的形式由用户提供语音作为系统100的输入给麦克风83。麦克风83转换音频语音信号为一个模拟电子信号,并提供给A/D转换器104。A/D转换器104转换该模拟语音信号为一串数字信号,并提供给特征提取模块106。在一个实施例中,特征提取模块106是一种传统的对数字信号执行光谱分析并为每一个频谱的频段计算量值的阵列处理器。在一个图示的实施例中,该信号以大约16kHz的采样频率通过A/D转换器104提供给特征提取模块106。In the illustrated embodiment, speech is provided by the user as an input to the
特征提取模块106将从A/D转换器104接收的数字信号分成包含多个数字样本的帧。每一个帧的持续时间大约是10毫秒。然后这些帧由特征提取模块106编码成一个反映多个频段频谱特征的特征向量。在离散和半连续隐含式马尔科夫模型的情况下,特征提取模块106还使用向量量化技术和从训练数据中获得的编码簿编码该特征向量为一个或多个代码字。因此,特征提取模块106提供其输出用于每一个所说的话的特征向量(或代码字)。特征提取模块106以大约每10毫秒一个特征向量或(代码字)的频率提供特征向量(或代码字)。
然后利用正被分析的特定帧的特征向量(或代码字)依照隐含式马尔科夫模型计算输出概率的分布。这些概率分布后面会用于执行一个维特比解码过程或类似类型的处理技术中。The distribution of output probabilities is then calculated according to a Hidden Markov Model using the feature vectors (or codewords) of the particular frame being analyzed. These probability distributions are then used to perform a Viterbi decoding process or similar type of processing technique.
在从特征提取模块106接收到代码字之后,树状搜索引擎114访问存储在声学模型112中的信息。该模型112存储有声学模型,如隐含式马尔科夫模型,其表示由语音识别系统100检测的语音部件。在一个实施例中,声学模型112包括一个与隐含式马尔科夫模型中的每一个马尔科夫状态相关的senone树。隐含式马尔科夫模型在一个图示实施例中表示音素。基于声学模型112中的senone,树状搜索引擎114确定由从特征提取模块106接收的特征向量(或代码单词)表示的最相似的音素,及从系统用户接收的说话方式的表示。After receiving the codeword from the
树状搜索引擎114还访问存储在模块110中的词典。由基于其声学模型112的访问的树状搜索引擎114接收的信息被用于搜索词典存储模块110以确定一个最可能表示从特征提取模块106接收的代码字或特征向量的单词。同时,搜索引擎114访问语言模型16和111。在一个实施例中,语言模型16是用于识别由输入语音表示的最相似字符或多个字符的一个单词的N个字符列,其包含一个字符(多个字符),上下文标记和一个单词词组以识别字符。例如,输入语音可以是“N as inNancy”,其中“N”(也可以是小写)是所需的字符,“as in”是上下文标记,“Nancy”是一个与字符“N”相关的单词词组以阐明或识别所需的字符。至于词组“N as in Nancy”,语音识别系统100的输出可能只有字符“N”。换句话说语音识别系统100在分析完关于词组“N as in Nancy”的输入语音数据之后,确定用户已经选择拼写字符。因此,上下文标记和相关的单词词组从输出的文本中被忽略了。搜索引擎114在必要时可以删除上下文际记和相关的单词词组。
应当注意在该实施例中,语言模型111为一个用于识别由一般口授的输入语音表示的最相似的单词的一个单词N个字符列。例如,当语音识别系统100具体表现为一个口授系统时,语言模型111提供用于一般口授的最相似单词的指示;然而,当用户使用具有上下文标记的词组时,来自语言模型16的输出会比用于相同词组的语言模型111具有较高值。语言模型16的较高值被用作用户正以上下文标记和单词词组进行识别的系统100中的一个指示。因此,对于一个具有上下文标记的输入词组,搜索引擎114或其他语音识别系统100的处理部件会忽略上下文标记和单词词组,而仅输出所需的字符。以下将继续讨论语言模型16的使用。It should be noted that in this embodiment, the
尽管在此描述的语音识别系统100使用HMM模型和senone树,应当理解这只是一个说明性的实施例。本领域普通技术人员会意识到,语音识别系统100可以有许多形式,其所需的就是使用语言模型16的特征并提供用户所说的文本作为输出。Although the
众所周知,一种统计的N个字符列语言模型为一个单词产生一种概率估算,假定直到那单词的单词序列。(即假定单词的历史记录H)。N个字符列语言模型只考虑历史记录H中对下一单词的概率有影响的前(n-1)个单词。例如,一个双阵列(或2个阵列)语言模型考虑对下一个单词有影响的前一个单词。因此,在一个N个字符阵列语言模型中,一个单词出现的概率表示如下:It is well known that a statistical N-gram language model produces a probability estimate for a word, given the sequence of words up to that word. (i.e. assume the word's history H). The N character string language model only considers the first (n-1) words in the history H that have an impact on the probability of the next word. For example, a double-array (or 2-array) language model considers the previous word as having an influence on the next word. Therefore, in an N character array language model, the probability of a word appearing is expressed as follows:
P(w/H)=P(w/w1,w2,…w(n-1)) (1)P(w/H)=P(w/w1,w2,…w(n-1)) (1)
其中w是一个重要的词;where w is an important word;
w1是位于单词w前面位于n-1位置的单词;w1 is the word at position n-1 in front of word w;
w2是位于单词w前面位于n-2位置的单词;且w2 is the word at position n-2 preceding word w; and
w(n-1)是序列中单词w前面的第一个单词。w(n-1) is the first word preceding word w in the sequence.
同时,一个单词序列的概率基于每一个给定其历史记录的单词的概率的增加来确定。因此,一个单词序列(w1…wm)的概率表示如下:At the same time, the probability of a sequence of words is determined based on the increasing probability of each word given its history. Therefore, the probability of a word sequence (w1...wm) is expressed as follows:
N个字符列模型通过应用一个N字符列算法到一个文本训练数据语料库中(词组、句子、语句段、段落等等的一个集合)来获得。一个N字符列算法可以使用,例如,熟知的统计技术例如Katz技术,或二项式后端分布补偿技术。在使用这些技术中,该算法估算单词w(n)将跟随一个单词序列W1,W2,…,W(n-1)的概率。这些概率值共同形成N个字符列语言模型。以下所述本发明的某些方面可以用于构造一个标准的统计型N字符列模型。N-gram models are obtained by applying an N-gram algorithm to a corpus of textual training data (a collection of phrases, sentences, sentence segments, paragraphs, etc.). An N-word algorithm can use, for example, well-known statistical techniques such as the Katz technique, or the binomial back-end distribution compensation technique. In using these techniques, the algorithm estimates the probability that a word w(n) will follow a sequence of words W1, W2, . . . , W(n-1). Together, these probability values form an N character sequence language model. Certain aspects of the invention described below can be used to construct a standard statistical N-character sequence model.
本发明的第一个主要方面如图4中所示,作为一种创建一个用于语言处理系统以指示字符的语言模型的方法140。也可以参照图5,一种包括具有用于实现方法140的具有指令的模块的系统或装置142。通常,方法140包括,对于一个单词词组表的每一个单词词组,在步骤144将该单词词组的字符串和该单词词组与表示识别该字符串的上下文标记相联系。应当注意该字符串可以包括一个单一字符。同样,一个单词词组可以包含一个单一的单词。例如,对于一个等于一个字符的字符串和一个等于一个单词的单词词组,步骤144将该单词的一个字符与单词表141中用于每一个单词的上下文标记相联系。一个上下文标记通常是由讲话者使用的特定语言中的一个单词或单词词组,以识别单词词组中的一个语言元素。在英文中上下文标记的例子包括“as in”,“for example”,“as found in”,“like”,“such as”等等。类似的单词或单词词组在其他语言中也可以找到,例如日语中的の和汉语中的的。在一个实施例中,步骤144包括构造一个单词词组143的语料库。每一个单词词组包括一个字符串、单词词组和上下文标记。典型地,当一个单一字符与一个单词关联时,使用第一个字符,尽管也可以使用该单词的另一个字符。这种单词词组的例子包括“N as in Nancy”,“P as in Paul”,和“Z as in zebra”。The first main aspect of the invention is shown in Figure 4 as a
在另一个实施例中,单词的另一个字符与单词和上下文标记关联,而在某些语言中,例如汉语,其中许多单词只包括一、二或三个字符,这对将该单词的每一个字符与上下文标记中的单词相关联是很有帮助的。如上指出的,关联所需字符与对应的单词和上下文标记的一种简单的方法是形成相同的单词词组。因此,给定一个单词表141,一个用于训练语言模型的单词词组143的语料库可以很容易生成所有所需的上下文标记。In another embodiment, another character of a word is associated with the word and a contextual marker, whereas in some languages, such as Chinese, where many words consist of only one, two or three characters, this would have the effect of making each of the words It is helpful for characters to be associated with words in contextual markup. As noted above, a simple way to associate desired characters with corresponding words and context tokens is to form identical word phrases. Thus, given a
基于语料库143,语言模型16利用传统构造模型146来构造,如一个N字符列构造模型,实现公知的用于构造语言模型16的技术。块148表示在方法140中构造语言模型16,其中语言模型16包括,但不局限于,一个N字符列语言模型,一个上下文无关文法或上述的混合。Based on the
生成的词组可以被指定一个合适的数值,将依据语言模型的形成产生一个合适的概率值。在上述例子中,“N as in Nancy”比词组“N as in notch”更可能被说。因此,本发明的又一个特征包括对语言模型中每一个相关的字符串和单词词组校正概率得分。概率得分可以针对语言模型16的生成来调节。在另一个实施例中,概率得分可以通过包括在语料库143中足够数量的相同的单词词组而校正来为相关的字符和单词词组在语言模型中产生一个合适的概率值。概率值也可以是单词词组使用的似然性的一个函数。通常,存在比其他单词使用更频繁来识别一个字符或多个字符的单词词组。这种单词词组可以被指定或另外在语言模型中提供一个较高的概率值。The generated phrase can be assigned an appropriate value, which will generate an appropriate probability value according to the formation of the language model. In the example above, "N as in Nancy" is more likely to be said than the phrase "N as in notch". Thus, yet another feature of the invention includes correcting the probability score for each associated string and word phrase in the language model. The probability score may be tuned for
图6示出了一种语音识别模块180和语言模型16。语音识别模块180可以是上述类型;然而,应当理解语音识别模块180并不局限于该实施例,而是可以有很多形式。如上述所指明的,语音识别模块180接收表示输入语音的数据并访问语言模型16以确定输入语音是否包括具有上下文标记的词组。在检测到一个具有上下文标记的单词词组时,语音识别模块180只提供与上下文标记和单词词组相关的单个或多个字符作为输出,而不是上下文标记或单词词组。换句话说,尽管语音识别模块检测到完整的词组“N as in Nancy”,语音识别模块也只提供“N”作为输出。该输出在口授系统中尤其有用,其中讲话者单独地选择以指示所需的单个或多个字符。FIG. 6 shows a
在这一点上,应当注意上述语言模型16实质上由相关的字符串、单词词组和上下文标记构成,因此允许语言模型16对具有这种形式的输入语音感应。在图3的实施例中,通用语言模型111可以被用作没有字符串、单词词组和上下文标记的特定形式的输入语音。然而,应当理解在这两种实施例中语言模型16和111如果需要的话可以合并。At this point, it should be noted that the
对于输入语音的接收和对语言模型16的访问,语音识别模块180为输入语音确定一个识别的字符串和一个识别的单词词组。在许多情况,识别的字符串由于使用语言模型16将是正确的。然而,在又一个实施例中,可以包含一个字符验证模块182来校正由语音识别模块180造成的错误中的至少一部分。字符验证模块182访问由语音识别模块180确定的识别字符串和识别单词词组并比较识别的字符串和识别的单词词组,特别是,验证识别的字符串存在于识别的单词词组中。如果识别的字符串不在识别的单词词组中,很明显发生了错误,尽管该错误既可能源自讲话者由于口授了错误的词组如“M as in Nancy”,也可能是语音识别模块180误解了识别的字符串或识别的单词词组。在一个实施例中,字符验证模块182可以假定该错误更可能在识别的字符串中,因此,用存在于识别的单词词组中的字符替代识别的字符串。以识别的单词词组的字符代替识别的字符串可以依据识别的字符串与识别的单词词组中的字符之间声音的相似性的比较来进行。因此,字符验证模块182可以访问属于说的时候独立的字符的声音的存储数据。利用存在于识别的单词词组中的字符,字符验证模块182比较在识别的单词词组中存储的每一个字符的声音数据和识别的字符串。提供最接近的字符作为输出。对于本领域普通技术人员来说,字符验证模块182可以包含在语音识别模块180中;然而,为了解释的目的,字符验证模块182单独以图示说明。Upon receipt of input speech and access to
尽管本发明参考优选实施例作了说明,本领域普通技术人员在不背离本发明精神和范围的情况下可以在形式和描述上作出各种修改。Although the present invention has been described with reference to preferred embodiments, various changes in form and description can be made by those skilled in the art without departing from the spirit and scope of the invention.
Claims (24)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/773,342 | 2001-01-31 | ||
US09/773,342 US6507453B2 (en) | 2000-02-02 | 2001-01-31 | Thin floppy disk drive capable of preventing an eject lever from erroneously operating |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1369830A CN1369830A (en) | 2002-09-18 |
CN100568222C true CN100568222C (en) | 2009-12-09 |
Family
ID=25097940
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB021065306A Expired - Fee Related CN100568222C (en) | 2001-01-31 | 2002-01-29 | Divergence elimination language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100568222C (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1727024A1 (en) * | 2005-05-27 | 2006-11-29 | Sony Ericsson Mobile Communications AB | Automatic language selection for text input in messaging context |
CN1940915B (en) * | 2005-09-29 | 2010-05-05 | 国际商业机器公司 | Corpus expansion system and method |
CN101256624B (en) * | 2007-02-28 | 2012-10-10 | 微软公司 | Method and system for establishing HMM topological structure being suitable for recognizing hand-written East Asia character |
JP5638948B2 (en) * | 2007-08-01 | 2014-12-10 | ジンジャー ソフトウェア、インコーポレイティッド | Automatic correction and improvement of context-sensitive languages using an Internet corpus |
US8326333B2 (en) | 2009-11-11 | 2012-12-04 | Sony Ericsson Mobile Communications Ab | Electronic device and method of controlling the electronic device |
WO2014203370A1 (en) * | 2013-06-20 | 2014-12-24 | 株式会社東芝 | Speech synthesis dictionary creation device and speech synthesis dictionary creation method |
CN103943109A (en) * | 2014-04-28 | 2014-07-23 | 深圳如果技术有限公司 | Method and device for converting voice to characters |
JP2016024212A (en) * | 2014-07-16 | 2016-02-08 | ソニー株式会社 | Information processing device, information processing method and program |
US10229687B2 (en) * | 2016-03-10 | 2019-03-12 | Microsoft Technology Licensing, Llc | Scalable endpoint-dependent natural language understanding |
CN113034995B (en) * | 2021-04-26 | 2023-04-11 | 读书郎教育科技有限公司 | Method and system for generating dictation content by student tablet |
-
2002
- 2002-01-29 CN CNB021065306A patent/CN100568222C/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN1369830A (en) | 2002-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6934683B2 (en) | Disambiguation language model | |
US7590533B2 (en) | New-word pronunciation learning using a pronunciation graph | |
JP5040909B2 (en) | Speech recognition dictionary creation support system, speech recognition dictionary creation support method, and speech recognition dictionary creation support program | |
CN100568223C (en) | Method and apparatus for multimodal input of ideographic languages | |
US6539353B1 (en) | Confidence measures using sub-word-dependent weighting of sub-word confidence scores for robust speech recognition | |
US7124080B2 (en) | Method and apparatus for adapting a class entity dictionary used with language models | |
US6694296B1 (en) | Method and apparatus for the recognition of spelled spoken words | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
US6973427B2 (en) | Method for adding phonetic descriptions to a speech recognition lexicon | |
JP5014785B2 (en) | Phonetic-based speech recognition system and method | |
US7890325B2 (en) | Subword unit posterior probability for measuring confidence | |
US8135590B2 (en) | Position-dependent phonetic models for reliable pronunciation identification | |
US20080221890A1 (en) | Unsupervised lexicon acquisition from speech and text | |
US7844457B2 (en) | Unsupervised labeling of sentence level accent | |
US20080027725A1 (en) | Automatic Accent Detection With Limited Manually Labeled Data | |
US6449589B1 (en) | Elimination of left recursion from context-free grammars | |
US6502072B2 (en) | Two-tier noise rejection in speech recognition | |
CN100568222C (en) | Divergence elimination language model | |
KR20130126570A (en) | Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof | |
KR102299269B1 (en) | Method and apparatus for building voice database by aligning voice and script | |
Büler et al. | Using language modelling to integrate speech recognition with a flat semantic analysis | |
JP2005221752A (en) | Speech recognition apparatus, speech recognition method and program | |
Reddy et al. | Incorporating knowledge of source language text in a system for dictation of document translations | |
Reddy | Speech based machine aided human translation for a document translation task | |
Shinozaki et al. | A new lexicon optimization method for LVCSR based on linguistic and acoustic characteristics of words. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20091209 Termination date: 20150129 |
|
EXPY | Termination of patent right or utility model |