CN1755795B

CN1755795B - Method of constructing digital voice database of Chinese characters, Chinese digital series synthesis system and method

Info

Publication number: CN1755795B
Application number: CN2004100831497A
Authority: CN
Inventors: 夏海荣; 吴翔; 董娜; 贾磊
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2004-09-30
Filing date: 2004-09-30
Publication date: 2010-12-15
Anticipated expiration: 2024-09-30
Also published as: CN1755795A

Abstract

The disclosed method to construct Chinese digital pronunciation database comprises: generating an original pronunciation database contained a plurality of digital unit string opposite to Chinese digital pronunciation; selecting digital units from organic database for first, medium and tail of said Chinese string as binary group digital units; similarly, selecting single group digital units; cutting down binary digital units with weak influence degree of adjacent units; constructing objective pronunciation database with cut binary units and single units. This invention keeps high-natural synthesis language, and decreases database scale fit to embedded device with small memory.

Description

Method for constructing digital sound library and system and method for synthesizing Chinese digital strings

技术领域technical field

本发明涉及一种构造汉字数字音库的方法和一种汉语数字串合成系统和方法，特别是涉及一种构造适用于小内存环境的高自然度汉字数字音库的方法，和一种适用于小内存环境的高自然度汉字数字串合成系统及其方法。The present invention relates to a method for constructing a Chinese character digital sound library and a Chinese digital string synthesis system and method, in particular to a method for constructing a Chinese character digital sound library with high naturalness suitable for a small memory environment, and a method suitable for A system and method for synthesizing high-naturalness Chinese character and numeral strings in a small memory environment.

背景技术Background technique

随着软、硬件技术的发展，目前信息产品(如PDA和智能手机)的计算能力和存储能力已经获得很大的改善，进而使得改善用户界面的需求更加强烈。文语转换(Text-to-Speech、TTS)技术用来将文本输入转换成人类可听的、具有一定可懂度和自然度的合成语音输出。应用该技术的计算机系统在大量减小预先录制(Prerecorded)的语音文件的数量的同时，还可以提供动态产生的语音响应，因而使信息产品具有更好的人机交互界面。With the development of software and hardware technologies, the computing power and storage capacity of current information products (such as PDAs and smart phones) have been greatly improved, which in turn makes the demand for improving user interfaces more intense. Text-to-Speech (TTS) technology is used to convert text input into human-audible synthetic speech output with certain intelligibility and naturalness. The computer system applying this technology can provide dynamically generated voice responses while greatly reducing the number of pre-recorded (Prerecorded) voice files, thus enabling information products to have a better human-computer interaction interface.

数字串合成是汉语文语转换技术的一个重要应用，例如为高端移动电话或者电话查号台提供电话号码朗读功能，报告证券交易行情和工业仪表读数等等。由于句子中的数字串通常包含更多的信息量，所以更容易引起听者的关注，客观上也就要求数字串合成具有更好的音质，也即是较高的自然度和可懂度。Digit string synthesis is an important application of Chinese text-to-speech conversion technology, such as providing phone number reading function for high-end mobile phones or telephone directory desks, reporting securities trading prices and industrial instrument readings, etc. Since the number strings in a sentence usually contain more information, it is easier to attract the attention of the listener. Objectively, the number string synthesis is required to have better sound quality, that is, higher naturalness and intelligibility.

传统上，根据应用平台和场合的不同，构建合成系统的技术也不相同。基于大规模音库的单元拼接方法，能够提供较高的音质和速度，因而适合服务器或者桌面计算环境；而基于声道模型的合成方法则由于更低的资源需求，在嵌入式应用领域更加适用，但同时只能提供较低品质的输出语音。因此，需要一种能够适用于小内存环境的高自然度汉语数字串合成技术。Traditionally, depending on the application platform and occasion, the techniques for building a synthetic system are different. The unit splicing method based on a large-scale sound bank can provide high sound quality and speed, so it is suitable for server or desktop computing environments; while the synthesis method based on the channel model is more suitable for embedded applications due to lower resource requirements , but can only provide lower quality output speech at the same time. Therefore, there is a need for a high-naturalness Chinese digit string synthesis technology that can be applied to a small memory environment.

发明内容Contents of the invention

因此，本发明要解决的问题是，现有的汉语数字串合成方法无法应用于移动通信终端上，这是由于移动设备的内存较小，不能将适用于桌面计算环境的大规模音库直接移植到移动通信终端上。Therefore, the problem to be solved by the present invention is that the existing method for synthesizing Chinese numeral strings cannot be applied to mobile communication terminals. This is because the memory of the mobile device is relatively small, and the large-scale sound library applicable to the desktop computing environment cannot be directly transplanted. to the mobile communication terminal.

即，本发明通过对建立的原始音库进行裁减，在保证合成语音的质量同时，使得音库大小可以适用小内存应用环境。That is, the present invention cuts down the established original sound bank, while ensuring the quality of the synthesized speech, the size of the sound bank can be adapted to a small memory application environment.

在本发明的一个方面，提出了一种构造汉语数字音库的方法，包括步骤：产生一包括多个数字单元串的原始音库，所述数字单元串是相应汉语数字串的发音的表示；从原始音库中挑选其中的数字单元分别位于所述汉语数字串的串首、串中和串尾的二元组数字单元；从原始音库中挑选其中的数字单元分别位于所述汉语数字串的串首、串中和串尾的一元组数字单元；对这样的二元组数字单元进行裁减，该二元组数字单元中的相邻数字单元之间的影响程度弱；由裁减后的二元组数字单元和挑选的一元组数字单元构成目标音库。In one aspect of the present invention, a kind of method of constructing the Chinese numeral sound bank is proposed, comprising the steps of: producing an original sound bank comprising a plurality of numeral unit strings, said numeral unit string being the representation of the pronunciation of the corresponding Chinese numeral string; From the original sound bank, the digital units in which are selected are respectively located at the beginning of the string, in the string and at the end of the string; from the original sound bank, the digital units are respectively located in the Chinese digital string. The one-tuple number unit of the string head, the string and the end of the string; such a two-tuple number unit is cut, and the degree of influence between adjacent number units in the two-tuple number unit is weak; by the two-tuple number unit after cutting The tuple number units and the selected one-tuple number units constitute the target sound bank.

利用根据本发明的构造汉字数字音库的方法，可以建立满足嵌入式设备的小内存要求的汉语数字音库，同时能够保持相当高的自然度。Utilizing the method for constructing a Chinese character digital sound library according to the present invention, a Chinese digital sound library that meets the small memory requirement of an embedded device can be established while maintaining a relatively high degree of naturalness.

此外，在根据本发明的构造汉字数字音库的方法中，还包括为各个数字单元建立用于描述其位置的索引结构的步骤，所述索引结构包括数字单元在目标音库中的偏移量、数字单元的时长和数字单元的上下文结构。其中，所述数字单元的上下文结构包括前一数字单元的编码、后一数字单元的编码和当前数字单元的编码。所述数字单元包括下面汉字的发音的表示：零、一、二、三、四、五、六、七、八、九、十、百、千、万、亿、点、分、之、又。In addition, in the method for constructing the digital sound bank of Chinese characters according to the present invention, it also includes the step of establishing an index structure for describing its position for each digital unit, and the index structure includes the offset of the digital unit in the target sound bank , the duration of the number unit and the context structure of the number unit. Wherein, the context structure of the digital unit includes the code of the previous digital unit, the code of the next digital unit and the code of the current digital unit. The number unit includes the pronunciation representations of the following Chinese characters: zero, one, two, three, four, five, six, seven, eight, nine, ten, hundred, thousand, ten thousand, hundred million, dot, minute, zhi, and.

此外，在根据本发明的构造汉字数字音库的方法中，从原始音库中挑选出的二元组数字单元中的数字单元在原始音库中的上下文单元是数字串的边界或者是对二元组数字单元中的数字单元影响弱的另一发音单元。In addition, in the method for constructing Chinese character digital sound bank according to the present invention, the context unit of the digital unit in the two-tuple digital unit selected from the original sound bank in the original sound bank is the boundary of the digital string or the pair A number unit in a tuple number unit affects another pronunciation unit weakly.

此外，在根据本发明的构造汉字数字音库的方法中，从原始音库中挑选出的一元组数字单元中的数字单元在原始音库中的上下文单元是数字串的边界或者是对一元组中的数字单元影响弱的另一数字单元。In addition, in the method for constructing Chinese character digital sound bank according to the present invention, the context unit of the digital unit in the one-tuple digital unit selected from the original sound bank in the original sound bank is the boundary of the digital string or is a pair of one-tuple A number unit in a weakly affects another number unit.

此外，在根据本发明的构造汉字数字音库的方法中，所述经过裁减的二元组数字单元中的数字单元至少包括：零、一、五、六及变调yao2、wu2、wan4、yi4和you4。In addition, in the method for constructing a Chinese character digital sound library according to the present invention, the digital units in the cut-down two-tuple digital units include at least: zero, one, five, six, and tone-modified yao2, wu2, wan4, yi4 and you4.

在本发明的另一方面，提出了一种合成汉语数字串的方法，包括步骤：根据输入的数字串的长度和类型进行韵律分组，得到数字串组；提取数字串组中的数字的上下文特征，所述上下文特征包括前一数字的编码、后一数字的编码和当前数字的编码；根据所述的上下文特征从音库中选择相应的数字单元，进而得到数字串组中各个数字的数字单元；将所述各个数字的数字单元的波形拼接在一起输出；其中，所述音库是使用如上所述的构造汉语数字音库的方法构造的。In another aspect of the present invention, a method for synthesizing Chinese numeral strings is proposed, comprising the steps of: performing prosodic grouping according to the length and type of the input numeral strings to obtain numeral string groups; extracting the context features of numbers in the numeral string groups , the context feature includes the encoding of the previous number, the encoding of the next number and the encoding of the current number; according to the context feature, select the corresponding number unit from the sound library, and then obtain the number unit of each number in the number string group ; The waveforms of the digital units of the various numbers are spliced together and output; wherein, the sound library is constructed using the method for constructing a Chinese digital sound library as described above.

利用根据本发明的合成汉语数字串的方法，一方面可以使朗读过程更加悦耳自然，另一方面消除了进行拼接时因为能量和周期不匹配而产生的噪声。Utilizing the method for synthesizing Chinese numeral strings according to the present invention, on the one hand, the reading process can be made more pleasing and natural, and on the other hand, the noise generated due to the mismatch of energy and period during splicing can be eliminated.

此外，在根据本发明的合成汉语数字串的方法中，当数字位于数字串首或者数字串组首时，其上下文特征中的前一数字的编码是边界标记。当数字位于数字串尾或者数字串组尾时，其上下文特征中的后一数字的编码是边界标记。In addition, in the method for synthesizing Chinese numeral strings according to the present invention, when a numeral is at the beginning of a numeral string or a numeral string group, the encoding of the previous numeral in its context feature is a boundary mark. When a number is at the end of a number string or a number string group, the encoding of the next number in its context feature is a boundary mark.

此外，在根据本发明的构造汉字数字音库的方法中，从原始音库中选择合适的数字单元包括通过对当前数字的数字单元的所有候选进行打分，选择得分最高的候选数字单元。通过当前数字的上下文特征与候选数字单元的上下文特征的匹配程度来进行打分。In addition, in the method for constructing a Chinese character digital sound library according to the present invention, selecting a suitable digital unit from the original sound library includes scoring all candidates of the current digital digital unit, and selecting the highest-scoring candidate digital unit. Scoring is performed by how well the contextual features of the current digit match the contextual features of the candidate digital unit.

此外，在根据本发明的构造汉字数字音库的方法中，如果针对两个相邻数字从音库中选择的两个数字单元相邻，则将所述两个数字单元的波形直接拼接，否则，在拼接所述两个数字单元的波形时执行加窗平滑处理。加窗平滑处理中使用的窗的长度是20毫秒，窗之间的重叠区长度是0-10毫秒。In addition, in the method for constructing Chinese character digital sound bank according to the present invention, if two digital units selected from the sound bank are adjacent to two adjacent numbers, then the waveforms of the two digital units are directly spliced, otherwise , performing windowing smoothing processing when splicing the waveforms of the two digital units. The length of the window used in the windowed smoothing process is 20 milliseconds, and the length of the overlapping region between the windows is 0-10 milliseconds.

在本发明的又一方面，提出了与上述的合成汉语数字串的方法相对应的系统，包括：分组装置，用于根据数字串的长度和数字串的类型对数字串进行韵律分组，得到数字串组；提取装置，用于提取数字串组中的数字的上下文特征，所述上下文特征包括前一数字的编码、后一数字的编码和当前数字的编码；选择装置，用于根据所述的上下文特征从音库中选择相应的数字单元，进而得到数字串组中各个数字的数字单元；拼接装置，用于将所述各个数字的数字单元的波形拼接在一起输出；其中，所述音库是使用如上所述的构造汉语数字音库的方法构造的。In yet another aspect of the present invention, a system corresponding to the above-mentioned method for synthesizing Chinese numeral strings is proposed, including: a grouping device, which is used to perform prosodic grouping of numeral strings according to the length of the numeral string and the type of the numeral string, to obtain numerals string group; extracting means, for extracting the context feature of the number in the number string group, said context feature comprises the coding of the previous number, the coding of the next number and the coding of the current number; the selection means, for according to said The context feature selects the corresponding digital unit from the sound bank, and then obtains the digital unit of each number in the digital string group; the splicing device is used to splice and output the waveforms of the digital units of each number; wherein, the sound bank It is constructed using the method for constructing the Chinese digital sound bank as described above.

利用本发明的合成数字串的系统，一方面可以使朗读过程更加悦耳自然，另一方面消除了进行拼接时因为能量和周期不匹配而产生的噪声。Utilizing the system for synthesizing number strings of the present invention, on the one hand, it can make the process of reading aloud more pleasing and natural, and on the other hand, it can eliminate the noise generated due to the mismatch of energy and period when splicing.

附图说明Description of drawings

图1所示的表格给出了汉语数字串合成中的数字单元；The table shown in Fig. 1 has provided the numeral unit in the Chinese numeral string synthesis;

图2示出了相邻数字的发音不粘连时的波形图；Fig. 2 shows the oscillogram when the pronunciation of adjacent numerals is not cohesive;

图3示出了相邻数字的发音粘连时的波形图；Fig. 3 shows the oscillogram when the pronunciation of adjacent numeral is glued;

图4示出了数字单元的辅音分类表；Fig. 4 shows the consonant classification table of numeral unit;

图5所示的表格给出了汉字数字发音之间的影响程度；The table shown in Figure 5 shows the degree of influence between the pronunciation of Chinese characters;

图6是音库索引结构的示意图；Fig. 6 is the schematic diagram of sound library index structure;

图7是图6所述的音库索引结构中的子项‘特征’的格式；Fig. 7 is the format of the subitem 'feature' in the sound bank index structure described in Fig. 6;

图8是本发明的构造汉字数字音库的方法的流程图；Fig. 8 is the flowchart of the method for constructing Chinese character digital sound storehouse of the present invention;

图9是本发明的汉字数字串合成方法的流程图；Fig. 9 is the flow chart of Chinese character numeral string synthesis method of the present invention;

图10示出了根据一个实施例对输入的数字串进行韵律分组的过程；Fig. 10 shows the process of performing prosodic grouping of input digit strings according to one embodiment;

图11以表格的形式示出了对输入的汉语数字串的韵律分组；Fig. 11 shows the prosodic grouping of the input Chinese numeral string in the form of a table;

图12示出了根据本发明一个实施例从音库中选择合适的数字单元的过程；Fig. 12 shows the process of selecting a suitable digital unit from the sound library according to one embodiment of the present invention;

图13示出了在本发明的汉字数组串合成方法中对选择的数字单元的波形进行拼接的示意图；Fig. 13 shows a schematic diagram of splicing the waveforms of selected digital units in the Chinese character array string synthesis method of the present invention;

图14是本发明的汉字数字合成系统的结构框图；Fig. 14 is a structural block diagram of the Chinese character digital synthesis system of the present invention;

图15示出了对本发明的汉字数字合成系统的输出的评价结果。Fig. 15 shows the evaluation results of the output of the Chinese character numeral synthesis system of the present invention.

具体实施方式Detailed ways

下面参照附图对本发明的实施例进行详细的说明，在描述过程中省略了对于本发明来说是不必要的细节和功能，以防止对本发明的理解造成混淆。Embodiments of the present invention will be described in detail below with reference to the accompanying drawings, and unnecessary details and functions for the present invention will be omitted during the description to prevent confusion in the understanding of the present invention.

【数字单元的定义及其特征的描述】[Definition of digital unit and description of its characteristics]

汉语中包括‘一’、‘二’、‘三’、‘四’、‘五’、‘六’、‘七’、‘八’、‘九’(或者这些数字的大写，如壹、贰等等)共十个基本数字，以及“十”、“百”、“千”、“万”、“亿”、“点”、“分”、“之”、“又”等单位和标记。这十九个单元构成的数字串包括按数值朗读的字串如金额或者逐位朗读的如电话号码。为使合成输出的自然度尽可能高，必须考虑实际汉语发音中的三声变调现象，即两个相邻的上声音节，前一个音节在发音时变成阳平，如“九九”发作“jiu2 jiu3”(这里位于拼音字母之后的阿拉伯数字表示该拼音的音调，1表示阴平，即一声；2表示阳平，即二声；3表示上声，即三声；4表示去声，即四声)而非“jiu3 jiu3”。Chinese includes 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine' (or the capitalization of these numbers, such as one, two, etc. etc.) a total of ten basic numbers, and units and marks such as "ten", "hundred", "thousand", "ten thousand", "billion", "point", "minute", "zhi", "again". The number string formed by these nineteen units includes a character string that is read aloud by value, such as an amount, or that is read aloud digit by digit, such as a telephone number. In order to make the synthetic output as natural as possible, it is necessary to consider the three tone transposition phenomenon in the actual Chinese pronunciation, that is, two adjacent upper syllables, the previous syllable becomes Yangping when pronounced, such as "Jiujiu" jiu2 jiu3" (here the Arabic numerals behind the pinyin letters indicate the pitch of the pinyin, 1 means Yinping, which is one tone; 2 means Yangping, which is two tones; 3 means upper tone, which is three tones; 4 means going tones, which is four tones ) instead of "jiu3 jiu3".

考虑到变调因素，上述单元中还应该包括：“五”在进行三声变调发作“wu2”，“九”在进行三声变调发作“jiu2”，“点”在进行三声变调发作“dian2”，“百”在进行三声变调发作“bai2”。此外，为了更容易和“七”相区别，“一”在某些场合读作“yao1”。此外，当数字位于一个数字串的首部或者尾部时，其上文或者下文实际上是串组边界，因此也归入一类，用“#”来表示，虽然它不发音，但是仍旧将它看作一个特殊的数字单元。Considering the factor of tone transposition, the above unit should also include: "5" is performing a three-tone transposition "wu2", "nine" is performing a three-tone transposition "jiu2", "dot" is performing a three-tone transposition "dian2" , "hundred" is performing three-tone transposition to produce "bai2". In addition, in order to distinguish it from "seven" more easily, "one" is pronounced "yao1" in some occasions. In addition, when a number is at the beginning or end of a number string, the above or below is actually a string group boundary, so it is also classified into one category, represented by "#", although it is not pronounced, it is still seen Make a special number unit.

综上所述，一个汉语数字串合成系统所涉及的的数字单元一共25个，包括24个发音的数字单元和一个不发音的数字单元，如图1所示。To sum up, a Chinese numeral string synthesis system involves a total of 25 numeral units, including 24 pronounced numeral units and one silent numeral unit, as shown in FIG. 1 .

除了单个的数字之外，数字单元的出现形式为数字串。在数字串的情况下，一个数字单元与其上下文的数字单元之间将会相互影响。因此，可以用一个数字单元在数字串中的上下文来描述，即，将前一数字单元，后一数字单元和数字单元本身构成一个矢量，作为该数字单元的特征。Except for single digits, numeric units appear as strings of digits. In the case of strings of numbers, there will be interactions between a number unit and its context's number units. Therefore, it can be described by the context of a digital unit in the digital string, that is, the previous digital unit, the next digital unit and the digital unit itself constitute a vector as the feature of the digital unit.

需要说明的是，本发明中的‘数字’和‘数字单元’是相互对应的，由于不存在不同的数字具有相同的发音，所以，在一些情况下，‘数字’和‘数字单元’具有相同的含义，不需要特别区分。同样，本发明中提到的‘数字串’和‘数字单元串’在一些情况下的含义是相同的。It should be noted that the 'number' and the 'number unit' in the present invention correspond to each other, since there are no different numbers with the same pronunciation, so, in some cases, the 'number' and the 'number unit' have the same The meaning does not need to be specially distinguished. Likewise, the meanings of 'number string' and 'number unit string' mentioned in the present invention are the same in some cases.

【发音影响因子表】【Pronunciation Impact Factor Table】

上面提到，在正常连续读出数字串时，数字相互之间会对彼此的发音产生影响，使得数字的发音与该数字单独发音时有所不同，甚至差别甚大。从波形的角度来看，可能彼此分割清晰，但可能发生粘连。As mentioned above, when the number string is read continuously, the numbers will affect each other's pronunciation, so that the pronunciation of the number is different from that of the number alone, even very different. From the waveform point of view, they may be clearly separated from each other, but adhesions may occur.

图2和图3分别给出了相邻的数字在连续读出时彼此分割清晰和发生粘连的例子。如图2所示，当我们连续读出‘02(零二)’时，数字‘0(零)’的发音和数字‘2(二)’发音相互没有什么影响，彼此很容易区分。但是，当连续读出‘25(二五)’时，数字‘2(二)’的发音和数字‘5(五)’的发音相互粘连，从波形上无法清晰分辨出二者之间分解线，如图3所示。Fig. 2 and Fig. 3 respectively give the examples of the adjacent digits being clearly separated from each other and sticking together when they are continuously read out. As shown in Figure 2, when we read '02 (zero two)' continuously, the pronunciation of the numeral '0 (zero)' and the pronunciation of the numeral '2 (two)' have no influence on each other, and they can be easily distinguished from each other. However, when '25 (two five)' is read continuously, the pronunciation of the number '2 (two)' and the pronunciation of the number '5 (five)' stick together, and the decomposition line between the two cannot be clearly distinguished from the waveform. ,As shown in Figure 3.

因此，为了衡量相邻数字单元之间的发音影响程度，可以建立一张发音影响因子表。该表将被用于在构建音库时进行音库裁减，和在合成数字时评价音库中的候选数字单元。Therefore, in order to measure the degree of pronunciation influence between adjacent digital units, a table of pronunciation influence factors can be established. This table will be used for soundbank pruning when constructing the soundbank, and for evaluating candidate digital units in the soundbank when synthesizing numbers.

为建立发音影响因子表，首先对所有数字单元的拼音按照发音原理进行分类。图4示出了数字单元的辅音分类表。In order to establish the table of pronunciation influencing factors, firstly, the pinyin of all number units is classified according to the pronunciation principle. Fig. 4 shows the consonant classification table of the number unit.

结合图4所作的分类，对原始音库中的数字单元进行波形分析。对其中13个基本数字单元分析的结果如图5所示，后一个数字为边音和半元音时，前一个数字的发音收到较的影响，也即发音的粘连关系较强，而其他的相邻数字间的互相影响比较小。Combined with the classification made in Figure 4, the waveform analysis of the digital units in the original sound bank is carried out. The results of the analysis of 13 basic number units are shown in Figure 5. When the latter number is a side sound and a semi-vowel, the pronunciation of the previous number is more affected, that is, the pronunciation has a stronger adhesion relationship, while other The mutual influence between adjacent numbers is relatively small.

在图5中，相邻两个数字连续发音时的粘连关系，Y表示影响较强，X表示影响较弱，其余的基本数字单元中，受影响较强的包括：“wan4”、“yi4”、“you4”。同样，可以对所有24个具有实际发音的数字单元进行分析，获得它们之间的发音影响程度。In Figure 5, the adhesion relationship between two adjacent numbers when they are pronounced continuously, Y indicates a stronger influence, X indicates a weaker influence, among the rest of the basic number units, the stronger influence includes: "wan4", "yi4" , "you4". Similarly, all 24 numerical units with actual pronunciation can be analyzed to obtain the degree of pronunciation influence among them.

在发音影响因子的实施例中，仅仅以‘强’和‘弱’形式来描述相邻两个基本数字单元之间的相关程度。作为一个选择实施例，可以采用较为精细的描述方式来描述相邻数字单元之间的相关程度，例如，三级或者更多，以及加权的形式。In the embodiment of the pronunciation influencing factor, only the 'strong' and 'weak' forms are used to describe the degree of correlation between two adjacent basic number units. As an optional embodiment, a more detailed description method may be used to describe the degree of correlation between adjacent digital units, for example, three levels or more, and a weighted form.

【原始音库的构建】[Construction of the original sound bank]

为了最终获得高自然度的合成数字串发音，除了要求原始音库具有足够高的质量的录音质量，此外还要求该音库能够覆盖尽可能多的发音现象，即现实发音中可能存在的搭配。基于这个原则设计的原始音库，可以保证合成数字串时，每个数字都能在音库中找到尽可能匹配的候选单元。在这个音库上做裁减以便缩小音库的时候，也能够最大程度地保证目标音库的质量。In order to finally obtain a high-natural pronunciation of synthetic digital strings, in addition to requiring the original sound library to have a sufficiently high recording quality, it is also required that the sound library can cover as many pronunciation phenomena as possible, that is, collocations that may exist in real pronunciation. The original sound bank designed based on this principle can ensure that when synthesizing a number string, each number can find a candidate unit that matches as much as possible in the sound bank. When doing cuts on this sound bank to reduce the sound bank, it can also ensure the quality of the target sound bank to the greatest extent.

为了覆盖数字在不同搭配下的连接情形，原始音库的数字脚本应该足够长。在数字串中，每个数字的位置可以分为：首部、中部和尾部。同样，在数字单元串中，每个数字单元的位置可以分为：首部、中部和尾部。由于一个三位的数字串就能够提供三个数字在不同位置和不同上下文的发音，所以原始音库由从“000”依次变化到“999”，并且包括阿拉伯数字“1”念作“幺”时的组合情形。当然，本领域的普通技术人员可以采用更长的数字串来构造原始音库。The digital script of the original sound bank should be long enough to cover how the numbers are connected in different collocations. In the number string, the position of each number can be divided into: head, middle and tail. Similarly, in the digital unit string, the position of each digital unit can be divided into: head, middle and tail. Since a three-digit number string can provide the pronunciation of three numbers in different positions and in different contexts, the original sound bank changes from "000" to "999" in turn, and includes the Arabic numeral "1" pronounced as "unit". time combination. Of course, those skilled in the art can use longer digital strings to construct the original sound bank.

音库脚本设计完成后，通过符合一定发音素质的播音人员在一定的录音条件下进行朗读录制，然后进行音节边界标记，就可以获得一份高质量的汉语数字串合成用原始音库。After the script design of the sound library is completed, a high-quality original sound library for the synthesis of Chinese numeral strings can be obtained by broadcasters who meet certain pronunciation quality to read and record under certain recording conditions, and then mark the syllable boundaries.

【目标音库的构建】[Construction of the target sound bank]

原始音库虽然覆盖了的大量的语言现象，但同时也包含了大量的冗余，因此在实用中，需要进行裁减构成目标音库。目标音库具有较为小巧，但又保证覆盖足够多语言现象特点。Although the original sound bank covers a large number of language phenomena, it also contains a lot of redundancy. Therefore, in practice, it needs to be cut to form the target sound bank. The target sound bank is relatively small, but it is guaranteed to cover enough multilingual phenomena.

两个连续发音的汉语数字，发生粘连的时候，前一个数字的发音将对后一个数字的发音产生较大的影响，反之亦然。这种情况下，数字发音的声学特征将显著区别于独立发音。因此，为了获得高的自然度，本发明提出，通过选取和保留数字发音的二元链接作为基本单元来构建目标音库。When two consecutively pronounced Chinese numbers are glued together, the pronunciation of the previous number will have a greater impact on the pronunciation of the latter number, and vice versa. In this case, the acoustic characteristics of the digital utterances will be significantly different from those of the independent utterances. Therefore, in order to obtain a high degree of naturalness, the present invention proposes to construct the target sound library by selecting and retaining binary links of digital pronunciations as basic units.

全部二元链接具有较好的覆盖能力，但同时需要较大的存储空间。在嵌入式等应用环境下，存储空间的限制使得需要对二元链进行进一步的裁剪和压缩。All binary links have better coverage, but require a larger storage space. In embedded and other application environments, the limitation of storage space makes it necessary to further cut and compress the binary chain.

根据发音影响因子表，对于具有粘连关系的两个连续数字单元，保留其发音样本可以提高真实合成时的自然度。这样的数字单元包括：0、1、5、6、yao2、wu2、wan4、yi4和you4，所有与这些数字单元相连接的数字二元组都被选入目标音库。此外由于上声变调比较特殊，所以在构建目标音库时，特别将其保留，即把所有可能发生三声变调的数字单元的组合形式如wu2-wu3、wu2-wu2、wu2-jiu3、wu2-jiu2、jiu2-wu3等等选入目标音库。According to the Pronunciation Affecting Factor Table, for two consecutive digital units with cohesive relationship, retaining their pronunciation samples can improve the naturalness of real synthesis. Such number units include: 0, 1, 5, 6, yao2, wu2, wan4, yi4 and you4, and all the number pairs connected with these number units are selected into the target sound bank. In addition, due to the special tone transposition of the upper tone, when constructing the target sound bank, it is specially reserved, that is, the combination of all digital units that may have three tone transpositions such as wu2-wu3, wu2-wu2, wu2-jiu3, wu2- jiu2, jiu2-wu3, etc. are selected into the target sound bank.

研究表明，对于非孤立的两个连续数字，其声学特性在其不同的上下文环境下，也并非一致。例如，“356”中的粘连数字“35”与“353”中的“35”存在差异，其原因在于后续数字“6”或“3”对前面数字的发音仍然存在影响。因此，根据影响因子表，选用具有较弱影响的上下文将能够选出更具通用性的数字样本。Studies have shown that for two consecutive numbers that are not isolated, their acoustic properties are not consistent in their different contexts. For example, the glued number "35" in "356" differs from "35" in "353" because the subsequent number "6" or "3" still has an effect on the pronunciation of the previous number. Therefore, choosing a context with a weaker impact will enable the selection of a more general numerical sample, according to the impact factor table.

研究还表明，发音粘连关系较强的两个连续数字，在数字串组中的位置不同，其发音也有所差异。例如“353”中的“35”和“335”中的“35”的声学表现并不一致。保留这两类不同的发音样本，可以提高合成时的自然度。The research also shows that two consecutive numbers with strong pronunciation cohesion have different pronunciations in different positions in the number string group. For example, the acoustic performance of the "35" in "353" and the "35" in "335" are not consistent. Retaining these two different types of pronunciation samples can improve the naturalness of the synthesis.

当一个数字单元与后续数字单元的粘连关系不强时，其发音将相对独立。这些数字单元在数字串组中的位置不同时，其发音也有所差异，例如处于数字串中部的“9”的发音要短于处于尾部的“9”。When the cohesive relationship between a number unit and subsequent number units is not strong, its pronunciation will be relatively independent. When the position of these number units in the number string is different, the pronunciation is also different. For example, the pronunciation of "9" in the middle of the number string is shorter than that of "9" at the end.

更重要的原因是，一个数字单元的听感上的停顿比插入静音产生的停顿更加自然，而听感停顿的一个重要因素就是该数字单元出现在数字串组中的位置。因此，这些数字单元要分别保留其处于首部、中部和尾部的发音样本，选取方法是：The more important reason is that the auditory pause of a digital unit is more natural than the pause produced by inserting silence, and an important factor of the auditory pause is the position where the digital unit appears in the digital string. Therefore, these digital units should retain their pronunciation samples at the head, middle and tail respectively. The selection method is:

1)如果该数字位于数字串首部，则该数字应与串中的后一个数字的发音的粘连关系应该较弱；1) If the number is at the head of the number string, the number should have a weaker bonding relationship with the pronunciation of the next number in the string;

2)如果该数字位于数字串中部，则该数字应与其前后的数字的发音的粘连关系都应该较弱；2) If the number is located in the middle of the number string, then the number should have a weak bonding relationship with the pronunciation of the numbers before and after it;

3)如果该数字位于数字串尾部，则该数字应与其前一个数字的发音的粘连关系应该较弱。3) If the number is at the end of the number string, the number should have a weaker bonding relationship with the pronunciation of the previous number.

作为一个实施例，“3”作为上下文数字可以保证当前数字的发音较少受到上下文的影响，即通过“×33”、“3×3”和“33×”来截取其中位于不同位置的“3”。由于“wu2”和“jiu2”不独立发音，所以截取单时，不考虑这两个数字单元。这一类数字单元的上文和下文认为是首尾位置标记或中间位置标记。As an example, "3" as a contextual number can ensure that the pronunciation of the current number is less affected by the context, that is, the "3" located in different positions is intercepted by "×33", "3×3" and "33×". ". Since "wu2" and "jiu2" are not pronounced independently, these two number units are not considered when intercepting the single. The above and below of this type of numerical unit are considered as head and tail position marks or middle position marks.

音库单元选定后，从原始音库中摘取出来并按照一定顺序保存生成一个独立的目标音库，并通过一个索引结构来进行描述每个数字单元的位置信息。索引结构如图6描述。After the sound bank unit is selected, it is extracted from the original sound bank and saved in a certain order to generate an independent target sound bank, and an index structure is used to describe the position information of each digital unit. The index structure is described in Figure 6.

图6的表格中的项‘特征’描述了数字单元的上下文环境，该值可以通过如图7所示的16位宽度的值来描述。‘特征’中的三个子项的赋值遵守以下规则：The item 'feature' in the table of FIG. 6 describes the context of the digital unit, and this value can be described by a value of 16 bits width as shown in FIG. 7 . The assignment of the three subkeys in 'Characteristics' obeys the following rules:

1)该数字单元位于串首时，‘特征’中的前一数字在标记为#；1) When the number unit is at the beginning of the string, the previous number in the 'feature' is marked as #;

2)该数字单元位于串尾时，‘特征’中后一数字标记为#；2) When the number unit is at the end of the string, the last number in the 'feature' is marked as #;

3)当前数字单元不位于串首，但位于所选的一元组或者二元组的第一元，则‘特征’的前一数字单元标记为&；3) The current digital unit is not located at the beginning of the string, but is positioned at the first unit of the selected one-tuple or two-tuple, then the previous digital unit of the 'feature' is marked as &;

4)当前数字单元不位于串尾，但位于所选的一元组或者二元组的第二元，则其‘特征’的后一数字单元标记为&；4) the current digital unit is not located at the end of the string, but is positioned at the second unit of the selected one-tuple or two-tuple, then the latter digital unit of its 'feature' is marked as &;

5)否则‘特征’中的前一数字单元和后一数字单元取为当前数字单元的实际上下文单元。5) Otherwise, the previous number unit and the next number unit in the 'feature' are taken as the actual context unit of the current number unit.

图8示出了根据本发明一个实施例的构造汉语数字音库的方法的流程图。下面对照图8对本发明的构造汉字数字音库的方法进行详细说明。Fig. 8 shows a flowchart of a method for constructing a Chinese digital phonetic library according to an embodiment of the present invention. The method for constructing a Chinese character digital sound library of the present invention will be described in detail below with reference to FIG. 8 .

在步骤S81，产生一包括多个数字单元串的原始音库。例如，朗读数字串从“000”依次变化到“999”，并且包括阿拉伯数字“1”念作“幺”时的组合情形，同时进行录音，确定边界标记，从而形成一份用于汉字数字串合成的原始音库。In step S81, an original sound library including a plurality of digital cell strings is generated. For example, the number string read aloud changes from "000" to "999" in turn, and includes the combination of the Arabic numeral "1" when it is pronounced as "unit". At the same time, it is recorded to determine the boundary mark, thereby forming a copy for the Chinese character number string. Synthesized raw soundbank.

接下来，在步骤S82，以二元组为单位从原始音库中挑选位于串首、串中和串尾的二元组数字单元，也就是从原始音库的三元数字单元串中挑选二元组数字单元，并且在挑选时要考虑二元组中的数字分别位于三元组数字单元串的串首、串中和串尾的情况。Next, in step S82, select the two-tuple digital unit that is positioned at string beginning, in the string and the string end from the original sound storehouse in the unit of two-tuple, just select two from the three-element digital unit string of the original sound storehouse. Tuple number units, and when selecting, consider the situation that the numbers in the two-tuples are located at the beginning, middle and end of the three-tuple number unit string respectively.

然后，在步骤S83从原始音库中挑选一元组数字单元，也就是按照单个数字在三元组数字串中的不同位置来挑选相应的数字单元。经过上面的两个挑选步骤，得到了包括二元组数字单元串和一元组数字单元的中间音库，虽然此时的音库比原始音库小，但是在二元组数字单元串中仍旧存在冗余，需要进一步的裁减。Then, select a tuple of digital units from the original sound bank in step S83, that is, select corresponding numerical units according to the different positions of a single number in the triplet number string. After the above two selection steps, the intermediate sound bank including the two-tuple digital unit string and the one-tuple digital unit is obtained. Although the sound bank at this time is smaller than the original sound bank, it still exists in the two-tuple digital unit string Redundancy requires further pruning.

在步骤S84，由于二元组中的两个数组单元之间的影响程度有强有弱，对于彼此影响弱的二元数字串，完全可以用单个的数字来合成，而不会降低合成结果的自然度。因此，裁减掉彼此影响弱的二元数字单元串。最后，在步骤S85，利用经过裁减的二元数字单元串和从原始音库中挑选的一元组数字单元组成最终的目标音库。In step S84, because the degree of influence between the two array units in the binary group is strong or weak, for the binary number strings with weak mutual influence, it can be synthesized with a single number without reducing the quality of the synthetic result. Naturalness. Therefore, the binary number unit strings that have weak influence on each other are cut out. Finally, in step S85, the final target sound bank is composed of the trimmed binary digital unit string and the one-tuple digital unit selected from the original sound bank.

除此之外，需要对目标音库中的数字单元的位置进行索引，形成索引文件，以便于在数字串合成时查找合适的候选数字单元。In addition, it is necessary to index the positions of the digital units in the target sound bank to form an index file, so as to find suitable candidate digital units when synthesizing digital strings.

经过上述过程构造的音库，占用的存储空间很小，并且保持了很高的自然度，能够满足嵌入式系统的要求。The sound library constructed through the above process occupies a small storage space and maintains a high degree of naturalness, which can meet the requirements of embedded systems.

【汉语数字串的合成】[Synthesis of Chinese number strings]

根据本发明一个实施例的合成汉语数字串的过程包括：数字串输入，韵律分组，特征提取，单元选择，波形拼接和输出。其流程如图9所示。The process of synthesizing Chinese numeral strings according to an embodiment of the present invention includes: digit string input, prosody grouping, feature extraction, unit selection, waveform splicing and output. Its process is shown in Figure 9.

1.数字串输入1. Digital string input

通常，按照指令从存储器的特定区域中读出存储的数字串，并且可以事先知道该数字串的长度和类型，例如，电话号码，身份证号，驾驶证号、证券数据和工业仪表读数等等。Usually, the stored number string is read from a specific area of the memory according to the instruction, and the length and type of the number string can be known in advance, for example, telephone number, ID number, driver's license number, securities data and industrial instrument readings, etc. .

2.韵律分组2. Rhythmic grouping

正常情况下，为了使听者有足够的时间理解所朗读的内容，并且使朗读过程更加悦耳自然，较长的数字串都应该分拆成较短的数字串组。这样，自然朗读的数字串便具有了一定的韵律结构。韵律结构首先决定于数字串的语法和语义，如身份证号码中的信息分组，其次取决于朗读习惯和能力，例如对8位电话号码的分组。Under normal circumstances, in order for the listener to have enough time to understand the content read aloud, and to make the reading process more pleasing and natural, longer digit strings should be split into shorter digit string groups. In this way, the number strings read aloud naturally have a certain rhythmic structure. The prosodic structure first depends on the syntax and semantics of the number string, such as the information grouping in the ID number, and secondly depends on the reading habit and ability, such as the grouping of 8-digit phone numbers.

通常情况下，数字串的韵律分组具有一定的规律，作为进行韵律分组的例子，图10给出了输入的电话号码并对其进行韵律分组的过程。Usually, the prosodic grouping of digit strings has certain rules. As an example of prosodic grouping, Fig. 10 shows the process of prosodic grouping the input phone numbers.

首先，读入具有预定长度L的电话号码。然后判断其长度L是否大于6，如果大于6，则按照3或4对其进行分组。例如，对于8位电话号码，按照4-4分组。First, a telephone number having a predetermined length L is read in. Then judge whether its length L is greater than 6, and if it is greater than 6, group it according to 3 or 4. For example, for an 8-digit phone number, group by 4-4.

如果该电话号码的长度L不大于6，则判断L是否等于6，如果等于6，按照3-3分组或者不分组。If the length L of the phone number is not greater than 6, it is judged whether L is equal to 6, and if it is equal to 6, group or not group according to 3-3.

如果该电话号码的长度L小于等于6，则判断其是否等于5，如果等于5，则按照2-3分组或者不分组。If the length L of the phone number is less than or equal to 6, it is judged whether it is equal to 5, and if it is equal to 5, grouping is performed according to 2-3 or not.

最后，如果电话号码的长度L小于5，例如公司内部的电话号码，则不进行进一步的分组。Finally, if the length L of the phone number is less than 5, such as a phone number within a company, no further grouping is done.

此外，如图11还具体列举出了对电话号码进行韵律分组的更详细的情况。通过韵律分组，使数字的特征更能够描述数字本身的特性，从而有助于从音库中选择正确的候选。In addition, as shown in FIG. 11 , a more detailed situation of prosodic grouping of telephone numbers is also specifically listed. Through prosodic grouping, the features of the digits are more able to describe the characteristics of the digits themselves, thus helping to select the correct candidates from the sound bank.

3.特征提取3. Feature extraction

对输入的数字串进行韵律分组后，就可以提取每个数字的特征，也即其上下文环境。在处理数字时，需要考虑上声变调，即wu3、jiu3、bai3、dian3在连续发音时，前一个音相应转换为wu2、jiu2、bai2、dian2。提取特征的原则与上面在‘目标音库的构建’中对数字单元的特征描述的过程基本一致，但仅在该数字位于串首或组首时，设定特征的上一数字单元为#，该数字单元位于串尾或组尾时，设定特征的下一数字单元为#。数字特征提取完成后，获得每个数字的上下文特征描述，该特征将用来从音库中选择合适的候选单元。After prosodic grouping the input digit strings, the features of each digit, that is, its context, can be extracted. When dealing with numbers, it is necessary to consider the upper tone transposition, that is, when wu3, jiu3, bai3, and dian3 are pronounced continuously, the previous tone is converted into wu2, jiu2, bai2, and dian2. The principle of feature extraction is basically the same as the process described above for the feature of the digital unit in the "Construction of the target sound bank", but only when the number is at the beginning of the string or group, the last digital unit of the feature is set to #, When the number unit is at the end of the string or group, the next number unit of the setting feature is #. After the digit feature extraction is completed, the contextual feature description of each digit is obtained, and this feature will be used to select a suitable candidate unit from the sound bank.

4.单元选择4. Unit selection

单元选择过程就是从目标音库中选择最合适的候选数字单元。单元选择过程首先通过音库索引文件来查找到当前待匹配的数字单元的所有候选，然后对每个候选依次进行打分评判，并找出获得最高分的候选作为最终的目标候选。The unit selection process is to select the most suitable candidate digital unit from the target sound bank. The unit selection process first finds all candidates of the current digital unit to be matched through the sound bank index file, then scores and judges each candidate in turn, and finds the candidate with the highest score as the final target candidate.

打分方法考虑上下文的匹配程度，可能并且可行的打分函数如下所示：The scoring method considers the matching degree of the context, and the possible and feasible scoring functions are as follows:

S＝Sp+SnS=Sp+Sn

其中Sp是对上下文中前一个数字单元匹配与否的评价，Sn是对上下文中后一个数字单元匹配与否的评价。Sp和Sn的打分方法是首先检验是否完全匹配，如果不完全匹配，则检查音库候选数字的上下文是否为元组边界标记&。如图12所示，具体打分方法如下：Where Sp is the evaluation of whether the previous number unit in the context matches or not, and Sn is the evaluation of whether the next number unit in the context matches or not. The scoring method of Sp and Sn is to check whether they match completely at first, and if they do not match completely, then check whether the context of the candidate number of the sound bank is the tuple boundary mark &. As shown in Figure 12, the specific scoring method is as follows:

如果前一数字单元匹配，则Sp＝3；If the previous digit unit matches, then Sp=3;

否则，如果目标音库中的候选数字单元的前一个数字单元为&，则Sp＝2；Otherwise, if the previous digital unit of the candidate digital unit in the target sound bank is &, then Sp=2;

否则，Sp＝1；Otherwise, Sp = 1;

如果后一个数字单元匹配，则Sn＝3；If the latter digit unit matches, then Sn=3;

否则，如果目标音库中的候选数字单元的后一个数字单元为&，则Sn＝2；Otherwise, if the last digital unit of the candidate digital unit in the target sound bank is &, then Sn=2;

否则，Sn＝1；Otherwise, Sn=1;

然后，对Sp和Sn求和，得到该候选数字单元的最终得分。Then, sum Sp and Sn to get the final score of the candidate digital unit.

通过重复上面的步骤，得到所有的候选数字单元的最终等分，从中挑选得分最高的那个数字单元，并且进一步按照该数字单元的索引结构得到其波形。By repeating the above steps, the final equalization of all candidate digital units is obtained, and the digital unit with the highest score is selected, and its waveform is further obtained according to the index structure of the digital unit.

5.波形拼接5. Wave splicing

波形拼接过程是将选出的每个候选数字单元的波形依次拼接连接起来形成一段完成的语音波形。为了消除两个数字单元直接拼接时因为能量和周期不匹配而产生的噪声，“喀嗒”声，需要对拼接点加窗平滑处理，而对于在原始音库中就相邻的两个数字单元则直接进行拼接。一种可能并且可行的窗函数形如图14，其中窗长约为20毫秒，重叠区长度在0到10毫秒之间，衰减方式为从加窗的起点到加窗终点按固定斜率进行正向或反向衰减。The waveform splicing process is to sequentially splice and connect the waveforms of each selected candidate digital unit to form a completed speech waveform. In order to eliminate the noise and "click" sound caused by the mismatch of energy and period when two digital units are directly spliced, it is necessary to add a window to the splicing point for smoothing, and for two adjacent digital units in the original sound bank Then splice directly. A possible and feasible window function is shown in Figure 14, where the window length is about 20 milliseconds, the length of the overlapping region is between 0 and 10 milliseconds, and the attenuation method is a positive slope from the starting point of windowing to the end point of windowing. or reverse decay.

对于具有粘连关系的相邻数字，在构建目标音库的时候已经都被选入，所以输入数字串中包含这样的数字链按时，单元选择过程总可以有效地将最合适的数字单元从目标音库选出，并且可以直接拼接，这样也就保证了最高的自然度。For the adjacent numbers that have a glue relationship, they have all been selected when building the target sound bank, so when the input number string contains such a number chain, the unit selection process can always effectively select the most suitable number unit from the target sound bank. The library is selected and can be spliced directly, which ensures the highest degree of naturalness.

6.波形输出6. Waveform output

在波形输出过程中，将拼接的波形直接从扬声器输出，或者对其进行放大之后再输出。During the waveform output process, output the spliced waveform directly from the speaker, or output it after being amplified.

此外，图14还示出了根据本发明的汉语数字串合成系统的方框图。In addition, Fig. 14 also shows a block diagram of the Chinese numeral string synthesis system according to the present invention.

如图14所示，本发明的汉语数字串合成系统10包括：输入部分141、分组部分142、提取部分143、选择部分144、拼接部分145、输出部分146和存储部分147。下面对上述的这些组成部分进行描述。As shown in FIG. 14 , the Chinese numeral string synthesis system 10 of the present invention includes: an input part 141 , a grouping part 142 , an extraction part 143 , a selection part 144 , a splicing part 145 , an output part 146 and a storage part 147 . These components described above are described below.

输入部分141输入具有预定长度的数字串，分组部分142按照该数字串的类型和长度对其进行韵律分组。例如，对于15位的身份证号，通常分成3-3-6-3，而对于18位的身份证号，通常分成3-3-4-4-4。对于电话号码，按照图11所列的情况进行韵律分组。The input section 141 inputs a number string having a predetermined length, and the grouping section 142 rhythmically groups the number string according to its type and length. For example, for a 15-digit ID number, it is usually divided into 3-3-6-3, and for an 18-digit ID number, it is usually divided into 3-3-4-4-4. For phone numbers, the prosodic grouping is performed according to the situations listed in Fig. 11 .

提取部分143提取待合成的数字的上下文特征，所述的上下文特征包括：前一数字的编码、后一数字的编码和该数字自身的编码。然后，选择部分144按照待合成的数字的上下文特征从存储了目标音库的存储部分147中选择最合适的数字单元，具体的匹配过程与上面结合附图13所作的描述相同。The extraction part 143 extracts the context features of the numbers to be synthesized, and the context features include: the code of the previous number, the code of the next number and the code of the number itself. Then, the selection part 144 selects the most suitable digital unit from the storage part 147 storing the target sound bank according to the context characteristics of the numbers to be synthesized, and the specific matching process is the same as that described above in conjunction with accompanying drawing 13 .

接下来，拼接部分145对选择的波形进行拼接。如果待合成的数字串在目标音库中是彼此相邻的，则直接将它们拼接在一起。否则，通过加窗处理来消除两个数字单元直接拼接时因为能量和周期不匹配而产生的“喀嗒”声。最后，输出部分146将拼接的波形直接从扬声器输出，或者对其进行放大之后再输出。Next, the splicing section 145 splices the selected waveforms. If the number strings to be synthesized are adjacent to each other in the target sound bank, they are directly spliced together. Otherwise, window processing is used to eliminate the "click" sound caused by the mismatch of energy and period when two digital units are directly spliced. Finally, the output part 146 outputs the concatenated waveform directly from the speaker, or outputs it after being amplified.

在上面的实施例中，各个组成部分之间是分离的，但是本领域的普通技术人员清楚可以将上述的一个或者多个部分集成在一起，例如，将输入部分141的功能集成在韵律分组之中，或者将输出部分146的功能直接集成在波形拼接部分145中，等等。In the above embodiment, the various components are separated, but it is clear to those skilled in the art that one or more of the above-mentioned parts can be integrated together, for example, the function of the input part 141 is integrated in the prosodic grouping , or directly integrate the function of the output part 146 into the waveform splicing part 145, and so on.

【实施效果】【Implementation Effect】

图15描述了目标音库中发音单元样本的平均数目与系统合成语音的主观评价得分之间的关系。主观评价的目标是自然度，它是由参评者对系统进行主观听辩，给予0～5分之间的一个打分，最后对所有打分进行加权平均获得最后打分的一种方法。Figure 15 describes the relationship between the average number of pronunciation unit samples in the target sound bank and the subjective evaluation score of the system synthesized speech. The goal of subjective evaluation is naturalness, which is a method in which participants subjectively listen to and defend the system, give a score between 0 and 5, and finally obtain a final score by weighting all the scores.

由图15可以看出，在发音单元样本数目达到15～30之间时，主观评价已经在4分以上，表示音质已经相当不错。It can be seen from Figure 15 that when the number of pronunciation unit samples reaches between 15 and 30, the subjective evaluation is already above 4 points, indicating that the sound quality is quite good.

根据图15对实施效果的总结如下：根据本发明所提供的方法，在一个较小的音库上也可以获得很高的自然度。保持了高自然度的合成语音的同时，音库的规模足够小，可以被移植到具有小内存的嵌入式设备中。According to Fig. 15, the summary of the implementation effect is as follows: According to the method provided by the present invention, a high degree of naturalness can also be obtained on a small sound bank. While maintaining high naturalness of synthesized speech, the size of the sound bank is small enough to be ported to embedded devices with small memory.

【变化】【Variety】

上面已经以实施例的方式对本发明进行了详细的说明，但是这里的实施例仅仅是出于清楚说明的目的，而不是要对本发明的范围进行限制。例如，在‘数字单元的定义及其特征的描述’中定义了25个数字单元，但这只是一个应用而已，在工业应用中，要进行文语转换的数字串可能带有各自的度量单位，例如‘度’、‘米’、‘帕’等。而在日常的具有文语转换功能的设备中，需要读出‘斤’、‘两’等度量单位。如银行、税务局等场所的叫号系统需要合成‘×××号’之类的语音。也就是说，在具体的应用中，数字单元的数目可能会多于上面描述的25个。The present invention has been described in detail above in the form of embodiments, but the embodiments here are only for the purpose of clear illustration, and are not intended to limit the scope of the present invention. For example, 25 numeric units are defined in 'Definition of Numerical Units and Description of Their Characteristics', but this is only an application. In industrial applications, the numeric strings to be converted into text may have their own measurement units, For example 'degree', 'meter', 'Pa', etc. However, in everyday devices with text-to-speech functions, it is necessary to read units of measurement such as 'jin' and 'liang'. Such as the calling system of places such as banks and tax bureaus needs to synthesize voices such as '××× number'. That is to say, in a specific application, the number of digital units may be more than the 25 described above.

因此，本发明可以有其它不脱离其精髓和实质特性的特定形式。因此本发明示例性地，而非限制地考虑了所有的方面，所以想要包括权利要求确定的、而非上述的说明或者实施例确定的发明范围和权利要求等同意义和范围之中的所有变化。Accordingly, the present invention may have other specific forms without departing from its spirit and essential characteristics. Therefore, the present invention considers all aspects by way of example rather than limitation, so it is intended to include all changes within the scope of the invention and the equivalent meaning and scope of the claims determined by the claims rather than the above description or examples. .

Claims

1. A method for constructing a Chinese digital sound bank, comprising steps:

Produce an original sound bank that comprises a plurality of numeral unit strings, and said numeral unit string is the representation of the pronunciation of corresponding Chinese numeral string;

From the original sound library, the number units in which are selected are respectively located at the beginning of the string, in the string and at the end of the string;

Select the one-tuple number units wherein the number units are respectively located at the beginning of the string, in the string and at the end of the string from the original sound bank;

Pruning such a two-tuple digital unit, the degree of influence between adjacent digital units in the two-tuple digital unit is weak;

The target sound bank is formed by the trimmed two-tuple digital units and the selected one-tuple digital units.

2. The method according to claim 1, further comprising the step of setting up an index structure for describing its position for each digital unit, said index structure comprising the offset of the digital unit in the target sound bank, the number of the digital unit Context structure for duration and number units.

3. The method according to claim 2, wherein the context structure of the digital unit includes a code of a previous digital unit, a code of a next digital unit and a code of a current digital unit.

4. The method according to claim 2, wherein said number unit comprises a representation of the pronunciation of the following Chinese characters: zero, one, two, three, four, five, six, seven, eight, nine, ten, hundred, thousand, Ten thousand, one hundred million, one point, one point, one, and another.

5. The method of claim 4, wherein the numeric unit further comprises a phonetic representation of the unit of measure of the string of numbers.

6. The method according to any one of claims 2-5, wherein the context unit of the number units in the binary number units selected from the original sound bank in the original sound bank is the boundary of the number string.

7. according to the described method of one of claim 2-5, wherein, the context unit in the context unit in the binary group digital unit of picking out from the original sound bank is to the binary group digital unit A number unit affects another pronunciation unit weakly.

8. The method according to any one of claims 2-5, wherein the context unit of the number unit in the tuple number unit selected from the original sound bank in the original sound bank is the boundary of the number string.

9. according to the described method of one of claim 2-5, wherein, the context unit of the digital unit in the one-tuple digital unit that is selected from the original sound bank is to the digital unit in the one-tuple influence Another digital unit that is weak.

10. The method according to any one of claims 2-5, wherein the digital units in the pruned two-tuple digital units at least include: zero, one, five, six and transposition yao2, wu2, wan4, yi4 and you4.

11. The method of claim 6, wherein each of the plurality of numeric unit strings in the original sound bank is a triplet numeric unit string.

12. The method of claim 7, wherein each of the plurality of numeric unit strings in the original sound bank is a triplet numeric unit string.

13. The method of claim 8, wherein each of the plurality of numeric unit strings in the original sound bank is a triplet numeric unit string.

14. The method of claim 9, wherein each of the plurality of numeric unit strings in the original sound bank is a triplet numeric unit string.

15. The method of claim 10, wherein each of the plurality of numeric unit strings in the original sound bank is a triplet numeric unit string.

16. A method for synthesizing a Chinese numeral string, comprising the steps of:

Prosodic grouping is performed according to the length and type of the input number string to obtain a number string group;

Extracting the contextual features of the numbers in the digital string group, the contextual features include the encoding of the previous number, the encoding of the next number and the encoding of the current number;

Select the corresponding digital unit from the sound library according to the context feature, and then obtain the digital unit of each number in the digital string group;

Stitching together the waveforms of the digital units of each number to output;

Wherein, the sound bank is constructed using the method for constructing a Chinese digital sound bank according to claim 1.

17. The method according to claim 16, when a number is at the head of a number string or a group of number strings, the encoding of the previous number in its context feature is a boundary mark.

18. The method according to claim 16, when a number is at the end of a number string or a number string group, the encoding of the next number in its context feature is a boundary marker.

19. The method of claim 16, selecting a suitable digital unit from the original sound bank comprising selecting the highest scoring candidate digital unit by scoring all candidates for the current digital unit.

20. The method according to claim 19, scoring is performed according to the matching degree of the context features of the current number and the context features of the candidate number units.

21. The method according to claim 16, if the two digital units selected from the sound bank are adjacent for two adjacent numbers, then the waveforms of the two digital units are directly spliced, otherwise, when splicing the Windowed smoothing is performed on the waveforms of the two digital units.

22. The method according to claim 21, the length of the window used in the windowed smoothing process is 20 milliseconds, and the length of the overlapping region between the windows is 0-10 milliseconds.

23. The method of claim 16, wherein the type of string of numbers includes: telephone number, identification number, driver's license number, security data, and industrial meter readings.

24. A Chinese numeral string synthesis system, comprising:

The grouping device is used for rhythmically grouping the number strings according to the length of the number string and the type of the number string to obtain the number string group;

Extracting means for extracting the context features of the numbers in the number string, the context features including the encoding of the previous number, the encoding of the next number and the encoding of the current number;

The selection device is used to select the corresponding digital unit from the sound bank according to the context feature, and then obtain the digital unit of each number in the digital string group;

splicing device, for splicing and outputting the waveforms of the digital units of the respective numbers together;

25. The system according to claim 24, when a number is at the beginning of a number string or a group of number strings, the encoding of the previous number in its context feature is a boundary marker.

26. The system according to claim 24, when a number is at the end of a number string or a number string group, the encoding of the next number in its context feature is a boundary marker.

27. The system according to claim 24, the selecting means selects the highest-scoring candidate digital unit by scoring all candidates for the digital unit of the current digit.

28. The system according to claim 27, wherein the selecting means performs scoring according to the matching degree of the contextual features of the current digit and the contextual features of the candidate digit units.

29. The system according to claim 24, if the two digital units selected from the sound bank of two adjacent numbers are adjacent, then the waveforms of the two digital units are directly spliced, otherwise, when splicing the Windowed smoothing is performed on the waveforms of the two digital units.

30. The system of claim 29, the length of the windows used in the windowed smoothing process is 20 milliseconds, and the length of the overlap between windows is 0-10 milliseconds.

31. The system of claim 24, wherein the types of strings of numbers include: telephone numbers, identification numbers, driver's license numbers, security data, and industrial meter readings.