CN114495939B

CN114495939B - A neural speech transliteration method based on distinctive features

Info

Publication number: CN114495939B
Application number: CN202111610301.2A
Authority: CN
Inventors: 郭宇航; 王志鹏; 陈朔鹰
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2025-02-14
Anticipated expiration: 2041-12-27
Also published as: CN114495939A

Abstract

The present invention relates to a neural speech transliteration method based on distinctive features, and belongs to the field of speech recognition technology. The method first recognizes audio using end-to-end speech recognition technology, and then performs regular conversion on the recognition sequence based on distinctive features. Among them, the end-to-end speech recognition part receives English speech and recognizes the corresponding International Phonetic Alphabet IPA sequence; the rule part converts the English IPA sequence into the closest Chinese Pinyin initial/final IPA sequence according to the distinctive features, and then converts the IPA sequence into a Pinyin sequence according to the mapping rules of IPA and Pinyin. The method combines distinctive features with end-to-end speech recognition technology, which is not only helpful for cross-border communication and English learning, but also provides fun for people in life, and has good practical applicability.

Description

Neural voice transliteration method based on distinguishing characteristics

Technical Field

The invention relates to a neural voice transliteration method based on distinguishing characteristics, and belongs to the technical field of voice recognition processing.

Technical Field

With the rapid development of computer technology, the computer technology is applied to various fields of society, and meanwhile, the problems of difficult processing of massive voice data, difficult man-machine interaction and the like are also generated.

The goal of speech recognition is to automatically convert human speech content into text by a computer. With further developments in artificial intelligence and deep learning techniques, speech recognition techniques have made significant progress. Most of the existing speech recognition technologies adopt a deep learning-based method. Wherein, the distinguishing features are various features which can distinguish language units based on the natural features of the voice. Typically for a phoneme, this feature allows a vector representation to be constructed for the phoneme, depending on whether it has the feature alignment labeled "+" or "-".

Transliteration is a well-known concept and has wide application in life. Currently, transliteration research is focused mainly on text, usually transliteration for place names, person names and some proper nouns in machine translation. Transliteration of entire text, particularly speech-to-text, is very rare. Currently, transliteration methods are mainly based on rule methods, and English sequences are mapped into Chinese character sequences through comparison of English phonemes and Chinese pinyin, or English words are mapped into Chinese characters directly according to written rules, however, the mapping mode is harder in some generated texts, and the texts are processed.

Although the application scene of the voice transliteration is not as many as the application scene of the text transliteration, the technology can provide convenience for the daily life of people. For example, in English learning, phonetic transliteration is often an effective mnemonic. Voice transliteration can also facilitate temporary country-to-country communication for people, for example, voice transliteration for teaching audio can help memory. In addition, voice transliteration can bring some fun to people in life.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a new technical approach for researching and developing text transliteration and creatively provides a neural voice transliteration method based on distinguishing characteristics in order to solve the defects of transliteration methods in the voice field.

The invention has the innovation point that the neural voice recognition method is combined with the distinguishing characteristic for the first time, so that transliteration from English voice to Chinese phonetic text is realized. Among them, the bridge between neural speech recognition and rule processing based on distinguishing features is IPA (international phonetic symbol, international Phonetic Alphabet). And in the process of rule processing, the similar IPAs are replaced by calculating the similarity between the IPAs, and finally, the IPAs and pinyin mapping rules are converted into pinyin sequences.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A neural speech transliteration method based on distinguishing features, comprising the following steps:

firstly, converting an English text sequence corresponding to a training set into an IPA sequence, and training a neural voice recognition model by using the converted data set. The neural voice recognition model can adopt a transducer model.

And then inputting the voice characteristics to be transliterated into a trained neural voice recognition model to obtain an IPA sequence corresponding to English voice.

And simultaneously, the corresponding IPA is found out from the initial consonant/vowel of the Chinese pinyin, and the vector representation of the initial consonant/vowel is determined according to the distinguishing characteristics.

Then, for each IPA vector of the output sequence, the Euclidean distance between the IPA vector and the initial consonant and the final of each Chinese pinyin is calculated, so that the similarity between two IPA characters is measured. And replacing the initial consonant/final IPA character with the closest Chinese phonetic initial consonant/final IPA character to the English IPA character in the output sequence to obtain the IPA sequence of the initial consonant/final of the Chinese phonetic.

Finally, according to the initial consonant/final of the Pinyin and the IPA mapping rule thereof, IPA is replaced by the corresponding initial consonant or final to obtain the initial consonant/final sequence of the Pinyin. And combining the initial and final sequences to obtain the final output pinyin sequence.

If the sound is followed by the vowel, the blank spaces between the sound and the vowel are removed and combined together, for example, h ao is combined to form hao, and if the sound is followed by the sound, the blank spaces between the sound and the vowel are removed and combined together, for example, k s is combined to form ks.

Advantageous effects

The method combines the advanced speech recognition technology and the distinguishing characteristics in the linguistics, and fills the blank in the field of speech transliteration. Meanwhile, the phonetic transliteration is converted into the pinyin sequence, so that the problem of hard pronunciation after transliteration is relieved. Some other improvements, such as "k s", are not output as "ke si", but are incorporated as "ks", which is also easily understood by people in native language of chinese, and greatly shortens the length of the transliterated sequence. The pinyin sequence is used as the final output instead of the IPA sequence, which facilitates understanding of transliteration results by people in Chinese as a native language, and does not need to learn international phonetic symbols.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a partial screenshot of an IPA-distinguishing feature annotation table used in the distinguishing feature rules section of the present method.

Fig. 3 is a diagram of an initial (right)/final (left) screenshot used by the discriminating characteristic rule section in the method.

Fig. 4 is a diagram showing the result of data processing before training a neural speech recognition model according to the method.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, a neural voice transliteration method based on distinguishing characteristics comprises the following steps:

and 1, processing the data set and converting the English text sequence of the text label into an IPA sequence. Sentence representation before and after conversion is shown in fig. 4. The data set is used for training a nerve voice recognition model, and a transducer model can be selected as the nerve voice recognition model.

The waveform file of the audio signal of the voice clip is subjected to feature extraction, and the feature extraction method can adopt MFCC (Mel-frequency cepstral coefficients), or other audio feature extraction methods.

After the audio signal is subjected to feature extraction, a two-dimensional array of feature values is obtained, wherein the first dimension represents time and the second dimension represents frequency. In the time dimension, each time step t is characterized by x _t,x_t being a vector.

And 2, inputting the extracted characteristic sequence into a neural voice recognition model, and recognizing a corresponding IPA sequence.

And 3, determining vector representation of each IPA character of the output sequence according to the distinguishing characteristics. Wherein, as shown in fig. 2, each IPA character corresponds to a label on each distinguishing feature. "+" is denoted by the number 1, "-" is denoted by the number-1, and "0" is denoted by the number 0. Thus, each IPA character is represented as a vector containing only 0, -1, 1.

Meanwhile, according to the distinguishing characteristics, the vector representation of the initial consonant/final of each Chinese pinyin is determined. The vector representation calculation of the initials/finals of the Chinese pinyin is the same as that of the step 4. As shown in fig. 3, part of the initials and finals are represented by a plurality of IPA characters, and the vector of each IPA character is averaged to obtain the vector of the initials/finals when calculating the vector representation of the initials/finals.

And 4, calculating the distance between each IPA character and each initial consonant/final in the output sequence, and replacing the IPA character in the output sequence with the closest initial consonant/final IPA.

The contribution of each distinguishing characteristic is different, so that the weighted Euclidean distance is adopted when the distance between vectors is calculated, the weights of syllables, consonants, ringing, passions, vocal sounds, glottal extension, glottal contraction and harshness are reduced, and the weights of the tongue tip, the tongue surface and the tongue body are increased. And replacing each IPA character in the output sequence with the closest initial consonant or final IPA according to the obtained distance, thereby obtaining the IPA sequence of the Chinese pinyin.

And 5, representing the output sequence as an initial consonant/final sound sequence according to the Pinyin IPA sequence from the step 4.

After the conversion is completed, the words can be separated by "/", and each IPA character in the words can be separated by a space. The Chinese pinyin and its IPA correspondence is shown in FIG. 3.

And 6, performing post-treatment on the obtained initial consonant/final sound sequence.

Wherein, the initial consonant and the final sound behind the initial consonant are combined in a word, and the final pinyin sequence is obtained by combining the initial consonant and the final sound behind the initial consonant.

Result verification

The method is applied to English voice translation, the English data set is TEDLIUMv < 2 >, and the transliteration result is shown in table 1.

TABLE 1 English phonetic transliteration results

Therefore, the method can complete basic transliteration operation, the sentence length of the transliteration result is moderate, and the pinyin formed by combining the initial consonants and the final consonants can be combined into non-standard pinyin, such as 'rei', which is more flexible than pronunciation represented by standard pinyin and specific Chinese characters. The output is pinyin, which is closer to Chinese, and the international phonetic symbols do not need to be learned.

Claims

1. A neural speech transliteration method based on distinctive features, characterized in that it comprises the following steps:

Step 1: Convert the English text sequence corresponding to the training set into the International Phonetic Alphabet (IPA) sequence, and use the converted data set to train the neural speech recognition model;

Step 2: Input the speech features to be transliterated into the trained neural speech recognition model to obtain the IPA sequence corresponding to the English speech;

Step 3: Determine the vector representation of each English IPA character in the output sequence based on the distinctive features; at the same time, find the corresponding IPA for the initial consonant/final vowel of the Chinese pinyin, and determine the vector representation of the initial consonant/final vowel based on the distinctive features;

Step 4: For each IPA vector in the output sequence, calculate the Euclidean distance between it and each Chinese Pinyin initial consonant and final vowel to measure the similarity between two IPA characters; replace the Chinese Pinyin initial consonant/final vowel IPA character that is closest to the English IPA character in the output sequence to obtain the IPA sequence of Chinese Pinyin initial consonants/final vowels;

Step 5: According to the initial consonant/final vowel of Pinyin and its IPA mapping rule, IPA is replaced with the corresponding initial consonant or final vowel to obtain the initial consonant/final vowel sequence of Pinyin; the initial consonant and final vowel sequences are combined to obtain the final output Pinyin sequence;

Among them, if the initial consonant is followed by a final, the space between the two is removed and they are combined together; if the initial consonant is followed by a final, the space between the two is removed and they are combined together.

2. A neural speech transliteration method based on distinctive features as described in claim 1, characterized in that the neural speech recognition model adopts a transformer model.

3. A neural speech transliteration method based on distinctive features as described in claim 1, characterized in that in step 3, each IPA character corresponds to a label in each distinctive feature, and "+" is represented by the number 1, "-" is represented by the number -1, and "0" is represented by the number 0, and each IPA character is represented by a vector containing only 0, -1, and 1.

4. A neural speech transliteration method based on distinctive features as described in claim 1, characterized in that in step 3, when the initial consonant and the final are represented by multiple IPA characters, when calculating the vector representation of the initial consonant/final, the vector of each IPA character is averaged to obtain the vector of the initial consonant/final.

5. A neural speech transliteration method based on distinctive features as described in claim 1, characterized in that in step 4, when calculating the distance between vectors, a weighted Euclidean distance is used, and the weights of syllables, consonants, sonorants, consonants, voiced sounds, glottal extension, glottal contraction and harshness are reduced; and the weight of the tip of the tongue or the surface of the tongue is increased.