Neural voice transliteration method based on distinguishing characteristics
Technical Field
The invention relates to a neural voice transliteration method based on distinguishing characteristics, and belongs to the technical field of voice recognition processing.
Technical Field
With the rapid development of computer technology, the computer technology is applied to various fields of society, and meanwhile, the problems of difficult processing of massive voice data, difficult man-machine interaction and the like are also generated.
The goal of speech recognition is to automatically convert human speech content into text by a computer. With further developments in artificial intelligence and deep learning techniques, speech recognition techniques have made significant progress. Most of the existing speech recognition technologies adopt a deep learning-based method. Wherein, the distinguishing features are various features which can distinguish language units based on the natural features of the voice. Typically for a phoneme, this feature allows a vector representation to be constructed for the phoneme, depending on whether it has the feature alignment labeled "+" or "-".
Transliteration is a well-known concept and has wide application in life. Currently, transliteration research is focused mainly on text, usually transliteration for place names, person names and some proper nouns in machine translation. Transliteration of entire text, particularly speech-to-text, is very rare. Currently, transliteration methods are mainly based on rule methods, and English sequences are mapped into Chinese character sequences through comparison of English phonemes and Chinese pinyin, or English words are mapped into Chinese characters directly according to written rules, however, the mapping mode is harder in some generated texts, and the texts are processed.
Although the application scene of the voice transliteration is not as many as the application scene of the text transliteration, the technology can provide convenience for the daily life of people. For example, in English learning, phonetic transliteration is often an effective mnemonic. Voice transliteration can also facilitate temporary country-to-country communication for people, for example, voice transliteration for teaching audio can help memory. In addition, voice transliteration can bring some fun to people in life.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a new technical approach for researching and developing text transliteration and creatively provides a neural voice transliteration method based on distinguishing characteristics in order to solve the defects of transliteration methods in the voice field.
The invention has the innovation point that the neural voice recognition method is combined with the distinguishing characteristic for the first time, so that transliteration from English voice to Chinese phonetic text is realized. Among them, the bridge between neural speech recognition and rule processing based on distinguishing features is IPA (international phonetic symbol, international Phonetic Alphabet). And in the process of rule processing, the similar IPAs are replaced by calculating the similarity between the IPAs, and finally, the IPAs and pinyin mapping rules are converted into pinyin sequences.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
A neural speech transliteration method based on distinguishing features, comprising the following steps:
firstly, converting an English text sequence corresponding to a training set into an IPA sequence, and training a neural voice recognition model by using the converted data set. The neural voice recognition model can adopt a transducer model.
And then inputting the voice characteristics to be transliterated into a trained neural voice recognition model to obtain an IPA sequence corresponding to English voice.
And simultaneously, the corresponding IPA is found out from the initial consonant/vowel of the Chinese pinyin, and the vector representation of the initial consonant/vowel is determined according to the distinguishing characteristics.
Then, for each IPA vector of the output sequence, the Euclidean distance between the IPA vector and the initial consonant and the final of each Chinese pinyin is calculated, so that the similarity between two IPA characters is measured. And replacing the initial consonant/final IPA character with the closest Chinese phonetic initial consonant/final IPA character to the English IPA character in the output sequence to obtain the IPA sequence of the initial consonant/final of the Chinese phonetic.
Finally, according to the initial consonant/final of the Pinyin and the IPA mapping rule thereof, IPA is replaced by the corresponding initial consonant or final to obtain the initial consonant/final sequence of the Pinyin. And combining the initial and final sequences to obtain the final output pinyin sequence.
If the sound is followed by the vowel, the blank spaces between the sound and the vowel are removed and combined together, for example, h ao is combined to form hao, and if the sound is followed by the sound, the blank spaces between the sound and the vowel are removed and combined together, for example, k s is combined to form ks.
Advantageous effects
The method combines the advanced speech recognition technology and the distinguishing characteristics in the linguistics, and fills the blank in the field of speech transliteration. Meanwhile, the phonetic transliteration is converted into the pinyin sequence, so that the problem of hard pronunciation after transliteration is relieved. Some other improvements, such as "k s", are not output as "ke si", but are incorporated as "ks", which is also easily understood by people in native language of chinese, and greatly shortens the length of the transliterated sequence. The pinyin sequence is used as the final output instead of the IPA sequence, which facilitates understanding of transliteration results by people in Chinese as a native language, and does not need to learn international phonetic symbols.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a partial screenshot of an IPA-distinguishing feature annotation table used in the distinguishing feature rules section of the present method.
Fig. 3 is a diagram of an initial (right)/final (left) screenshot used by the discriminating characteristic rule section in the method.
Fig. 4 is a diagram showing the result of data processing before training a neural speech recognition model according to the method.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, a neural voice transliteration method based on distinguishing characteristics comprises the following steps:
and 1, processing the data set and converting the English text sequence of the text label into an IPA sequence. Sentence representation before and after conversion is shown in fig. 4. The data set is used for training a nerve voice recognition model, and a transducer model can be selected as the nerve voice recognition model.
The waveform file of the audio signal of the voice clip is subjected to feature extraction, and the feature extraction method can adopt MFCC (Mel-frequency cepstral coefficients), or other audio feature extraction methods.
After the audio signal is subjected to feature extraction, a two-dimensional array of feature values is obtained, wherein the first dimension represents time and the second dimension represents frequency. In the time dimension, each time step t is characterized by x t,xt being a vector.
And 2, inputting the extracted characteristic sequence into a neural voice recognition model, and recognizing a corresponding IPA sequence.
And 3, determining vector representation of each IPA character of the output sequence according to the distinguishing characteristics. Wherein, as shown in fig. 2, each IPA character corresponds to a label on each distinguishing feature. "+" is denoted by the number 1, "-" is denoted by the number-1, and "0" is denoted by the number 0. Thus, each IPA character is represented as a vector containing only 0, -1, 1.
Meanwhile, according to the distinguishing characteristics, the vector representation of the initial consonant/final of each Chinese pinyin is determined. The vector representation calculation of the initials/finals of the Chinese pinyin is the same as that of the step 4. As shown in fig. 3, part of the initials and finals are represented by a plurality of IPA characters, and the vector of each IPA character is averaged to obtain the vector of the initials/finals when calculating the vector representation of the initials/finals.
And 4, calculating the distance between each IPA character and each initial consonant/final in the output sequence, and replacing the IPA character in the output sequence with the closest initial consonant/final IPA.
The contribution of each distinguishing characteristic is different, so that the weighted Euclidean distance is adopted when the distance between vectors is calculated, the weights of syllables, consonants, ringing, passions, vocal sounds, glottal extension, glottal contraction and harshness are reduced, and the weights of the tongue tip, the tongue surface and the tongue body are increased. And replacing each IPA character in the output sequence with the closest initial consonant or final IPA according to the obtained distance, thereby obtaining the IPA sequence of the Chinese pinyin.
And 5, representing the output sequence as an initial consonant/final sound sequence according to the Pinyin IPA sequence from the step 4.
After the conversion is completed, the words can be separated by "/", and each IPA character in the words can be separated by a space. The Chinese pinyin and its IPA correspondence is shown in FIG. 3.
And 6, performing post-treatment on the obtained initial consonant/final sound sequence.
Wherein, the initial consonant and the final sound behind the initial consonant are combined in a word, and the final pinyin sequence is obtained by combining the initial consonant and the final sound behind the initial consonant.
Result verification
The method is applied to English voice translation, the English data set is TEDLIUMv < 2 >, and the transliteration result is shown in table 1.
TABLE 1 English phonetic transliteration results
Therefore, the method can complete basic transliteration operation, the sentence length of the transliteration result is moderate, and the pinyin formed by combining the initial consonants and the final consonants can be combined into non-standard pinyin, such as 'rei', which is more flexible than pronunciation represented by standard pinyin and specific Chinese characters. The output is pinyin, which is closer to Chinese, and the international phonetic symbols do not need to be learned.