CN116484806A

CN116484806A - Chinese character pronunciation conversion method, electronic equipment and storage medium

Info

Publication number: CN116484806A
Application number: CN202310570517.3A
Authority: CN
Inventors: 郭洋; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-07-25

Abstract

The application relates to the technical field of artificial intelligence, in particular to a Chinese character pronunciation conversion method, electronic equipment and a storage medium. According to the method for converting the Chinese character pronunciation in the first aspect, the target text and the multi-pronunciation character position information in the target text are required to be acquired firstly, then the target text is divided into the first conversion text set and the second conversion text set based on the multi-pronunciation character position information, the single-pronunciation characters in the first conversion text set are subjected to pronunciation conversion processing to obtain the first pronunciation set, the target text is subjected to semantic feature extraction based on the pre-trained semantic recognition model to obtain a semantic feature sequence, the target multi-pronunciation character semantic feature corresponding to the second conversion text set is extracted from the semantic feature sequence based on the multi-pronunciation character position information, the target multi-pronunciation character semantic feature corresponding to the second conversion text set is analyzed based on the pre-trained pronunciation classifier to obtain the second pronunciation set, and finally the target pronunciation set corresponding to the target text is obtained according to the first pronunciation set and the second pronunciation set, so that the accuracy of Chinese character pronunciation conversion is improved.

Description

Chinese character pronunciation conversion method, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a Chinese character pronunciation conversion method, electronic equipment and a storage medium.

Background

Chinese character pronunciation conversion refers to the process of converting Chinese text into corresponding Chinese phonetic alphabet, which is used to mark the pronunciation of Chinese text. The chinese character-To-Speech conversion can be applied To many scenarios, for example, speech synthesis (TTS) application scenarios. Unlike english alphabets, chinese characters represent semantics rather than pronunciation, and the accuracy of word-to-sound conversion directly affects the intelligibility of speech synthesis. However, there are polyphones in Chinese characters whose pronunciation needs to be determined according to context semantics.

In the related art, in order to eliminate pronunciation ambiguity of polyphones in the process of converting Chinese characters into pronunciation, in some methods, pronunciation of polyphones is selected through defined complex rules and dictionaries, but the methods need a large number of defined linguistic rules and are inflexible in use; in other methods, after the text is segmented, the decision tree and the maximum entropy are utilized to select the pronunciation of the polyphone, but the methods depend on the quality of the segmented word, and if the segmentation result is not matched with the content of the decision tree, cascading errors are easy to generate. Therefore, how to improve the accuracy of the Chinese character pronunciation conversion has become a great problem to be solved in the industry.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the method, the electronic equipment and the storage medium for converting the Chinese character pronunciation can improve the accuracy of converting the Chinese character pronunciation.

According to an embodiment of the first aspect of the present application, a method for converting Chinese characters into pronunciation includes:

acquiring a target text and polyphone position information in the target text;

dividing the target text into a first conversion corpus and a second conversion corpus based on the polyphone position information, wherein the first conversion corpus comprises single words in the target text, and the second conversion corpus comprises polyphones in the target text;

performing word-to-sound conversion processing on the single-tone words in the first conversion text set to obtain a first word-to-sound set;

extracting semantic features of the target text based on a pre-trained semantic recognition model to obtain a semantic feature sequence;

extracting target polyphone semantic features corresponding to the second conversion corpus from the semantic feature sequence based on the polyphone location information;

analyzing the semantic features of the target polyphones based on a pre-trained word-sound classifier to obtain a second word-sound set;

And obtaining a target word sound set corresponding to the target text according to the first word sound set and the second word sound set.

According to some embodiments of the present application, the obtaining the target text and the polyphone location information in the target text includes:

acquiring the target text, and performing word segmentation on the target text to obtain a plurality of target text characters;

arranging a plurality of target text characters according to the line text sequence of the target text to obtain a target character sequence;

the polyphone location information is determined from the target character sequence based on a chinese word pronunciation specification.

According to some embodiments of the present application, the semantic feature sequence includes a plurality of semantic feature elements, the target text characters in the target character sequence and the semantic feature elements in the semantic feature sequence are in one-to-one correspondence in an arrangement position relationship, and the extracting, based on the polyphone position information, the target polyphone semantic feature corresponding to the second conversion corpus from the semantic feature sequence includes:

determining a first sequence position of a polyphone in the target character sequence based on the polyphone position information;

Determining the semantic feature elements corresponding to the polyphones in the semantic feature sequence based on the first sequence position;

and determining the semantic feature elements corresponding to the polyphones in the semantic feature sequence as the target polyphone semantic features.

According to some embodiments of the present application, before the pre-trained word-tone classifier parses the target multi-tone word semantic feature to obtain the second word-tone set, the method further includes pre-training the word-tone classifier, specifically including:

acquiring a training semantic feature set and a first training label set, wherein the training semantic feature set comprises a plurality of sample polyphone semantic features, and the first training label set comprises sample word sound labels which are in one-to-one correspondence with the sample polyphone semantic features;

training a preset original classifier based on the sample polyphone semantic features and the sample word sound labels to obtain the word sound classifier.

According to some embodiments of the present application, training a preset original classifier based on the sample polyphone semantic features and the sample word sound labels to obtain the word sound classifier includes:

Identifying the sample polyphone semantic features through the original classifier to obtain first training identification data;

comparing the first training identification data with the sample word sound label to obtain word sound classification probability data;

if the word-tone classification probability data is lower than a preset first accuracy threshold, updating the original classifier based on the word-tone classification probability data, the training semantic feature set and the first training label set;

performing first iterative training on the updated original classifier based on the sample polyphone semantic features and the sample word sound labels;

and after the first iterative training, when the word-tone classification probability data is greater than or equal to the first accuracy threshold value, obtaining the word-tone classifier.

According to some embodiments of the present application, the original classifier includes a first fully connected layer and a second fully connected layer connected in sequence, and updating the original classifier based on the word-tone classification probability data, the training semantic feature set, and the first training label set includes:

constructing a cross entropy loss function based on the word-tone classification probability data and the first training tag set;

Inputting the training semantic feature set into the first full-connection layer, obtaining word sound hidden variables, and obtaining word sound weight values from the second full-connection layer;

obtaining a classification included angle parameter according to the word sound hidden variable and the word sound weight;

optimizing the cross entropy loss function based on the Chinese character sound set and the classification included angle parameter to obtain a classification loss function;

updating the original classifier based on the classification loss function.

According to some embodiments of the present application, the acquiring the training semantic feature set and the first training tag set includes:

acquiring the training semantic feature set;

and carrying out label inquiry on a plurality of sample polyphone semantic features in the training semantic feature set based on a preset Chinese word sound set to obtain the sample word sound labels corresponding to each sample polyphone semantic feature.

According to some embodiments of the present application, before extracting the semantic features of the target text based on the pre-trained semantic recognition model to obtain the semantic feature sequence, the method further includes pre-training the semantic recognition model, specifically including:

acquiring a training text set and a second training label set, wherein the training text set comprises a plurality of training texts, and the second training label set comprises sample semantic labels which are in one-to-one correspondence with the training texts;

Identifying the training text through the original identification model to obtain second training identification data;

comparing the second training identification data with the sample semantic tags to obtain identification accuracy;

updating the original recognition model based on the recognition accuracy when the recognition accuracy is lower than the accuracy threshold;

performing second iteration training on the updated original recognition model based on the training text and the semantic tag;

and after the second iteration training, when the recognition accuracy reaches a preset second accuracy threshold, obtaining the semantic recognition model.

In a second aspect, an embodiment of the present application provides an electronic device, including: the Chinese character pronunciation conversion device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the Chinese character pronunciation conversion method according to any one of the embodiments of the first aspect of the application when executing the computer program.

In a third aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, where the computer program is executed by a processor to implement a chinese character pronunciation conversion method according to any one of the embodiments of the first aspect of the present application.

According to the Chinese character pronunciation conversion method, the electronic equipment and the storage medium, the method and the electronic equipment have at least the following beneficial effects:

according to the Chinese character pronunciation conversion method, firstly, target text and multi-pronunciation character position information in the target text are required to be acquired, then the target text is divided into a first conversion corpus and a second conversion corpus based on the multi-pronunciation character position information, the first conversion corpus comprises single-pronunciation characters in the target text, the second conversion corpus comprises multi-pronunciation characters in the target text, further, word pronunciation conversion processing is carried out on the single-pronunciation characters in the first conversion corpus to obtain a first word pronunciation set, further, semantic feature extraction is carried out on the target text based on a pre-trained semantic recognition model to obtain a semantic feature sequence, target multi-pronunciation character semantic features corresponding to the second conversion corpus are extracted from the semantic feature sequence based on the multi-pronunciation character position information, the second word pronunciation set is obtained based on a pre-trained word pronunciation classifier, and finally, the target word pronunciation set corresponding to the target text is obtained according to the first word pronunciation set and the second word pronunciation set, and the Chinese character pronunciation conversion accuracy is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a flow chart of a method for converting Chinese characters into pronunciation according to an embodiment of the present application;

fig. 2 is a schematic flow chart of step S101 in the embodiment shown in fig. 1 in the present application;

fig. 3 is a schematic flow chart of step S104 in the embodiment shown in fig. 1 in the present application;

fig. 4 is a schematic flow chart of step S302 in fig. 3 of the present application;

fig. 5 is a schematic flow chart of step S105 in the embodiment shown in fig. 1 in the present application;

fig. 6 is a schematic flow chart of step S106 in the embodiment shown in fig. 1 in the present application;

fig. 7 is a schematic flow chart of step S601 in the embodiment shown in fig. 6;

fig. 8 is a schematic flow chart of step S602 in the embodiment shown in fig. 6 in the present application;

fig. 9 is another schematic flow chart of step S803 in fig. 8 of the present application;

fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In the description of the present application, the meaning of a number is one or more, the meaning of a number is two or more, greater than, less than, exceeding, etc. are understood to not include the present number, and the meaning of a number above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, left, right, front, rear, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present application and simplifying the description, and does not indicate or imply that the apparatus or element to be referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific content of the technical solution. In addition, the following description of specific steps does not represent limitations on the order of steps or logic performed, and the order of steps and logic performed between steps should be understood and appreciated with reference to what is described in the embodiments.

The following is a further description based on the accompanying drawings.

Fig. 1 shows an alternative flow chart provided by the method for converting Chinese characters into pronunciation, which may include, but is not limited to, steps S101 to S107 described below.

Step S101, acquiring a target text and polyphone position information in the target text;

step S102, dividing the target text into a first conversion text set and a second conversion text set based on the polyphone position information, wherein the first conversion text set comprises single words in the target text, and the second conversion text set comprises polyphones in the target text;

step S103, performing word-to-sound conversion processing on the single-tone words in the first conversion text set to obtain a first word-to-sound set;

step S104, extracting semantic features of the target text based on a pre-trained semantic recognition model to obtain a semantic feature sequence;

step S105, extracting target polyphone semantic features corresponding to the second conversion corpus from the semantic feature sequence based on polyphone position information;

step S106, analyzing the semantic features of the target polyphones based on a pre-trained word-sound classifier to obtain a second word-sound set;

step S107, according to the first word sound set and the second word sound set, a target word sound set corresponding to the target text is obtained.

In step S101 of some embodiments of the present application, target text and polyphone location information in the target text are obtained. It should be noted that the target text is composed of a plurality of Chinese characters conforming to the Chinese grammar specification, has a certain line text sequence, and is composed of polyphones and monophones. It should be noted that the word-to-sound conversion method is used for converting the target text into the corresponding target word-to-sound set, and the conversion of the polyphones is the focus of the resource conversion method. Thus, in some exemplary embodiments, it may be desirable to first determine the position of the polyphones in the line-text order from the target text, thereby obtaining polyphone position information.

Referring to fig. 2, step S101 according to some embodiments of the present application may include, but is not limited to, the following steps S201 to S203:

step S201, obtaining a target text, and performing word segmentation processing on the target text to obtain a plurality of target text characters;

step S202, arranging a plurality of target text characters according to the line text sequence of the target text to obtain a target character sequence;

step S203, based on the chinese character tone specification, polyphone position information is determined from the target character sequence.

In step S201 of some embodiments of the present application, a target text is obtained, and word segmentation processing is performed on the target text, so as to obtain a plurality of target text characters. The object of the word segmentation processing on the target text is to facilitate the determination of the polyphonic positional information. It should be noted that word segmentation processing is the basis of natural language processing, and word segmentation accuracy directly determines the quality of semantic features. In some embodiments, because English sentences use spaces to separate words, word segmentation issues need not be considered in most cases except for certain specific words (e.g., how many, new York, etc.). However, different Chinese characters are naturally lack of separators, and readers are required to divide words and break sentences by themselves, so that when Chinese natural language processing is performed, the words are required to be divided first. Aiming at Chinese word segmentation, the current word segmentation method is mainly divided into two types, namely a dictionary-based rule matching method and a statistical-based machine learning method. First, dictionary-based word segmentation algorithms are essentially string matches. And matching the character strings to be matched with a dictionary large enough based on a certain algorithm strategy, and if the matching hits, word segmentation can be performed. According to different matching strategies, the method is divided into a forward maximum matching method, a reverse maximum matching method, two-way matching word segmentation, full segmentation path selection and the like; secondly, a word segmentation algorithm based on statistics is essentially a sequence labeling problem. We mark the words in the sentence according to their position in the word. The labels are mainly as follows: b (one word at the beginning of the word), E (the last word of the word), M (the word in the middle of the word, possibly multiple), S (the word represented by one word). For example, "today's weather is true and good", the result after labeling is "besbeesbebme", and the corresponding word segmentation result is "today/weather/true/good". It should be understood that the word segmentation method is various, and may be divided word by word, divided according to Chinese vocabulary, or divided in other ways, so that the word segmentation of the target text is not limited to the specific embodiments described above.

In step S202 of some embodiments of the present application, a plurality of target text characters are arranged according to a line sequence of the target text, so as to obtain a target character sequence. It should be noted that, the target text is composed of a plurality of Chinese characters conforming to the Chinese grammar specification, and has a certain line text sequence, so that after the target text is subjected to word segmentation processing to obtain a plurality of target text characters, each target text character can be arranged according to the line text sequence of the target text to obtain a target character sequence. So as to clarify the polyphone position and the monophone position in the target text.

In step S203 of some embodiments of the present application, polyphone location information is determined from the target character sequence based on the chinese word pronunciation specification. Note that the chinese word pitch specification refers to a word pitch annotation specification for chinese. It should be noted that the Chinese phonetic alphabet refers to a standard phonetic alphabet of modern Chinese, namely, a phonetic syllable of Mandarin Chinese, which is formed by spelling letters and spellings specified in the Chinese phonetic alphabet scheme. As a mandarin phonetic symbol of a Chinese character, the pinyin is a tool for assisting the pronunciation of the Chinese character. The Chinese phonetic alphabet scheme is the unified specification of Chinese name, place name and Roman letter spelling of Chinese literature, and is used in Chinese character inconvenient or unusable fields. The symbols written according to this set of specifications are called pinyin. Therefore, in some embodiments of the present application, the phonetic notation specification of Chinese characters based on "Pinyin scheme" is the Chinese phonetic notation. It is to be clear that, based on the "chinese phonetic alphabet scheme", there is a separate word-tone label for each single word, such as "water (shu ǐ)", "holes (k ǒ n, in the sense of" and "black (h i)"; for each polyphone, there are a set of phonetic labels, such as "little (sh, a)", "hard (n a n )", "hard (de, di, d i)", etc. Therefore, based on the Chinese character tone specification, the Chinese characters configured with a group of character tone marks can be determined, so that the polyphones in Chinese can be determined.

It should be pointed out that according to the modern Chinese common character table, the Chinese common Chinese characters are 3500, and more than 250 polyphones are contained; more than 600 polyphones are listed in Xinhua dictionary; the general standard Chinese character table is a general standard word set for recording Chinese in modern times, and reflects the standard of the modern general Chinese characters in terms of word quantity, word level, word shape and the like, and the general standard Chinese character table contains more than 920 polyphones. In some embodiments of the present application, the chinese character sound specification is obtained according to a chinese character specification, and different numbers of chinese characters are recorded under different chinese character specifications, so that the numbers of corresponding chinese character sound specifications will also have differences correspondingly, however, once the chinese characters recorded in the chinese character specification are determined, the type of the character sound recorded in the corresponding chinese character sound specification is determined. In some specific embodiments, in the chinese phonetic specification including 18000 more words, if 762 chinese characters are polyphones, 762 polyphones may correspond to 875 phonetic categories, and it is required to be clear that different chinese characters may correspond to the same phonetic, for example, "ground" corresponds to the phonetic "de".

Through steps S201 to S203, it is clarified that the words are segmented first, and then arranged according to the line text sequence of the target text, so as to obtain a target character sequence, and further, based on the chinese character pronunciation specification, the polyphone position information is determined from the target character sequence. The method can conveniently and rapidly determine the position information of the polyphones and determine the position of the polyphones in the target character sequence.

In step S102 of some embodiments of the present application, the target text is divided into a first conversion corpus and a second conversion corpus based on the polyphone location information, the first conversion corpus including monophones in the target text, and the second conversion corpus including polyphones in the target text. It should be noted that when the position information in the target text is obtained, the position information of the single words corresponding to the rest text in the target text is also clarified, so that the target text can be divided into a first conversion corpus and a second conversion corpus based on the multi-word position information, wherein the first conversion corpus comprises the single words in the target text, and the second conversion corpus comprises the multi-words in the target text. Since the division of the first conversion corpus and the second conversion corpus is obtained according to the polyphonic position information, some of the present applicationIn an exemplary embodiment, the second conversion corpus integrates polyphones in the target text and polyphone location information in the target text. For example, if the target text includes N Chinese characters in total, the target text may be expressed as Y ₀ ＝{y ₁ ，y ₂ ，...，y _N Based on the polyphone position information, it is possible to specify that the polyphone is in the target text Y ₀ Where 1.ltoreq.i, j, …, n.ltoreq.n, then the first conversion corpus may be represented as Y ₁ ＝{…，y _i-1 ，y _i+1 ，...，y _j-1 ，y _j+1 ，...，y _N The second conversion corpus can be represented as Y ₂ ＝{y _i ，y _j ，...，y _n }. It should be appreciated that the representation of the target text, the first conversion corpus, and the second conversion corpus is a wide variety and may include, but is not limited to, the specific embodiments set forth above.

In step S103 of some embodiments of the present application, word-to-sound conversion processing is performed on the single-tone words in the first converted text set, so as to obtain a first word-to-sound set. It should be emphasized that the word-to-sound conversion method is used for converting the target text into the corresponding target word-to-sound set, and the target text comprises multi-sound words and single-sound words, and the difficulty of word-to-sound conversion of the two types of text is different. Thus, in some exemplary embodiments of the present application, it is necessary to divide the two types of text into two types for word-to-sound conversion. The steps of the word-to-sound conversion processing for single-tone words and the word-to-sound conversion processing for multi-tone words are not clearly sequential, and the steps may be performed simultaneously, or the word-to-sound conversion processing for single-tone words may be performed first, then the word-to-sound conversion processing for multi-tone words may be performed similarly, or the word-to-sound conversion processing for multi-tone words may be performed first, and then the word-to-sound conversion processing for single-tone words may be performed. According to the embodiment provided by the application, for the single-tone words in the first conversion text set, a unique mapping relationship can be found between the single-tone words and the word tones in the Chinese pinyin specification (such as various dictionaries and dictionaries), so that the single-tone words are subjected to word tone conversion in various ways, which can include but are not limited to: the conversion is completed in a preset word-sound table directly through the unique mapping relation between the single-tone words and the word sound, the word sound conversion of the single-tone words is completed through the artificial intelligent model, and the like.

In some exemplary embodiments, if the target text includes N Chinese characters in total, the target text may be represented as Y ₀ ＝{y ₁ ，y ₂ ，...，y _N After dividing the target text based on the polyphone location information, the first conversion corpus may be represented as Y ₁ ＝{…，y _i-1 ，y _i+1 ，...，y _j-1 ，y _j+1 ，...，y _N Further, the single-tone words in the first converted text set are subjected to word-tone conversion processing to obtain a first word-tone set X ₁ ＝{…，x _i-1 ，x _i+1 ，...，x _j-1 ，x _j+1 ，...，x _N }. It should be appreciated that the first word sound set may be represented in a wide variety of ways, including but not limited to the specific embodiments set forth above.

Referring to fig. 3, before step S104 according to some embodiments of the present application, a pre-training semantic recognition model is further included, specifically including the following steps S301 to S302.

Step S301, a training text set and a second training label set are obtained, wherein the training text set comprises a plurality of training texts, and the second training label set comprises sample semantic labels corresponding to the training texts one by one;

step S302, training a preset original recognition model based on the training text and the sample semantic tags to obtain a semantic recognition model.

In steps S301 to S302 of some embodiments of the present application, a training text set and a second training label set are first obtained, the training text set includes a plurality of training texts, the second training label set includes sample semantic labels corresponding to the training texts one by one, and then a preset original recognition model is trained based on the training texts and the sample semantic labels, so as to obtain a semantic recognition model. It should be noted that the semantic recognition model refers to a natural language model for extracting text semantic features, where the natural language model has the capability of extracting text semantic features after being trained in advance. The preset original recognition model is a preset natural language model which is not trained in advance. It is to be noted that the training text set includes a plurality of training texts, the second training label set includes sample semantic labels corresponding to the training texts one by one, wherein the training texts are input into a preset original recognition model for recognition, the original recognition model is adjusted and corrected through the sample semantic labels corresponding to the training texts one by one, the capability of extracting text semantic features of the original recognition model is gradually trained, and the semantic recognition model can be obtained after the pre-training is completed.

Through the above steps S301 to S302, the embodiment of the present application provides a pre-training method regarding a semantic recognition model. The semantic recognition model can have the capability of extracting text semantic features.

Referring to fig. 4, step S302 according to some embodiments of the present application may include, but is not limited to, the following step S40:

step S401, recognizing training texts through an original recognition model to obtain second training recognition data;

step S402, comparing the second training identification data with the sample semantic tags to obtain identification accuracy;

step S403, when the recognition accuracy is lower than the accuracy threshold, updating the original recognition model based on the recognition accuracy;

step S404, performing second iteration training on the updated original recognition model based on the training text and the semantic tags;

and step S405, after the second iteration training, when the recognition accuracy reaches a preset second accuracy threshold, obtaining a semantic recognition model.

In step S401 of some embodiments of the present application, training text is identified by the original identification model, so as to obtain second training identification data. It should be noted that, the second training recognition data refers to result data obtained after the original recognition model recognizes the training text, that is, the training text semantics extracted by the original recognition model with respect to the training text.

In steps S402 to S405 of some embodiments of the present application, the second training recognition data is compared with the sample semantic tag to obtain a recognition accuracy, when the recognition accuracy is lower than an accuracy threshold, the original recognition model is updated based on the recognition accuracy, further, based on the training text and the semantic tag, the updated original recognition model is subjected to a second iterative training, and further, after the second iterative training, when the recognition accuracy reaches a preset second accuracy threshold, the semantic recognition model is obtained. After the second training result data is obtained, the second training recognition data and the sample semantic label need to be further compared to obtain the recognition accuracy, and the purpose of the method is to check the semantic extraction capability of the original recognition model through the recognition accuracy. If the second training result data is closer to the sample semantic label, the recognition accuracy is higher, and the semantic extraction capacity of the original recognition model is higher; similarly, if the second training result data is far away from the sample semantic label, the recognition accuracy is smaller, and the semantic extraction capability of the original recognition model is lower. It should be noted that, the second iterative training of each round adjusts the model parameters according to the variation trend of the recognition accuracy rate, so as to perform the second iterative training of the next round, for example, the recognition accuracy rate of the present round of training is obviously improved compared with that of the previous round of training, which indicates that the parameter tuning of the previous round is beneficial to improving the model performance, so that the present round of parameter tuning can follow the mode of parameter tuning with the previous round, and if the recognition accuracy rate of the present round of training is not obviously improved compared with that of the previous round of training, the parameter tuning mode of the previous round may be defective, or the model performance is optimized. It should be noted that the training manner of the preset original recognition model based on the training text and the sample semantic tag is various, and may include, but not limited to, the specific embodiments mentioned above.

In step S104 of some embodiments of the present application, semantic feature extraction is performed on the target text based on a pre-trained semantic recognition model, so as to obtain a semantic feature sequence. It should be noted that the semantic recognition model refers to a natural language model for extracting text semantic features, wherein the natural language model has the capability of extracting the text semantic features after being trained in advance. Semantic feature sequences refer to feature vector sequences that characterize chinese text semantics. Needs to be as followsIt is emphasized that the target text is composed of a plurality of Chinese characters conforming to the Chinese grammar specification and has a certain line text sequence, so in some embodiments of the present application, the semantic feature sequence extracted from the target text by the semantic recognition model also corresponds to the line text sequence of the target text. For example, if the target text includes N Chinese characters in total, the target text may be expressed as Y ₀ ＝{y ₁ ，y ₂ ，...，y _N Then extracting semantic features of the target text based on the pre-trained semantic recognition model to obtain a semantic feature sequence H ₀ ＝{h ₁ ，h ₂ ，...，h _N }. Compared with the traditional manual design feature characteristic text semantic, the semantic feature sequence is extracted by utilizing the semantic recognition model, so that a complex and fine feature design process can be avoided, the method can be flexibly migrated to different application scenes, and cascading errors of Chinese characters caused by word segmentation, part-of-speech tagging and the like can be avoided.

In some more specific embodiments, the types of semantic recognition models that are selectable are varied, such as a BERT pre-training model (Bidirectional Encoder Representation from Transformers, BERT), a multitasking deep neural network (Multi-Task Deep Neural Networks, MT-DNN) model, an XLnet model, and the like. It should be noted that the BERT pre-training model is a transform-based bi-directional encoder that aims to pre-train the deep bi-directional representation from unlabeled text by conditional computation that is common in the left and right contexts, so that the pre-trained BERT model requires only one extra output layer to fine tune to generate the corresponding model for various natural language processing tasks. Therefore, in some preferred embodiments of the present application, the pre-training is performed based on the BERT pre-training model, so as to obtain a semantic recognition model for extracting semantic features of the target text.

In step S105 of some embodiments of the present application, a target polyphone semantic feature corresponding to the second conversion corpus is extracted from the semantic feature sequence based on the polyphone location information. In some exemplary embodiments of the present application, the semantic feature sequences extracted from the target text and the text sequence of the target text are processed by the semantic recognition model The order corresponds, and the target text is composed of polyphones and monophones, so that the target polyphone semantic features corresponding to the second conversion corpus can be extracted from the semantic feature sequence based on the polyphone location information. For example, if the target text includes N Chinese characters in total, the target text may be expressed as Y ₀ ＝{y ₁ ，y ₂ ，...，y _N Then extracting semantic features of the target text based on the pre-trained semantic recognition model to obtain a semantic feature sequence H ₀ ＝{h ₁ ，h ₂ ，...，h _N Based on the multi-tone word position information, the multi-tone word is further defined in the target text Y ₀ The sequence of positions in (a) is { i, j..once..n }, where 1.ltoreq.i, j, …, n.ltoreq.N, then it is still further from the semantic feature sequence H ₀ ＝{h ₁ ，h ₂ ，...，h _N Extracting target polyphone semantic features H corresponding to the second conversion corpus ₂ ＝{h _i ，h _j ，...，h _n }. The target polyphonic semantic feature may be expressed in a variety of ways, including but not limited to the specific embodiments described above.

Referring to fig. 5, the semantic feature sequence includes a plurality of semantic feature elements, and the target text character in the target character sequence corresponds to the semantic feature elements in the semantic feature sequence one by one in the arrangement position relationship, and according to step S105 of some embodiments of the present application, the following steps S501 to S503 may be included but not limited thereto.

Step S501, determining a first sequence position of a polyphone in a target character sequence based on polyphone position information;

step S502, determining semantic feature elements corresponding to polyphones in a semantic feature sequence based on the first sequence position;

step S503, determining semantic feature elements corresponding to the polyphones in the semantic feature sequence as target polyphone semantic features.

In steps S501 to S502 of some embodiments of the present application, a first sequence position of a polyphone in a target character sequence is determined based on polyphone position information, and then a semantic feature element corresponding to the polyphone in a semantic feature sequence is determined based on the first sequence position. It should be emphasized that in some exemplary embodiments of the present application, the semantic feature sequence extracted from the target text by the semantic recognition model corresponds to the line text sequence of the target text, and the target text is composed of polyphones and monophones, so that the semantic feature sequence includes semantic feature elements corresponding to the polyphones and also includes semantic feature elements corresponding to the monophones. Based on the polyphone position information in the target character sequence, the first sequence position of the polyphones in the target character sequence can be determined, and the target text characters in the target character sequence and the semantic feature elements in the semantic feature sequence are in one-to-one correspondence in the arrangement position relation, so that the semantic feature elements corresponding to the polyphones in the semantic feature sequence can be determined further based on the first sequence position.

Step S503, determining semantic feature elements corresponding to the polyphones in the semantic feature sequence as target polyphone semantic features. In the semantic feature sequence, only semantic feature elements corresponding to the polyphones are determined, and the semantic feature elements corresponding to the polyphones in the semantic feature sequence can be determined to be target polyphone semantic features.

Through the embodiment shown in step S501 to step S503, the semantic features of the target polyphones corresponding to the second conversion corpus can be determined efficiently by taking the target character sequences and the semantic feature sequences which are in one-to-one correspondence in the arrangement position relationship as clues, so that the sentence meaning of the polyphones in the target text set can be clarified, and the word sounds corresponding to the polyphones can be further clarified in the subsequent steps.

Referring to fig. 6, according to some embodiments provided herein, step S106 further includes pre-training a word-tone classifier, specifically including steps S601 to S602 described below.

Step S601, a training semantic feature set and a first training label set are obtained, wherein the training semantic feature set comprises a plurality of sample polyphone semantic features, and the first training label set comprises sample word sound labels which are in one-to-one correspondence with the sample polyphone semantic features;

Step S602, training a preset original classifier based on the semantic features of the sample polyphones and the sample word-sound labels to obtain the word-sound classifier.

In steps S601 to S602 of some embodiments of the present application, a training semantic feature set and a first training label set are first obtained, the training semantic feature set includes a plurality of sample polyphone semantic features, the first training label set includes sample word-sound labels corresponding to the sample polyphone semantic features one by one, and then a preset original classifier is trained based on the sample polyphone semantic features and the sample word-sound labels, so as to obtain a word-sound classifier. It should be noted that the word-tone classifier refers to a natural language model for determining word-tones according to text semantic feature recognition, where the natural language model is trained in advance and has the capability of determining word-tones according to text semantic feature recognition. The preset original classifier is a preset natural language model which is not trained in advance. It is noted that the training semantic feature set includes a plurality of sample polyphone semantic features, the first training label set includes sample word sound labels corresponding to the sample polyphone semantic features one by one, wherein the sample polyphone semantic features are input into a preset original classifier for recognition, the original classifier is adjusted and corrected through the sample word sound labels corresponding to the sample polyphone semantic features one by one, the original classifier is gradually trained, the word sound determining capability of the original classifier is determined according to text semantic feature recognition, and the word sound classifier can be obtained after the pre-training is completed.

Through the above steps S601 to S602, the embodiment of the present application provides a pre-training method for a word-sound classifier. The word and sound classifier can be provided with the capability of determining word and sound according to text semantic feature recognition.

Referring to fig. 7, step S601 according to some embodiments of the present application may include, but is not limited to, the following steps S701 to S702.

Step S701, acquiring a training semantic feature set;

step S702, performing label query on a plurality of sample polyphone semantic features in a training semantic feature set based on a preset Chinese word sound set to obtain sample word sound labels corresponding to each sample polyphone semantic feature.

In steps S701 to S702 in some embodiments of the present application, a training semantic feature set is first obtained, and then a label query is performed on a plurality of sample polyphone semantic features in the training semantic feature set based on a preset chinese word sound set, so as to obtain a sample word sound label corresponding to each sample polyphone semantic feature. It should be noted that, the word sound of the polyphonic word is closely related to the semantic meaning in the target text, the word sound has the important function of distinguishing the part of speech and the meaning of the word, for example, "will", when the verb "brings" or the adverb "is about to be" is indicated, the word sound is ji ā ng, when the noun "general" or "big will" is indicated, the word sound is ji-ng, and when the word sound is qi ā ng; for example, "less" means "small number", "insufficient number or due number", and the like, and the word sound is sh gao, and "light and vintage", and the word sound is sh gao. It is required to make clear that the preset Chinese character sound set has the mapping relation between the character sound and the semantic feature of each multi-sound character. Therefore, based on a preset Chinese word pronunciation set, label inquiry is carried out on a plurality of sample polyphone semantic features in the training semantic feature set, and then sample word pronunciation labels actually corresponding to the polyphone semantic features of each sample can be obtained.

Through the embodiment shown in the steps S701 to S702, the sample word-sound label can be specifically queried according to the semantic features of the sample polyphone according to the mapping relation between the word sound and the semantic features of the polyphone, so that the original classifier can be adjusted and corrected through the sample word-sound label corresponding to the semantic features of the sample polyphone one to one, and the word-sound classifier is trained to determine the word sound capacity according to the text semantic feature recognition.

Referring to fig. 8, step S602 according to some embodiments of the present application may include, but is not limited to, steps S801 to S805 described below.

Step S801, recognizing sample polyphone semantic features through an original classifier to obtain first training recognition data;

step S802, comparing the first training identification data with a sample word sound label to obtain word sound classification probability data;

step S803, if the word-tone classification probability data is lower than a preset first accuracy threshold, updating the original classifier based on the word-tone classification probability data, the training semantic feature set and the first training label set;

step S804, based on the sample multi-tone word semantic features and the sample word-tone labels, performing a first iterative training on the updated original classifier;

step S805, after the first iterative training, when the word-tone classification probability data is greater than or equal to the first accuracy threshold, obtaining a word-tone classifier.

In step S801 of some embodiments of the present application, sample polyphonic semantic features are identified by an original classifier, so as to obtain first training identification data. It should be noted that, the first training recognition data refers to result data obtained after the original classifier recognizes the sample polyphone semantic features, that is, word sound categories correspondingly divided by the original classifier with respect to the sample polyphone semantic features.

In steps S802 to S805 of some embodiments of the present application, comparing first training recognition data with a sample word-tone label to obtain word-tone classification probability data, if the word-tone classification probability data is lower than a preset first accuracy threshold, updating an original classifier based on the word-tone classification probability data, a training semantic feature set and the first training label set, performing a first iterative training on the updated original classifier based on the sample multi-tone word semantic feature and the sample word-tone label, and after the first iterative training, obtaining the word-tone classifier when the word-tone classification probability data is greater than or equal to the first accuracy threshold. Note that, the word-tone classification probability data refers to a word-tone class probability distribution obtained after the original classifier recognizes the sample multi-tone word semantic feature, for example, if the word-tone classification probability data is "P (a) =0.228, P (B) =0.619, P (C) =0.153", the probability that the sample multi-tone word semantic feature corresponds to the word tone a is 22.8%, the probability that the sample multi-tone word semantic feature corresponds to the word tone B is 61.9%, and the probability that the sample multi-tone word semantic feature corresponds to the word tone C is 15.3%. It should be appreciated that the word-tone classification probability data may be represented in a wide variety of ways, including, but not limited to, the specific embodiments set forth above.

After the first training result data is obtained, the first training identification data is further compared with the sample word-tone label to obtain word-tone classification probability data, and the purpose of the method is to test the word-tone classification capability of the original classifier through the word-tone classification probability data. If the first training result data is closer to the sample word sound label, the probability distribution reflected by the word sound classification probability data is closer to the sample word sound label, so that the word sound classification capacity of the original classifier is higher; similarly, if the first training result data is far away from the sample word-sound label, the probability distribution reflected by the word-sound classification probability data is far away from the sample word-sound label, which means that the word-sound classification capability of the original classifier is lower. It should be noted that, the first iterative training of each round adjusts the model parameters according to the variation trend of the word-tone classification probability data, so as to perform the first iterative training of the next round, for example, the training of the present round is significantly closer to the sample word-tone label than the word-tone classification probability data of the previous round, which indicates that the parameter tuning of the previous round is beneficial to improving the model performance, so that the parameter tuning of the present round can be along the parameter tuning mode of the previous round, and if the word-tone classification probability data of the present round is not closer to the sample word-tone label than the word-tone classification probability data of the previous round, the parameter tuning mode of the previous round may have defects, or the model performance is optimized. It should be noted that the training of the preset original classifier based on the sample polyphonic semantic features and the sample word-tone labels may be performed in various ways, and may include, but is not limited to, the specific embodiments described above.

Referring to fig. 9, the original classifier includes a first fully connected layer and a second fully connected layer connected in sequence, and according to step S803 of some embodiments of the present application, may include, but is not limited to, the following steps S901 to S905.

Step S901, constructing a cross entropy loss function based on word-tone classification probability data and a first training tag set;

step S902, inputting a training semantic feature set into a first full-connection layer, obtaining a word sound hidden variable, and obtaining a word sound weight value from a second full-connection layer;

step S903, obtaining a classified included angle parameter according to the hidden word sound variable and the weight of the word sound;

step S904, optimizing the cross entropy loss function based on the Chinese character sound set and the classification included angle parameter to obtain a classification loss function;

in step S905, the original classifier is updated based on the classification loss function.

In step S901 of some embodiments of the present application, a cross entropy loss function is constructed based on the word-tone classification probability data and the first training tag set. It should be noted that, the Cross Entropy loss function (Cross Entropy) is the most commonly used loss function in classification tasks, and Cross Entropy is used to measure the difference between two probability distributions, so as to measure the difference between the distribution learned by the original classifier and the true distribution. In some exemplary embodiments of the present application, it is desirable to construct a cross entropy loss function based on word-tone classification probability data and a first training tag set, specifically, if the training semantic feature set is denoted as T ₀ ＝{t ₁ ，t ₂ ，...，t _M In which the polyphones are in a training semantic feature set T ₀ The position sequence of (c) is { g, q. }, M }, wherein 1.ltoreq.g, q.,. Sup.m.ltoreq.M, and the sample polyphone semantic feature can be expressed as T ₁ ＝{t _g ，t _q ，...，t _m Then, further, identifying the semantic features of the sample polyphones by the original classifier to obtain first training identification data, and comparing the first training identification data with the sample polyphone labels to obtain the polyphone classification probability data P ₁ ＝{p _g ，p _q ，...，p _m If the first training label set is denoted as Z ₀ ＝{z ₁ ，z ₂ ，...，z _M -wherein the labels corresponding one-to-one to the sample polyphone semantic features are denoted Z ₁ ＝{z _g ，z _q ，...，z _m And the analytic formula of the corresponding cross entropy loss function is:

it should be appreciated that the training semantic feature set, the sample multi-word semantic features, and the first training tag set representation are not limited to the specific embodiments set forth above.

In some embodiments of the present application, in step S902 to step S904, a training semantic feature set is input into a first full-connection layer, a word-sound hidden variable is obtained, a word-sound weight is obtained from a second full-connection layer, and then a classification included angle parameter is obtained according to the word-sound hidden variable and the word-sound weight, and further, based on the chinese word-sound set and the classification included angle parameter, a cross entropy loss function is optimized, so as to obtain a classification loss function. In some exemplary embodiments of the present application, in order to further improve classification accuracy, based on a cross entropy loss function, a word sound hidden variable obtained after a training semantic feature set is processed by a first full-connection layer is further calculated with a word sound weight of a second full-connection layer, where a value range of the classification included angle parameter is related to a total number of word sound categories in a first training tag set. Specifically, if the total number of word-sound categories in the first training tag set is M, the word-sound hidden variable f _k Word-sound weight W of the second full-connection layer _k The included angle parameter of the classification between the two is theta _x The classification loss function obtained after the improvement based on the cross entropy loss function is combined with the adjustable model parameters s and r in the original classifier, and is as follows:

Through some embodiments shown in the above steps S901 to S905, the word sound hidden variable f _k Word-sound weight W of the second full-connection layer _k The included angle parameter of the classification between the two is theta _x And the total number of the word sound categories in the first training label set is M and uniformly distributed in theta _x In the value space of (2), the improved classification loss function can better converge the original classifier, so that the word-sound classification probability data can be accurately matched with the first training label, and the original classifier is obtainedThe adjustable model parameters s and r in the classifier are defined in a reasonable interval, and the word-sound classifier with higher classification precision can be obtained after the original classifier is updated for multiple rounds.

In step S106 of some embodiments of the present application, the target polyphonic semantic features are parsed based on a pre-trained word-tone classifier, to obtain a second word-tone set. It is emphasized that the word pitch of polyphones is closely related to its semantics in the target text, the word pitch having an important role in distinguishing part of speech from meaning. Thus, in some exemplary embodiments of the present application, the target polyphonic semantic features are parsed based on a pre-trained word-tone classifier to obtain a second word-tone set. It should be noted that the word-tone classifier refers to a natural language model for determining word-tones according to text semantic feature recognition, where the natural language model is trained in advance and has the capability of determining word-tones according to text semantic feature recognition. It should be appreciated that if the target polyphone semantic feature H ₂ ＝{h _i ，h _j ，...，h _n Then through word sound classifier to the target polyphone semantic feature h ₂ After the semantic features of each Chinese character are correspondingly classified, a second word-sound set X for expressing the word sound of each Chinese character of the second conversion corpus can be obtained ₂ ＝{x _i ，x _j ，...，x _n }. It should be appreciated that the second set of words may be represented in a wide variety of ways, including but not limited to the specific embodiments set forth above.

In step S107 in some embodiments of the present application, a target word sound set corresponding to the target text is obtained according to the first word sound set and the second word sound set. It should be emphasized that the word-to-sound conversion method is used for converting the target text into the corresponding target word-to-sound set, and the target text comprises multi-sound words and single-sound words, and the difficulty of word-to-sound conversion of the two types of text is different. Thus, in some exemplary embodiments of the present application, it is necessary to divide the two types of text into two types for word-to-sound conversion. Aiming at single-tone words, the embodiment of the application aims at single-tone words in the first conversion text set, and can find out the unique mapping relation between the single-tone words and the word tones in the Chinese phonetic specification (such as various dictionaries and dictionaries), so as to perform word tone conversion processing on the single-tone words in the first conversion text set to obtain a first word tone set; for polyphones, since the word phones of the polyphones are closely related to the semantics of the polyphones in the target text, semantic feature extraction is performed on the target text based on a pre-trained semantic recognition model to obtain a semantic feature sequence, then the target polyphone semantic feature corresponding to the second conversion corpus is extracted from the semantic feature sequence based on the polyphone position information, and further, the target polyphone semantic feature is analyzed based on a pre-trained word phone classifier to obtain a second word phone set. Finally, according to the first word sound set and the second word sound set, a target word sound set corresponding to the target text can be obtained, and the accuracy of the word sound conversion of the Chinese characters is improved.

Fig. 10 shows an electronic device 1000 provided in an embodiment of the present application. The electronic device 1000 includes: the processor 1001, the memory 1002, and a computer program stored in the memory 1002 and executable on the processor 1001, the computer program being operative to perform the chinese character pronunciation conversion method as described above.

The processor 1001 and the memory 1002 may be connected by a bus or other means.

The memory 1002 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs, such as the kanji word-to-sound conversion method described in the embodiments of the present application. The processor 1001 implements the above-described kanji word-to-sound conversion method by running a non-transitory software program and instructions stored in the memory 1002.

The memory 1002 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area. The storage data area can store and execute the Chinese character pronunciation conversion method. In addition, the memory 1002 may include high-speed random access memory 1002, and may also include non-transitory memory 1002, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some implementations, the memory 1002 optionally includes memory 1002 remotely located relative to the processor 1001, which remote memory 1002 can be connected to the electronic device 1000 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the above-described chinese character-to-sound conversion method are stored in the memory 1002, and when executed by the one or more processors 1001, the above-described chinese character-to-sound conversion method is performed, for example, method steps S101 to S107 in fig. 1, method steps S201 to S203 in fig. 2, method steps S301 to S302 in fig. 3, method steps S401 to S405 in fig. 4, method steps S501 to S503 in fig. 5, method steps S601 to S603 in fig. 6, method steps S701 to S702 in fig. 7, method steps S801 to S805 in fig. 8, and method steps S901 to S905 in fig. 9.

The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions for executing the Chinese character pronunciation conversion method.

In an embodiment, the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, for example, to perform method steps S101 through S107 in fig. 1, method steps S201 through S203 in fig. 2, method steps S301 through S302 in fig. 3, method steps S401 through S405 in fig. 4, method steps S501 through S503 in fig. 5, method steps S601 through S603 in fig. 6, method steps S701 through S702 in fig. 7, method steps S801 through S805 in fig. 8, and method steps S901 through S905 in fig. 9.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. It should also be appreciated that the various embodiments provided in the embodiments of the present application may be arbitrarily combined to achieve different technical effects.

While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A Chinese character pronunciation conversion method is characterized by comprising the following steps:

acquiring a target text and polyphone position information in the target text;

2. The method of claim 1, wherein the obtaining the target text and the polyphone location information in the target text comprises:

3. The method according to claim 2, wherein the semantic feature sequence includes a plurality of semantic feature elements, the target text character in the target character sequence corresponds to the semantic feature elements in the semantic feature sequence one-to-one in an arrangement position relationship, and the extracting, based on the polyphone position information, the target polyphone semantic feature corresponding to the second conversion corpus from the semantic feature sequence includes:

4. A method according to any one of claims 1 to 3, wherein before said pre-trained word-tone classifier parses the target multi-tone word semantic features to obtain a second word-tone set, the method further comprises pre-training the word-tone classifier, comprising in particular:

5. The method of claim 4, wherein training the pre-set original classifier based on the sample polyphonic semantic features and the sample phonetic labels to obtain the phonetic classifier comprises:

6. The method of claim 5, wherein the original classifier comprises a first fully connected layer and a second fully connected layer connected in sequence, the updating the original classifier based on the word-tone classification probability data, the training semantic feature set, and the first training label set comprising:

updating the original classifier based on the classification loss function.

7. The method of claim 4, wherein the obtaining the training semantic feature set and the first training tag set comprises:

acquiring the training semantic feature set;

8. The method according to claim 1, wherein before extracting semantic features from the target text based on a pre-trained semantic recognition model to obtain a semantic feature sequence, the method further comprises pre-training the semantic recognition model, specifically comprising:

9. An electronic device, comprising: a memory, a processor storing a computer program, the processor implementing the chinese character pronunciation conversion method as claimed in any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium storing a computer program that is executed by a processor to implement the kanji word-to-sound conversion method of any one of claims 1 to 8.