CN112347789A

CN112347789A - Punctuation prediction method, device, equipment and storage medium

Info

Publication number: CN112347789A
Application number: CN202011230897.9A
Authority: CN
Inventors: 李小喜; 李亚; 张为泰; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-09
Anticipated expiration: 2040-11-06
Also published as: CN112347789B

Abstract

The application provides a punctuation prediction method, a punctuation prediction device, punctuation prediction equipment and a storage medium, wherein the punctuation prediction method comprises the following steps: acquiring a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice segment; acquiring historical prediction information according to whether a text to be predicted is the first intermediate recognition result of the current voice segment, wherein the historical prediction information is intermediate information which is generated in the process of punctuation prediction of the historical recognition result and is used for determining the punctuation prediction result; and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted. The punctuation prediction method has high prediction accuracy and prediction efficiency, and the punctuation prediction method can be suitable for simultaneous interpretation scenes of machines by the aid of the advantages.

Description

Punctuation prediction method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a punctuation prediction method, apparatus, device, and storage medium.

Background

In recent years, with the application of deep learning in the fields of speech, natural language processing and the like, the accuracy of speech recognition is increasing, and the translation effect of machine translation is also improving, wherein the machine translation basically reaches the level of manual translation. Meanwhile, the progress of automatic speech recognition and machine translation also promotes the development of simultaneous machine interpretation.

Standard automatic speech recognition systems usually generate audio into a text without any punctuation marks, which is poorly readable and affects the processing of subsequent tasks (e.g. machine-to-machine interpretation), and this problem is solved by inserting appropriate punctuation marks in the recognized text.

It can be understood that, if a proper punctuation is to be inserted into an identification text, firstly, punctuation information of each word in the identification text needs to be predicted (whether a punctuation needs to be inserted after a word, and if a punctuation needs to be inserted, which punctuation should be inserted), and how to predict punctuation information of each word in the identification text is a problem that needs to be solved urgently at present.

Disclosure of Invention

In view of the above, the present application provides a punctuation prediction method, apparatus, device and storage medium, which are used to predict punctuation information of each word in a speech recognition text, and the technical solution is as follows:

a punctuation prediction method comprising:

acquiring a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice segment, and the recognition result of one voice segment comprises a plurality of intermediate recognition results and a final recognition result;

acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment, wherein the historical prediction information is intermediate information which is generated in the process of punctuation prediction of the historical recognition result and is used for determining the punctuation prediction result;

and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction method further includes:

determining an updating type corresponding to the text to be predicted according to the updating condition of the text to be predicted relative to the previous recognition result;

if the updating type corresponding to the text to be predicted is modified, acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment;

and if the updating type corresponding to the text to be predicted is increased, counting the number of words of the text to be predicted, which are increased compared with the previous recognition result, and if the number of the increased words is greater than a first preset number, executing the historical prediction information based on whether the text to be predicted is the first recognition result of the current voice segment.

Optionally, the obtaining historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment includes:

if the text to be predicted is the first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is obtained and used as historical prediction information;

if the text to be predicted is the non-first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to the previous recognition result of the current voice segment are obtained and used as historical prediction information;

the punctuation prediction information is intermediate information which is generated in the punctuation prediction process of the corresponding recognition result and is used for determining the punctuation prediction result.

Optionally, the predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted includes:

predicting punctuation information after words in the text to be predicted by using a pre-established punctuation marking model based on the historical prediction information and the text to be predicted;

the punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice segments, and when the punctuation prediction model is trained by using the training text, aiming at each word in the training text, the punctuation prediction model predicts punctuation information after the word according to the word before the word and a second preset number of words after the word.

Optionally, the predicting punctuation information of words in the text to be predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted includes:

removing punctuation prediction information of a last second preset number of words in the historical recognition result from the historical prediction information, and taking the information obtained after removal as prediction reference information;

splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted, which does not participate in punctuation prediction, and taking the spliced text as an input text;

and inputting the prediction reference information and the input text into the punctuation prediction model for punctuation prediction to obtain punctuation information of words in the text to be predicted.

Optionally, removing punctuation prediction information of a last second preset number of words in the history recognition result from the history prediction information includes:

if the text to be predicted is the first recognition result of the current voice segment, removing punctuation prediction information of the last second preset number of words in the final recognition result of the forward adjacent voice segment of the current voice segment from the historical prediction information;

and if the text to be predicted is the non-first recognition result of the current voice segment, removing punctuation prediction information of the last second preset number of words in the previous recognition result of the current voice segment from the historical prediction information.

Optionally, the splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted, which does not participate in the landmark point prediction, includes:

if the text to be predicted is the first recognition result of the current voice segment, splicing the last second preset number of words in the historical recognition result with the whole text to be predicted;

and if the text to be predicted is the non-first recognition result of the current voice segment, splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted, which is increased compared with the previous middle recognition result of the current voice segment.

Optionally, the inputting the prediction reference information and the input text into the punctuation prediction model for punctuation prediction to obtain punctuation information of words in the text to be predicted includes:

determining a representation vector of each word in the input text by using the punctuation prediction model;

determining a target vector corresponding to each word in the input text by using the punctuation prediction model, the representation vector of each word in the input text and the prediction reference information, wherein the target vector corresponding to one word in the input text can represent the correlation degree between a word before the word and a second preset number of words after the word in the input text and the word;

and determining punctuation information of words in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text and the target vector corresponding to each word in the input text.

Optionally, the determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text, and the target vector corresponding to each word in the input text includes:

for each word in the input text, predicting punctuation information behind each word in a second preset number of words before the word by using the punctuation prediction model, the representation vector of the word and the target vector corresponding to the word;

for each word in the input text, determining punctuation information after the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word;

and acquiring punctuation information of words in the text to be predicted from punctuation information of all words in the input text.

A punctuation prediction apparatus comprising: the device comprises a text to be predicted acquisition module, a historical prediction information acquisition module and a punctuation prediction module;

the text to be predicted obtaining module is used for obtaining a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice segment, and a recognition result of a voice segment comprises a plurality of intermediate recognition results and a final recognition result;

the historical prediction information acquisition module is used for acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment, wherein the historical prediction information is intermediate information which is generated in the process of punctuation prediction of the historical recognition result and is used for determining the punctuation prediction result;

and the punctuation prediction module is used for predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction device further comprises: an update type determining module and a quantity counting module;

the updating type determining module is used for determining the updating type corresponding to the text to be predicted according to the updating condition of the text to be predicted compared with the previous recognition result;

the historical prediction information acquisition module is specifically used for acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment when the updating type corresponding to the text to be predicted is modification;

the number counting module is used for counting the number of words of the text to be predicted, which are increased compared with a previous recognition result, when the updating type corresponding to the text to be predicted is increased;

the historical prediction information obtaining module is specifically configured to, when the number of the added words is greater than a first preset number, obtain historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment.

Optionally, the punctuation prediction module is specifically configured to predict punctuation information of words in the text to be predicted by using a pre-established punctuation marking model based on the historical prediction information and the text to be predicted;

A punctuation prediction device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of any one of the punctuation prediction methods described above.

A readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the punctuation prediction methods described above.

According to the scheme, after the text to be predicted is obtained, historical prediction information is obtained according to whether the text to be predicted is the first recognition result of the current voice segment, and punctuation information of words in the text to be predicted is predicted according to the historical prediction information and the text to be predicted. The punctuation prediction method provided by the application, when the punctuation information of the text to be predicted is predicted, the historical prediction information is combined with the information of the text to be predicted, the prediction is carried out by combining the historical prediction information, more semantic information can be obtained, and more accurate punctuation prediction results can be obtained, in addition, the historical prediction information adopts intermediate information which is generated in the punctuation prediction process of the historical recognition results and is used for determining the punctuation prediction results, but not the historical recognition results, compared with the prediction carried out by directly using the historical recognition results, the computation amount can be greatly reduced, so that the punctuation prediction efficiency is improved, in addition, the prediction is carried out by combining the historical prediction information no matter the intermediate recognition results or the final recognition results of the voice segments, so that the intermediate recognition results of the voice segments can be predicted no matter the intermediate recognition results of the voice segments are predicted, and the final recognition result is predicted, so that a more accurate prediction result can be obtained, and the punctuation prediction method provided by the application can be suitable for the simultaneous interpretation scene of the machine.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a punctuation prediction method according to an embodiment of the present application;

fig. 2 is another schematic flow chart of a punctuation prediction method provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a process of predicting punctuation information of words in a text to be predicted by using a pre-established punctuation prediction model based on historical prediction information and the text to be predicted according to the embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating punctuation prediction performed on a model for punctuation prediction of input text and prediction reference information according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a punctuation prediction apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a punctuation prediction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to realize punctuation prediction, especially punctuation prediction in a machine simultaneous interpretation scene, the inventor of the present invention researches, and the initial thought is as follows: and performing punctuation prediction by adopting a punctuation prediction scheme based on a statistical language model.

The purpose of the statistical language model is to obtain the probability distribution of the text sequence appearing in the corresponding corpus on the premise of knowing the text sequence. The basic framework of a statistical language model is that for a text sequence, its probability can be expressed as:

according to the Markov assumption, that is, the probability of any word appearing is only related to the N words that it has appeared before, where N is an integer greater than 1. It was found that the smaller N, the more times it occurs in the training set, so that the more reliable the result becomes. The larger N is, the more information can be used for predicting the next word, and the higher the accuracy is, but at the same time, the more corresponding parameters are, and the longer the calculation is. Generally, the larger the value of N is, the more excellent the performance of the model is, but according to practical experience, the value of N is generally set to be an integer not more than 4, which is a result obtained through a large number of practices and is also the best value with balanced efficiency and accuracy. However, it is difficult for the N-gram language model to use the history information, and only the top N words of the current word can be focused on, but most of the language models use N-3, that is, the model can focus on only the top 3 words, which is far from sufficient for punctuation prediction, that is, the prediction accuracy of the punctuation prediction scheme based on the statistical language model is low.

In view of the above problems of the punctuation prediction scheme based on the statistical language model, the inventors of the present application further study, and during the study, the inventors think of the punctuation prediction scheme based on the bidirectional long-and-short term memory network. The general idea of the punctuation prediction scheme based on the bidirectional long-time memory network is as follows: when the punctuation prediction is carried out on the current recognition result of the current voice segment, the final recognition results of the previous m voice segments of the current voice segment are spliced with the current recognition result of the current voice segment, and the final recognition results are input into a bidirectional long-term and short-term memory network to carry out punctuation prediction.

However, for a speech segment, n intermediate recognition results are usually obtained before obtaining the final recognition result, and if the recognition results of the previous m speech segments are spliced each time, the recognition results of the previous m speech segments are repeatedly calculated n times, that is, the punctuation prediction efficiency is very low. However, when the machine simultaneous interpretation takes time delay into consideration, the translation system can only translate by using the intermediate recognition result, so that the machine simultaneous interpretation has higher requirement on punctuation prediction accuracy of the intermediate recognition result.

In view of the above problems of the punctuation prediction scheme based on the bidirectional long-and-short-term memory network, the inventors of the present application have further made intensive research, and finally, through research, have provided a punctuation prediction method which has a good prediction effect and can be applied to a machine simultaneous interpretation scenario, and the punctuation prediction method can be applied to a terminal having a data processing capability, and can also be applied to a server cluster composed of a single server or a plurality of servers, and then introduces the punctuation prediction scheme provided by the present application through the following embodiments.

First embodiment

Referring to fig. 1, a schematic flow chart of a punctuation prediction method provided in an embodiment of the present application is shown, where the method may include:

step S101: and acquiring the text to be predicted.

And the text to be predicted is the current recognition result of the current voice segment.

It should be noted that, in the process of recognizing a speech segment to obtain a final recognition result, intermediate recognition results are usually obtained, that is, the recognition result of a speech segment includes intermediate recognition results and a final recognition result, and each recognition result has at least an increased content compared to the previous recognition result.

Illustratively, the current speech segment is the second speech segment VAD2, and in the process of identifying VAD2, the following identification results are generated:

recognition result 1: at this point

Recognition result 2: in the spring

Recognition result 3: fly in the spring

And (4) recognition result: in the good season of fragrant and fragrant in spring

Recognition result 5: in the good season of the spring fragrant flowers we gather

Recognition result 6: in the good season of spring fragrant, we gather in Beijing Tiananmen

The recognition results 1 to 5 are intermediate recognition results of the voice segment VAD2, and the recognition result 6 is the final recognition result of the voice segment VAD 2.

Step S102: and acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment.

The historical prediction information is intermediate information which is generated in the process of punctuation prediction of the historical recognition result and is used for determining the punctuation prediction result. Note that the historical recognition result is a recognition result before the current recognition result.

Specifically, the process of obtaining the historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment may include:

step S102a, if the text to be predicted is the first recognition result of the current voice segment, the punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is obtained as the historical prediction information.

Assuming that the current speech segment is VAD2 in the above example, and the text to be predicted is "at this", the text to be predicted is the first recognition result of the current speech segment.

It should be noted that the first recognition result of a speech segment usually has an identifier indicating that it is the first recognition result, and if the obtained recognition result has the identifier indicating the first recognition result, it can be determined that the recognition result is the first recognition result of a speech segment.

In this embodiment, the punctuation prediction information corresponding to the final recognition result of the speech segment before the current speech segment is intermediate information generated in the process of punctuation prediction on the final recognition result of the speech segment before the current speech segment and used for determining the punctuation prediction result.

The punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is obtained, and the realization modes of the punctuation prediction information as the historical prediction information are various, in a possible implementation manner, punctuation prediction information corresponding to the final recognition result of all speech segments before the current speech segment can be obtained as historical prediction information, considering that the semantic meaning of the current speech segment is generally higher in correlation with one or more speech segments which are closer to the current speech segment, in another possible implementation manner, punctuation prediction information corresponding to a final recognition result of a preset number (e.g., 3) of speech segments before the current speech segment may be obtained as historical prediction information, for example, the current speech segment is the 5 th speech segment VAD5, punctuation prediction information corresponding to the final recognition results of VAD 2-VAD 4 can be obtained as historical prediction information.

Step S102b, if the text to be predicted is the non-first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to the previous recognition result of the current voice segment are obtained and used as historical prediction information.

The punctuation prediction information corresponding to the previous recognition result of the current voice segment is intermediate information which is generated in the process of punctuation prediction of the previous recognition result of the current voice segment and is used for determining the punctuation prediction result.

It should be noted that the previous recognition result of the current speech segment refers to a recognition result that is located before the current recognition result in the recognition result of the current speech segment. Assuming that the current speech segment is VAD2 in the above example, and the text to be predicted is recognition result 5 ("we converge at the nice time of this spring time), punctuation prediction information corresponding to the final recognition result of VAD1 and punctuation prediction information corresponding to recognition result 4 of VAD2 (" at the nice time of this spring time ") are obtained as history prediction information.

Step S103: and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

When punctuation prediction is performed on a text to be predicted, prediction is performed by combining historical prediction information, more semantics can be obtained by combining the historical prediction information, and therefore a more accurate punctuation prediction result can be obtained.

According to the punctuation prediction method provided by the embodiment, after the text to be predicted is obtained, historical prediction information is obtained, and punctuation information of words in the text to be predicted is predicted according to the historical prediction information and the text to be predicted. The punctuation prediction method provided by the embodiment can obtain more semantic information by combining historical prediction information and predicting according to the historical prediction information when punctuation information is predicted for a text to be predicted, and can obtain more accurate punctuation prediction results, and the historical prediction information in the embodiment adopts intermediate information which is generated in the punctuation prediction process of a historical recognition result and is used for determining the punctuation prediction result, rather than the historical recognition result, compared with the method of directly predicting according to the historical recognition result, the computation amount can be greatly reduced, so that the punctuation prediction efficiency is improved, in addition, the method of predicting according to the historical prediction information can predict according to the intermediate recognition result or the final recognition result of a voice segment, so that the intermediate recognition result of the voice segment can be predicted, and the final recognition result is predicted, so that a more accurate prediction result can be obtained, and the punctuation prediction method provided by the embodiment can be suitable for the simultaneous interpretation scene of the machine due to the advantages.

Second embodiment

In order to improve the punctuation prediction efficiency, the embodiment provides another punctuation prediction method, please refer to fig. 2, which shows a flow diagram of the punctuation prediction method, and the flow diagram may include:

step S201: and acquiring the text to be predicted.

The text to be predicted is a current recognition result of the current speech segment, which is a recognition result obtained in the process of recognizing the current speech segment, and may be an intermediate recognition result of the current speech segment or a final recognition result of the current speech segment.

Step S202: and determining the updating type corresponding to the text to be predicted according to the updating condition of the text to be predicted relative to the previous recognition result, executing the step S203 if the updating type corresponding to the text to be predicted is increased, and executing the step S205 if the updating type corresponding to the text to be predicted is modified.

The updating conditions of the text to be predicted relative to the previous recognition result include two types, one is only increasing, the other is modifying and increasing, if the text to be predicted only increases words relative to the previous recognition result, the updating type corresponding to the text to be predicted is determined to be increasing, at this moment, step S203 is executed, if the text to be predicted not only increases words but also modifies words relative to the previous recognition result, the updating type corresponding to the text to be predicted is determined to be modifying, at this moment, step S205 is executed.

Assuming that the current speech segment is VAD2 mentioned in the above embodiment, the text to be predicted is recognition result 3 ("in this spring day"), since recognition result 3 is increased by only one word "spring day" with respect to recognition result 1 ("in this"), it can be determined that the update type corresponding to the text to be predicted is increased; assuming that the current speech segment is VAD2 mentioned in the above embodiment, and the text to be predicted is recognition result 4 ("in the nice season of fragrant in this spring"), since recognition result 4 is not only added to the nice season "as compared with recognition result 3 (" in this spring fly "), but also" fly "is modified to be fragrant, and therefore, it is determined that the update type corresponding to the text to be predicted is modified.

In addition, it should be noted that, if the text to be predicted is the first recognition result of the current speech segment, it is determined that the update type corresponding to the text to be predicted is increased.

Step S203: and counting the number of words of the text to be predicted, which are increased compared with the previous recognition result.

Optionally, the number of words of the text to be predicted, which is increased compared with the previous recognition result, may be determined by taking the word as a unit and using a Levenshtein edit distance calculation algorithm.

Step S204: judging whether the number of the added words is greater than or equal to a first preset number (for example, 2), if the number of the added words is greater than or equal to the first preset number, executing step S205, and if the number of the added words is less than the first preset number, acquiring a next recognition result of the current speech segment as the text to be predicted.

Step S205: and judging whether the text to be predicted is the first recognition result of the current voice segment, if so, executing the step S206a, and if not, executing the step S206 b.

Specifically, whether the text to be predicted has an indication mark of a first recognition result or not can be judged, if the text to be predicted has the indication mark of the first recognition result, the text to be predicted is judged to be the first recognition result of the current voice segment, and if the text to be predicted does not have the indication mark of the first recognition result, the text to be predicted is judged to be a non-first recognition result of the current voice segment.

Step S206 a: and acquiring punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment as historical prediction information.

The punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is intermediate information which is generated in the process of punctuation prediction of the final recognition result of the voice segment before the current voice segment and is used for determining the punctuation prediction result.

Step S206 b: and acquiring punctuation prediction information corresponding to a final recognition result of a voice segment before the current voice segment and punctuation prediction information corresponding to a previous recognition result of the current voice segment as historical prediction information.

For the specific implementation process and the related explanation of step S206a and step S206b, reference may be made to the specific implementation process and the related explanation of step S102a and step S102b in the first embodiment, which is not described herein again.

Step S207: and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

The punctuation prediction method provided by the embodiment has the following advantages: (1) when the punctuation information of the text to be predicted is predicted, the punctuation information is predicted by combining the historical prediction information in addition to the information of the text to be predicted, so that more semantic information can be obtained, and a more accurate punctuation prediction result can be obtained; (2) in the embodiment, the historical prediction information adopts intermediate information which is generated in the process of performing punctuation prediction on the historical identification result and is used for determining the punctuation prediction result, instead of the historical identification result, compared with the method of directly performing prediction by using the historical identification result, the method can greatly reduce the operation amount, thereby improving the punctuation prediction efficiency; (3) the embodiment combines the historical prediction information to predict whether the intermediate recognition result or the final recognition result of the voice segment is the intermediate recognition result or the final recognition result of the voice segment, so that a more accurate prediction result can be obtained whether the intermediate recognition result or the final recognition result of the voice segment is predicted; (4) when the types corresponding to the text to be predicted are increased, label prediction is continuously carried out on the text to be predicted only when the number of the increased words is larger than the first preset number, so that the efficiency of punctuation prediction on the recognition result of the voice segment can be improved.

Third embodiment

The present embodiment describes a specific implementation process of "predicting punctuation information of words in a text to be predicted according to historical prediction information and the text to be predicted" in the above embodiments.

According to the historical prediction information and the text to be predicted, the process of predicting punctuation information of words in the text to be predicted can comprise the following steps: and predicting punctuation information of words in the text to be predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted.

The punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice segments, and when the punctuation prediction model is trained by the training text, aiming at each word in the training text, the punctuation prediction model predicts punctuation information after the word according to the word before the word and a second preset number of words after the word.

Next, a process of predicting punctuation information of words in the text to be predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted is introduced.

Referring to fig. 3, a schematic flow chart illustrating a process of predicting punctuation information of words in a text to be predicted by using a pre-established punctuation prediction model based on historical prediction information and the text to be predicted may include:

step S301: and removing punctuation prediction information of the last second preset number of words in the historical recognition result from the historical prediction information, and taking the information obtained after removal as prediction reference information.

Specifically, if the text to be predicted is the first recognition result of the current voice segment, the punctuation prediction information of the last second preset number of words in the final recognition result of the forward adjacent voice segment of the current voice segment is removed from the historical prediction information; and if the text to be predicted is the non-first recognition result of the current voice segment, removing punctuation prediction information of the last second preset number of words in the previous recognition result of the current voice segment from the historical prediction information.

It should be noted that, the forward adjacent speech segment of the current speech segment refers to a speech segment located before and adjacent to the current speech segment, and the previous intermediate recognition result of the current speech segment is preferably a recognition result located before and adjacent to the text to be predicted in the recognition result of the current speech segment.

Illustratively, the second preset number is 4, and the current voice segment is the 2 nd voice segment VAD 2: if the text to be predicted is the first recognition result of the VAD2, the historical prediction information is punctuation prediction information corresponding to the final recognition result of the VAD1, and step S301 is executed to remove punctuation prediction information of the last 4 words in the final recognition result of the VAD1 from the historical prediction information, and to remove punctuation prediction information of the 4 words "good afternoon of women everybody in honored" from the historical prediction information if the final recognition result of the VAD1 is "good afternoon of each woman of ancestors"; if the text to be predicted is the non-first recognition result of VAD2, for example, the 3 rd intermediate recognition result of VAD2, the history prediction information is punctuation prediction information corresponding to the final recognition result of VAD1 and punctuation prediction information of the 2 nd recognition result of VAD2, and step S301 is executed to remove punctuation prediction information of the last 4 words of the 2 nd intermediate recognition result of VAD2 from the history prediction information.

The above-mentioned contents mention that, when the punctuation prediction model is trained by using the training text, for each word in the training text, the punctuation prediction model predicts punctuation after the word according to a word before the word and a second preset number of words after the word, similarly, when the punctuation prediction model is used to predict the recognition result of the speech segment before the current speech segment, i.e. the historical recognition result, the punctuation after each word is predicted according to the word before each word and the second preset number of words after each word, however, the words after the last second preset number of words of the historical recognition result are all less than the second preset number, which means that the punctuation prediction information of the last second preset number of words in the historical recognition result may be inaccurate, and in order to avoid that the punctuation prediction of the punctuation prediction information of the punctuation prediction text of the last second preset number of words in the historical recognition result has a bad influence, the embodiment removes punctuation prediction information of the last second preset number of words in the history recognition result from the history prediction information.

Step S302: and splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted which does not participate in punctuation prediction, wherein the spliced text is used as an input text.

In view of the fact that the prediction information of the last second preset number of words in the historical recognition result is removed in step S301, the information of the last second preset number of words in the historical recognition result is lacked, in this step, the last second preset number of words in the historical recognition result is spliced with the part of the text to be predicted, which has not participated in punctuation prediction, and the spliced text is used as an input text of the punctuation prediction model, that is, the last second preset number of words in the historical recognition result is also participated in punctuation prediction.

It should be noted that, if the text to be predicted is the first recognition result of the current speech segment, all words in the text to be predicted do not participate in punctuation prediction, and in this case, the last second preset number of words in the historical recognition result are spliced with the whole text to be predicted; and if the text to be predicted is the non-first intermediate result of the current voice segment, the part of the text to be predicted, which does not participate in punctuation prediction, is a word of the text to be predicted, which is increased compared with the previous recognition result of the current voice segment, and under the condition, the last second preset number of words in the historical recognition result are spliced with the word of the text to be predicted, which is increased compared with the previous recognition result of the current voice segment.

Illustratively, the second preset number is 4, and the current voice segment is the 2 nd voice segment VAD 2: if the text to be predicted is the first recognition result of VAD2, it is assumed that "at this" and the final recognition result of the speech segment in the forward vicinity of the current speech segment is "good afternoon of each woman of each mr. respecting", step S302 is to perform the process of concatenating the last 4 words "good afternoon of each woman of each mr. respecting" with the text to be predicted "at this" in the following manner: "women are good in afternoon < SEP > at this", the spliced text is used as the input text of the punctuation prediction model; if the text to be predicted is not the first recognition result of the VAD2, assuming that the text to be predicted is the 5 th recognition result of the VAD2, the 5 th recognition result of the VAD2 is that "we are gathering in the nice time of the tamarix in this spring," the 5 th recognition result of the VAD2 is increased by "we are gathering" compared with the 4 th recognition result of the VAD2 ("the nice time of the tamarix in this spring"), the last 4 words "the nice time of the tamarix" in the 4 th recognition result of the VAD2 are spliced with "we are gathering" to obtain a spliced text "the nice time of the tamarix" is gathering ", and the spliced text is used as the input text of the punctuation prediction model.

Note that the above-mentioned symbol < SEP > is used to distinguish the recognition results of different speech segments, that is, the text before and after < SEP > is the recognition result of different speech segments.

Step S303: and performing punctuation prediction on the prediction reference information and the input text input punctuation prediction model to obtain punctuation information of words in the text to be predicted.

Referring to fig. 4, a schematic flow chart of punctuation prediction using prediction reference information and an input text punctuation prediction model is shown, which may include:

step S401: and determining a representation vector of each word in the input text by using a punctuation prediction model.

Wherein, the representation vector of a word is a vector for representing the semantic meaning of the word.

Specifically, each word in the input text is input into a word vector determination module of the punctuation prediction model, and a representation vector of each word in the input text is obtained.

Step S402: and determining a target vector corresponding to each word in the input text by using the punctuation prediction model, the prediction reference information and the representation vector of each word in the input text.

The target vector corresponding to one word in the input text can represent the relevance between the word in the input text and the words in the input text, wherein the words are located in front of the word, and the words are located behind the word and have a second preset number.

Specifically, for each word in the input text, the attention module of the punctuation prediction model, the prediction reference information, the characterization vector of the word, the characterization vectors of the words in the third preset number before the word, and the characterization vectors of the words in the second preset number after the word may be used to determine a target vector corresponding to the word and capable of indicating the degree of correlation between the words in the third preset number before the word and the words in the second preset number after the word and the word.

It should be noted that when determining the target vector corresponding to each word in the input text, a mask may be used to block information of words except for the word, a third preset number of words before the word, and a second preset number of words after the word.

Step S403: and determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text and the target vector corresponding to each word in the input text.

Specifically, the process of determining punctuation information of words in a text to be predicted by using a punctuation prediction model, a representation vector of each word in an input text and a target vector corresponding to each word in the input text comprises the following steps:

step S4031: and for each word in the input text, predicting punctuation information of each word in a second preset number of words before the word by using a punctuation information determination module of a punctuation prediction model, a representation vector of the word and a target vector corresponding to the word.

Specifically, for each word in the input text, the probability that the punctuation after each word in the second preset number of words before the word is in each punctuation category in the preset multiple punctuation categories is determined by using the punctuation prediction model, the representation vector of the word and the target vector corresponding to the word, and the punctuation information after each word in the second preset number of words before the word is determined according to the determined probability. Wherein, the preset multiple punctuation categories may include: no punctuation, pause, comma, period, question mark, exclamation mark.

For example, the second preset number is 4, the input text includes a word "big person", 4 words before the word "big person" are "each mr, and what is needed in step S4031 is to predict punctuation information after four words of" each mr each woman "by using a punctuation prediction model, a characterization vector of the word" big person "and a target vector corresponding to the word" big person ".

Step S4032: and for each word in the input text, utilizing a punctuation information determination module of the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word to determine punctuation information after the word.

The punctuation of a word in the input text may be determined by punctuation information predicted for the word by a second preset number of words after the word, and optionally, the punctuation of a word in the input text may be determined by punctuation information predicted for the word by a word that is farthest from the word in the second preset number of words after the word.

Illustratively, the second preset number is 4, the input text includes words "mr" and 4 words after "mr" are "each woman afternoon", and punctuation information predicted for "mr" by the word "afternoon" is determined as punctuation information of the word "mr".

It should be noted that, for each word in the last second preset number of words of the input text, since the number of words thereafter is less than the second preset number, the punctuation information determined for the word by the word with the farthest distance is determined as the punctuation information of the word in this embodiment. Illustratively, the second preset number is 4, the last "4" words in the input text are "this spring spidery", the number of words after "this" is only 3, and the punctuation information predicted by "this" is determined as the punctuation information of "this". It should be noted that the last word in the input text is predicted in the round without prediction results, and punctuation information of the last word in the input text is predicted together when the next text to be predicted is predicted.

Step S4033: and acquiring punctuation information of words in the text to be predicted from punctuation information of all words in the input text.

In this embodiment, punctuation information of a word without punctuation information in the text to be predicted may be obtained from punctuation information of each word in the input text. Specifically, if the text to be predicted is the first recognition result of the current voice segment, punctuation information of each word in the text to be predicted is obtained from punctuation information of each word in the input text; and if the text to be predicted is the non-first recognition result of the current voice segment, acquiring punctuation information of words of the text to be predicted, which is increased compared with the previous recognition result of the current voice segment, from punctuation information of all words in the input text.

The punctuation information of the text to be predicted is determined by utilizing a pre-established punctuation prediction model, and then the process of establishing the punctuation prediction model is introduced.

The process of establishing the punctuation prediction model comprises the following steps:

step a1, pre-training the initial punctuation prediction model by using the training data in the first training data set to obtain a pre-trained punctuation prediction model.

Wherein the first training data set comprises a plurality of pieces (typically on the order of billions) of training data, each piece of training data being sentence-level monolingual text data having punctuation.

And a2, screening out the training data with better quality from the first training data set.

Specifically, a part of data with better quality can be manually screened out from the training data set, a part of data with poorer quality can be simultaneously screened out to obtain two types of training data, a two-classification model is trained by utilizing the two types of screened-out training data, the training data in the training data set is classified by utilizing the two-classification model, and accordingly the training data with better quality can be obtained from the training data set according to the classification result of the training data in the training data set, for example, two million pieces of training data can be obtained.

Step a3, new training data are constructed by utilizing the screened training data, a second training data set is formed by the constructed training data, the training data in the second training data set is used for carrying out fine adjustment on the pre-trained punctuation prediction model, and the fine-adjusted punctuation prediction model is the final punctuation prediction model.

Specifically, the screened training data can be used to construct new training data in three ways:

firstly, randomly inserting a symbol "< SEP >" in a sentence after word segmentation;

the second way, inserting symbol "< SEP >" before comma and pause of a sentence;

and a second mode is to splice two sentences and insert a symbol "< SEP >" in the middle of the two sentences.

The ratio of the number of training data constructed in the above three ways can be made to be 1:1: 1.

And further training the pre-trained punctuation prediction model by utilizing the training data constructed in the way until the model converges, wherein the model obtained after the training is the final punctuation prediction model.

It should be noted that, in order to improve the training efficiency of the model, during training, each time a plurality of pieces of training data are input to the model for parallel training, the punctuation prediction model obtained by training can perform punctuation prediction on a plurality of texts at the same time.

Fourth embodiment

The embodiments of the present application further provide a punctuation prediction device, which is described below, and the punctuation prediction device described below and the punctuation prediction method described above may be referred to in a corresponding manner.

Referring to fig. 5, a schematic structural diagram of a punctuation prediction apparatus provided in an embodiment of the present application is shown, which may include: a text to be predicted acquisition module 501, a history prediction information acquisition module 502 and a punctuation prediction module 503.

A to-be-predicted text obtaining module 501, configured to obtain a to-be-predicted text, where the to-be-predicted text is a current recognition result of a current speech segment, and a recognition result of a speech segment includes a plurality of intermediate recognition results and a final recognition result;

a historical prediction information obtaining module 502, configured to obtain historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment, where the historical prediction information is intermediate information that is generated in a process of performing punctuation prediction on a historical recognition result and is used for determining a punctuation prediction result.

And a punctuation prediction module 503, configured to predict punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction apparatus provided in the embodiment of the present application may further include: an update type determining module and a quantity counting module.

And the updating type determining module is used for determining the updating type corresponding to the text to be predicted according to the updating condition of the text to be predicted compared with the previous recognition result.

The historical prediction information obtaining module is specifically configured to, when the update type corresponding to the text to be predicted is modified, obtain historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment.

And the number counting module is used for counting the number of words of the text to be predicted, which are increased compared with the previous recognition result, when the updating type corresponding to the text to be predicted is increased.

Optionally, the punctuation prediction module is specifically configured to predict punctuation information of words in the text to be predicted by using a pre-established punctuation marking model based on the historical prediction information and the text to be predicted.

Optionally, the historical prediction information obtaining module 502 is specifically configured to, if the text to be predicted is the first recognition result of the current speech segment, obtain punctuation prediction information corresponding to the final recognition result of the speech segment before the current speech segment, as historical prediction information; and if the text to be predicted is the non-first recognition result of the current voice segment, obtaining punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to the previous recognition result of the current voice segment as historical prediction information. The punctuation prediction information is intermediate information which is generated in the punctuation prediction process of the corresponding recognition result and is used for determining the punctuation prediction result.

Optionally, the punctuation prediction module 503 comprises: a prediction reference information obtaining sub-module, an input text obtaining sub-module and a punctuation prediction sub-module.

And the prediction reference information acquisition submodule is used for removing punctuation prediction information of the last second preset number of words in the history recognition result from the history prediction information, and the information obtained after removal is used as prediction reference information.

And the input text acquisition submodule is used for splicing the last second preset number of words in the historical recognition result with the part which does not participate in the punctuation prediction in the text to be predicted, and the spliced text is used as the input text.

And the punctuation prediction submodule is used for inputting the prediction reference information and the input text into the punctuation prediction model to carry out punctuation prediction so as to obtain punctuation information of words in the text to be predicted.

Optionally, the prediction reference information obtaining sub-module is specifically configured to, if the text to be predicted is the first recognition result of the current speech segment, remove, from the historical prediction information, punctuation prediction information of a last second preset number of words in the final recognition result of a forward adjacent speech segment of the current speech segment; and if the text to be predicted is the non-first recognition result of the current voice segment, removing punctuation prediction information of the last second preset number of words in the previous recognition result of the current voice segment from the historical prediction information.

Optionally, the input text obtaining sub-module is specifically configured to splice a last second preset number of words in the historical recognition result with the whole text to be predicted if the text to be predicted is a first recognition result of the current speech segment; and if the text to be predicted is the non-first recognition result of the current voice segment, splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted, which is increased compared with the previous middle recognition result of the current voice segment.

Optionally, the punctuation prediction sub-module is specifically configured to determine, by using the punctuation prediction model, a feature vector of each word in the input text, and determine, by using the punctuation prediction model, the feature vector of each word in the input text, and the prediction reference information, a target vector corresponding to each word in the input text, where the target vector corresponding to one word in the input text can represent the correlation between a word before the word and a second preset number of words after the word in the input text and the word; and determining punctuation information of words in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text and the target vector corresponding to each word in the input text.

Optionally, when determining punctuation information of a word in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text, and the target vector corresponding to each word in the input text, the punctuation prediction sub-module is specifically configured to predict punctuation information after each word in a second preset number of words before the word for each word in the input text by using the punctuation prediction model, the representation vector of the word, and the target vector corresponding to the word; for each word in the input text, determining punctuation information of the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word; and acquiring punctuation information of words in the text to be predicted from punctuation information of all words in the input text.

The punctuation prediction device provided by the embodiment can predict the punctuation information of the recognition result of the voice segment more accurately and efficiently.

Fifth embodiment

An embodiment of the present application further provides a punctuation prediction device, please refer to fig. 6, which shows a schematic structural diagram of the punctuation prediction device, where the punctuation prediction device may include: at least one processor 601, at least one communication interface 602, at least one memory 603, and at least one communication bus 604;

in the embodiment of the present application, the number of the processor 601, the communication interface 602, the memory 603, and the communication bus 604 is at least one, and the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604;

the processor 601 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 603 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Sixth embodiment

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A punctuation prediction method comprising:

2. The punctuation prediction method of claim 1, further comprising:

3. The punctuation prediction method of claim 1, wherein the obtaining of historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment comprises:

4. The punctuation prediction method of claim 1, wherein the predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted comprises:

predicting punctuation information of words in the text to be predicted by using a pre-established punctuation marking model based on the historical prediction information and the text to be predicted;

the punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice segments, and when the punctuation prediction model is trained by using the training text, aiming at each word in the training text, the punctuation prediction model predicts punctuation information of the word according to the word before the word and a second preset number of words after the word.

5. The punctuation prediction method of claim 4, wherein the punctuation information of words in the text to be predicted is predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted, and the punctuation prediction method comprises the following steps:

6. The punctuation prediction method of claim 5 wherein said removing punctuation prediction information of a last second preset number of words in the historical recognition result from said historical prediction information comprises:

7. The punctuation prediction method of claim 5, wherein the splicing the last second preset number of words in the historical recognition result with the part of the text to be predicted which does not participate in punctuation prediction comprises:

8. The punctuation prediction method of claim 5, wherein the punctuation prediction by inputting the prediction reference information and the input text into the punctuation prediction model to obtain punctuation information of words in the text to be predicted comprises:

9. The punctuation prediction method of claim 8, wherein the determining punctuation information of words in the text to be predicted by using the punctuation prediction model, the representation vector of each word in the input text and the target vector corresponding to each word in the input text comprises:

for each word in the input text, determining punctuation information of the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word;

10. A punctuation prediction apparatus, comprising: the device comprises a text to be predicted acquisition module, a historical prediction information acquisition module and a punctuation prediction module;

11. A punctuation prediction device characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the punctuation prediction method according to any one of claims 1 to 9.

12. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the punctuation prediction method as claimed in any one of claims 1 to 9.