CN109979435B

CN109979435B - Data processing method and device for data processing

Info

Publication number: CN109979435B
Application number: CN201711464113.7A
Authority: CN
Inventors: 姜里羊; 王宇光; 阳家俊; 施亮亮; 卫林钰; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2021-10-22
Anticipated expiration: 2037-12-28
Also published as: CN109979435A

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing, wherein the method specifically comprises the following steps: acquiring a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence; and extracting features aiming at the training corpus, wherein the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and training a language model for the training data according to the training characteristics. The embodiment of the invention can improve the accuracy of adding the punctuations.

Description

Data processing method and device for data processing

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a data processing method and apparatus, and an apparatus for data processing.

Background

In the technical fields of information processing, such as the communication field, the internet field and the like, punctuation needs to be added to files lacking punctuation in certain application scenes. For example, to facilitate reading, punctuation is added to the text corresponding to the speech recognition result.

Existing solutions may utilize language models to add punctuation to text. The language model is used to describe the distribution of the probability of a given sequence of character units occurring in a language, which may include: word and/or punctuation, the output of the language model may be a probability score corresponding to a sequence of character units. And determining punctuation addition results corresponding to the texts according to the probability scores corresponding to the character unit sequences output by the language model.

The inventor finds that the used training corpora are often corpora corresponding to complete sentences in the existing language model training method in the process of implementing the embodiment of the invention. And training the language model according to the corpus corresponding to the complete sentence, so that the existing language model obtained by training has the punctuation adding capability of the complete sentence. Therefore, punctuation is often added at the tail of the text by using the existing language model, and an erroneous punctuation addition result is often obtained under the condition that the text is an incomplete sentence, thereby causing low accuracy of punctuation addition.

Disclosure of Invention

In view of the above problems, embodiments of the present invention have been made to provide a data processing method, a data processing apparatus, and an apparatus for data processing that overcome or at least partially solve the above problems, and can improve accuracy of adding punctuation.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, including:

acquiring a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence;

and extracting features aiming at the training corpus, wherein the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit;

and training a language model for the training data according to the training characteristics.

Optionally, the first corpus corresponding to the incomplete statement is obtained by intercepting the second corpus corresponding to the complete statement.

Optionally, the obtaining the corpus includes:

performing word segmentation on a second corpus corresponding to the complete sentence to obtain a vocabulary included in the second corpus;

determining a truncation position corresponding to the second corpus according to the vocabulary included in the second corpus;

and intercepting the character strings corresponding to the truncation positions from the second corpus according to the sequence from front to back to serve as the first corpus corresponding to the incomplete statement.

Optionally, the truncation position is located between two adjacent words.

Optionally, the truncation position is not adjacent to a punctuation included in the second corpus.

Optionally, the corpus corresponding to the language model further includes: and the second corpus corresponds to the complete sentence.

On the other hand, the embodiment of the invention discloses a data processing method, which comprises the following steps:

acquiring a text to be processed;

adding punctuation points for the text to be processed by utilizing a language model to obtain punctuation point adding results corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit;

and outputting a punctuation addition result corresponding to the text to be processed.

Optionally, the adding punctuation to the text to be processed by using the language model includes:

performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;

adding punctuation marks between adjacent words in the global word sequence to obtain a plurality of alternative punctuation addition results corresponding to the global word sequence;

determining a probability score corresponding to each of the multiple candidate punctuation addition results according to a language model;

and acquiring one alternative punctuation addition result with the highest probability score from the multiple alternative punctuation addition results to serve as the punctuation addition result corresponding to the text to be processed.

In another aspect, an embodiment of the present invention discloses a data processing apparatus, including:

the corpus acquiring module is used for acquiring training corpuses; the corpus comprises: a first corpus corresponding to the incomplete sentence;

a feature extraction module, configured to perform feature extraction on the training corpus, where the training features corresponding to the language model include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and

and the model training module is used for training the language model of the training data according to the training characteristics.

Optionally, the corpus acquiring module includes:

the word segmentation sub-module is used for segmenting words of a second corpus corresponding to the complete sentence to obtain words included in the second corpus;

a truncation position determining submodule, configured to determine a truncation position corresponding to the second corpus according to a vocabulary included in the second corpus; and

and the intercepting submodule is used for intercepting the character strings corresponding to the intercepting positions from the second corpus as the first corpus corresponding to the incomplete sentence according to the sequence from front to back.

Optionally, the truncation position is located between two adjacent words.

the text acquisition module is used for acquiring a text to be processed;

the punctuation adding module is used for adding punctuation for the text to be processed by utilizing a language model so as to obtain a punctuation adding result corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and

and the result output module is used for outputting the punctuation addition result corresponding to the text to be processed.

Optionally, the punctuation adding module includes:

the word segmentation sub-module is used for performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;

the punctuation adding submodule is used for adding punctuation marks between adjacent vocabularies in the global word sequence so as to obtain a plurality of alternative punctuation adding results corresponding to the global word sequence;

a probability score determining submodule for determining a probability score corresponding to each of the candidate punctuation addition results according to a language model; and

and the selection submodule is used for acquiring one alternative punctuation addition result with the highest probability score from the multiple alternative punctuation addition results to serve as the punctuation addition result corresponding to the text to be processed.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

acquiring a text to be processed;

In yet another aspect, an embodiment of the present invention discloses a machine-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform the aforementioned data processing method.

The embodiment of the invention has the following advantages:

according to the embodiment of the invention, the language model is trained according to the first corpus corresponding to the incomplete sentences, so that the trained language model has punctuation adding capability of the incomplete sentences, and thus the accuracy of punctuation adding can be improved.

Moreover, the training features corresponding to the language model in the embodiment of the present invention may include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; the language units can be independent units in the corpus, such as single characters or words; the first corpus and the training characteristics can enable the trained language model of the invention to have the capability of adding no punctuation at the end of the text aiming at incomplete sentences, so that the embodiment of the invention can obtain more accurate punctuation adding results aiming at incomplete sentences, namely can improve the accuracy of punctuation adding.

Drawings

FIG. 1 is a schematic diagram of an exemplary architecture of a speech recognition system of the present invention;

FIG. 2 is a flow chart of the steps of a method for training a language model according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating steps of a method for acquiring a first corpus corresponding to an incomplete statement according to an embodiment of the present invention;

FIG. 4 is a flow chart of the steps of a data processing method embodiment of the present invention;

FIG. 5 is a diagram illustrating a punctuation addition process for a global word sequence according to an embodiment of the present invention;

FIG. 6 is a block diagram of an embodiment of a data processing apparatus according to the present invention;

FIG. 7 is a block diagram of another data processing apparatus embodiment of the present invention;

FIG. 8 is a block diagram illustrating an apparatus for data processing as a terminal in accordance with an example embodiment; and

fig. 9 is a block diagram illustrating an apparatus for data processing as a server according to an example embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The inventor finds that, in the process of implementing the embodiment of the present invention, the training corpus used in the training method of the existing language model is often the corpus corresponding to the complete sentence, and the used features may include: the position of a single word or a word in a complete sentence in a corpus and the punctuation condition behind the single word or the word (whether punctuation exists behind the single word or the word) are included, so that the punctuation tends to be added at the tail position of a text in the existing language model obtained by training, that is, the probability corresponding to a first character unit sequence is considered to be superior to the probability corresponding to a second character unit sequence by the existing language model obtained by training (wherein, the superiority and inferiority of the probability can be represented by probability scores, and generally, the probability score is higher and the probability is better); the first character unit sequence may be a character unit sequence corresponding to the incomplete text and the punctuation mark added at the tail position of the text, and the second character unit sequence may be a character unit sequence corresponding to the incomplete text and the punctuation mark not added at the tail position of the text. Assuming that the text is "hello today", the first sequence of character units may be "hello today". ", the second sequence of character units may be" hello, today ", then the existing language model may consider" hello, today. The probability of "is better than the probability of" hello, today ". Due to "hello," today. Because of obvious punctuation errors, punctuation is added to the text corresponding to the incomplete sentence by using the existing language model, a wrong punctuation addition result is often obtained, and the accuracy of punctuation addition is low.

Aiming at the technical problem of low accuracy of adding punctuations in the existing scheme, the embodiment of the invention provides a data processing scheme which can obtain training corpora; the corpus comprises: a first corpus corresponding to the incomplete sentence; and extracting features aiming at the training corpus, wherein the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and training a language model for the training data according to the training characteristics.

In the embodiment of the invention, the sentence is a basic unit for language application, is composed of words and phrases, can express a complete meaning, such as telling others about a matter, proposing a problem, expressing a requirement or a stop, expressing a certain probability and expressing continuation or omission of a section of speech. There is a large pause between sentences and in between sentences. Its end should be marked with a period, question mark, ellipsis, or exclamation point. A complete sentence can express a complete meaning, while an incomplete sentence cannot express a complete meaning. Corpora, may refer to instances of language in natural language processing.

In the embodiment of the invention, the training of the language model is carried out according to the first corpus corresponding to the incomplete sentence and the training characteristics, and the trained language model of the invention considers that the probability corresponding to the second character unit sequence is superior to the probability corresponding to the first character unit sequence; the first character unit sequence may be a character unit sequence corresponding to the case that the text is incomplete and the punctuation is added to the tail position of the text, and the second character unit sequence may be a character unit sequence corresponding to the case that the text is incomplete and the punctuation is not added to the tail position of the text; that is, the probability output by the inventive language model for the second sequence of character units is better than the probability output for the first sequence of character units. Assuming that the text to be processed is "hello today", the first sequence of character units may be "hello today". "the second sequence of character units may be" hello, today ", then the language model of the present invention may consider that the probability of" hello, today "is better than" hello, today ". "and therefore, the punctuation addition result obtained for" you are today "can be: "hello, today"; because obvious punctuation errors do not exist in 'hello, today', the punctuation is added to the text corresponding to the incomplete sentence by utilizing the language model of the invention, and the accuracy of punctuation addition can be improved.

The embodiment of the invention also provides a data processing scheme, which can obtain the text to be processed; adding punctuation points for the text to be processed by utilizing a language model to obtain punctuation point adding results corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and outputting a punctuation addition result corresponding to the text to be processed.

The embodiment of the invention can be applied to any application scene needing to add punctuations, such as voice recognition, voice translation and the like. The speech translation scenario may include: and (5) simultaneous interpretation of the scenes. The simultaneous interpretation is a interpretation mode that a simultaneous interpreter continuously interprets the speech content of a speaking user to an audience under the condition of not interrupting the speech of the speaking user. At present, the simultaneous interpretation technology is widely applied to scenes such as large conferences, lectures, exhibitions, scenic spots and the like. Taking a conference scene as an example, in the conference process, a simultaneous translator sits in a sound insulation room, uses professional equipment to synchronously interpret the content heard from an earphone into a target language, and outputs the target language through a microphone; meanwhile, the conference participants needing the simultaneous interpretation service can obtain the interpreted information from the earphones.

In the simultaneous interpretation scenario, it is usually necessary to obtain a text corresponding to an incomplete sentence through speech recognition and add punctuation to the text corresponding to the incomplete sentence when a speaking user does not input the complete sentence. In a speech recognition scenario, if the speech speed of the speaking user is slow, it is also necessary to obtain a text corresponding to an incomplete sentence through speech recognition and add punctuation to the text corresponding to the incomplete sentence under the condition that the speaking user does not input the complete sentence. It is to be understood that the embodiments of the present invention are not limited to specific application scenarios.

The data processing method provided by the embodiment of the invention can be applied to the application environment of devices such as a terminal or a server. Optionally, the terminal may include, but is not limited to: smart phones, tablets, laptop portable computers, in-vehicle computers, desktop computers, smart televisions, wearable devices, and the like. The server can be a cloud server or a common server and is used for providing punctuation addition service for the client.

The data processing method provided by the embodiment of the invention can be suitable for processing Chinese, Japanese, Korean and other languages, and is used for improving the accuracy of punctuation addition. It is understood that any language in which punctuation is required is within the scope of the data processing method of the embodiments of the present invention.

Referring to fig. 1, an exemplary structural diagram of a speech recognition system of the present invention is shown, which may specifically include: speech recognition means 101 and punctuation addition means 102. The voice recognition device 101 and the punctuation adding device 102 may be separate devices (including a server or a terminal), and may be commonly disposed in the same device; it is understood that the specific arrangement of the speech recognition device 101 and the punctuation adding device 102 is not limited by the embodiment of the present invention.

The speech recognition apparatus 101 may be configured to convert a speech signal of a speaking user into text information, and specifically, the speech recognition apparatus 101 may output a speech recognition result. In practical applications, a speaking user may speak in a speech recognition scene, a speech translation scene, and other scenes and send out a speech signal, and then the speech signal of the speaking user may be received by a microphone or other speech acquisition devices, and the received speech signal is sent to the speech recognition device 101; alternatively, the voice recognition apparatus 101 may have a function of receiving a voice signal of a speaking user.

Alternatively, the speech recognition apparatus 101 may employ speech recognition technology to convert the speech signal of the speaking user into text information. If the speech signal of the user who speaks is marked as S, the S is processed in series to obtain a corresponding speech feature sequence O, and the sequence O is marked as { O ═ O₁，O₂，…，O_k，…，O_TIn which O is_iIs the kth speech feature, T isThe total number of the voice features, i, k and T are natural numbers. A sentence corresponding to a speech signal S can be regarded as a word string composed of many words, and is denoted by W ═ W₁，w₂，…，w_n}. The process of speech recognition is to find the most probable word string W based on the known speech feature sequence O, where k, T and n are positive integers.

Specifically, the speech recognition is a model matching process, in which a speech model is first established according to the speech characteristics of a person, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.

The punctuation adding device 102 may be connected to the speech recognition device 101, and may receive the speech recognition result sent by the speech recognition device 101 and add punctuation to the received speech recognition result. Specifically, the received voice recognition result can be used as a text to be processed, and punctuation is added to the text to be processed by using a language model, so as to obtain a punctuation addition result corresponding to the text to be processed; the corpus corresponding to the language model may include: a first corpus corresponding to the incomplete sentence; and outputting a punctuation addition result corresponding to the text to be processed.

Optionally, in a speech recognition scenario, the punctuation adding device 102 may output the punctuation addition result to the user or a client corresponding to the user; in the speech translation scenario, the punctuation adding device 102 may output the punctuation addition result to the machine translation device, so that the machine translation device translates the punctuation addition result into characters in the target language. The machine translation device may use a machine translation technology to translate the optimal sentence-breaking result, and the machine translation technology may use a process of converting a text in one natural language (source language) into a text in another natural language (target language) by using a computer, for example, the source language and the target language may be chinese and english, or the source language and the target language may be english and chinese, respectively. Optionally, the machine translation device may include: statistical types and/or neural network types, etc., it is understood that embodiments of the present invention are not limited to particular types of machine translation devices.

It can be understood that, according to an actual application scenario, a person skilled in the art may determine an output manner corresponding to the punctuation addition result corresponding to the text to be processed, and the embodiment of the present invention does not limit a specific output manner corresponding to the punctuation addition result corresponding to the text to be processed.

Method embodiment

Referring to fig. 2, a flowchart illustrating steps of a method for training a language model according to an embodiment of the present invention is shown, which may specifically include the following steps:

step 201, obtaining a training corpus; the corpus may include: a first corpus corresponding to the incomplete sentence;

step 202, performing feature extraction on the training corpus, where the training features corresponding to the language model may include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit;

and step 203, training a language model for the training data according to the training characteristics.

In an optional embodiment of the present invention, the first corpus corresponding to the incomplete sentence may be obtained by intercepting from the second corpus corresponding to the complete sentence, that is, the first corpus corresponding to the incomplete sentence may be intercepted from the second corpus corresponding to the complete sentence. The specific source of the second corpus corresponding to the complete sentence in the embodiment of the present invention is not limited, for example, the source of the second corpus may include: the method includes the steps of providing an existing corpus, an internet corpus (such as a webpage corpus or a microblog corpus) or an input behavior corpus of a user provided by an input method.

Referring to fig. 3, a flowchart illustrating a step of a method for acquiring a first corpus corresponding to an incomplete statement according to an embodiment of the present invention is shown, which may specifically include the following steps:

301, performing word segmentation on a second corpus corresponding to a complete sentence to obtain a word included in the second corpus;

step 302, determining a truncation position corresponding to the second corpus according to the vocabulary included in the second corpus;

and step 303, intercepting the character string corresponding to the truncation position from the second corpus according to the sequence from front to back, and taking the character string as the first corpus corresponding to the incomplete statement.

A sentence is typically a continuous string of characters made up of words. In order to understand the semantics, firstly, a sentence needs to be divided into word strings taking vocabularies as basic units, namely, participles. The vocabulary is the smallest meaningful language component capable of independent movement, the English words are marked by spaces as natural delimiters, the Chinese is a writing unit taking characters as the basic unit, and no clear distinguishing mark exists between the words, so the Chinese word segmentation is the basis and key of Chinese information processing. Common word segmentation methods may include: a method based on string matching, a method based on rules, and the like, it can be understood that the embodiment of the present invention does not limit the specific word segmentation method.

The step 302 of determining the truncation position corresponding to the second corpus may include: and determining a truncation position corresponding to the second corpus by taking the vocabularies as a unit so as to avoid that the truncation position is positioned in the middle position of one vocabulary to cause the interruption of the vocabulary and further avoid that the first corpus comprises incomplete vocabularies.

In the embodiment of the invention, the truncation position can be used for segmenting two adjacent vocabularies in the second corpus, so that the first corpus can comprise complete vocabularies, and the first corpus can be prevented from comprising incomplete vocabularies. According to an embodiment, the truncation position may be an end position of a previous vocabulary, in which case, the process of truncating the character string corresponding to the truncation position from the second corpus may include: and taking the truncation position and the character string before the truncation position as the character string corresponding to the truncation position. According to another embodiment, the truncation position may be a position between two adjacent words in the second corpus, in which case, the process of truncating the character string corresponding to the truncation position from the second corpus may include: and taking the character string before the truncation position as the character string corresponding to the truncation position.

In an alternative embodiment of the present invention, the truncation position may be located between two adjacent vocabularies, so that the first corpus includes complete vocabularies and the first corpus does not include incomplete vocabularies. In an application example of the present invention, it is assumed that the second corpus corresponding to the complete sentence is "how good you are, how much is the weather today? ", its corresponding word segmentation result may be: "hello/,/today/weather/how/? "/" herein is a symbol provided for convenience in describing the application, and "/" is used to indicate boundaries between words and/or between words and punctuation marks, which in practical applications may not have any meaning. Thus, the truncation position of the embodiment of the present invention may be located between two adjacent vocabularies, such as between "hello" and "today," between "today" and "weather," or between "weather" and "what kind," and accordingly, the truncated first corpus may be obtained: "hello", "hello, today weather" and the like.

For a second corpus, there may be one or more truncation positions. According to the embodiment of the invention, the character string corresponding to the truncation position can be intercepted from the second corpus as the first corpus corresponding to the incomplete statement aiming at each truncation position, and then a plurality of truncation positions corresponding to a plurality of truncation positions can be obtained.

In another optional embodiment of the present invention, the truncation position may not be adjacent to a punctuation included in the second corpus. Under the condition that the truncation position is adjacent to the punctuation included in the second corpus, it is described that the punctuation already exists at the truncation position, that is, it is described that the character string corresponding to the truncation position can express a relatively complete meaning, so as to avoid the repetition between the truncated first corpus and the second corpus capable of expressing a complete meaning under the condition, in this case, the truncation position adjacent to the punctuation included in the second corpus may be discarded. The word segmentation result can be: "hello/,/today/weather/how/? For example, a truncation position between "hello" and "today" is included in the second corpus, and "adjacent to" the truncation position, the first corpus obtained according to the truncation position is "hello", which can express a relatively complete meaning, so that in order to avoid repetition between the truncated first corpus and the second corpus capable of expressing the complete meaning, in this case, the truncation position may be discarded. Therefore, in an embodiment of the present invention, before performing step 303, it may be further determined whether the truncation position obtained in step 302 is adjacent to a punctuation included in the second corpus, and if so, the truncation position is discarded, otherwise, the truncation position is retained.

The training features corresponding to the language model in the embodiment of the invention may include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; the language units can be independent units in the corpus, such as single characters or words; the first corpus and the training characteristics can enable the trained language model of the invention to have the capability of adding no punctuation at the end of the text aiming at incomplete sentences, so that the embodiment of the invention can obtain more accurate punctuation adding results aiming at incomplete sentences, namely can improve the accuracy of punctuation adding.

In an embodiment of the present invention, the language model is used to describe a probability of occurrence of a sequence of character units in a language, where the character units include: words and/or punctuation.

Language models may include, but are not limited to: an N-gram (N-gram) language model, and/or a neural network language model, wherein the neural network language model may further include: RNNLM (Recurrent Neural Network Language Model), CNNLM (Convolutional Neural Network Language Model), DNNLM (Deep Neural Network Language Model), and the like.

Where the N-gram language model is based on the assumption that the occurrence of the Nth word is only related to the first N-1 words and not to any other words, the probability of a complete sentence is the product of the probabilities of occurrence of the words.

Since the N-gram language model predicts the Nth word with a limited number of N-1 words (above), the N-gram language model may have the descriptive capability of the probability score of the character unit sequence with length N, for example, N may be a positive integer with a fixed value less than the first length threshold, such as 3, 5, etc. One advantage of neural network language models over N-gram language models, such as RNNLM, is: all the above can be utilized to predict the next word, so RNNLM can have the description capability of the probability score of the character unit sequence with variable length, that is, RNNLM is suitable for character unit sequences with a wide length range, for example, the length range of the character unit sequence corresponding to RNNLM can be: 1-a second length threshold, wherein the second length threshold is greater than the first length threshold.

In an optional embodiment of the present invention, the corpus corresponding to the language model may further include: and the second corpus corresponds to the complete sentence. The second corpus and the training characteristics can enable the trained language model of the invention to have the capability of adding punctuation at the end of the text aiming at the complete sentence, so that the embodiment of the invention can obtain a more accurate punctuation adding result aiming at the complete sentence.

In summary, the training method of the language model according to the embodiment of the present invention performs the training of the language model according to the first corpus corresponding to the incomplete sentence, so that the trained language model of the present invention has the punctuation adding capability of the incomplete sentence, thereby improving the accuracy of punctuation addition.

Referring to fig. 4, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

step 401, acquiring a text to be processed;

step 402, adding punctuations for the text to be processed by using a language model to obtain punctuation addition results corresponding to the text to be processed; the corpus corresponding to the language model may include: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model may include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit;

and 403, outputting a punctuation addition result corresponding to the text to be processed.

In the embodiment of the invention, the text to be processed can be used for representing the text needing to be added with punctuation, and the text to be processed can be sourced from the text or voice input by a user through a device, and can also be sourced from other devices. It should be noted that, the text to be processed may include: one language or more than one language, for example, the text to be processed may include chinese, or may include a mixture of chinese and other languages such as english, and the embodiment of the present invention does not limit the specific text to be processed.

In practical applications, the embodiment of the present invention may execute the data processing method of the embodiment of the present invention through a client APP (Application program). The client APP may run on the terminal, for example, the client APP may be any APP running on the terminal, and then the client APP may obtain the text to be processed from other applications of the terminal. Or, in the embodiment of the present invention, the functional device of the client application may execute the data processing method flow in the embodiment of the present invention, and then the functional device may obtain the text to be processed from another functional device. Alternatively, the data processing method according to the embodiment of the present invention may be executed by a server according to the embodiment of the present invention.

In practical application, step 401 may obtain a text to be processed from a text corresponding to the voice signal or a text input by the user according to practical application requirements. For example, step 401 may obtain a text to be processed according to a voice signal of a speaking user, in this case, step 401 may convert the voice signal of the speaking user into text information, and obtain the text to be processed from the text information; alternatively, step 401 may directly receive text information corresponding to the voice signal of the user from the voice recognition apparatus, and obtain the text to be processed from the text information.

According to an embodiment, the process of obtaining the text to be processed from the text corresponding to the voice signal may include: and acquiring a text to be processed from the text corresponding to the voice signal S according to the interval time of the voice signal S. For example, when the interval time of the voice signal S is greater than the time threshold, a first demarcation point corresponding to the voice signal S may be determined, a text corresponding to the voice signal S before the first demarcation point is used as a text to be processed, and the text corresponding to the voice signal S after the first demarcation point is processed to continue to obtain the text to be processed.

According to another embodiment, the process of obtaining the text to be processed from the text corresponding to the voice signal or the text input by the user may include: and acquiring the text to be processed from the text corresponding to the voice signal or the text input by the user according to the number of words contained in the text corresponding to the voice signal or the text input by the user. For example, when the number of words included in the text corresponding to the voice signal or the text input by the user is greater than the word number threshold, the second demarcation point corresponding to the voice signal may be determined according to the word number threshold, the text corresponding to the voice signal S before the second demarcation point may be used as the text to be processed, and the text corresponding to the voice signal S after the second demarcation point is processed to continue to obtain the text to be processed. It can be understood that the embodiment of the present invention does not impose any limitation on the specific process of obtaining the text to be processed from the text corresponding to the voice signal or the text input by the user.

In an optional embodiment of the present invention, the step 402 of adding punctuation to the text to be processed by using the language model may include: performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed; adding punctuation marks between adjacent words in the global word sequence to obtain a plurality of alternative punctuation addition results corresponding to the global word sequence; determining a probability score corresponding to each of the multiple candidate punctuation addition results according to a language model; and acquiring one alternative punctuation addition result with the highest probability score from the multiple alternative punctuation addition results to serve as the punctuation addition result corresponding to the text to be processed.

The process of adding punctuation marks between adjacent words in the global word sequence may include: and determining candidate punctuation marks which need to be added between adjacent vocabularies in the global word sequence according to the actual application requirements. Optionally, the candidate punctuation marks may include: the invention relates to a method for segmenting words, which comprises the steps of generating a plurality of words, wherein the words are represented by commas, question marks, periods, exclamation marks, spaces and the like, wherein the spaces can play a role in word segmentation or do not play any role, for example, for English, the spaces can be used for segmenting different words, and for Chinese, the spaces can be punctuation marks which do not play any role.

In practical application, a path planning algorithm can be adopted to obtain a plurality of alternative punctuation addition results corresponding to the global word sequence. The principle of the path planning algorithm may be that, in an environment with an obstacle, a collision-free path from an initial state to a target state is found according to a certain evaluation criterion, specifically, in the embodiment of the present invention, the obstacle may be used to represent candidate punctuation marks added between adjacent words of a global word sequence corresponding to a text to be processed, and the initial state and the target state respectively represent punctuation marks after a first word and a last word of the global word sequence corresponding to the text to be processed.

Referring to fig. 5, a schematic diagram of a punctuation adding process of a global word sequence according to an embodiment of the present invention is shown, where the global word sequence is "hello/my is/mingmen/happy/know you", and candidate punctuation symbols may be added between adjacent words of "hello/my is/mingmen/happy/know you"; in fig. 5, words such as "hello", "my is", "xiaoming", "happy", "know you" are respectively represented by rectangles, and punctuations such as comma, space, exclamation mark, question mark, period are respectively represented by circles, so that there may be multiple paths between punctuations after the first word "hello" and the last word "know you" of the global word sequence corresponding to the voice recognition result. It is understood that the global word sequence shown in fig. 5 is only an alternative embodiment, and in fact, the data processing apparatus may periodically receive the speech recognition result sent by the speech recognition apparatus 101, and obtain the text corresponding to the speech recognition result after the punctuation addition processing according to the preset time period.

It can be understood that the path planning algorithm is only an optional embodiment of the present invention, and actually, a person skilled in the art may obtain multiple alternative punctuation addition results corresponding to the text to be processed by using other algorithms according to actual application requirements, and it can be understood that the embodiment of the present invention does not limit a specific obtaining algorithm of the multiple alternative punctuation addition results.

In practical application, the language model can directly add results aiming at the alternative punctuations and output corresponding probability scores; or, the language model may add a partial character unit sequence included in the result for the candidate punctuations, output a corresponding score of the probability score, and then fuse the score of the probability score to obtain a corresponding probability score.

In an optional embodiment of the present invention, the determining, according to the language model, a probability score corresponding to each of the multiple candidate punctuation addition results may include: adding semantic fragments contained in the result aiming at each alternative punctuation, and determining a corresponding probability score; fusing probability scores corresponding to all semantic fragments contained in each alternative punctuation addition result to obtain corresponding probability scores; the candidate punctuation addition result with the highest probability score can be obtained from all the candidate punctuation addition results and used as the optimal candidate punctuation addition result corresponding to the text to be processed.

Optionally, corresponding semantic segments may be obtained from the candidate punctuation addition result by moving in a front-to-back order, the number of character units included in different semantic segments may be the same, and adjacent semantic segments may have repeated character units, where the character units may include: words and/or punctuation. In this case, the probability scores corresponding to the semantic segments can be determined by an N-gram language model and/or a neural network language model. Assuming that N is 5 and the number of the first character unit is 1, the following order of numbering may be followed: 1-5, 2-6, 3-7, 4-8 and the like, obtaining a corresponding semantic segment with the length of 5 from the candidate punctuation addition result, and determining a probability score corresponding to each semantic segment by using an N-gram language model, for example, if each semantic segment is input into an N-gram, the N-gram can output a corresponding probability score.

Optionally, the process of fusing the probability scores corresponding to all semantic segments included in each candidate punctuation addition result may include: the probability scores corresponding to all semantic segments included in each alternative punctuation addition result are summed, or multiplied, or weighted average processed, and the like.

In another optional embodiment of the present invention, the determining, according to the language model, a probability score corresponding to each of the multiple candidate punctuation addition results may include: determining probability scores corresponding to all semantic fragments of each alternative punctuation addition result by using a neural network language model; the candidate punctuation addition result with the highest probability score can be obtained from all the candidate punctuation addition results and used as the optimal candidate punctuation addition result corresponding to the text to be processed. Because RNNLM is suitable for semantic fragments with a wide length range, all semantic fragments of each candidate punctuation addition result can be taken as a whole, and probability scores corresponding to all semantic fragments of the candidate punctuation addition result are determined by RNNLM, for example, if all character units included in the candidate punctuation addition result are input into RNNLM, RNNLM can output corresponding probability scores.

In an application example of the present invention, assuming that the preset time period is 1s, assuming that punctuation addition processing is performed on a speech recognition result through a language model, and N is less than or equal to 5, a text corresponding to the speech recognition result subjected to the punctuation addition processing and acquired according to the preset time period may include:

second 1: weather today

Second 2: today, the weather is good, we

And 3, second: today, the weather is good, and we go out and climb mountains

And 4, second: today's weather is good, what we feel when going out and climbing mountains?

Firstly, receiving ' today's weather ' in the 1 st s, performing punctuation addition processing on the global word sequence ' today/weather ', and assuming that the probability score corresponding to ' today's/blank/weather/blank/' output by the language model is higher than the probability score corresponding to ' today's/blank/weather/comma, punctuation marks such as exclamation marks, question marks, periods and the like ', so that a punctuation addition result ' today/weather ' can be obtained.

Then, receiving "today weather is good for us" at the 2 nd s, the punctuation addition processing can be performed on the global word sequence "today/weather/good for us/us", and assuming that the probability score corresponding to "today/space/weather/space/good for us" output by the language model is higher than the probability score corresponding to other punctuation addition results such as "today/space/weather/space/good for space/us/period", the punctuation addition result "today/space/weather/space/good for us/,/us" can be obtained.

Then, receiving "weather is good today and we go out and climb up hill" at s 3, the global word sequence "today/weather/good/we/go/climb up hill" may be subjected to punctuation addition processing, and assuming that the probability score corresponding to "today/space/weather/space/good/,/us/space/go/space/climb up hill" output by the language model is higher than the probability score corresponding to other punctuation addition results such as "today/space/weather/good/,/space/go/space/climb up hill/period", so punctuation addition results can be obtained: "today/space/weather/space/go/,/we/space/go/space/hill-climbing".

Then, receiving "weather is good for our going out and climbing what you feel" at 4s, the punctuation addition processing can be performed on the global word sequence "today/space/weather/space/good/,/we/space/going out/space/climbing mountain/you/feel", assuming that the probability score corresponding to "today/space/weather/space/good/,/we/space/going out/space/climbing mountain/space/you/space/feel/space/what you look/question mark" output by the language model is higher than the probability score corresponding to other punctuation addition results, so the punctuation addition result can be obtained: "today/weather/space/go/, we/space/go/space/climb mountain/space/you/space/feel/space/how/question mark".

In a speech recognition scenario, step 403 may output the punctuation addition result to the user or a client corresponding to the user; alternatively, in the speech translation scenario, step 403 may output the punctuation addition result to the machine translation apparatus, so that the machine translation apparatus translates the punctuation addition result into characters in the target language. It can be understood that, in the embodiment of the present invention, a specific process of outputting the punctuation addition result corresponding to the text to be processed in step 403 is not limited.

In summary, the data processing method according to the embodiment of the present invention performs the training of the language model according to the first corpus corresponding to the incomplete sentence, so that the trained language model of the present invention has the punctuation adding capability of the incomplete sentence, thereby improving the accuracy of punctuation addition.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 6, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include:

a text obtaining module 601, configured to obtain a text to be processed;

a punctuation adding module 602, configured to add punctuation to the to-be-processed text by using a language model to obtain a punctuation addition result corresponding to the to-be-processed text; the corpus corresponding to the language model may include: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model may include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and

and a result output module 603, configured to output a punctuation addition result corresponding to the text to be processed.

Optionally, the apparatus may further include:

the word segmentation module is used for segmenting words of a second corpus corresponding to the complete sentence to obtain words which can be included in the second corpus;

a truncation position determining module, configured to determine a truncation position corresponding to the second corpus according to a vocabulary that the second corpus may include;

and the intercepting module is used for intercepting the character strings corresponding to the intercepting positions from the second corpus as the first corpus corresponding to the incomplete sentences according to the sequence from front to back.

Optionally, the truncation position is located between two adjacent words.

Optionally, the truncation position is not adjacent to a punctuation that the second corpus may include.

Optionally, the corpus corresponding to the language model may further include: and the second corpus corresponds to the complete sentence.

Optionally, the punctuation adding module 602 may include:

Referring to fig. 7, a block diagram of another data processing apparatus according to another embodiment of the present invention is shown, which may specifically include:

a corpus obtaining module 701 configured to obtain a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence;

a feature extraction module 702, configured to perform feature extraction on the training corpus, where the training features corresponding to the language model include: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and

and a model training module 703, configured to perform language model training on the training data according to the training characteristics.

Optionally, the first corpus corresponding to the incomplete sentence may be obtained by intercepting the second corpus corresponding to the complete sentence.

Optionally, the corpus acquiring module 701 may specifically include:

Alternatively, the truncation position may be located between two adjacent words.

Optionally, the truncation position may not be adjacent to a punctuation included in the second corpus.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Embodiments of the present invention also provide a data processing apparatus, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring a text to be processed; adding punctuation points for the text to be processed by utilizing a language model to obtain punctuation point adding results corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and outputting a punctuation addition result corresponding to the text to be processed.

Embodiments of the present invention also provide a data processing apparatus, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence; and extracting features aiming at the training corpus, wherein the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and training a language model for the training data according to the training characteristics.

Fig. 8 is a block diagram illustrating an apparatus for data processing as a terminal according to an example embodiment. For example, terminal 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

Referring to fig. 7, terminal 900 can include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

Processing component 902 generally controls overall operation of terminal 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

Memory 904 is configured to store various types of data to support operation at terminal 900. Examples of such data include instructions for any application or method operating on terminal 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power components 906 provide power to the various components of the terminal 900. The power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal 900.

The multimedia components 908 include a screen providing an output interface between the terminal 900 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 900 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when terminal 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing various aspects of state assessment for the terminal 900. For example, sensor assembly 914 can detect an open/closed state of terminal 900, a relative positioning of components, such as a display and keypad of terminal 900, a change in position of terminal 900 or a component of terminal 900, the presence or absence of user contact with terminal 900, an orientation or acceleration/deceleration of terminal 900, and a change in temperature of terminal 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 916 is configured to facilitate communications between terminal 900 and other devices in a wired or wireless manner. Terminal 900 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 904 comprising instructions, executable by processor 920 of terminal 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 9 is a block diagram illustrating an apparatus for data processing as a server according to an example embodiment. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as memory 1932 that includes instructions executable by a processor of server 1900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform the method of any of fig. 2-5.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a data processing method, the method comprising: acquiring a text to be processed; adding punctuation points for the text to be processed by utilizing a language model to obtain punctuation point adding results corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and outputting a punctuation addition result corresponding to the text to be processed.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a data processing method, the method comprising: acquiring a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence; and extracting features aiming at the training corpus, wherein the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and training a language model for the training data according to the training characteristics.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The data processing method, the data processing apparatus, and the apparatus for data processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the above descriptions of the embodiments are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method, comprising:

acquiring a training corpus; the corpus comprises: a first corpus corresponding to the incomplete sentence; the first corpus comprises: at least one vocabulary with the first vocabulary in the second corpus corresponding to the complete sentence as the beginning;

training a language model for the training data according to the training characteristics; the language model is used for learning the probability of the character unit sequence corresponding to the first corpus appearing in the language.

2. The method according to claim 1, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

3. The method of claim 1, wherein the obtaining the corpus comprises:

4. The method of claim 3, wherein the truncation position is between two adjacent words.

5. The method according to claim 3, wherein the truncation position is not adjacent to a punctuation included in the second corpus.

6. The method according to any one of claims 1 to 5, wherein the corpus corresponding to the language model further comprises: and the second corpus corresponds to the complete sentence.

7. A data processing method, comprising:

acquiring a text to be processed;

adding punctuation points for the text to be processed by utilizing a language model to obtain punctuation point adding results corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; the first corpus comprises: at least one vocabulary with the first vocabulary in the second corpus corresponding to the complete sentence as the beginning; the language model is used for learning the probability of the character unit sequence corresponding to the first corpus appearing in the language;

8. The method according to claim 7, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

9. The method according to claim 7 or 8, wherein the adding punctuation to the text to be processed by using the language model comprises:

10. A data processing apparatus, comprising:

the corpus acquiring module is used for acquiring training corpuses; the corpus comprises: a first corpus corresponding to the incomplete sentence; the first corpus comprises: at least one vocabulary with the first vocabulary in the second corpus corresponding to the complete sentence as the beginning;

the feature extraction module is used for extracting features of the training corpus, and the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; and

the model training module is used for training a language model for the training data according to the training characteristics; the language model is used for learning the probability of the character unit sequence corresponding to the first corpus appearing in the language.

11. The apparatus according to claim 10, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

12. The apparatus of claim 10, wherein the corpus obtaining module comprises:

13. The apparatus of claim 12, wherein the truncation position is located between two adjacent words.

14. The apparatus according to claim 12, wherein the truncation position is not adjacent to a punctuation included in the second corpus.

15. The apparatus according to any one of claims 10 to 14, wherein the corpus corresponding to the language model further comprises: and the second corpus corresponds to the complete sentence.

16. A data processing apparatus, comprising:

the text acquisition module is used for acquiring a text to be processed;

the punctuation adding module is used for adding punctuation for the text to be processed by utilizing a language model so as to obtain a punctuation adding result corresponding to the text to be processed; wherein, the training corpus corresponding to the language model comprises: a first corpus corresponding to the incomplete sentence; the training features corresponding to the language model comprise: the position of a language unit in the complete sentence in the first corpus and the punctuation condition behind the language unit; the first corpus comprises: at least one vocabulary with the first vocabulary in the second corpus corresponding to the complete sentence as the beginning; the language model is used for learning the probability of the character unit sequence corresponding to the first corpus appearing in the language; and

17. The apparatus according to claim 16, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

18. The apparatus of claim 16 or 17, wherein the punctuation addition module comprises:

19. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

20. The apparatus of claim 19, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

21. The apparatus of claim 19, wherein the obtaining the corpus comprises:

22. The apparatus of claim 21 wherein the truncation position is between two adjacent words.

23. The apparatus according to claim 21, wherein the truncation position is not adjacent to a punctuation included in the second corpus.

24. The apparatus according to any one of claims 19 to 23, wherein the corpus corresponding to the language model further comprises: and the second corpus corresponds to the complete sentence.

25. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

acquiring a text to be processed;

26. The apparatus according to claim 25, wherein the first corpus corresponding to the incomplete sentence is extracted from the second corpus corresponding to the complete sentence.

27. The apparatus according to claim 25 or 26, wherein said adding punctuation to the text to be processed by using the language model comprises:

28. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 6.

29. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 7 to 9.