CN110032732A - A kind of text punctuate prediction technique, device, computer equipment and storage medium - Google Patents
A kind of text punctuate prediction technique, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110032732A CN110032732A CN201910182506.1A CN201910182506A CN110032732A CN 110032732 A CN110032732 A CN 110032732A CN 201910182506 A CN201910182506 A CN 201910182506A CN 110032732 A CN110032732 A CN 110032732A
- Authority
- CN
- China
- Prior art keywords
- text
- punctuate
- target
- words
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000013598 vector Substances 0.000 claims abstract description 164
- 238000012545 processing Methods 0.000 claims abstract description 42
- 230000011218 segmentation Effects 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 29
- 230000015654 memory Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 20
- 238000000926 separation method Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 5
- 238000003780 insertion Methods 0.000 description 5
- 230000037431 insertion Effects 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000012795 verification Methods 0.000 description 3
- 208000003028 Stuttering Diseases 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了一种文本标点预测方法、装置、计算机设备及存储介质,应用于深度学习技术领域,用于解决语音识别得到的话术文本无标点的问题。本发明提供的方法包括:获取无标点的目标文本;对目标文本进行分词处理,得到目标文本中的各个目标字词;对各个目标字词分别进行向量化处理,得到各个目标向量;按照各个目标字词在目标文本中的次序,将各个目标向量依次输入至网络模型,得到输出的结果序列,结果序列中的各个数值分别表征了各个目标字词对应的标点;根据预设的数值标点对应关系分别确定出各个数值对应的各个标点;针对各个标点中的每个标点,将每个标点插入至目标文本中与每个标点对应目标字词的后面位置,得到标点预测后的话术文本。
The invention discloses a text punctuation prediction method, device, computer equipment and storage medium, which are applied to the technical field of deep learning and are used to solve the problem of no punctuation in the lexical text obtained by speech recognition. The method provided by the invention includes: obtaining target text without punctuation; performing word segmentation processing on the target text to obtain each target word in the target text; performing vectorization processing on each target word to obtain each target vector; The order of words in the target text, each target vector is input into the network model in turn, and the output result sequence is obtained, and each value in the result sequence represents the punctuation corresponding to each target word; according to the preset numerical punctuation correspondence Each punctuation corresponding to each numerical value is determined respectively; for each punctuation in each punctuation, each punctuation is inserted into the target text at a position behind the target word corresponding to each punctuation, to obtain a punctuation-predicted lexical text.
Description
技术领域technical field
本发明涉及深度学习技术领域,尤其涉及一种文本标点预测方法、装置、计算机设备及存储介质。The present invention relates to the technical field of deep learning, and in particular, to a text punctuation prediction method, device, computer equipment and storage medium.
背景技术Background technique
随着社会和高科技技术的飞速发展,智能家居控制、自动问答、语音助手等自然语言处理得到越来越多的关注。但是,由于口语对话没有标点符号,不能区分语句边界和规范语言结构,因此标点预测是极其重要的自然语言处理任务。在智能电话客服场景中,对于用户的讲话,通过语音识别得到的是无标点无断句的原始话术文本,没有办法直接使用,故而在进一步利用用户的话术之前,需要先对原始话术文本进行标点预测,以便对无标点的文本添加标点。With the rapid development of society and high-tech technology, natural language processing such as smart home control, automatic question answering, and voice assistants has received more and more attention. However, since spoken dialogue has no punctuation and cannot distinguish between sentence boundaries and canonical language structures, punctuation prediction is an extremely important natural language processing task. In the smart phone customer service scenario, for the user's speech, the original speech text with no punctuation and no segmentation is obtained through speech recognition, and there is no way to use it directly. Punctuation prediction to add punctuation to unpunctuated text.
因此,寻找一种能够准确地对话术文本进行标点预测的方法成为本领域技术人员亟需解决的问题。Therefore, it is an urgent problem for those skilled in the art to find a method that can accurately predict the punctuation of technical texts.
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种文本标点预测方法、装置、计算机设备及存储介质,以解决语音识别得到的话术文本无标点的问题。Embodiments of the present invention provide a text punctuation prediction method, device, computer equipment, and storage medium, so as to solve the problem of no punctuation in the lexical text obtained by speech recognition.
一种文本标点预测方法,包括:A text punctuation prediction method, comprising:
获取无标点的目标文本;Get the target text without punctuation;
对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;Perform word segmentation processing on the target text to obtain each target word in the target text;
对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;Perform vectorization processing on each of the target words to obtain each target vector corresponding to each of the target words;
按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;According to the order of the target words in the target text, the target vectors are sequentially input to the network model to obtain a sequence of results output by the network model in sequence, and each value in the sequence of results represents the The punctuation corresponding to each target word, the network model is composed of a pre-trained LSTM network and a conditional random field;
根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;Each punctuation corresponding to each numerical value is respectively determined according to the preset numerical punctuation corresponding relationship, and the numerical punctuation corresponding relationship records the one-to-one correspondence between the numerical value and the punctuation;
针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。For each punctuation in the various punctuations, insert each punctuation into the target text at a position behind the target word corresponding to each punctuation to obtain a punctuation-predicted lexical text, and the latter position refers to the position in the target text that follows the target word and is immediately adjacent to the target word.
一种文本标点预测装置,包括:A text punctuation prediction device, comprising:
文本获取模块,用于获取无标点的目标文本;Text acquisition module, used to acquire target text without punctuation;
分词处理模块,用于对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;a word segmentation processing module, configured to perform word segmentation processing on the target text to obtain each target word in the target text;
字词向量化模块,用于对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;A word vectorization module, configured to perform vectorization processing on each of the target words to obtain each target vector corresponding to each of the target words;
向量输入模块,用于按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;The vector input module is used for sequentially inputting the respective target vectors to the network model according to the order of the respective target words in the target text, to obtain a sequence of results output by the network model in sequence, in the sequence of results Each numerical value of represents the punctuation corresponding to each target word, and the network model is composed of a pre-trained LSTM network and a conditional random field;
标点确定模块,用于根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;a punctuation determination module, configured to respectively determine each punctuation corresponding to each numerical value according to a preset numerical punctuation correspondence, wherein the numerical punctuation correspondence records the one-to-one correspondence between the numerical value and the punctuation;
标点插入模块,用于针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。A punctuation insertion module is used for, for each punctuation in the various punctuations, inserting the each punctuation into the target text in the position behind the target word corresponding to each punctuation, to obtain the vocabulary after punctuation prediction text, and the latter position refers to a position in the target text that is located after the target word and is close to the target word.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述文本标点预测方法的步骤。A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above-mentioned text punctuation prediction method when the processor executes the computer program.
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述文本标点预测方法的步骤。A computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the steps of the above-mentioned text punctuation prediction method.
上述文本标点预测方法、装置、计算机设备及存储介质,首先,获取无标点的目标文本;然后,对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;接着,对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;再之,按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;次之,根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;最后,针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。可见,本发明可以通过预先训练好的LSTM网络和预设的条件随机场准确地对目标文本进行标点预测,完成对无标点文本的标点添加,提升了文本标点预测的效率,以便于后续自然语言处理对文本的直接使用。The above-mentioned text punctuation prediction method, device, computer equipment and storage medium, first, obtain a target text without punctuation; then, perform word segmentation processing on the target text to obtain each target word in the target text; Each target word is vectorized separately to obtain each target vector corresponding to each target word; and then, according to the order of each target word in the target text, the each target vector is sequentially Input into the network model, and obtain the result sequence output by the network model in turn, each numerical value in the result sequence represents the punctuation corresponding to each target word, and the network model is composed of a pre-trained LSTM network and conditions. A random field is formed; secondly, each punctuation corresponding to each numerical value is determined according to the preset numerical punctuation correspondence, and the numerical punctuation correspondence records the one-to-one correspondence between the numerical value and the punctuation; each punctuation, inserting the each punctuation into the target text in the back position of the target word corresponding to each punctuation, to obtain the punctuation prediction after the term text, the back position refers to the target text in the position following and immediately following the target word. It can be seen that the present invention can accurately predict the punctuation of the target text through the pre-trained LSTM network and the preset conditional random field, complete the addition of punctuation to the text without punctuation, and improve the efficiency of text punctuation prediction, so as to facilitate the subsequent natural language Handles direct use of text.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present invention. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.
图1是本发明一实施例中文本标点预测方法的一应用环境示意图;1 is a schematic diagram of an application environment of a text punctuation prediction method in an embodiment of the present invention;
图2是本发明一实施例中文本标点预测方法的一流程图;Fig. 2 is a flow chart of a text punctuation prediction method in an embodiment of the present invention;
图3是本发明一实施例中文本标点预测方法步骤103在一个应用场景下的流程示意图;3 is a schematic flowchart of step 103 of the text punctuation prediction method in an application scenario in an embodiment of the present invention;
图4是本发明一实施例中文本标点预测方法在一个应用场景下训练网络模型的流程示意图;4 is a schematic flowchart of a training network model in an application scenario of a text punctuation prediction method according to an embodiment of the present invention;
图5是本发明一实施例中文本标点预测方法步骤106在一个应用场景下的流程示意图;5 is a schematic flowchart of step 106 of the text punctuation prediction method in an application scenario in an embodiment of the present invention;
图6是本发明一实施例中文本标点预测装置在一个应用场景下的结构示意图;6 is a schematic structural diagram of a text punctuation prediction device in an application scenario according to an embodiment of the present invention;
图7是本发明一实施例中文本标点预测装置在另一个应用场景下的结构示意图;7 is a schematic structural diagram of a text punctuation prediction device in another application scenario according to an embodiment of the present invention;
图8是本发明一实施例中标点插入模块的结构示意图;8 is a schematic structural diagram of a punctuation insertion module in an embodiment of the present invention;
图9是本发明一实施例中计算机设备的一示意图。FIG. 9 is a schematic diagram of a computer device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
本申请提供的文本标点预测方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务器进行通信。其中,该客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The text punctuation prediction method provided by the present application can be applied in the application environment as shown in FIG. 1 , in which the client communicates with the server through the network. Wherein, the client can be but not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种文本标点预测方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2 , a text punctuation prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
101、获取无标点的目标文本;101. Obtain target text without punctuation;
本实施例中,服务器可以根据实际使用的需要或者应用场景的需要获取无标点的目标文本。例如,服务器可以与客户端通信连接,该客户端提供给某场所内的用户咨询问题,用户通过客户端的麦克风输入语音问题,客户端将该语音问题上传给服务器,服务器将该语音问题音转字后得到文本,一般该文本为无标点的目标文本。或者,服务器也可以执行对大批量的话术文本进行标点识别的任务,某数据库预先收集大量的话术文本,然后通过网络将多个话术文本传输给服务器,服务器需要对这些话术文本分别进行标点预测,从而这些话术文本分别为各个待标点预测的、无标点的目标文本。可以理解的是,服务器还可以通过多种方式获取到这些待标点预测的目标文本,对此不再过多赘述。In this embodiment, the server may acquire the target text without punctuation according to actual needs or application scenarios. For example, the server can communicate with the client, the client provides questions to users in a certain place, the user inputs a voice question through the microphone of the client, the client uploads the voice question to the server, and the server converts the voice question into words. Then the text is obtained, which is generally the target text without punctuation. Alternatively, the server can also perform the task of punctuation recognition for a large number of discourse texts. A database collects a large number of discourse texts in advance, and then transmits multiple discourse texts to the server through the network. The server needs to punctuate these discourse texts respectively. prediction, so that these discourse texts are target texts without punctuation to be predicted respectively. It can be understood that the server can also obtain the target texts to be predicted by punctuation in various ways, which will not be repeated here.
需要说明的是,本实施例所说的文本一般是指话术文本,即由人所说的话通过音转字得到的文本内容。It should be noted that the text mentioned in this embodiment generally refers to the vocabulary text, that is, the text content obtained from the words spoken by a person through sound-to-word conversion.
102、对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;102. Perform word segmentation processing on the target text to obtain each target word in the target text;
可以理解的是,在进行标点预测时,需要准确把握标点可能出现的位置,而标点的位置又与目标文本中各个字词密切相关,这就需要服务器对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词。举例说明,目标文本为“你好我明天回复你”,经过分词后,可以得到“你好”、“我”、“明天”、“回复”、“你”共5个字词,这5个字词即为所述各个目标字词。It can be understood that, when predicting punctuation, it is necessary to accurately grasp the possible positions of punctuation, and the position of punctuation is closely related to each word in the target text, which requires the server to perform word segmentation on the target text to obtain all the describe each target word in the target text. For example, the target text is "Hello, I will reply you tomorrow", after word segmentation, you can get a total of 5 words "hello", "me", "tomorrow", "reply", "you", these 5 words The terms are the respective target terms.
特别地,在对目标文本进行分词处理时,可以采用结巴分词等第三方软件实现分词处理,得到各个目标字词。In particular, when performing word segmentation processing on the target text, third-party software such as stuttering word segmentation can be used to implement word segmentation processing to obtain each target word.
为了减少目标文本中的干扰信息,保证后续分词和投入网络模型进行识别的准确性,进一步地,在步骤102之前,本方法还包括:删除所述目标文本中的指定文本,所述指定文本至少包括停用词。可以理解的是,这里所说的停用词可以是指使用频率特别高的单汉字,比如“的”、“了”等无实际语言意义的汉字。执行步骤102之前,服务器可以将目标文本中的指定文本删除,举例说明,假设该指定文本包括停用词,该目标文本中包括文本“我今天来上班了”,服务器可以先将其中的“了”等无实际意义的停用词删除,从而得到删除后的文本“我今天来上班”。In order to reduce the interference information in the target text and ensure the accuracy of subsequent word segmentation and input network model for recognition, further, before step 102, the method further includes: deleting the specified text in the target text, the specified text at least Include stop words. It is understandable that the stop words mentioned here may refer to single Chinese characters that are used with a particularly high frequency, such as Chinese characters without actual linguistic meaning such as "的" and "的". Before executing step 102, the server may delete the specified text in the target text. For example, assuming that the specified text includes stop words, and the target text includes the text "I came to work today", the server may first delete the "now" in the target text. " and other meaningless stop words are deleted, so as to get the deleted text "I'm coming to work today".
103、对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;103. Perform vectorization processing on each of the target words to obtain each target vector corresponding to each of the target words;
在得到各个目标字词后,为了便于后续网络模型的识别和学习,服务器需要对所述各个目标字词分别进行向量化处理,即将字词转化为向量的方式表示,从而得到所述各个目标字词对应的各个目标向量。具体地,服务器可以将每个目标字词以一维矩阵(一维向量)的形式记载。After obtaining each target word, in order to facilitate the identification and learning of the subsequent network model, the server needs to perform vectorization processing on each target word, that is, the word is converted into a vector representation, so as to obtain the each target word Each target vector corresponding to the word. Specifically, the server may record each target word in the form of a one-dimensional matrix (one-dimensional vector).
为便于理解的,在一个具体应用场景下,如图3所示,进一步地,所述步骤103具体可以包括:For ease of understanding, in a specific application scenario, as shown in FIG. 3 , further, the step 103 may specifically include:
201、针对所述各个目标字词中的每个目标字词,检索预设的字典中是否记录有所述每个目标字词,若是,则执行步骤202,若否,则执行步骤203,所述字典记录了字词与一维向量之间的对应关系;201. For each target word in each of the target words, search for whether each target word is recorded in the preset dictionary, if so, go to step 202, if not, go to step 203, and so on. The dictionary records the correspondence between words and one-dimensional vectors;
202、获取与所述每个目标字词对应的一维向量;202. Obtain a one-dimensional vector corresponding to each of the target words;
203、通过加载第一第三方平台的词向量,将所述每个目标字词转化为第一向量;203. Convert each target word into a first vector by loading the word vector of the first third-party platform;
204、通过加载第二第三方平台的词向量,将所述每个目标字词转化为第二向量;204. Convert each target word into a second vector by loading the word vector of the second third-party platform;
205、拼接所述第一向量和第二向量,得到一个一维向量作为所述每个目标字词对应的一维向量;205, splicing the first vector and the second vector, obtaining a one-dimensional vector as the one-dimensional vector corresponding to each target word;
206、将拼接得到的所述一维向量和与之对应的目标字词记录至所述字典。206. Record the one-dimensional vector obtained by splicing and the target word corresponding thereto in the dictionary.
对于上述步骤201,服务器在将各个目标字词转化为向量时,可以逐个对这些目标字词进行转化,也可以采用多线程的方式同时对多个目标字词进行转化,每个线程同一时间对一个目标字词进行向量转化。具体地,针对每个目标字词进行向量转化过程中,首先,服务器可以检索预设的字典中是否记录有该目标字词。这里需要说明的是,为了便于实现对字词到向量的转化,服务器可以预先设有字典,该字典记录了字词与一维向量之间的一一对应关系。例如,可以设置“你好”与“1号向量”对应,“我”与“2号向量”对应,“明天”与“3号向量”对应,“回复”与“4号向量”对应,“你”与“5号向量”对应……,通过尽可能穷尽所有字词来完善该字典,从而当需要转化该目标文本中的各个目标字词时,服务器可以采用预设的字典将所述目标文本中各个目标字词转化为各个一维向量。For the above step 201, when the server converts each target word into a vector, it can convert these target words one by one, or it can convert multiple target words at the same time in a multi-threaded manner, and each thread can convert the target words at the same time. A target word for vector transformation. Specifically, during the vector transformation process for each target word, first, the server may retrieve whether the target word is recorded in a preset dictionary. It should be noted here that, in order to facilitate the conversion of words into vectors, the server may be provided with a dictionary in advance, and the dictionary records the one-to-one correspondence between words and one-dimensional vectors. For example, you can set "hello" to correspond to "vector 1", "me" to correspond to "vector 2", "tomorrow" to correspond to "vector 3", "reply" to correspond to "vector 4", " "you" corresponds to "No. 5 vector"..., perfect the dictionary by exhausting all the words as much as possible, so that when each target word in the target text needs to be transformed, the server can use a preset dictionary to convert the target Each target word in the text is converted into each one-dimensional vector.
因此,若服务器检测到字典中记录有该目标字词,说明该字典中也记录有该目标字词对应的一维向量,反之,则不记录有与该目标字词对应的一维向量。Therefore, if the server detects that the target word is recorded in the dictionary, it means that the one-dimensional vector corresponding to the target word is also recorded in the dictionary; otherwise, the one-dimensional vector corresponding to the target word is not recorded.
对于上述步骤202,可以理解的是,若检测发现预设的字典中记录有所述每个目标字词,则说明该字典中记录有所述每个目标字词对应的一维向量,因此,服务器可以从字典中获取到与所述每个目标字词对应的一维向量。For the above step 202, it can be understood that if it is detected and found that each target word is recorded in the preset dictionary, it means that a one-dimensional vector corresponding to each target word is recorded in the dictionary. Therefore, The server can obtain a one-dimensional vector corresponding to each target word from the dictionary.
对于上述步骤203,可以理解的是,若检测发现预设的字典中没有记录有所述每个目标字词,则说明该字典中没有记录所述每个目标字词对应的一维向量。这是因为,服务器预设字典时往往难以穷尽所有字词,即便花费大量成本穷尽所有字词记录到字典中,由于当前社会信息量每日剧增,几乎每天均会产生新的字词,比如网络用语,因此预设的字典也存在没有收录某些字词的情况。面对这种情况,本实施例中可以在使用时一边实现对目标字词的向量转化,一边补充新增字词到字典中以完善字典。具体地,服务器先通过加载第一第三方平台的词向量,将所述每个目标字词转化为第一向量。可知,由于第三方平台往往更新及时,因此在其上加载的词向量一般会涵盖了所有当前可能出现的字词,因此可以实现将该目标字词转化为第一向量。For the above step 203, it can be understood that if it is detected that each target word is not recorded in the preset dictionary, it means that the one-dimensional vector corresponding to each target word is not recorded in the dictionary. This is because it is often difficult to exhaust all the words when the server presets the dictionary. Even if it costs a lot of money to exhaust all the words and record them in the dictionary, due to the rapid increase of the current amount of social information every day, new words will be generated almost every day, such as Internet terms, so the default dictionary may not include some words. Faced with this situation, in this embodiment, the vector transformation of the target word can be realized while adding new words to the dictionary to improve the dictionary. Specifically, the server firstly converts each target word into a first vector by loading the word vector of the first third-party platform. It can be known that, because third-party platforms are often updated in a timely manner, the word vectors loaded on them generally cover all the words that may currently appear, so the target word can be converted into the first vector.
对于上述步骤204,为了增加向量转化的准确性,降低误差率,本实施例还通过加载第二第三方平台的词向量,将所述每个目标字词转化为第二向量。可知,第二第三方平台与第一第三方平台为两个不同的平台,在各自上加载的词向量也不相同。For the above step 204, in order to increase the accuracy of vector conversion and reduce the error rate, this embodiment also converts each target word into a second vector by loading the word vector of the second third-party platform. It can be seen that the second third-party platform and the first third-party platform are two different platforms, and the word vectors loaded on each are also different.
对于上述步骤205,服务器在得到第一向量和第二向量后,可以所述第一向量和第二向量,得到一个一维向量作为所述每个目标字词对应的一维向量。具体地,可以将同一个字词对应的第一向量和第二向量一前一后拼接起来,即第一向量的尾部紧接上第二向量的头部,从而得到一个新的一维向量。可知,由于第一向量和第二向量来自不同平台的两个词向量,因此两者存在不同,本实施例将两个平台的转化规则整合到一起,从整体上可以减少向量转化的误差,通知也保证了每个一维向量均具有足够的长度,提高后续使用的准确性。For the above step 205, after obtaining the first vector and the second vector, the server may obtain a one-dimensional vector from the first vector and the second vector as the one-dimensional vector corresponding to each target word. Specifically, the first vector and the second vector corresponding to the same word may be spliced together one after the other, that is, the tail of the first vector is immediately followed by the head of the second vector, thereby obtaining a new one-dimensional vector. It can be seen that since the first vector and the second vector come from two word vectors on different platforms, the two are different. This embodiment integrates the transformation rules of the two platforms, which can reduce the error of vector transformation as a whole. Notice It is also ensured that each one-dimensional vector has a sufficient length to improve the accuracy of subsequent use.
对于上述步骤206,可以理解的是,拼接得到的该一维向量相对该预设的字典来说是新的一维向量,因此,为便于完善该字典,便于后续使用该字典时能提高字词的检索成功率,服务器可以将拼接得到的所述一维向量和与之对应的目标字词记录至所述字典。For the above-mentioned step 206, it can be understood that the one-dimensional vector obtained by splicing is a new one-dimensional vector relative to the preset dictionary. Therefore, in order to improve the dictionary, it is convenient to improve the wording in the subsequent use of the dictionary. The server can record the one-dimensional vector obtained by splicing and the target word corresponding to it in the dictionary.
104、按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;104. According to the order of the target words in the target text, the target vectors are sequentially input into the network model to obtain a sequence of results output by the network model in turn, and each value in the sequence of results is respectively Characterizing the punctuation corresponding to each target word, and the network model is composed of a pre-trained LSTM network and a conditional random field;
在得到各个目标字词对应的各个目标向量之后,服务器可以按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至预先训练好的网络模型,得到所述网络模型依次输出的结果序列,其中,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点。例如,假设该目标文本对应的目标向量共5个,分别为1-5号向量,则在执行步骤104时,先将1号向量输入至该网络模型,然后将2号向量输入至该网络模型,随后是3号向量、4号向量和5号向量;同时,可知在1号向量输入至该网络模型没多久,该网络模型会输出与该1号向量对应的数值,随后会输出与该2号向量对应的数值,以及输出与该3号向量对应的数值、4号向量对应的数值、5号向量对应的数值。因此,该网络模型依次输出的5个数值组成了该结果序列。After obtaining each target vector corresponding to each target word, the server may sequentially input each target vector into the pre-trained network model according to the order of each target word in the target text, and obtain the A sequence of results sequentially output by the network model, wherein each numerical value in the sequence of results respectively represents the punctuation corresponding to each of the target words. For example, assuming that there are 5 target vectors corresponding to the target text, which are vectors 1-5 respectively, then when step 104 is executed, first vector 1 is input into the network model, and then vector 2 is input into the network model , followed by vector 3, vector 4 and vector 5; at the same time, it can be seen that shortly after vector 1 is input to the network model, the network model will output the value corresponding to vector 1, and then output the value corresponding to vector 2. The value corresponding to the number vector, and the value corresponding to the number 3 vector, the number corresponding to the number 4 vector, and the number corresponding to the number 5 vector are output. Therefore, the 5 values output by the network model in turn constitute the result sequence.
需要说明的是,服务器预先设置好每个数值与标点之间的对应关系,具体可以根据实际情况需要设定。例如,在一个应用场景下,可以将数值与标点的对应关系设置如下表一所示:It should be noted that the server pre-sets the corresponding relationship between each numerical value and the punctuation, which can be set according to actual needs. For example, in an application scenario, the corresponding relationship between numerical values and punctuation can be set as shown in Table 1 below:
表一Table I
可知,上述标点的种类可以根据实际情况的需要增多或减少,并且,哪一数值与哪一标点对应可以根据需要设定,只需保证该网络模型训练时和使用时均采用同一套对应关系即可。It can be seen that the types of the above punctuation can be increased or decreased according to the needs of the actual situation, and which value corresponds to which punctuation can be set as required, as long as the network model is trained and used using the same set of correspondences Can.
本实施例中,该网络模型由两部分组成,前半部分为LSTM网络,后半部分则为条件随机场。可以理解的是,无标点文本的标点预测应用场景下,LSTM网络善于解决长序依赖问题,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件,能很好地理解无标点文本中各个字词之间的依赖关系并给出预测,但是LSTM网络缺乏对输出类别信息建模的能力,因此,本方法摈弃LSTM网络后加上全连接层的方式,而在LSTM网络后接上条件随机场(CRF,conditional random field algorithm),其能很好地弥补LSTM网络的这一缺陷,使得两者结合相得益彰,提高对无标点文本的标点预测准确性。In this embodiment, the network model consists of two parts, the first half is the LSTM network, and the second half is the conditional random field. It is understandable that in the punctuation prediction application scenario of unpunctuated text, the LSTM network is good at solving the long-order dependency problem, suitable for processing and predicting important events with relatively long intervals and delays in time series, and can well understand unpunctuated text. However, the LSTM network lacks the ability to model the output category information. Therefore, this method abandons the method of adding a fully connected layer after the LSTM network, and connects it after the LSTM network. Conditional random field (CRF, conditional random field algorithm), which can make up for this defect of the LSTM network well, make the combination of the two complement each other, and improve the accuracy of punctuation prediction for unpunctuated text.
为便于理解,下面将对网络模型的训练过程进行详细描述。如图4所示,进一步地,所述网络模型可以通过以下步骤预先训练好:For ease of understanding, the training process of the network model will be described in detail below. As shown in Figure 4, further, the network model can be pre-trained through the following steps:
301、收集多个带标点的话术文本;301. Collect multiple punctuation texts;
302、将收集到的各个话术文本中的标点与文本分离,得到各个样本文本和与所述各个样本文本对应的各个标点集合;302. Separate the punctuation and text in each of the collected vocabulary texts to obtain each sample text and each punctuation set corresponding to each of the sample texts;
303、针对每个标点集合,根据预设的数值标点对应关系分别确定出所述每个标点集合中各个标点对应的第一数值,并以各个所述第一数值组成与所述每个标点集合对应的标准序列,所述数值标点对应关系记录了数值与标点的一一对应关系;303. For each punctuation set, determine the first numerical value corresponding to each punctuation in the each punctuation set according to the preset numerical punctuation corresponding relationship, and form each punctuation set with each of the first numerical values. The corresponding standard sequence, the numerical punctuation correspondence records the one-to-one correspondence between numerical values and punctuation;
304、对所述样本文本分别进行分词处理,得到各个所述样本文本中的各个样本字词;304. Perform word segmentation processing on the sample texts to obtain each sample word in each of the sample texts;
305、对各个所述样本文本中的各个样本字词分别进行向量化处理,得到与所述各个样本字词对应的各个样本向量;305. Perform vectorization processing on each sample word in each of the sample texts to obtain each sample vector corresponding to each of the sample words;
306、针对各个所述样本文本中每个样本文本,按照各个样本字词在每个样本文本中的次序,将各个样本向量依次输入至所述网络模型中LSTM网络,得到所述LSTM网络依次输出的各个中间向量;306. For each sample text in each of the sample texts, according to the order of each sample word in each sample text, input each sample vector into the LSTM network in the network model in turn, and obtain the output of the LSTM network in turn. Each intermediate vector of ;
307、分别将各个所述中间向量输入至所述网络模型中条件随机场中,得到所述条件随机场输出的样本序列,所述样本序列中的各个数值分别表征了所述各个样本字词对应的标点;307. Input each of the intermediate vectors into the conditional random field in the network model, respectively, to obtain a sample sequence output by the conditional random field, where each value in the sample sequence represents the correspondence of each sample word respectively. punctuation;
308、以输出的所述样本序列作为调整目标,调整所述LSTM网络的参数和所述条件随机场的权重系数,以最小化得到的所述样本序列与所述每个样本文本对应的标准序列之间的误差;308. Taking the output sample sequence as the adjustment target, adjust the parameters of the LSTM network and the weight coefficient of the conditional random field to minimize the obtained sample sequence and the standard sequence corresponding to each sample text error between;
309、若所述样本序列与所述每个样本文本对应的标准序列之间的误差满足预设的训练终止条件,则确定所述网络模型已训练好。309. If the error between the sample sequence and the standard sequence corresponding to each sample text satisfies a preset training termination condition, determine that the network model has been trained.
对于上述步骤301,本实施例中,工作人员可以在不同应用场景下收集大量的话术文本,比如,可以收集用户咨询问题时的话术文本、收集用户投诉时的话术文本、收集用户闲聊时的话术文本、等等。在收集话术文本时,服务器可以通过专业知识库、网络数据库等渠道收集大量的、原始的话术文本。需要说明的是,这些话术文本需要带有标点,收集的原始话术文本若不带有标点,可以人工为其添加上标点。For the above step 301, in this embodiment, the staff can collect a large number of lexical texts in different application scenarios, for example, can collect lexical texts when users ask questions, collect lexical texts when users complain, and collect lexical texts when users chat text, etc. When collecting the discourse texts, the server can collect a large amount of original discourse texts through channels such as professional knowledge bases and network databases. It should be noted that these vernacular texts need to be punctuated. If the collected original vernacular texts do not have punctuation, punctuation can be added to them manually.
对于上述步骤302,在训练时,输入的是不带有标点的话术文本,因此服务器可以将收集到的各个话术文本中的标点与文本分离,得到各个样本文本和与所述各个样本文本对应的各个标点集合。例如,某个收集到的话术文本为“你们有什么产品?”,将这个话术文本分离后可以得到样本文本“你们有什么产品”以及标点集合“?”(问号前面有四个空格)。For the above step 302, during the training, the input is the vocabulary text without punctuation, so the server can separate the punctuation and the text in each collected vocabulary text, and obtain each sample text and the corresponding sample text. collection of punctuation. For example, a collected phrasal text is "what products do you have?", after separating this phrasal text, the sample text "what products do you have" and a set of punctuation points "?" (there are four spaces in front of the question mark).
对于上述步骤303,可以理解的是,为了便于后续步骤的处理,在步骤302从话术文本中分离出标点集合后,服务器还可以将这些标点集合转化成以由数值组成的序列,即标准序列。具体地,根据上面所说的数值标点对应关系将每个标点集合中的各个标点转换为第一数值,然后将这些第一数值排列得到标准序列。举例说明,例如上述标点集合“?”,参照上述表一所示的对应关系可以得到标准序列为“00003”。For the above step 303, it can be understood that, in order to facilitate the processing of the subsequent steps, after the punctuation set is separated from the vocabulary text in step 302, the server can also convert these punctuation sets into a sequence consisting of numerical values, that is, a standard sequence . Specifically, each punctuation in each punctuation set is converted into a first numerical value according to the above-mentioned numerical punctuation correspondence, and then the first numerical values are arranged to obtain a standard sequence. For example, for example, the above punctuation set "?", referring to the corresponding relationship shown in Table 1 above, the standard sequence can be obtained as "00003".
对于上述步骤304,与上述步骤102同理,在进行网络模型训练之前,同样需要对这些样本文本进行分词处理。因此,服务器可以对所述样本文本分别进行分词处理,得到各个所述样本文本中的各个样本字词。举例说明,样本文本为“你们有什么产品”,经过分词后,可以得到“你们”、“有”、“什么”、“产品”共4个样本字词。For the above-mentioned step 304, similar to the above-mentioned step 102, before the network model training is performed, the word segmentation processing also needs to be performed on these sample texts. Therefore, the server may perform word segmentation processing on the sample texts to obtain each sample word in each of the sample texts. For example, the sample text is "what products do you have", after word segmentation, you can get 4 sample words including "you", "you", "what", and "products".
特别地,在对样本文本进行分词处理时,可以采用结巴分词等第三方软件实现分词处理,得到各个样本字词。In particular, when performing word segmentation processing on the sample text, third-party software such as stuttering word segmentation can be used to implement word segmentation processing to obtain each sample word.
为了减少样本文本中的干扰信息,保证后续分词和投入网络模型进行训练的准确性,进一步地,在步骤304之前,本方法还包括:删除所述样本文本中的指定文本,所述指定文本至少包括停用词。可以理解的是,这里所说的停用词可以是指使用频率特别高的单汉字,比如“的”、“了”等无实际语言意义的汉字。执行步骤304之前,服务器可以将样本文本中的指定文本删除,举例说明,假设该指定文本包括停用词,该样本文本中包括文本“我今天来上班了”,服务器可以先将其中的“了”等无实际意义的停用词删除,从而得到删除后的文本“我今天来上班”。In order to reduce the interference information in the sample text and ensure the accuracy of subsequent word segmentation and input network model training, further, before step 304, the method further includes: deleting the specified text in the sample text, the specified text at least Include stop words. It is understandable that the stop words mentioned here may refer to single Chinese characters that are used with a particularly high frequency, such as Chinese characters without actual linguistic meaning such as "的" and "的". Before executing step 304, the server may delete the specified text in the sample text. For example, assuming that the specified text includes stop words, the sample text includes the text "I'm coming to work today", and the server may first delete the "now" in the sample text. " and other meaningless stop words are deleted, so as to get the deleted text "I'm coming to work today".
对于上述步骤305,与上述步骤103同理,在得到各个样本字词后,为了便于后续网络模型的识别和学习,服务器需要对所述各个样本字词分别进行向量化处理,即将字词转化为向量的方式表示,从而得到所述各个样本字词对应的各个目标向量。具体地,服务器可以将每个样本字词以一维矩阵(一维向量)的形式记载。For the above step 305, in the same way as the above step 103, after each sample word is obtained, in order to facilitate the identification and learning of the subsequent network model, the server needs to perform vectorization processing on the each sample word, that is, convert the word into vector, so as to obtain each target vector corresponding to each sample word. Specifically, the server may record each sample word in the form of a one-dimensional matrix (one-dimensional vector).
对于上述步骤306,可以理解的是,在训练网络模型时,针对各个所述样本文本中每个样本文本进行分别训练。服务器可以按照各个样本字词在每个样本文本中的次序,将各个样本向量依次输入至所述网络模型中的LSTM网络进行训练,得到所述LSTM网络依次输出的各个中间向量。例如,假设某个样本文本的样本向量共4个,分别为1-4号向量,则在执行步骤306时,先将1号向量输入至该LSTM网络,然后将2号向量输入至该LSTM网络,随后是3号向量、4号向量;同时,可知在1号向量输入至该LSTM网络没多久,该LSTM网络会输出与该1号向量对应的中间向量,随后会输出与该2号向量对应的中间向量,以及输出与该3号向量对应的中间向量、4号向量对应的中间向量。可以理解的是,基于LSTM网络能对文本的内容进行短期记忆的特点,LSTM网络输出的中间向量相比输入的样本向量包含更多的文本信息,是本发明对无标点文本进行标点预测的基础。For the above step 306, it can be understood that when training the network model, training is performed separately for each sample text in each of the sample texts. The server may sequentially input each sample vector into the LSTM network in the network model for training according to the order of each sample word in each sample text, and obtain each intermediate vector sequentially output by the LSTM network. For example, assuming that there are 4 sample vectors of a certain sample text, which are No. 1-4 vectors respectively, then when step 306 is executed, the No. 1 vector is first input to the LSTM network, and then the No. 2 vector is input to the LSTM network , followed by the No. 3 vector and No. 4 vector; at the same time, it can be seen that not long after the No. 1 vector is input to the LSTM network, the LSTM network will output the intermediate vector corresponding to the No. 1 vector, and then output the No. 2 vector corresponding to the , and output the intermediate vector corresponding to the No. 3 vector and the intermediate vector corresponding to the No. 4 vector. It can be understood that, based on the feature that the LSTM network can perform short-term memory on the content of the text, the intermediate vector output by the LSTM network contains more text information than the input sample vector, which is the basis for the present invention to perform punctuation prediction on text without punctuation. .
关于LSTM网络,其可以克服传统的RNN(Recurrent Neural Network)无法处理远距离依赖的缺点。LSTM有三个门,分别为忘记门、输入门和输出门。其中,首先忘记门代表从上一个细胞状态丢弃的信息,取值为从0到1,取值越小表明要丢弃的信息越多。紧接着的输入门代表了让多少新的信息加入到细胞状态去。最后的输出门会根据当前细胞状态和新的信息得到对应的输出,并更新细胞状态。关于LSTM的网络结构具体可以参照现有资料,此处不再赘述。Regarding the LSTM network, it can overcome the shortcomings of the traditional RNN (Recurrent Neural Network) that cannot handle long-distance dependencies. LSTM has three gates, namely forget gate, input gate and output gate. Among them, the first forget gate represents the information discarded from the previous cell state, and the value ranges from 0 to 1. The smaller the value, the more information to be discarded. The next input gate represents how much new information is added to the cell state. The final output gate will get the corresponding output according to the current cell state and new information, and update the cell state. For the specific network structure of LSTM, you can refer to the existing materials, which will not be repeated here.
对于上述步骤307,在得到LSTM网络输出的各个中间向量后,服务器可以分别将各个所述中间向量输入至所述网络模型中条件随机场中,得到所述条件随机场输出的样本序列,其中,所述样本序列中的各个数值分别表征了所述各个样本字词对应的标点。For the above step 307, after obtaining each intermediate vector output by the LSTM network, the server may input each intermediate vector into the conditional random field in the network model, and obtain the sample sequence output by the conditional random field, wherein, Each numerical value in the sample sequence respectively represents the punctuation corresponding to each sample word.
需要说明的是,CRF即条件随机场(Conditional Random Fields),是在给定一组输入随机变量条件下另外一组输出随机变量的条件概率分布模型,它是一种判别式的概率无向图模型,既然是判别式,那就是对条件概率分布建模。因此,在本实施例中,CRF可以实现依据LSTM网络给出的各个中间向量从各种可能的输出序列中选取出可能性最高的一个序列作为该样本序列。可知,一个CRF通常由多个特征函数组成,每个特征函数对应设有不同的权重系数,在训练CRF时,通过确定这些权重系统来完成对CRF的训练。It should be noted that CRF is Conditional Random Fields, which is a conditional probability distribution model of another set of output random variables under the condition of a given set of input random variables. It is a discriminant probability undirected graph. The model, since it is a discriminant, models a conditional probability distribution. Therefore, in this embodiment, the CRF can select a sequence with the highest probability as the sample sequence from various possible output sequences according to each intermediate vector given by the LSTM network. It can be seen that a CRF is usually composed of multiple feature functions, and each feature function corresponds to a different weight coefficient. When training the CRF, the training of the CRF is completed by determining these weight systems.
对于上述步骤308,可以理解的是,本实施例训练网络模型的过程,即为训练该LSTM网络和条件随机场的过程,需要调整所述LSTM网络的参数和所述条件随机场的权重系数。举例说明,假设针对某个样本文本“你们有什么产品”,将该样本文本中4个样本字词对应的样本向量依次输入LSTM网络+条件随机场后,最后条件随机场输出的样本序列为[00104],而该样本文本对应的标准序列为[00003],服务器可以检测得知两者存在误差,为此,服务器可以通过调整所述LSTM网络的参数和所述条件随机场的权重系数,尽量使得网络模型输出的结果接近[00003]。For the above step 308, it can be understood that the process of training the network model in this embodiment is the process of training the LSTM network and the conditional random field, and the parameters of the LSTM network and the weight coefficients of the conditional random field need to be adjusted. For example, suppose that for a sample text "what products do you have", after the sample vectors corresponding to the four sample words in the sample text are input into the LSTM network + conditional random field in turn, the final sample sequence output by the conditional random field is [ 00104], and the standard sequence corresponding to the sample text is [00003], the server can detect and know that there is an error between the two, for this reason, the server can adjust the parameters of the LSTM network and the weight coefficient of the conditional random field as much as possible. Make the output of the network model close to [00003].
在执行步骤308调整所述LSTM网络的参数和所述条件随机场的权重系数时,也可以通过现有的反向传播算法进行调整,对此不再展开描述。When performing step 308 to adjust the parameters of the LSTM network and the weight coefficient of the conditional random field, the adjustment may also be performed by using an existing back-propagation algorithm, which will not be described further.
对于上述步骤309,服务器可以判断所述样本序列与所述每个样本文本对应的标准序列之间的误差是否满足预设的训练终止条件,若满足,则说明该网络模型中的各个参数和权重系数已经调整到位,可以确定该网络模型已训练完成;反之,若不满足,则说明该网络模型还需要继续训练。其中,该训练终止条件可以根据实际使用情况预先设定,具体地,可以将该训练终止条件设定为:若所述样本序列与所述每个样本文本对应的标准序列之间的误差均小于指定误差值,则认为其满足该预设的训练终止条件。或者,也可以将其设为:使用验证集中的话术文本执行上述步骤306-308,若网络模型输出的样本序列与标准序列之间的误差在一定范围内,则认为其满足该预设的训练终止条件。其中,该验证集中的话术文本的收集与上述步骤301类似,具体地,可以执行上述步骤301收集得到大量话术文本后,将收集得到的话术文本中的一定比例划分为训练集,剩余的话术文本划分为验证集。比如,可以将收集得到的话术文本中随机划分80%作为后续训练网络模型的训练集的样本,将其它的20%划分为后续验证网络模型是否训练完成,也即是否满足预设训练终止条件的验证集的样本。For the above step 309, the server can determine whether the error between the sample sequence and the standard sequence corresponding to each sample text satisfies the preset training termination condition, and if so, indicates the parameters and weights in the network model If the coefficient has been adjusted in place, it can be determined that the network model has been trained; otherwise, if it is not satisfied, it means that the network model needs to continue training. Wherein, the training termination condition may be preset according to the actual use situation. Specifically, the training termination condition may be set as: if the error between the sample sequence and the standard sequence corresponding to each sample text is less than If the error value is specified, it is considered to meet the preset training termination condition. Alternatively, it can also be set as: the above steps 306-308 are performed using the vocabulary text in the verification set, and if the error between the sample sequence output by the network model and the standard sequence is within a certain range, it is considered that it satisfies the preset training. Termination condition. The collection of discourse texts in the verification set is similar to the above step 301. Specifically, after a large amount of discourse texts can be collected by performing the above step 301, a certain proportion of the collected discourse texts can be divided into a training set, and the remaining discourse texts can be divided into training sets. The text is divided into a validation set. For example, 80% of the collected discourse texts can be randomly divided as the samples of the training set for the subsequent training of the network model, and the other 20% can be divided into the subsequent verification of whether the training of the network model is completed, that is, whether the preset training termination conditions are met. A sample of the validation set.
105、根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;105. Determine each punctuation corresponding to each numerical value according to a preset numerical punctuation correspondence, where the numerical punctuation correspondence records the one-to-one correspondence between the numerical value and the punctuation;
在得到该网络模型输出的结果序列后,服务器可以根据预设的数值标点对应关系分别确定出各个数值对应的各个标点。例如,假设将“你好我明天回复你”对应的各个目标向量输入该网络模型后,得到结果序列为[20001],则根据上述表一的对应关系可以得到该结果序列对应的5个标点分别为“,”、空格、空格、空格、“。”。After obtaining the result sequence output by the network model, the server may determine each punctuation corresponding to each numerical value according to the preset numerical punctuation corresponding relationship. For example, if each target vector corresponding to "Hello, I will reply you tomorrow" is input into the network model, the result sequence is [20001], then according to the correspondence in Table 1 above, the five punctuation points corresponding to the result sequence can be obtained respectively. is ",", space, space, space, ".".
106、针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。106. For each punctuation in the various punctuations, insert the each punctuation into the target text at the position behind the target word corresponding to each punctuation to obtain the punctuation prediction after the term text, the The latter position refers to a position in the target text that is located after the target word and is immediately adjacent to the target word.
可以理解的是,在确定出各个标点后,服务器将这些标点插入到目标文本的相应位置,即可得到标点预测后的话术文本,完成对目标文本的标点添加。承接上述举例,在得到“,。”这5个标点后,将其添加至目标文本“你好我明天回复你”中,得到话术文本“你好,我明天回复你。”It can be understood that, after each punctuation is determined, the server inserts the punctuation into the corresponding position of the target text to obtain the punctuation-predicted phrasal text, and completes the addition of punctuation to the target text. Following the above example, after getting the 5 punctuations of ",.", add them to the target text "Hello, I will reply you tomorrow", and get the text of speech "Hello, I will reply you tomorrow."
为便于理解,如图5所示,进一步地,上述步骤106具体可以包括:For ease of understanding, as shown in FIG. 5 , further, the foregoing step 106 may specifically include:
401、将所述结果序列中的第一个标点确定为当前标点;401. Determine the first punctuation in the result sequence as the current punctuation;
402、将所述目标文本中的第一个目标字词确定为当前字词;402. Determine the first target word in the target text as the current word;
403、将所述当前标点插入至所述目标文本中当前字词与下一字词之前的位置,所述下一字词是指所述目标文本中所述当前字词的下一个字词;403, inserting the current punctuation into the position before the current word and the next word in the target text, and the next word refers to the next word of the current word in the target text;
404、若所述当前标点不是所述结果序列的最后一个标点,则将所述结果序列中当前标点的下一个标点确定为新的当前标点,且将所述目标文本中当前字词的下一个字词确定为新的当前字词,再返回执行步骤403;404. If the current punctuation is not the last punctuation of the result sequence, the next punctuation of the current punctuation in the result sequence is determined as the new current punctuation, and the next punctuation of the current word in the target text is determined. The word is determined to be the new current word, and then returns to step 403;
405、若所述当前标点是所述结果序列的最后一个标点,则确定所述目标文本为标点预测后的话术文本。405. If the current punctuation is the last punctuation of the result sequence, determine that the target text is the punctuation-predicted vocabulary text.
对于上述步骤401,承接上述举例,该结果序列为[20001],其第一个标点为“,”,将“,”确定为当前标点。For the above step 401, following the above example, the result sequence is [20001], the first punctuation is ",", and "," is determined as the current punctuation.
对于上述步骤402,该目标文本为“你好我明天回复你”,其第一个目标字词为“你好”,从而将“你好”确定为当前字词。For the above step 402, the target text is "Hello, I will reply you tomorrow", and the first target word is "Hello", so that "Hello" is determined as the current word.
对于上述步骤403,将“,”插入到“你好”后面,从而该目标文本更新为“你好,我明天回复你”。此时,下一字词是“你好”后面的“我”。For the above step 403, "," is inserted after "Hello", so that the target text is updated to "Hello, I will reply you tomorrow". In this case, the next word is "me" after "hello".
对于上述步骤404,服务器判断得知“,”并非结果序列的最后一个标点,因此可以将“”(空格)确定为新的当前标点,将“我”确定为新的当前字词,并返回执行步骤403。可知,在执行步骤403时,将“”插入到“我”后面,从而该目标文本更新为“你好,我明天回复你”。然后,服务器继续判断得知“”也不是结果序列的最后一个节点,因此可以将“”(第2个空格)确定为新的当前标点,将“明天”确定为新的当前字词,以此类推。直到当前标点为“。”时,服务器判断得知“。”为该结果序列的最后一个标点,因此执行步骤405。For the above step 404, the server determines that "," is not the last punctuation of the result sequence, so it can determine "" (space) as the new current punctuation, determine "I" as the new current word, and return to execute Step 403. It can be seen that when step 403 is executed, "" is inserted after "I", so that the target text is updated to "Hello, I will reply you tomorrow". Then, the server continues to judge that "" is not the last node of the result sequence, so "" (the second space) can be determined as the new current punctuation, and "tomorrow" can be determined as the new current word, so as to analogy. Until the current punctuation is ".", the server determines that "." is the last punctuation of the result sequence, so step 405 is executed.
对于上述步骤405,当当前标点为“。”时,说明该结果序列中的全部标点均已添加至该目标文本中,此时目标文本更新为“你好,我明天回复你。”,可见,该目标文本已经完成了标点预测和添加,从而服务器可以确定所述目标文本为标点预测后的话术文本。For the above step 405, when the current punctuation is ".", it means that all punctuations in the result sequence have been added to the target text, and the target text is updated to "Hello, I will reply you tomorrow.", it can be seen that, The target text has completed punctuation prediction and addition, so the server can determine that the target text is the punctuation-predicted phrasal text.
本发明实施例中,首先,获取无标点的目标文本;然后,对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;接着,对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;再之,按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;次之,根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;最后,针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。可见,本发明可以通过预先训练好的LSTM网络和预设的条件随机场准确地对目标文本进行标点预测,完成对无标点文本的标点添加,提升了文本标点预测的效率,以便于后续自然语言处理对文本的直接使用。In the embodiment of the present invention, first, a target text without punctuation is obtained; then, word segmentation is performed on the target text to obtain each target word in the target text; then, each target word is vectorized processing, to obtain each target vector corresponding to each target word; then, according to the order of each target word in the target text, input each target vector into the network model in turn to obtain the The result sequence output by the network model in turn, each numerical value in the result sequence represents the punctuation corresponding to each target word, and the network model is composed of a pre-trained LSTM network and a conditional random field; secondly, according to The preset numerical punctuation corresponding relationship determines each punctuation corresponding to each numerical value, and the numerical punctuation corresponding relationship records the one-to-one correspondence between the numerical value and the punctuation; finally, for each punctuation in the various punctuations, the Each punctuation is inserted into the back position of the target word corresponding to each punctuation in the target text, to obtain a punctuation-predicted lexical text, where the back position refers to the position behind the target word in the target text , and next to the target word. It can be seen that the present invention can accurately predict the punctuation of the target text through the pre-trained LSTM network and the preset conditional random field, complete the addition of punctuation to the text without punctuation, and improve the efficiency of text punctuation prediction, so as to facilitate the subsequent natural language Handles direct use of text.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
在一实施例中,提供一种文本标点预测装置,该文本标点预测装置与上述实施例中文本标点预测方法一一对应。如图6所示,该文本标点预测装置包括文本获取模块501、分词处理模块502、字词向量化模块503、向量输入模块504、标点确定模块505和标点插入模块506。各功能模块详细说明如下:In one embodiment, a text punctuation prediction apparatus is provided, and the text punctuation prediction apparatus is in one-to-one correspondence with the text punctuation prediction method in the above-mentioned embodiment. As shown in FIG. 6 , the text punctuation prediction device includes a text acquisition module 501 , a word segmentation processing module 502 , a word vectorization module 503 , a vector input module 504 , a punctuation determination module 505 and a punctuation insertion module 506 . The detailed description of each functional module is as follows:
文本获取模块501,用于获取无标点的目标文本;A text acquisition module 501, configured to acquire target text without punctuation;
分词处理模块502,用于对所述目标文本进行分词处理,得到所述目标文本中的各个目标字词;A word segmentation processing module 502, configured to perform word segmentation processing on the target text to obtain each target word in the target text;
字词向量化模块503,用于对所述各个目标字词分别进行向量化处理,得到所述各个目标字词对应的各个目标向量;A word vectorization module 503, configured to perform vectorization processing on the respective target words to obtain respective target vectors corresponding to the respective target words;
向量输入模块504,用于按照所述各个目标字词在所述目标文本中的次序,将所述各个目标向量依次输入至网络模型,得到所述网络模型依次输出的结果序列,所述结果序列中的各个数值分别表征了所述各个目标字词对应的标点,所述网络模型由预先训练好的LSTM网络和条件随机场组成;The vector input module 504 is configured to sequentially input the respective target vectors into the network model according to the order of the respective target words in the target text, and obtain a sequence of results output by the network model in turn, the sequence of results. Each numerical value in represents the punctuation corresponding to each target word, and the network model is composed of a pre-trained LSTM network and a conditional random field;
标点确定模块505,用于根据预设的数值标点对应关系分别确定出各个数值对应的各个标点,所述数值标点对应关系记录了数值与标点的一一对应关系;The punctuation determination module 505 is configured to determine each punctuation corresponding to each numerical value according to the preset numerical punctuation correspondence, which records the one-to-one correspondence between the numerical value and the punctuation;
标点插入模块506,用于针对所述各个标点中的每个标点,将所述每个标点插入至所述目标文本中与所述每个标点对应目标字词的后面位置,得到标点预测后的话术文本,所述后面位置是指所述目标文本中位于所述目标字词后面、且紧靠所述目标字词的位置。The punctuation insertion module 506 is configured to, for each punctuation in the various punctuations, insert the each punctuation into the back position of the target word corresponding to the each punctuation in the target text, and obtain the words after punctuation prediction technical text, the latter position refers to a position in the target text that is located after the target word and is close to the target word.
如图7所示,进一步地,所述网络模型可以通过以下模块预先训练好:As shown in Figure 7, further, the network model can be pre-trained by the following modules:
话术文本收集模块507,用于收集多个带标点的话术文本;A vocabulary text collection module 507, configured to collect a plurality of punctuated vocabulary texts;
标点文本分离模块508,用于将收集到的各个话术文本中的标点与文本分离,得到各个样本文本和与所述各个样本文本对应的各个标点集合;The punctuation text separation module 508 is used to separate the punctuation and text in each collected vocabulary text, and obtain each sample text and each punctuation set corresponding to each sample text;
第一数值确定模块509,用于针对每个标点集合,根据预设的数值标点对应关系分别确定出所述每个标点集合中各个标点对应的第一数值,并以各个所述第一数值组成与所述每个标点集合对应的标准序列,所述数值标点对应关系记录了数值与标点的一一对应关系;The first numerical value determination module 509 is configured to, for each punctuation set, determine the first numerical value corresponding to each punctuation in the each punctuation set according to the preset numerical punctuation corresponding relationship, and form each of the first numerical values With the standard sequence corresponding to each punctuation set, the numerical punctuation correspondence records the one-to-one correspondence between numerical values and punctuation;
样本分词处理模块510,用于对所述样本文本分别进行分词处理,得到各个所述样本文本中的各个样本字词;The sample word segmentation processing module 510 is configured to perform word segmentation processing on the sample texts to obtain each sample word in each of the sample texts;
样本向量化模块511,用于对各个所述样本文本中的各个样本字词分别进行向量化处理,得到与所述各个样本字词对应的各个样本向量;The sample vectorization module 511 is configured to perform vectorization processing on each sample word in each of the sample texts to obtain each sample vector corresponding to each of the sample words;
样本向量输入模块512,用于针对各个所述样本文本中每个样本文本,按照各个样本字词在每个样本文本中的次序,将各个样本向量依次输入至所述网络模型中LSTM网络,得到所述LSTM网络依次输出的各个中间向量;The sample vector input module 512 is used to input each sample vector into the LSTM network in the network model in turn according to the order of each sample word in each sample text for each sample text in each of the sample texts, to obtain Each intermediate vector output by the LSTM network in turn;
随机场模块513,用于分别将各个所述中间向量输入至所述网络模型中条件随机场中,得到所述条件随机场输出的样本序列,所述样本序列中的各个数值分别表征了所述各个样本字词对应的标点;The random field module 513 is configured to respectively input each of the intermediate vectors into the conditional random field in the network model to obtain a sample sequence output by the conditional random field, and each value in the sample sequence represents the Punctuation corresponding to each sample word;
参数系数调整模块514,用于以输出的所述样本序列作为调整目标,调整所述LSTM网络的参数和所述条件随机场的权重系数,以最小化得到的所述样本序列与所述每个样本文本对应的标准序列之间的误差;The parameter coefficient adjustment module 514 is used to adjust the parameters of the LSTM network and the weight coefficient of the conditional random field with the output sample sequence as the adjustment target, so as to minimize the difference between the obtained sample sequence and each of the The error between the standard sequences corresponding to the sample text;
训练完成确定模块515,用于若所述样本序列与所述每个样本文本对应的标准序列之间的误差满足预设的训练终止条件,则确定所述网络模型已训练好。The training completion determination module 515 is configured to determine that the network model has been trained if the error between the sample sequence and the standard sequence corresponding to each sample text satisfies a preset training termination condition.
如图8所示,进一步地,所述标点插入模块506可以包括:As shown in FIG. 8, further, the punctuation insertion module 506 may include:
当前标点确定单元5061,用于将所述结果序列中的第一个标点确定为当前标点;The current punctuation determination unit 5061 is used to determine the first punctuation in the result sequence as the current punctuation;
当前字词确定单元5062,用于将所述目标文本中的第一个目标字词确定为当前字词;The current word determination unit 5062 is used to determine the first target word in the target text as the current word;
插入单元5063,用于将所述当前标点插入至所述目标文本中当前字词与下一字词之前的位置,所述下一字词是指所述目标文本中所述当前字词的下一个字词;The inserting unit 5063 is configured to insert the current punctuation into the position before the current word and the next word in the target text, where the next word refers to the lower part of the current word in the target text. a word;
新标点确定单元5064,用于若所述当前标点不是所述结果序列的最后一个标点,则将所述结果序列中当前标点的下一个标点确定为新的当前标点,且将所述目标文本中当前字词的下一个字词确定为新的当前字词,再返回执行所述将所述当前标点插入至所述目标文本中当前字词与下一字词之前的位置的步骤;A new punctuation determining unit 5064 is configured to determine the next punctuation of the current punctuation in the result sequence as the new current punctuation if the current punctuation is not the last punctuation of the result sequence, and set the target text in the The next word of the current word is determined as the new current word, and then returns to the step of inserting the current punctuation into the position before the current word and the next word in the target text;
预测完成确定单元5065,用于若所述当前标点是所述结果序列的最后一个标点,则确定所述目标文本为标点预测后的话术文本。The prediction completion determining unit 5065 is configured to determine that the target text is the term text after punctuation prediction if the current punctuation is the last punctuation of the result sequence.
进一步地,所述字词向量化模块可以包括:Further, the word vectorization module may include:
字词检索单元,用于针对所述各个目标字词中的每个目标字词,检索预设的字典中是否记录有所述每个目标字词,所述字典记录了字词与一维向量之间的对应关系;A word retrieval unit, used for each target word in each of the target words, to retrieve whether each target word is recorded in a preset dictionary, and the dictionary records the word and the one-dimensional vector Correspondence between;
向量获取单元,用于若预设的字典中记录有所述每个目标字词,则获取与所述每个目标字词对应的一维向量;a vector obtaining unit, configured to obtain a one-dimensional vector corresponding to each target word if each of the target words is recorded in the preset dictionary;
第一字词转化单元,用于若预设的字典中没有记录有所述每个目标字词,则通过加载第一第三方平台的词向量,将所述每个目标字词转化为第一向量;The first word conversion unit is used to convert each target word into the first word vector by loading the word vector of the first third-party platform if the preset dictionary does not record each of the target words. vector;
第二字词转化单元,用于通过加载第二第三方平台的词向量,将所述每个目标字词转化为第二向量;A second word conversion unit, configured to convert each of the target words into a second vector by loading the word vector of the second third-party platform;
向量拼接单元,用于拼接所述第一向量和第二向量,得到一个一维向量作为所述每个目标字词对应的一维向量;A vector splicing unit, for splicing the first vector and the second vector to obtain a one-dimensional vector as the one-dimensional vector corresponding to each target word;
字词记录单元,用于将拼接得到的所述一维向量和与之对应的目标字词记录至所述字典。A word recording unit, configured to record the one-dimensional vector obtained by splicing and the target word corresponding to it in the dictionary.
进一步地,所述文本标点预测装置还可以包括:Further, the text punctuation prediction device may also include:
指定文本删除模块,用于删除所述目标文本中的指定文本,所述指定文本至少包括停用词。A specified text deletion module, configured to delete specified text in the target text, where the specified text includes at least stop words.
关于文本标点预测装置的具体限定可以参见上文中对于文本标点预测方法的限定,在此不再赘述。上述文本标点预测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the text punctuation prediction apparatus, reference may be made to the above limitation on the text punctuation prediction method, which will not be repeated here. Each module in the above-mentioned text punctuation prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储文本标点预测方法中涉及到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种文本标点预测方法。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the data involved in the text punctuation prediction method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a text punctuation prediction method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述实施例中文本标点预测方法的步骤,例如图2所示的步骤101至步骤106。或者,处理器执行计算机程序时实现上述实施例中文本标点预测装置的各模块/单元的功能,例如图6所示模块501至模块506的功能。为避免重复,这里不再赘述。In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the method for predicting text punctuation in the above embodiment is implemented. steps, such as step 101 to step 106 shown in FIG. 2 . Alternatively, when the processor executes the computer program, the functions of each module/unit of the text punctuation prediction apparatus in the above-mentioned embodiment are implemented, for example, the functions of modules 501 to 506 shown in FIG. 6 . To avoid repetition, details are not repeated here.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例中文本标点预测方法的步骤,例如图2所示的步骤101至步骤106。或者,计算机程序被处理器执行时实现上述实施例中文本标点预测装置的各模块/单元的功能,例如图6所示模块501至模块506的功能。为避免重复,这里不再赘述。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the text punctuation prediction method in the above-mentioned embodiment are implemented, for example, step 101 shown in FIG. 2 . Go to step 106 . Alternatively, when the computer program is executed by the processor, the functions of each module/unit of the text punctuation prediction apparatus in the above-mentioned embodiments, such as the functions of modules 501 to 506 shown in FIG. 6 , are implemented. To avoid repetition, details are not repeated here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the foregoing implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the within the protection scope of the present invention.
Claims (10)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910182506.1A CN110032732A (en) | 2019-03-12 | 2019-03-12 | A kind of text punctuate prediction technique, device, computer equipment and storage medium |
PCT/CN2019/117303 WO2020181808A1 (en) | 2019-03-12 | 2019-11-12 | Text punctuation prediction method and apparatus, and computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910182506.1A CN110032732A (en) | 2019-03-12 | 2019-03-12 | A kind of text punctuate prediction technique, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110032732A true CN110032732A (en) | 2019-07-19 |
Family
ID=67235820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910182506.1A Pending CN110032732A (en) | 2019-03-12 | 2019-03-12 | A kind of text punctuate prediction technique, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110032732A (en) |
WO (1) | WO2020181808A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339750A (en) * | 2020-02-24 | 2020-06-26 | 网经科技(苏州)有限公司 | Spoken language text processing method for removing stop words and predicting sentence boundaries |
WO2020181808A1 (en) * | 2019-03-12 | 2020-09-17 | 平安科技(深圳)有限公司 | Text punctuation prediction method and apparatus, and computer device and storage medium |
CN111832564A (en) * | 2020-07-20 | 2020-10-27 | 浙江诺诺网络科技有限公司 | Image character recognition method, system, electronic device and storage medium |
CN112464642A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Method, device, medium and electronic equipment for adding punctuation to text |
CN112685996A (en) * | 2020-12-23 | 2021-04-20 | 北京有竹居网络技术有限公司 | Text punctuation prediction method and device, readable medium and electronic equipment |
CN112735384A (en) * | 2020-12-28 | 2021-04-30 | 科大讯飞股份有限公司 | Turning point detection method, device and equipment applied to speaker separation |
CN113780449A (en) * | 2021-09-16 | 2021-12-10 | 平安科技(深圳)有限公司 | Text similarity calculation method and device, storage medium and computer equipment |
CN114049885A (en) * | 2022-01-12 | 2022-02-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Punctuation mark recognition model construction method and punctuation mark recognition model construction device |
CN116070628A (en) * | 2023-01-10 | 2023-05-05 | 长城汽车股份有限公司 | Text sentence breaking method and device, electronic equipment and storage medium |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347789B (en) * | 2020-11-06 | 2024-04-12 | 科大讯飞股份有限公司 | Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and storage medium |
CN112633479B (en) * | 2020-12-30 | 2025-03-04 | 北京捷通华声科技股份有限公司 | A method and device for predicting target data |
CN114595676A (en) * | 2022-02-15 | 2022-06-07 | 北京三快在线科技有限公司 | Text clause dividing method and device, electronic equipment and readable storage medium |
CN117113941B (en) * | 2023-10-23 | 2024-02-06 | 新声科技(深圳)有限公司 | Punctuation mark recovery method, device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164399A (en) * | 2013-02-26 | 2013-06-19 | 北京捷通华声语音技术有限公司 | Punctuation addition method and device in speech recognition |
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
US20170032789A1 (en) * | 2015-07-31 | 2017-02-02 | Lenovo (Singapore) Pte. Ltd. | Insertion of characters in speech recognition |
CN106653030A (en) * | 2016-12-02 | 2017-05-10 | 北京云知声信息技术有限公司 | Punctuation mark adding method and device |
CN107221330A (en) * | 2017-05-26 | 2017-09-29 | 北京搜狗科技发展有限公司 | Punctuate adding method and device, the device added for punctuate |
CN107767870A (en) * | 2017-09-29 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Adding method, device and the computer equipment of punctuation mark |
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11250841B2 (en) * | 2016-06-10 | 2022-02-15 | Conduent Business Services, Llc | Natural language generation, a hybrid sequence-to-sequence approach |
CN107247700A (en) * | 2017-04-27 | 2017-10-13 | 北京捷通华声科技股份有限公司 | A kind of method and device for adding text marking |
US11593558B2 (en) * | 2017-08-31 | 2023-02-28 | Ebay Inc. | Deep hybrid neural network for named entity recognition |
CN108920446A (en) * | 2018-04-25 | 2018-11-30 | 华中科技大学鄂州工业技术研究院 | A kind of processing method of Engineering document |
CN110032732A (en) * | 2019-03-12 | 2019-07-19 | 平安科技(深圳)有限公司 | A kind of text punctuate prediction technique, device, computer equipment and storage medium |
-
2019
- 2019-03-12 CN CN201910182506.1A patent/CN110032732A/en active Pending
- 2019-11-12 WO PCT/CN2019/117303 patent/WO2020181808A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164399A (en) * | 2013-02-26 | 2013-06-19 | 北京捷通华声语音技术有限公司 | Punctuation addition method and device in speech recognition |
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
US20170032789A1 (en) * | 2015-07-31 | 2017-02-02 | Lenovo (Singapore) Pte. Ltd. | Insertion of characters in speech recognition |
CN106653030A (en) * | 2016-12-02 | 2017-05-10 | 北京云知声信息技术有限公司 | Punctuation mark adding method and device |
CN107221330A (en) * | 2017-05-26 | 2017-09-29 | 北京搜狗科技发展有限公司 | Punctuate adding method and device, the device added for punctuate |
CN107767870A (en) * | 2017-09-29 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Adding method, device and the computer equipment of punctuation mark |
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
Non-Patent Citations (1)
Title |
---|
MIN WANG等: "Yuan at SemEval-2018 Task 1: Tweets Emotion Intensity Prediction using Ensemble Recurrent Neural Network", 12TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, pages 205 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020181808A1 (en) * | 2019-03-12 | 2020-09-17 | 平安科技(深圳)有限公司 | Text punctuation prediction method and apparatus, and computer device and storage medium |
CN111339750A (en) * | 2020-02-24 | 2020-06-26 | 网经科技(苏州)有限公司 | Spoken language text processing method for removing stop words and predicting sentence boundaries |
CN111339750B (en) * | 2020-02-24 | 2023-09-08 | 网经科技(苏州)有限公司 | Spoken language text processing method for removing stop words and predicting sentence boundaries |
CN111832564A (en) * | 2020-07-20 | 2020-10-27 | 浙江诺诺网络科技有限公司 | Image character recognition method, system, electronic device and storage medium |
WO2021213155A1 (en) * | 2020-11-25 | 2021-10-28 | 平安科技(深圳)有限公司 | Method, apparatus, medium, and electronic device for adding punctuation to text |
CN112464642A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Method, device, medium and electronic equipment for adding punctuation to text |
CN112685996A (en) * | 2020-12-23 | 2021-04-20 | 北京有竹居网络技术有限公司 | Text punctuation prediction method and device, readable medium and electronic equipment |
CN112685996B (en) * | 2020-12-23 | 2024-03-22 | 北京有竹居网络技术有限公司 | Text punctuation prediction method, device, readable medium and electronic device |
CN112735384A (en) * | 2020-12-28 | 2021-04-30 | 科大讯飞股份有限公司 | Turning point detection method, device and equipment applied to speaker separation |
CN113780449A (en) * | 2021-09-16 | 2021-12-10 | 平安科技(深圳)有限公司 | Text similarity calculation method and device, storage medium and computer equipment |
CN113780449B (en) * | 2021-09-16 | 2023-08-25 | 平安科技(深圳)有限公司 | Text similarity calculation method and device, storage medium and computer equipment |
CN114049885A (en) * | 2022-01-12 | 2022-02-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Punctuation mark recognition model construction method and punctuation mark recognition model construction device |
CN114049885B (en) * | 2022-01-12 | 2022-04-22 | 阿里巴巴达摩院(杭州)科技有限公司 | Punctuation mark recognition model construction method and punctuation mark recognition model construction device |
CN116070628A (en) * | 2023-01-10 | 2023-05-05 | 长城汽车股份有限公司 | Text sentence breaking method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020181808A1 (en) | 2020-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110032732A (en) | A kind of text punctuate prediction technique, device, computer equipment and storage medium | |
CN110765244B (en) | Method, device, computer equipment and storage medium for obtaining answering operation | |
WO2021031383A1 (en) | Intelligent auxiliary judgment method and apparatus, and computer device and storage medium | |
CN112492111B (en) | Intelligent voice outbound method, device, computer equipment and storage medium | |
CN109614627B (en) | Text punctuation prediction method and device, computer equipment and storage medium | |
CN109063217B (en) | Work order classification method and device in electric power marketing system and related equipment thereof | |
CN113420113B (en) | Semantic recall model training and recall question and answer method, device, equipment and medium | |
WO2021051866A1 (en) | Method and apparatus for determining case judgment result, device, and computer-readable storage medium | |
CN109033305A (en) | Question answering method, equipment and computer readable storage medium | |
CN110689878B (en) | Intelligent voice conversation intention recognition method based on X L Net | |
WO2020238553A1 (en) | Testing corpus generating method and device, computer equipment and storage medium | |
CN110704571A (en) | Court trial auxiliary processing method, trial auxiliary processing device, equipment and medium | |
CN110347810B (en) | Dialogue type search answering method, device, computer equipment and storage medium | |
CN111339248A (en) | Data attribute filling method, device, equipment and computer readable storage medium | |
CN111400340A (en) | Natural language processing method and device, computer equipment and storage medium | |
CN116644183B (en) | Text classification method, device and storage medium | |
CN114676239A (en) | A text processing method, device, storage medium and device | |
CN119719312A (en) | Intelligent government affair question-answering method, device, equipment and storage medium | |
CN114550718A (en) | Hot word speech recognition method, device, equipment and computer readable storage medium | |
CN110674276A (en) | Robot self-learning method, robot terminal, device and readable storage medium | |
CN118733712B (en) | Intelligent search method based on retrieval enhancement generation | |
CN111552785A (en) | Method and device for updating database of human-computer interaction system, computer equipment and medium | |
CN113111157B (en) | Question-answer processing method, device, computer equipment and storage medium | |
CN113569021A (en) | Method for user classification, computer device and readable storage medium | |
CN113158052A (en) | Chat content recommendation method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190719 |