CN110222654A

CN110222654A - Text segmenting method, device, equipment and storage medium

Info

Publication number: CN110222654A
Application number: CN201910499932.8A
Authority: CN
Inventors: 丁宇辰; 刘凯
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2019-09-10

Abstract

Embodiments of the present invention propose a text segmentation method, device, device, and storage medium, wherein the method includes: for each sentence interval in the first text, respectively determining the degree of relevance between the preceding sentence and the following sentence of the sentence interval; according to The degree of association determines whether the sentence interval is a text segmentation point; if the sentence interval is a text segmentation point, segment the first text at the position of the sentence interval. The text segmentation method that can be proposed by the embodiments of the present invention is applicable to various types of texts, and has wider application fields.

Description

Text segmentation method, device, equipment and storage medium

技术领域technical field

本发明涉及文本分割技术领域，尤其涉及一种文本分割方法、装置、设备及存储介质。The present invention relates to the technical field of text segmentation, in particular to a text segmentation method, device, equipment and storage medium.

发明内容Contents of the invention

现有的文本分割方法一般采用以下两种方式：Existing text segmentation methods generally adopt the following two methods:

第一种是基于外部结构信息的方法。例如，在对网页的超文本标记语言(HTML，Hyper Text Markup Language)文本进行分割时，可以参考HTML标签信息。如，<head>标签中的内容通常是标题，需要与<p>标签下的正文分割开；<list>标签下的内容会以列表形式展现，其内容也与普通文本有明显区别，需要从文本中单独提取出来；遇到<strong>标示的加粗文本，可能代表总结或者强调的含义，可以酌情在此段文字后执行分割。The first is a method based on external structural information. For example, when segmenting Hyper Text Markup Language (HTML, Hyper Text Markup Language) text of a web page, HTML tag information may be referred to. For example, the content in the <head> tag is usually the title, which needs to be separated from the text under the <p> tag; the content under the <list> tag will be displayed in the form of a list, and its content is also obviously different from ordinary text, which needs to be separated from the The text is extracted separately; when encountering the bold text marked with <strong>, it may represent a summary or emphasis, and you can perform segmentation after this paragraph of text as appropriate.

第二种是基于语义相关性的方法。在文本摘要领域中，一些方法会参考句子与文章标题、主题的关系，判断文本的分割点。首先计算每个句子与文章标题或主题的相关性得分，之后设置一个相关性阈值，将连续的几个相关性高于或低于阈值的句子作为一个短文本片段。The second is a method based on semantic relevance. In the field of text summarization, some methods will refer to the relationship between sentences and article titles and topics to determine the segmentation point of the text. First calculate the correlation score between each sentence and the article title or topic, then set a correlation threshold, and use several consecutive sentences whose correlation is higher or lower than the threshold as a short text segment.

可见，上述第一种方法的应用场景受限于数据格式。当数据格式发生改变，或者没有可依赖的外部结构信息时，方法便无法生效。上述第二种方式需要以文章的标题或主题作为判断依据，当文章没有标题，或是得不到具体、正确的主题时，方法的效果会大打折扣。因此，上述两种方法的应用领域均受到限制。It can be seen that the application scenario of the first method above is limited by the data format. When the data format changes, or there is no external structure information to rely on, the method cannot take effect. The above-mentioned second method needs to use the title or theme of the article as the basis for judgment. When the article has no title or cannot get a specific and correct theme, the effect of the method will be greatly reduced. Therefore, the fields of application of the above two methods are limited.

发明内容Contents of the invention

本发明实施例提供一种文本分割方法及装置，以至少解决现有技术中的以上技术问题。Embodiments of the present invention provide a text segmentation method and device to at least solve the above technical problems in the prior art.

第一方面，本发明实施例提供了一种文本分割方法，包括：In a first aspect, an embodiment of the present invention provides a text segmentation method, including:

针对第一文本中的每个句子间隔，分别确定所述句子间隔的前句与后句的关联度；For each sentence interval in the first text, determine the relevance degree of the preceding sentence and the following sentence of the sentence interval respectively;

根据所述关联度确定所述句子间隔是否为文本分割点；Determine whether the sentence interval is a text segmentation point according to the degree of association;

在所述句子间隔是文本分割点的情况下，在所述句子间隔的位置分割所述第一文本。In case the sentence interval is a text segmentation point, the first text is segmented at the position of the sentence interval.

在一种实施方式中，所述确定所述句子间隔的前句与后句的关联度，包括：In one embodiment, the determination of the correlation between the preceding sentence and the following sentence of the sentence interval includes:

根据所述前句与所述后句的语义关联、所述前句及所述后句的句式结构以及所述后句的引导词中的至少一项，确定所述句子间隔的前句与后句的关联度。According to at least one of the semantic association between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the leading words of the following sentence, determine the preceding sentence and the preceding sentence of the sentence interval The degree of relevance of the latter sentence.

在一种实施方式中，所述根据所述前句与所述后句的语义关联、所述前句及所述后句的句式结构以及所述后句的引导词中的至少一项，确定所述句子间隔的前句与后句的关联度，包括：In one embodiment, according to at least one of the semantic relationship between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the leading word of the following sentence, Determine the degree of relevance between the preceding sentence and the following sentence of the sentence interval, including:

确定所述前句与所述后句的语义关联对应的语义关联矩阵，确定所述前句及所述后句的句式结构对应的句式矩阵，并确定所述后句的引导词对应的引导词矩阵；Determine the semantic relevance matrix corresponding to the semantic association of the preceding sentence and the following sentence, determine the sentence matrix corresponding to the sentence structure of the preceding sentence and the following sentence, and determine the corresponding leading words of the following sentence guide word matrix;

对所述语义关联矩阵、所述句式矩阵及所述引导词矩阵分别进行线性变换；Carry out linear transformation respectively to described semantic association matrix, described sentence pattern matrix and described guiding word matrix;

将所述线性变换的结果组合成所述前句与后句的关联信息向量；Combining the result of the linear transformation into the associated information vector of the preceding sentence and the following sentence;

将所述关联信息向量输入预先训练的关联度预测模型，得到所述前句与后句的关联度。Inputting the association information vector into a pre-trained association degree prediction model to obtain the association degree between the preceding sentence and the following sentence.

在一种实施方式中，所述确定所述前句与所述后句的语义关联对应的语义关联矩阵，包括：In one embodiment, the determining the semantic association matrix corresponding to the semantic association between the preceding sentence and the following sentence includes:

对所述前句中的词对应的词向量进行计算，得到所述前句的语义表示矩阵；并对所述后句中的词对应的词向量进行计算，得到所述后句的语义表示矩阵；Calculate the word vector corresponding to the word in the preceding sentence to obtain the semantic representation matrix of the preceding sentence; and calculate the word vector corresponding to the word in the latter sentence to obtain the semantic representation matrix of the latter sentence ;

将所述前句的语义表示矩阵与所述后句的语义表示矩阵相乘，得到所述前句与所述后句的语义关联对应的语义关联矩阵。multiplying the semantic representation matrix of the preceding sentence by the semantic representation matrix of the following sentence to obtain a semantic correlation matrix corresponding to the semantic correlation between the preceding sentence and the following sentence.

在一种实施方式中，所述计算的方式为：采用双向长短期记忆模型、词袋模型或基于转换器的双向编码表示模型进行计算。In one embodiment, the calculation is performed by using a two-way long-short-term memory model, a bag-of-words model, or a converter-based two-way coding representation model for calculation.

在一种实施方式中，所述确定所述前句及所述后句的句式结构对应的句式矩阵，包括：In one embodiment, the determining the sentence matrix corresponding to the sentence structure of the preceding sentence and the following sentence includes:

采用预先设计的句式模板，分别确定所述前句的句式信息及所述后句的句式信息；Using a pre-designed sentence pattern template, respectively determining the sentence pattern information of the preceding sentence and the sentence pattern information of the following sentence;

根据所述前句的句式信息生成所述前句的句式向量，并根据所述后句的句式信息生成所述后句的句式向量；Generate the sentence pattern vector of the preceding sentence according to the sentence pattern information of the preceding sentence, and generate the sentence pattern vector of the latter sentence according to the sentence pattern information of the latter sentence;

将所述前句的句式向量与所述后句的句式向量组合，得到所述前句及所述后句的句式结构对应的句式矩阵。Combining the sentence pattern vector of the preceding sentence with the sentence pattern vector of the following sentence to obtain a sentence pattern matrix corresponding to the sentence pattern structure of the preceding sentence and the following sentence.

在一种实施方式中，所述确定所述后句的引导词对应的引导词矩阵，包括：In one embodiment, the determining the leading word matrix corresponding to the leading word of the following sentence includes:

分别确定所述后句中的前N个词对应的词向量，所述N为整数；Determine the word vectors corresponding to the first N words in the following sentence respectively, and the N is an integer;

将确定的所述词向量拼接为所述后句的引导词对应的引导词矩阵。Splicing the determined word vectors into a leading word matrix corresponding to the leading words of the latter sentence.

在一种实施方式中，所述针对第一文本中的每个句子间隔，分别确定所述句子间隔的前句与后句的关联度之前，还包括：In one embodiment, before determining the correlation between the preceding sentence and the following sentence of the sentence interval for each sentence interval in the first text, it also includes:

采用预先设置的列表模板，识别原始文本中的列表文本；Use the pre-set list template to identify the list text in the original text;

将所述原始文本中的列表文本分割出去，将所述原始文本中剩余的部分作为所述第一文本。The list text in the original text is divided, and the remaining part of the original text is used as the first text.

第二方面，本发明实施例还提出一种关联度预测模型的训练方法，方法包括：In the second aspect, the embodiment of the present invention also proposes a method for training a correlation degree prediction model, the method including:

生成两个相邻样本句子的关联信息向量，并获取所述两个相邻样本句子的实际关联度；Generate association information vectors of two adjacent sample sentences, and obtain the actual degree of association of the two adjacent sample sentences;

将所述关联信息向量输入关联度预测模型；Inputting the association information vector into the association degree prediction model;

将所述关联度预测模型输出的预测关联度与所述实际关联度进行比较，根据比较结果调整所述关联度预测模型的参数。Comparing the predicted association degree output by the association degree prediction model with the actual association degree, and adjusting the parameters of the association degree prediction model according to the comparison result.

在一种实施方式中，所述生成两个相邻样本句子的关联信息向量，包括：In one embodiment, the generating the associated information vectors of two adjacent sample sentences includes:

确定样本前句与样本后句的语义关联对应的语义关联矩阵，确定样本前句及样本后句的句式结构对应的句式矩阵，并确定样本后句的引导词对应的引导词矩阵；其中，所述样本前句为所述两个相邻样本句子中的前一个句子，所述样本后句为所述两个相邻样本句子中的后一个句子；Determine the semantic correlation matrix corresponding to the semantic association of the sample front sentence and the sample back sentence, determine the sentence pattern matrix corresponding to the sentence structure of the sample front sentence and the sample back sentence, and determine the guide word matrix corresponding to the guide words of the sample back sentence; , the sample front sentence is the previous sentence in the two adjacent sample sentences, and the sample post sentence is the next sentence in the two adjacent sample sentences;

将所述线性变换的结果组合成所述两个相邻样本句子的关联信息向量。Combining the results of the linear transformation into an associated information vector of the two adjacent sample sentences.

在一种实施方式中，还包括：In one embodiment, it also includes:

从不同文档中选取文本片段，将选取的文本片段拼接成第一文本片段；将所述第一文本片段中的拼接位置作为弱正例；Selecting text fragments from different documents, splicing the selected text fragments into a first text fragment; using the splicing position in the first text fragment as a weak positive example;

将同一文档中连续的文本片段拼接成第二文本片段；将所述第二文本片段中的拼接位置作为强正例；Splicing continuous text fragments in the same document into a second text fragment; using the stitching position in the second text fragment as a strong positive example;

将所述第一文本片段中除所述弱正例以外的句子间隔，和/或所述第二文本片段中除所述强正例以外的句子间隔作负例；Using sentence intervals other than the weak positive example in the first text segment, and/or sentence intervals other than the strong positive example in the second text segment as negative examples;

将所述弱正例、强正例或所述负例前后的两个句子确定为所述两个相邻样本句子。Determining the two sentences before and after the weak positive example, the strong positive example, or the negative example as the two adjacent sample sentences.

第三方面，本发明实施例还提出一种文本分割装置，包括：In the third aspect, the embodiment of the present invention also proposes a text segmentation device, including:

关联度确定模块，用于针对第一文本中的每个句子间隔，分别确定所述句子间隔的前句与后句的关联度；Relevance determination module, for each sentence interval in the first text, respectively determine the relevancy of the former sentence and the latter sentence of the sentence interval;

分割点确定模块，用于根据所述关联度确定所述句子间隔是否为文本分割点；A segmentation point determination module, configured to determine whether the sentence interval is a text segmentation point according to the degree of association;

文本分割模块，用于在所述句子间隔是文本分割点的情况下，在所述句子间隔的位置分割所述第一文本。A text segmentation module, configured to segment the first text at the position of the sentence interval when the sentence interval is a text segmentation point.

在一种实施方式中，所述关联度确定模块，用于根据所述前句与所述后句的语义关联、所述前句及所述后句的句式结构以及所述后句的引导词中的至少一项，确定所述句子间隔的前句与后句的关联度。In one embodiment, the association degree determination module is configured to determine the semantic association between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the guidance of the following sentence at least one item in the word, and determine the correlation degree between the preceding sentence and the following sentence of the sentence interval.

第四方面，本发明实施例还提出一种关联度预测模型的训练装置，包括：In the fourth aspect, the embodiment of the present invention also proposes a training device for a correlation degree prediction model, including:

样本确定模块，用于生成两个相邻样本句子的关联信息向量，并获取所述两个相邻样本句子的实际关联度；The sample determination module is used to generate the correlation information vector of two adjacent sample sentences, and obtain the actual correlation degree of the two adjacent sample sentences;

输入模块，用于将所述关联信息向量输入关联度预测模型；An input module, configured to input the association information vector into the association degree prediction model;

参数调整模块，用于将所述关联度预测模型输出的预测关联度与所述实际关联度进行比较，根据比较结果调整所述关联度预测模型的参数。A parameter adjustment module, configured to compare the predicted degree of association output by the degree of association prediction model with the actual degree of association, and adjust the parameters of the degree of association prediction model according to the comparison result.

第五方面，本发明实施例提供了一种文本分割设备，所述设备的功能可以通过硬件实现，也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In the fifth aspect, the embodiment of the present invention provides a text segmentation device, and the functions of the device can be implemented by hardware, or can be implemented by executing corresponding software on the hardware. The hardware or software includes one or more modules corresponding to the above functions.

在一个可能的设计中，所述文本分割设备的结构中包括处理器和存储器，所述存储器用于存储支持所述文本分割设备执行上述文本分割方法的程序，所述处理器被配置为用于执行所述存储器中存储的程序。所述文本分割设备还可以包括通信接口，用于与其他设备或通信网络通信。In a possible design, the structure of the text segmentation device includes a processor and a memory, the memory is used to store a program that supports the text segmentation device to execute the above text segmentation method, and the processor is configured to A program stored in the memory is executed. The text segmentation device may also include a communication interface for communicating with other devices or a communication network.

第六方面，本发明实施例提供了一种关联度预测模型的训练设备，所述设备的功能可以通过硬件实现，也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In a sixth aspect, an embodiment of the present invention provides a training device for an association degree prediction model. The functions of the device can be implemented by hardware, or by executing corresponding software on the hardware. The hardware or software includes one or more modules corresponding to the above functions.

在一个可能的设计中，所述关联度预测模型的训练设备的结构中包括处理器和存储器，所述存储器用于存储支持所述关联度预测模型的训练设备执行上述关联度预测模型的训练方法的程序，所述处理器被配置为用于执行所述存储器中存储的程序。所述关联度预测模型的训练设备还可以包括通信接口，用于与其他设备或通信网络通信。In a possible design, the structure of the training device of the correlation degree prediction model includes a processor and a memory, and the memory is used to store the training device supporting the correlation degree prediction model to execute the above-mentioned training method of the correlation degree prediction model program, the processor configured to execute the program stored in the memory. The training device for the correlation degree prediction model may also include a communication interface for communicating with other devices or a communication network.

第七方面，本发明实施例提供了一种计算机可读存储介质，用于存储文本分割设备所用的计算机软件指令，其包括用于执行上述文本分割方法或关联度预测模型的训练方法所涉及的程序。In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer software instructions used by a text segmentation device, which includes the steps involved in executing the above-mentioned text segmentation method or the training method of the association degree prediction model. program.

上述技术方案中的一个技术方案具有如下优点或有益效果：One of the above technical solutions has the following advantages or beneficial effects:

本发明实施例提出一种文本分割方法，针对每个句子间隔，确定句子间隔的前句与后句的关联度。根据关联度确定对应的句子间隔是否为文本分割点，如果是，则在该句子间隔的位置分割文本。由于在分割文本时不需要利用外部结构信息或标题信息等外部信息，本发明实施例的文本分割方法的应用领域更广泛。本发明实施例还提出一种关联度预测模型训练方法，能够训练用于预测两个相邻句子的关联度的模型。The embodiment of the present invention proposes a text segmentation method, for each sentence interval, the degree of relevance between the previous sentence and the subsequent sentence of the sentence interval is determined. Determine whether the corresponding sentence interval is a text segmentation point according to the degree of association, and if so, segment the text at the position of the sentence interval. Since external information such as external structure information or title information does not need to be used when text is segmented, the text segmentation method in the embodiment of the present invention has wider application fields. The embodiment of the present invention also proposes a method for training a correlation degree prediction model, capable of training a model for predicting the correlation degree of two adjacent sentences.

上述概述仅仅是为了说明书的目的，并不意图以任何方式进行限制。除上述描述的示意性的方面、实施方式和特征之外，通过参考附图和以下的详细描述，本发明进一步的方面、实施方式和特征将会是容易明白的。The above summary is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments and features described above, further aspects, embodiments and features of the present invention will be readily apparent by reference to the drawings and the following detailed description.

附图说明Description of drawings

在附图中，除非另外规定，否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解，这些附图仅描绘了根据本发明公开的一些实施方式，而不应将其视为是对本发明范围的限制。In the drawings, unless otherwise specified, the same reference numerals designate the same or similar parts or elements throughout the several drawings. The drawings are not necessarily drawn to scale. It should be understood that these drawings only depict some embodiments disclosed in accordance with the present invention and should not be taken as limiting the scope of the present invention.

图1为本发明实施例的一种文本分割方法实现流程图一；FIG. 1 is a flow chart 1 for implementing a text segmentation method according to an embodiment of the present invention;

图2为本发明实施例的一种文本分割方法实现流程图二；Fig. 2 is a second implementation flow chart of a text segmentation method according to an embodiment of the present invention;

图3为本发明实施例的一种文本分割方法中，步骤S11中确定句子间隔的前句与后句的关联度的实现流程图；Fig. 3 is a kind of text segmentation method of the embodiment of the present invention, in step S11, determine the realization flow chart of the relevance degree of the former sentence and the latter sentence of sentence interval;

图4为本发明实施例的一种文本分割方法中，步骤S111中确定前句与后句的语义关联对应的语义关联矩阵的实现流程图；Fig. 4 is a kind of text segmentation method in the embodiment of the present invention, in step S111, determine the implementation flow diagram of the semantic association matrix corresponding to the semantic association between the preceding sentence and the following sentence;

图5为本发明实施例的一种文本分割方法中，步骤S111中确定前句与后句的句式结构对应的句式矩阵的实现流程图；Fig. 5 is a flow chart of realizing the sentence pattern matrix corresponding to the sentence pattern structure of the preceding sentence and the following sentence in step S111 in a text segmentation method according to an embodiment of the present invention;

图6为本发明实施例的一种文本分割方法中，步骤S111中确定后句的引导词对应的引导词矩阵的实现流程图；Fig. 6 is a flow chart of realizing the guide word matrix corresponding to the guide word of the following sentence determined in step S111 in a text segmentation method according to an embodiment of the present invention;

图7为本发明实施例的一种文本分割方法的实现框架示意图；FIG. 7 is a schematic diagram of an implementation framework of a text segmentation method according to an embodiment of the present invention;

图8为本发明实施例的一种文本分割方法中，确定句子间隔是否为文本分割点的实现框架示意图；8 is a schematic diagram of an implementation framework for determining whether a sentence interval is a text segmentation point in a text segmentation method according to an embodiment of the present invention;

图9为本发明实施例的一种关联度预测模型的训练方法中实现流程图；FIG. 9 is a flow chart for implementing a method for training a correlation degree prediction model according to an embodiment of the present invention;

图10为本发明实施例的一种文本分割装置结构示意图一；FIG. 10 is a first structural schematic diagram of a text segmentation device according to an embodiment of the present invention;

图11为本发明实施例的一种文本分割装置结构示意图二；FIG. 11 is a second structural diagram of a text segmentation device according to an embodiment of the present invention;

图12为本发明实施例的文本分割设备或关联度预测模型的训练设备结构示意图。FIG. 12 is a schematic structural diagram of a text segmentation device or a training device for an association degree prediction model according to an embodiment of the present invention.

具体实施方式Detailed ways

在下文中，仅简单地描述了某些示例性实施例。正如本领域技术人员可认识到的那样，在不脱离本发明的精神或范围的情况下，可通过各种不同方式修改所描述的实施例。因此，附图和描述被认为本质上是示例性的而非限制性的。In the following, only some exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature and not restrictive.

本发明实施例主要提供了文本分割方法和装置，下面分别通过以下实施例进行技术方案的展开描述。Embodiments of the present invention mainly provide a text segmentation method and device, and the technical solutions are described below through the following embodiments.

图1为本发明实施例的一种文本分割方法实现流程图一，包括：Fig. 1 is a kind of text segmentation method implementation flowchart 1 of the embodiment of the present invention, including:

S11：针对第一文本中的每个句子间隔，分别确定所述句子间隔的前句与后句的关联度；S11: For each sentence interval in the first text, respectively determine the degree of relevance between the preceding sentence and the following sentence of the sentence interval;

S12：根据所述关联度确定所述句子间隔是否为文本分割点；S12: Determine whether the sentence interval is a text segmentation point according to the degree of association;

S13：在所述句子间隔是文本分割点的情况下，在所述句子间隔的位置分割所述第一文本。S13: In the case that the sentence interval is a text segmentation point, segment the first text at the position of the sentence interval.

本发明实施例可以将句号作为一个句子的结尾。相应地，前一个句子的句号与后一个句子的第一个字之间的位置，可以作为句子间隔。由上述过程可见，本发明实施例将文本分割问题看作对句子序列的分割问题，遍历每个句子，判断在该句的末尾是否需要进行文本分割。通过依次判断每个句子间隔是否为文本分割点，并在是文本分割点的情况下分割文本，实现了对整个文本的分割。一段第一文本中可以有多个文本分割点，也就是可以被分割为多段文本。In the embodiment of the present invention, a period may be used as the end of a sentence. Correspondingly, the position between the full stop of the previous sentence and the first word of the next sentence can be used as a sentence interval. It can be seen from the above process that the embodiment of the present invention treats the text segmentation problem as a sentence sequence segmentation problem, traverses each sentence, and judges whether text segmentation needs to be performed at the end of the sentence. By sequentially judging whether each sentence interval is a text segmentation point, and segmenting the text if it is a text segmentation point, the entire text is segmented. A piece of first text may have multiple text segmentation points, that is, it may be divided into multiple pieces of text.

可见，本发明实施例在分割文本时，不需要利用外部结构信息或标题信息等外部信息，应用领域更广泛。It can be seen that the embodiment of the present invention does not need to use external information such as external structure information or title information when segmenting text, and has wider application fields.

上述第一文本可以指不包含列表的普通文本。本发明实施例可以首先检测并提取原始文本中的列表格式的文本(即列表文本)。之后，对剩余的第一文本进行分割。如图2所示，图2为本发明实施例的一种文本分割方法实现流程图二，包括：The above-mentioned first text may refer to ordinary text that does not contain a list. The embodiment of the present invention may first detect and extract text in a list format (that is, list text) in the original text. Afterwards, the remaining first text is segmented. As shown in FIG. 2, FIG. 2 is a flow chart 2 for implementing a text segmentation method according to an embodiment of the present invention, including:

S201：采用预先设置的列表模板，识别原始文本中的列表文本；S201: Identify the list text in the original text by using a preset list template;

S202：将所述原始文本中的列表文本分割出去，将所述原始文本中剩余的部分作为第一文本；S202: Segment the list text in the original text, and use the remaining part in the original text as the first text;

在一种可能的实施方式中，通过互联网数据挖掘及人工标注的方法设置所述多个列表模版。每个列表模板中包含一种常用列表的格式内容。如“第n步：”、“其n，”、”n、”等，其中，n代表数字。列表的格式内容中还可以包括各种形式的数字，如“1”、“一”、“I”等。In a possible implementation manner, the plurality of list templates are set by means of Internet data mining and manual labeling. Each list template contains the format content of a common list. Such as "the nth step:", "its n,","n," etc., where n represents a number. The format content of the list may also include various forms of numbers, such as "1", "one", "I" and so on.

相应地，上述步骤S201的具体方式可以为：遍历原始文本中的每个句子，检测各个句子的前几个词是否与列表模板匹配，若匹配则记录下来。如果有位置相近的几个句子与同一列表模板匹配，并且数字呈连续递增关系，则认为这些句子范围内的文本属于列表文本。将该列表文本从原始文本中分割出去，剩余上述第一文本。Correspondingly, the specific method of the above step S201 may be: traverse each sentence in the original text, check whether the first few words of each sentence match the list template, and record it if they match. If there are several sentences with similar positions that match the same list template, and the numbers are in a continuous increasing relationship, then the text within the range of these sentences is considered to belong to the list text. The list text is separated from the original text, leaving the above-mentioned first text.

针对上述列表文本，如果列表文本的长度较短，则可作为单独的段落分割出来。如果列表文本的长度较长，则可以将其中的每一项(列表中的“第一”、“第二”等各种形式的数字之后的文本称为项)作为单独的段落分割出来。For the list text above, if the list text is short, it can be divided as a separate paragraph. If the length of the list text is long, each item (the text after various forms of numbers such as "first" and "second" in the list is called an item) can be separated as a separate paragraph.

在一种可能的实施方式中，上述步骤S11中确定句子间隔的前句与后句的关联度，包括：In a possible implementation manner, determining the correlation degree between the preceding sentence and the following sentence of the sentence interval in the above step S11 includes:

图3为本发明实施例的一种文本分割方法中，步骤S11中确定句子间隔的前句与后句的关联度的实现流程图，包括：Fig. 3 is a kind of text segmentation method of the embodiment of the present invention, in step S11, the implementation flowchart of determining the relevance degree of the previous sentence and the following sentence of the sentence interval, including:

S111：确定所述前句与所述后句的语义关联对应的语义关联矩阵，确定所述前句及所述后句的句式结构对应的句式矩阵，并确定所述后句的引导词对应的引导词矩阵；S111: Determine the semantic relevance matrix corresponding to the semantic relationship between the preceding sentence and the subsequent sentence, determine the sentence matrix corresponding to the sentence structure of the preceding sentence and the subsequent sentence, and determine the leading words of the latter sentence The corresponding guide word matrix;

S112：对所述语义关联矩阵、句式矩阵及引导词矩阵分别进行线性变换；S112: Carry out linear transformation on the semantic association matrix, sentence pattern matrix and leading word matrix respectively;

S113：将所述线性变换的结果组合成所述前句与后句的关联信息向量；S113: Combine the result of the linear transformation into an associated information vector of the preceding sentence and the following sentence;

S114：将所述关联信息向量输入预先训练的关联度预测模型，得到所述前句与所述后句的关联度。S114: Input the association information vector into a pre-trained association degree prediction model to obtain the association degree between the preceding sentence and the following sentence.

图4为本发明实施例的一种文本分割方法中，步骤S111中确定前句与后句的语义关联对应的语义关联矩阵的实现流程图，包括：Fig. 4 is a kind of text segmentation method in the embodiment of the present invention, in the step S111, determine the implementation flowchart of the semantic relevance matrix corresponding to the semantic relationship between the preceding sentence and the following sentence, including:

S11141：对所述前句中的词对应的词向量进行计算，得到所述前句的语义表示矩阵；并对所述后句中的词对应的词向量进行计算，得到所述后句的语义表示矩阵；S11141: Calculate the word vectors corresponding to the words in the preceding sentence to obtain the semantic representation matrix of the preceding sentence; and calculate the word vectors corresponding to the words in the subsequent sentence to obtain the semantics of the subsequent sentence represents a matrix;

S11142：将所述前句的语义表示矩阵与所述后句的语义表示矩阵相乘，得到所述前句与所述后句的语义关联对应的语义关联矩阵。S11142: Multiply the semantic representation matrix of the previous sentence by the semantic representation matrix of the subsequent sentence to obtain a semantic correlation matrix corresponding to the semantic correlation between the previous sentence and the subsequent sentence.

在一种可能的实施方式中，步骤S11141中的计算方式可以为：采用双向长短期记忆(LSTM，Long Short-Term Memory)模型、词袋模型(BOW，Bag-of-Words model)或基于转换器的双向编码表示(BERT，Bidirectional Encoder Representation fromTransformers)模型进行计算。In a possible implementation, the calculation method in step S11141 can be: using a two-way long-short-term memory (LSTM, Long Short-Term Memory) model, a bag-of-words model (BOW, Bag-of-Words model) or based on conversion Bidirectional Encoder Representation from Transformers (BERT, Bidirectional Encoder Representation from Transformers) model for calculation.

在步骤S11141中，可以采用前句/后句中的所有词对应的词向量进行计算，或者采用前句/后句中的实词对应的词向量计算对应的语义表示矩阵。其中，实词可以指含有词汇意义和语法意义的词，包括名词、动词、形容词、数词、量词、代词等。In step S11141, the word vectors corresponding to all the words in the preceding sentence/the following sentence can be used for calculation, or the corresponding semantic representation matrix can be calculated using the word vectors corresponding to the content words in the preceding sentence/sequent sentence. Among them, content words can refer to words with lexical meaning and grammatical meaning, including nouns, verbs, adjectives, numerals, quantifiers, pronouns, etc.

通过观察统计发现，文本中的疑问句、感叹句、列举句等句式，以及句子长度、短句数量对文本分割有较大影响。特别是列举句式、多短句句式，很难从语义角度利用机器学习模型捕捉。因此，本发明实施例设计了基本的句式模板用于提取句式信息。得到句式信息后，可以将其向量化以便于计算。Through observation and statistics, it is found that sentence patterns such as interrogative sentences, exclamatory sentences, and enumerated sentences in the text, as well as the length of sentences and the number of short sentences have a greater impact on text segmentation. In particular, enumerating sentence patterns and multi-sentence sentence patterns are difficult to capture using machine learning models from a semantic perspective. Therefore, the embodiment of the present invention designs a basic sentence pattern template for extracting sentence pattern information. After getting the sentence pattern information, it can be vectorized for easy calculation.

图5为本发明实施例的一种文本分割方法中，步骤S111中确定前句与后句的句式结构对应的句式矩阵的实现流程图，包括：Fig. 5 is a kind of text segmentation method of the embodiment of the present invention, in step S111, determine the realization flowchart of the sentence matrix corresponding to the sentence structure of the previous sentence and the following sentence, including:

S11151：采用预先设计的句式模板，分别确定所述前句的句式信息及所述后句的句式信息；S11151: Using a pre-designed sentence pattern template, respectively determine the sentence pattern information of the preceding sentence and the sentence pattern information of the following sentence;

S11152：根据所述前句的句式信息生成所述前句的句式向量，并根据所述后句的句式信息生成所述后句的句式向量；S11152: Generate a sentence pattern vector of the previous sentence according to the sentence pattern information of the previous sentence, and generate a sentence pattern vector of the latter sentence according to the sentence pattern information of the latter sentence;

S11153：将所述前句的句式向量与所述后句的句式向量组合，得到所述前句及所述后句的句式结构对应的句式矩阵。S11153: Combine the sentence pattern vector of the preceding sentence and the sentence pattern vector of the following sentence to obtain a sentence pattern matrix corresponding to the sentence pattern structure of the preceding sentence and the following sentence.

在本发明实施例中，句式向量的维度与句式模板的个数有关。例如，如果预先设计T个句式模板，则可以用一个T维的句式向量表示句式信息。句式向量中的每个元素对应一个句式模板。对于一个句子，如果确定该句子的句式为第t种句式(也就是与第t个句式模板匹配)，则可以将该句子的句式向量中的第t个元素设置为1，其它元素设置为0。在一种可能的实施方式中，将上述前句的句式向量与后句的句式向量组合，可以得到一个2行T列的句式矩阵。该句式矩阵中包含了前句和后句的句式结构信息。其中，句式矩阵的第1行可以为前句的句式向量，句式矩阵的第2行可以为后句的句式向量。In the embodiment of the present invention, the dimension of the sentence pattern vector is related to the number of sentence pattern templates. For example, if T sentence pattern templates are designed in advance, a T-dimensional sentence pattern vector can be used to represent the sentence pattern information. Each element in the sentence vector corresponds to a sentence template. For a sentence, if it is determined that the sentence pattern of the sentence is the t-th sentence pattern (that is, matches the t-th sentence pattern template), then the t-th element in the sentence pattern vector of the sentence can be set to 1, and other element is set to 0. In a possible implementation manner, the sentence pattern vector of the preceding sentence and the sentence pattern vector of the following sentence are combined to obtain a sentence pattern matrix with 2 rows and T columns. The sentence pattern matrix contains the sentence pattern structure information of the preceding sentence and the following sentence. Wherein, the first row of the sentence pattern matrix may be the sentence pattern vector of the preceding sentence, and the second row of the sentence pattern matrix may be the sentence pattern vector of the following sentence.

除了句式外，发明人发现一些表达逻辑的引导词可以对文本分割起标示作用，如“然而”、“例如”、“以上”等。因此，本发明实施例可以将引导词作为文本分割的一个依据。In addition to sentence patterns, the inventors found that some leading words expressing logic can mark text segmentation, such as "however", "for example", "above" and so on. Therefore, the embodiment of the present invention can use the guide words as a basis for text segmentation.

图6为本发明实施例的一种文本分割方法中，步骤S111中确定后句的引导词对应的引导词矩阵的实现流程图，包括：Fig. 6 is a kind of text segmentation method of the embodiment of the present invention, in the step S111, determine the guide word matrix corresponding to the guide word of the following sentence, including:

S11161：分别确定所述后句中的前N个词对应的词向量，所述N为整数；S11161: respectively determine word vectors corresponding to the first N words in the following sentence, where N is an integer;

S11162：将确定的所述词向量拼接为所述后句的引导词对应的引导词矩阵。S11162: Concatenate the determined word vectors into a guide word matrix corresponding to the guide words of the following sentences.

其中，引导词矩阵的每一行可以为步骤S11161中确定的每一个词向量。Wherein, each row of the guide word matrix may be each word vector determined in step S11161.

在一种可能的实施方式中，上述步骤S112中的线性变换可以指采用预先设定的矩阵分别与所述语义关联矩阵、句式矩阵或引导词矩阵相乘，分别得到一个向量。例如：In a possible implementation, the linear transformation in the above step S112 may refer to multiplying a preset matrix with the semantic relevance matrix, sentence pattern matrix or guiding word matrix to obtain a vector respectively. E.g:

如果语义关联矩阵(记为A)为L×P的矩阵(即行数为L，列数为P)，则可以利用一个预先设置的1×L的矩阵B，计算B与A的乘积，得到一个新的向量v1，向量v1的维度为P。If the semantic relevance matrix (denoted as A) is a matrix of L×P (that is, the number of rows is L and the number of columns is P), then a preset 1×L matrix B can be used to calculate the product of B and A to obtain a The new vector v1, the dimension of vector v1 is P.

如果句式矩阵(记为C)为2×T的矩阵，则可以利用一个预先设置的1×2的矩阵D，计算D与C的乘积，得到一个新的向量v2，向量v2的维度为T。If the sentence matrix (denoted as C) is a 2×T matrix, then a preset 1×2 matrix D can be used to calculate the product of D and C to obtain a new vector v2 whose dimension is T .

如果引导词矩阵(记为E)为X×Y的矩阵，则可以利用一个预先设置的1×X的矩阵F，计算F与E的乘积，得到一个新的向量v3，向量v3的维度为Y。If the guide word matrix (denoted as E) is a matrix of X×Y, a preset 1×X matrix F can be used to calculate the product of F and E to obtain a new vector v3 whose dimension is Y .

上述步骤S113中，将线性变换的结果组合成所述前句与后句的关联信息向量的方式可以为：将上述得到的3个向量依次连接，得到关联信息向量。In the above step S113, the method of combining the results of the linear transformation into the related information vectors of the preceding sentence and the following sentence may be: sequentially connect the three vectors obtained above to obtain the related information vectors.

例如，将上述v1、v2和v2依次连接，得到维度为(P+T+Y)的关联信息向量。For example, the above v1, v2 and v2 are sequentially connected to obtain an associated information vector with a dimension of (P+T+Y).

如图7为本发明实施例的一种文本分割方法的实现框架示意图。在图7中，文本分割方法包括2个步骤。其中，步骤1包括列表提取阶段。本发明实施例可以采用预先设置的列表模板提取并分割原始文本中的列表文本。步骤2为分割阶段。本发明实施例遍历句子，对于每两个连续的句子，根据前句与后句的语义关联(也可以称为相似度)、前句的句式结构、后句的句式结构、以及后句的引导词，生成前句与后句的关联信息向量。将关联信息向量输入关联度预测模型，得到前句与后句的关联度。根据关联度，可以确定是否将前句和后句之间的句子间隔作为文本分割点。FIG. 7 is a schematic diagram of an implementation framework of a text segmentation method according to an embodiment of the present invention. In Figure 7, the text segmentation method consists of 2 steps. Among them, step 1 includes a list extraction stage. In this embodiment of the present invention, a preset list template can be used to extract and segment the list text in the original text. Step 2 is the segmentation stage. The embodiment of the present invention traverses sentences, and for every two consecutive sentences, according to the semantic association (also known as similarity) between the preceding sentence and the following sentence, the sentence structure of the preceding sentence, the sentence structure of the following sentence, and the following sentence The leading words of the sentence generate the related information vector of the previous sentence and the next sentence. Input the correlation information vector into the correlation degree prediction model to obtain the correlation degree between the previous sentence and the subsequent sentence. According to the degree of association, it can be determined whether to use the sentence interval between the preceding sentence and the following sentence as a text segmentation point.

如图8为本发明实施例的一种文本分割方法中，确定句子间隔是否为文本分割点的实现框架示意图。在图8中，确定两个连续的句子(即句子1和句子2)之间的句子间隔能否作为文本分割点。可以将句子1的每个词的词向量分别输入正向LSTM模型和反向LSTM模型，其中，正向LSTM模型和反向LSTM模型构成双向LSTM模型。将正向LSTM模型和反向LSTM模型的输出结果进行处理，得到句子1的语义表示矩阵。对句子2采用与句子1相同的处理方式，得到句子2的语义表示矩阵。将句子1的语义表示矩阵与句子2的语义表示矩阵相乘，得到句子1与句子2的语义关联矩阵。之后，将句子1与句子2的语义关联矩阵、句子1与句子2的句式矩阵、以及句子2的引导词矩阵组合，得到句子1与句子2的关联信息向量。将句子1与句子2的关联信息向量输入关联度预测模型，输出句子1与句子2的关联度。其中，关联度预测模型可以采用Softmax分类模型。关联度预测模型输出的结果可以是一个(0，1)区间内的数值。本发明实施例可以预先设置一个阈值，当关联度预测模型输出的句子1与句子2的关联度超过该阈值时，可以将句子1的结束位置作为文本分割点，在句子1的结束位置分割文本。如果句子1与句子2的关联度不超过该阈值，则继续针对句子2与句子3执行相同的操作，直至文本中的句子遍历完毕。FIG. 8 is a schematic diagram of an implementation framework for determining whether a sentence interval is a text segmentation point in a text segmentation method according to an embodiment of the present invention. In FIG. 8, it is determined whether the sentence interval between two consecutive sentences (ie sentence 1 and sentence 2) can be used as a text segmentation point. The word vector of each word in sentence 1 can be input into the forward LSTM model and the reverse LSTM model respectively, wherein the forward LSTM model and the reverse LSTM model constitute a bidirectional LSTM model. Process the output results of the forward LSTM model and the reverse LSTM model to obtain the semantic representation matrix of sentence 1. Sentence 2 is processed in the same way as sentence 1, and the semantic representation matrix of sentence 2 is obtained. Multiply the semantic representation matrix of sentence 1 and the semantic representation matrix of sentence 2 to obtain the semantic correlation matrix of sentence 1 and sentence 2. Afterwards, the semantic correlation matrix of sentence 1 and sentence 2, the sentence pattern matrix of sentence 1 and sentence 2, and the leading word matrix of sentence 2 are combined to obtain the related information vector of sentence 1 and sentence 2. Input the correlation information vector of sentence 1 and sentence 2 into the correlation degree prediction model, and output the correlation degree of sentence 1 and sentence 2. Wherein, the association degree prediction model may adopt a Softmax classification model. The result output by the association degree prediction model may be a value within a (0, 1) interval. In the embodiment of the present invention, a threshold can be set in advance. When the correlation between sentence 1 and sentence 2 output by the correlation degree prediction model exceeds the threshold, the end position of sentence 1 can be used as the text segmentation point, and the text can be segmented at the end position of sentence 1. . If the correlation between sentence 1 and sentence 2 does not exceed the threshold, continue to perform the same operation on sentence 2 and sentence 3 until the sentences in the text are traversed.

本发明实施例还提出一种关联度预测模型的训练方法，该关联度预测模型用于预测两个相邻句子的关联度。如图9为本发明实施例的一种关联度预测模型的训练方法中实现流程图，包括：The embodiment of the present invention also proposes a training method for a correlation degree prediction model, and the correlation degree prediction model is used to predict the correlation degree of two adjacent sentences. Figure 9 is a flowchart for implementing a method for training a correlation prediction model according to an embodiment of the present invention, including:

S91：生成两个相邻样本句子的关联信息向量，并获取所述两个相邻样本句子的实际关联度；S91: Generate correlation information vectors of two adjacent sample sentences, and obtain actual correlation degrees of the two adjacent sample sentences;

S92：将所述关联信息向量输入关联度预测模型；S92: Input the association information vector into the association degree prediction model;

S93：将所述关联度预测模型输出的预测关联度与所述实际关联度进行比较，根据比较结果调整所述关联度预测模型的参数。S93: Comparing the predicted degree of association output by the degree of association prediction model with the actual degree of association, and adjusting parameters of the degree of association prediction model according to the comparison result.

在一种可能的实施方式中，所述步骤S91中，生成两个相邻样本句子的关联信息向量，包括：In a possible implementation manner, in the step S91, generating the associated information vectors of two adjacent sample sentences includes:

对所述语义关联矩阵、句式矩阵及引导词矩阵分别进行线性变换；Carry out linear transformation respectively to described semantic association matrix, sentence pattern matrix and leading word matrix;

本实施例中，生成两个相邻样本句子的关联信息向量的具体方式与上述文本分割方法中的相应方法一致，在此不再赘述。In this embodiment, the specific manner of generating the associated information vectors of two adjacent sample sentences is consistent with the corresponding method in the above text segmentation method, and will not be repeated here.

在一种可能的实施方式中，所述步骤S91中，两个相邻样本句子的实际关联度可以由人工设置。In a possible implementation manner, in the step S91, the actual degree of correlation between two adjacent sample sentences may be manually set.

在一种可能的实施方式中，还可以包括训练数据生成过程。本发明实施例中的训练数据可以包括正例和负例，其中，正例可以包括弱正例和强正例。正例是应该被分割的点，负例是不应该被分割的点。弱正例的作用是保证模型的底线，即能将明显不相关的文本分割开。强正例的作用是使模型达到更好的效果，即希望模型能学到基于语义的分割方法。In a possible implementation manner, a training data generating process may also be included. The training data in this embodiment of the present invention may include positive examples and negative examples, where the positive examples may include weak positive examples and strong positive examples. Positive examples are points that should be segmented, and negative examples are points that should not be segmented. The role of weak positive examples is to ensure the bottom line of the model, that is, to separate obviously irrelevant texts. The role of strong positive examples is to make the model achieve better results, that is, it is hoped that the model can learn a semantic-based segmentation method.

在一种可能的实施方式中，本发明实施例从不同文档中选取文本片段，将选取的文本片段拼接成第一文本片段；将所述第一文本片段中的拼接位置作为弱正例。本发明实施例可以从不同的文档中随机选取一些短片段拼接起来。In a possible implementation manner, the embodiment of the present invention selects text fragments from different documents, and stitches the selected text fragments into a first text fragment; and uses the stitching position in the first text fragment as a weak positive example. The embodiment of the present invention can randomly select some short fragments from different documents and stitch them together.

在一种可能的实施方式中，本发明实施例将同一文档中连续的文本片段拼接成第二文本片段；将所述第二文本片段中的拼接位置作为强正例。In a possible implementation manner, in the embodiment of the present invention, consecutive text segments in the same document are spliced into a second text segment; the splicing position in the second text segment is used as a strong positive example.

在一种可能的实施方式中，本发明实施例将第一文本片段中除弱正例以外的句子间隔，和/或第二文本片段中除强正例以外的句子间隔作负例。In a possible implementation manner, in the embodiment of the present invention, sentence intervals in the first text segment other than weak positive examples, and/or sentence intervals in the second text segment other than strong positive examples are used as negative examples.

本发明实施例还提出一种文本分割装置。参见图10，图10为本发明实施例的一种文本分割装置结构示意图一，包括：Embodiments of the present invention also provide a text segmentation device. Referring to FIG. 10, FIG. 10 is a schematic structural diagram of a text segmentation device according to an embodiment of the present invention, including:

关联度确定模块1010，用于针对第一文本中的每个句子间隔，分别确定所述句子间隔的前句与后句的关联度；Relevance determination module 1010, for each sentence interval in the first text, respectively determine the relevancy of the preceding sentence and the following sentence of the sentence interval;

分割点确定模块1020，用于根据所述关联度确定所述句子间隔是否为文本分割点；A segmentation point determination module 1020, configured to determine whether the sentence interval is a text segmentation point according to the degree of association;

文本分割模块1030，用于在所述句子间隔是文本分割点的情况下，在所述句子间隔的位置分割所述第一文本。The text segmentation module 1030 is configured to segment the first text at the position of the sentence interval when the sentence interval is a text segmentation point.

在一种可能的实施方式中，关联度确定模块1010，用于根据所述前句与所述后句的语义关联、所述前句及所述后句的句式结构以及所述后句的引导词中的至少一项，确定所述句子间隔的前句与后句的关联度。In a possible implementation manner, the association degree determination module 1010 is configured to determine the semantic relationship between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the At least one of the leading words determines the degree of relevance between the preceding sentence and the following sentence of the sentence interval.

在一种可能的实施方式中，关联度确定模块1010，用于确定所述前句与所述后句的语义关联对应的语义关联矩阵，确定所述前句及所述后句的句式结构对应的句式矩阵，并确定所述后句的引导词对应的引导词矩阵；In a possible implementation manner, the association degree determination module 1010 is configured to determine the semantic association matrix corresponding to the semantic association between the preceding sentence and the following sentence, and determine the sentence structure of the preceding sentence and the following sentence Corresponding sentence pattern matrix, and determine the leading word matrix corresponding to the leading word of described back sentence;

将所述关联信息向量输入预先训练的关联度预测模型，得到所述前句与所述后句的关联度。Inputting the association information vector into a pre-trained association degree prediction model to obtain the association degree between the preceding sentence and the following sentence.

在一种可能的实施方式中，关联度确定模块1010，用于对所述前句中的词对应的词向量进行计算，得到所述前句的语义表示矩阵；并对所述后句中的词对应的词向量进行计算，得到所述后句的语义表示矩阵；In a possible implementation manner, the association degree determination module 1010 is configured to calculate the word vector corresponding to the word in the preceding sentence to obtain the semantic representation matrix of the preceding sentence; The word vector corresponding to the word is calculated to obtain the semantic representation matrix of the latter sentence;

在一种可能的实施方式中，关联度确定模块1010，用于采用双向长短期记忆模型、词袋模型或基于转换器的双向编码表示模型进行计算。In a possible implementation manner, the association degree determination module 1010 is configured to use a two-way long-short-term memory model, a bag-of-words model, or a converter-based two-way coding representation model for calculation.

在一种可能的实施方式中，关联度确定模块1010，用于采用预先设计的句式模板，分别确定所述前句的句式信息及所述后句的句式信息；In a possible implementation manner, the association degree determining module 1010 is configured to use a pre-designed sentence pattern template to determine the sentence pattern information of the preceding sentence and the sentence pattern information of the following sentence;

在一种可能的实施方式中，关联度确定模块1010，用于分别确定所述后句中的前N个词对应的词向量，所述N为整数；将确定的所述词向量拼接为所述后句的引导词对应的引导词矩阵。In a possible implementation manner, the association degree determination module 1010 is configured to respectively determine word vectors corresponding to the first N words in the following sentence, where N is an integer; and splicing the determined word vectors into the The leading word matrix corresponding to the leading word of the following sentence.

图11为本发明实施例的一种文本分割装置结构示意图二，包括：FIG. 11 is a second structural diagram of a text segmentation device according to an embodiment of the present invention, including:

关联度确定模块1010、分割点确定模块1020、文本分割模块1030及列表分割模块1140；其中，所述关联度确定模块1010、分割点确定模块1020及文本分割模块1030与上述实施例中的相应模块相同，在此不再赘述。Relevance degree determination module 1010, segmentation point determination module 1020, text segmentation module 1030 and list segmentation module 1140; Same, no more details here.

所述列表分割模块1140，用于采用预先设置的列表模板，识别原始文本中的列表文本；将所述原始文本中的列表文本分割出去，将所述原始文本中剩余的部分作为所述第一文本。The list segmentation module 1140 is configured to use a preset list template to identify the list text in the original text; segment the list text in the original text, and use the remaining part of the original text as the first text.

本发明实施例各装置中的各模块的功能可以参见上述方法中的对应描述，在此不再赘述。For functions of each module in each device in the embodiment of the present invention, reference may be made to the corresponding description in the foregoing method, and details are not repeated here.

本发明实施例还提出一种文本分割及一种关联度预测模型的训练设备，如图12为本发明实施例的文本分割设备或关联度预测模型的训练设备结构示意图，包括：The embodiment of the present invention also proposes a text segmentation and a training device for a correlation degree prediction model. Figure 12 is a schematic structural diagram of a text segmentation device or a training device for a correlation degree prediction model in an embodiment of the present invention, including:

存储器11和处理器12，存储器11存储有可在处理器12上运行的计算机程序。所述处理器12执行所述计算机程序时实现上述实施例中的文本分割方法或关联度预测模型的训练方法。所述存储器11和处理器12的数量可以为一个或多个。A memory 11 and a processor 12 , the memory 11 stores computer programs that can run on the processor 12 . When the processor 12 executes the computer program, it realizes the text segmentation method or the training method of the association degree prediction model in the above-mentioned embodiments. The number of the memory 11 and the processor 12 may be one or more.

所述设备还可以包括：The device may also include:

通信接口13，用于与外界设备进行通信，进行数据交换传输。The communication interface 13 is used for communicating with external devices for data exchange and transmission.

存储器11可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatilememory)，例如至少一个磁盘存储器。The memory 11 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

如果存储器11、处理器12和通信接口13独立实现，则存储器11、处理器12和通信接口13可以通过总线相互连接并完成相互之间的通信。所述总线可以是工业标准体系结构(ISA，Industry Standard Architecture)总线，外部设备互连(PCI，PeripheralComponent Interconnect)总线或扩展工业标准体系结构(EISA，Extended IndustryStandard Architecture)等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示，图12中仅用一条粗线表示，并不表示仅有一根总线或一种类型的总线。If the memory 11, the processor 12 and the communication interface 13 are implemented independently, the memory 11, the processor 12 and the communication interface 13 may be connected to each other through a bus to complete mutual communication. The bus may be an Industry Standard Architecture (ISA, Industry Standard Architecture) bus, a Peripheral Component Interconnect (PCI, Peripheral Component Interconnect) bus, or an Extended Industry Standard Architecture (EISA, Extended Industry Standard Architecture), etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 12 , which does not mean that there is only one bus or one type of bus.

可选的，在具体实现上，如果存储器11、处理器12和通信接口13集成在一块芯片上，则存储器11、处理器12和通信接口13可以通过内部接口完成相互间的通信。Optionally, in specific implementation, if the memory 11 , processor 12 and communication interface 13 are integrated on one chip, the memory 11 , processor 12 and communication interface 13 may communicate with each other through the internal interface.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions with reference to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present invention, "plurality" means two or more, unless otherwise specifically defined.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the invention includes alternative implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present invention pertain.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, For use with instruction execution systems, devices, or devices (such as computer-based systems, systems including processors, or other systems that can fetch instructions from instruction execution systems, devices, or devices and execute instructions), or in conjunction with these instruction execution systems, devices or equipment used. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device. More specific examples (non-exhaustive list) of computer-readable media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Read Only Memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium on which the program can be printed, since the program can be read, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable processing if necessary. The program is processed electronically and stored in computer memory.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention can be realized by hardware, software, firmware or their combination. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

此外，在本发明各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读存储介质中。所述存储介质可以是只读存储器，磁盘或光盘等。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到其各种变化或替换，这些都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of its various changes or modifications within the technical scope disclosed in the present invention. Replacement, these should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A text segmentation method, characterized in that, comprising:

For each sentence interval in the first text, determine the relevance degree of the preceding sentence and the following sentence of the sentence interval respectively;

Determine whether the sentence interval is a text segmentation point according to the degree of association;

In case the sentence interval is a text segmentation point, the first text is segmented at the position of the sentence interval.

2. method according to claim 1, is characterized in that, the degree of relevance of the preceding sentence and the following sentence of described sentence interval of described determination, comprises:

According to at least one of the semantic association between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the leading words of the following sentence, determine the preceding sentence and the preceding sentence of the sentence interval The degree of relevance of the latter sentence.

3. The method according to claim 2, characterized in that, the semantic association between the preceding sentence and the following sentence, the sentence structure of the preceding sentence and the following sentence, and the At least one of the leading words, determine the degree of relevance between the preceding sentence and the following sentence of the sentence interval, including:

Determine the semantic relevance matrix corresponding to the semantic association of the preceding sentence and the following sentence, determine the sentence matrix corresponding to the sentence structure of the preceding sentence and the following sentence, and determine the corresponding leading words of the following sentence guide word matrix;

Carry out linear transformation respectively to described semantic association matrix, described sentence pattern matrix and described guiding word matrix;

Combining the result of the linear transformation into the associated information vector of the preceding sentence and the following sentence;

Inputting the association information vector into a pre-trained association degree prediction model to obtain the association degree between the preceding sentence and the following sentence.

4. method according to claim 3, is characterized in that, the semantic association matrix corresponding to the semantic association of described preceding sentence and described following sentence described in described determination, comprises:

Calculate the word vector corresponding to the word in the preceding sentence to obtain the semantic representation matrix of the preceding sentence; and calculate the word vector corresponding to the word in the latter sentence to obtain the semantic representation matrix of the latter sentence ;

multiplying the semantic representation matrix of the preceding sentence by the semantic representation matrix of the following sentence to obtain a semantic correlation matrix corresponding to the semantic correlation between the preceding sentence and the following sentence.

5 . The method according to claim 4 , wherein the calculation method is: using a two-way long-short-term memory model, a bag-of-words model, or a converter-based two-way coding representation model for calculation.

6. The method according to claim 3, wherein said determining the sentence matrix corresponding to the sentence structure of said preceding sentence and said following sentence comprises:

Using a pre-designed sentence pattern template, respectively determining the sentence pattern information of the preceding sentence and the sentence pattern information of the following sentence;

Generate the sentence pattern vector of the preceding sentence according to the sentence pattern information of the preceding sentence, and generate the sentence pattern vector of the latter sentence according to the sentence pattern information of the latter sentence;

Combining the sentence pattern vector of the preceding sentence with the sentence pattern vector of the following sentence to obtain a sentence pattern matrix corresponding to the sentence pattern structure of the preceding sentence and the following sentence.

7. method according to claim 3, is characterized in that, the leading word matrix corresponding to the leading word of said determining described back sentence, comprises:

Determine the word vectors corresponding to the first N words in the following sentence respectively, and the N is an integer;

Splicing the determined word vectors into a leading word matrix corresponding to the leading words of the latter sentence.

8. The method according to any one of claims 1 to 7, wherein, for each sentence interval in the first text, before determining the degree of relevance between the preceding sentence and the following sentence of the sentence interval, further include:

Use the pre-set list template to identify the list text in the original text;

The list text in the original text is divided, and the remaining part of the original text is used as the first text.

9. A training method for a degree of association prediction model, characterized in that the method comprises:

Generate association information vectors of two adjacent sample sentences, and obtain the actual degree of association of the two adjacent sample sentences;

Inputting the association information vector into the association degree prediction model;

Comparing the predicted association degree output by the association degree prediction model with the actual association degree, and adjusting the parameters of the association degree prediction model according to the comparison result.

10. method according to claim 9, is characterized in that, described generating the associated information vector of two adjacent sample sentences, comprises:

Determine the semantic correlation matrix corresponding to the semantic association of the sample front sentence and the sample back sentence, determine the sentence pattern matrix corresponding to the sentence structure of the sample front sentence and the sample back sentence, and determine the guide word matrix corresponding to the guide words of the sample back sentence; , the sample front sentence is the previous sentence in the two adjacent sample sentences, and the sample post sentence is the next sentence in the two adjacent sample sentences;

Combining the results of the linear transformation into an associated information vector of the two adjacent sample sentences.

11. The method according to claim 9 or 10, further comprising:

Selecting text fragments from different documents, splicing the selected text fragments into a first text fragment; using the splicing position in the first text fragment as a weak positive example;

Splicing continuous text fragments in the same document into a second text fragment; using the stitching position in the second text fragment as a strong positive example;

Using sentence intervals other than the weak positive example in the first text segment, and/or sentence intervals other than the strong positive example in the second text segment as negative examples;

Determining the two sentences before and after the weak positive example, the strong positive example, or the negative example as the two adjacent sample sentences.

12. A text segmentation device, characterized in that, comprising:

Relevance determination module, for each sentence interval in the first text, respectively determine the relevancy of the former sentence and the latter sentence of the sentence interval;

A segmentation point determination module, configured to determine whether the sentence interval is a text segmentation point according to the degree of association;

A text segmentation module, configured to segment the first text at the position of the sentence interval when the sentence interval is a text segmentation point.

13. The device according to claim 12, wherein the degree of association determination module is configured to determine the semantic association between the preceding sentence and the following sentence, the sentence pattern of the preceding sentence and the following sentence At least one of the structure and the leading word of the latter sentence determines the degree of relevance between the former sentence and the latter sentence of the sentence interval.

14. A training device for a degree of relevance prediction model, characterized in that the device comprises:

The sample determination module is used to generate the correlation information vector of two adjacent sample sentences, and obtain the actual correlation degree of the two adjacent sample sentences;

An input module, configured to input the association information vector into the association degree prediction model;

A parameter adjustment module, configured to compare the predicted degree of association output by the degree of association prediction model with the actual degree of association, and adjust the parameters of the degree of association prediction model according to the comparison result.

15. A text segmentation device, characterized in that the device comprises:

one or more processors;

storage means for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 1-8.

16. A training device for a degree of association prediction model, characterized in that the device comprises:

one or more processors;

storage means for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 9-11.

17. A computer-readable storage medium storing a computer program, characterized in that, when the program is executed by a processor, the method according to any one of claims 1-11 is implemented.