CN114330294B

CN114330294B - A method for extracting character speech based on text syntactic analysis

Info

Publication number: CN114330294B
Application number: CN202111651242.3A
Authority: CN
Inventors: 汤世松; 贺成龙; 梁增玉; 李惠柯; 刘蛰; 高峰
Original assignee: Nanjing Laiwangxin Technology Research Institute Co ltd; CETC 28 Research Institute
Current assignee: Nanjing Laiwangxin Technology Research Institute Co ltd; CETC 28 Research Institute
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2024-09-17
Anticipated expiration: 2041-12-30
Also published as: CN114330294A

Abstract

The present invention provides a method for extracting character speech based on text syntactic analysis, which extracts character speech appearing in the text based on provided character information and text information. It includes constructing a trigger word dictionary, that is, for the initial speech trigger words, using synonym technology to expand the trigger words and construct a complete trigger word dictionary; text sentence segmentation, that is, segmenting the entire text into complete sentences; sentence filtering, screening candidate sentences based on whether the provided character information and trigger word information are hit in the sentence; speech determination, that is, judging the relationship between the character and the trigger word based on syntactic analysis, and judging whether the sentence is the character speech. The character speech extraction method of the present invention can extract character speech information in the text simply, efficiently and accurately.

Description

A method for extracting character speech based on text syntactic analysis

技术领域Technical Field

本发明涉及文本信息抽取技术领域，尤其涉及一种基于文本句法分析的人物言论抽取方法。The present invention relates to the technical field of text information extraction, and in particular to a method for extracting character speech based on text syntactic analysis.

背景技术Background Art

随着时代的演进，经济、社会、生产、生活越来越依赖网络，通过网络获取信息，也成为人们日常生活工作中必不可少的环节。面对海量的信息，冗长的报道，如何快速，有效地获取涉事主要人物的言论，成为提升阅读效率必不可少的一项需求，因此，针对人物言论抽取的研究具有十分重要的意义。With the evolution of the times, the economy, society, production, and life are increasingly dependent on the Internet, and obtaining information through the Internet has become an indispensable part of people's daily life and work. Faced with massive amounts of information and lengthy reports, how to quickly and effectively obtain the speeches of the main characters involved has become an indispensable requirement for improving reading efficiency. Therefore, research on character speech extraction is of great significance.

目前关于人物言论抽取的方法主要有基于规则的方法和基于机器学习、深度学习的方法。其中基于规则的方法多采用触发词的方式，但是很少有对触发词字典的构建有较深的研究，同时，该类方法大多没有关注句子中触发词与人物词之间的关系，导致抽取的精度降低。而对于基于机器学习个深度学习的方法，尽管不受触发词典的约束，但是前期的训练预料标注需要投入大量的人力资源，而且当目标文本和训练样本偏差较大时，效果往往效果一般，同时，深度学习的方法往往算法的复杂度较高，实际操作环节中，依赖更好的硬件资源作基础。At present, there are mainly rule-based methods and machine learning and deep learning methods for character speech extraction. Among them, rule-based methods mostly use trigger words, but there are few in-depth studies on the construction of trigger word dictionaries. At the same time, most of these methods do not pay attention to the relationship between trigger words and character words in sentences, resulting in reduced extraction accuracy. For machine learning and deep learning methods, although they are not constrained by trigger dictionaries, the early training and annotation require a lot of human resources, and when the target text and training samples deviate greatly, the effect is often mediocre. At the same time, deep learning methods often have higher algorithm complexity, and in actual operation, they rely on better hardware resources as a basis.

发明内容Summary of the invention

发明目的：本发明的主要目的在于解决现有技术的缺陷，并发明一种基于规则的，能够自动扩充触发词，并引入句法分析判断触发词和人物之间的关系作为补充，最终以提供一种准确的，高效的人物言论抽取方法。Purpose of the invention: The main purpose of the present invention is to solve the defects of the prior art and to invent a rule-based method that can automatically expand trigger words and introduce syntactic analysis to judge the relationship between trigger words and characters as a supplement, ultimately providing an accurate and efficient method for extracting character speech.

本发明提供了一种基于文本句法分析的人物言论抽取方法，包括以下步骤：The present invention provides a method for extracting character speech based on text syntactic analysis, comprising the following steps:

步骤S1，构建言论触发词字典：针对初始的言论触发词，以近义词技术，扩展触发词，构建触发词字典；Step S1, constructing a speech trigger word dictionary: for the initial speech trigger words, using synonym technology, expand the trigger words and construct a trigger word dictionary;

步骤S2，文本分句：将整个文本按完整的句子进行切分；Step S2, text sentence segmentation: segment the entire text into complete sentences;

步骤S3，句子过滤；Step S3, sentence filtering;

步骤S4，言论判定。Step S4, speech determination.

步骤S1中，所述构建触发词字典是采用初始的触发词列表L：[W₁,W₂,…,W_n-1,W_n]，其中W₁,W₂,…,W_n-1,W_n依次对应第1个，第2个，第3个，…，第n个初始触发词；初始触发词为新闻类舆情数据中初筛获得的言论类触发词。初始触发词的个数在20个以内，包括：“说”、“表示”、“告诉”、“指出”、“透露”、“坦言”、“声明”等整理的词汇。通过多种近义词方式对初始言论类触发词进行扩充。In step S1, the trigger word dictionary is constructed by using the initial trigger word list L: [W ₁ ,W ₂ ,…,W _n-1 ,W _n ], where W ₁ ,W ₂ ,…,W _n-1 ,W _n correspond to the first, second, third,…, nth initial trigger words respectively; the initial trigger words are speech trigger words obtained by preliminary screening in the news public opinion data. The number of initial trigger words is within 20, including: "say", "express", "tell", "point out", "reveal", "frankly", "statement" and other organized words. The initial speech trigger words are expanded by a variety of synonyms.

步骤S1中，所述通过多种近义词扩充方式进行扩充包括基于同义词词林搜索近义词的扩充方式和基于词向量的word2vec搜索近义词的扩充方式，所述word2vec采用的训练语料是自己标注的舆情领域新闻类业务数据，因此，更适用于人物言论触发词对应的近义词的扩充。In step S1, the expansion through multiple synonym expansion methods includes an expansion method based on the synonym dictionary search for synonyms and an expansion method based on the word vector word2vec search for synonyms. The training corpus used by word2vec is the self-annotated news business data in the field of public opinion. Therefore, it is more suitable for the expansion of synonyms corresponding to the trigger words of character speech.

针对第1个初始触发词W₁，具体包括如下步骤：For the first initial trigger word W ₁ , the following steps are specifically included:

步骤a1，以W₁作为输入，通过同义词词林搜索W₁的近义词，返回W₁的近义词集合L₁，其中，L₁的计算公式为：Step a1, taking W ₁ as input, searching for synonyms of W ₁ through the synonym dictionary, and returning a synonym set L ₁ of W ₁ , where the calculation formula of L ₁ is:

L₁＝{W₁ ⁱ|sim(W₁,W₁ ⁱ)>0.6}L ₁ ={W ₁ ⁱ |sim(W ₁ ,W ₁ ⁱ )>0.6}

为统一表述，采用列表代替集合，记L₁为：[W₁ ¹，W₁ ²，W₁ ³，W₁ ⁴，…，W₁ ^k]，W₁ ⁱ表示W₁通过同义词词林搜索的第i个近义词，其中：此处k的取值最小设置为20，最大对应为W₁的所有近义词中与W₁相似度大于60％的个数。特别地，当W₁的所有近义词中与W₁相似度阈值大于60％的个数不足20时，则从相似度从高到低选取W₁的近义词，以补足20个。In order to unify the expression, a list is used instead of a set, and L ₁ is recorded as: [W ₁ ¹ , W ₁ ² , W ₁ ³ , W ₁ ⁴ , … , W ₁ ^k ], W ₁ ⁱ represents the i-th synonym of W ₁ searched through the synonym dictionary, where: the minimum value of k here is set to 20, and the maximum value corresponds to the number of synonyms of W ₁ with a similarity greater than 60% to W _1. In particular, when the number of synonyms of W ₁ with a similarity threshold greater than 60% to W ₁ is less than 20, the synonyms of W ₁ are selected from the highest to the lowest similarity to make up 20.

步骤a2，以W₁作为输入，通过word2vec搜索W₁的近义词，返回W₁的近义词集合L’₁：其中，L’₁计算公式为：Step a2, taking W ₁ as input, searching for synonyms of W ₁ through word2vec, and returning the synonym set L' ₁ of W ₁ : where the calculation formula of L' ₁ is:

L’₁＝{W₁ ⁱ|sim_word2vec(W₁,W₁ ⁱ)>0.6}L' ₁ ={W ₁ ⁱ |sim _word2vec (W ₁ ,W ₁ ⁱ )>0.6}

为统一表述，采用列表代替集合，记L’₁为：[W’₁ ¹，W’₁ ²，W’₁ ³，W’₁ ⁴，…，W’₁ ^k]，W’₁ ⁱ表示W₁通过word2vec搜索的第i个近义词，其中：此处k的取值最小设置为20，最大对应为W₁的所有word2vec搜索的近义词中与W₁相似度大于60％的个数。特别地，当W₁的所有word2vec搜索的近义词中与W₁相似度阈值大于60％的个数不足20时，从相似度从高到低选取W₁的近义词，以补足20个。For a unified representation, a list is used instead of a set, and L' ₁ is recorded as: [W' ₁ ¹ , W' ₁ ² , W' ₁ ³ , W' ₁ ⁴ , ..., W' ₁ ^k ], W' ₁ ⁱ represents the i-th synonym of W ₁ searched by word2vec, where: The minimum value of k is set to 20, and the maximum value corresponds to the number of synonyms of W ₁ searched by word2vec with a similarity greater than 60% to W _1. In particular, when the number of synonyms of W ₁ searched by word2vec with a similarity threshold greater than 60% is less than 20, the synonyms of W ₁ are selected from the highest to _the lowest similarity to make up 20.

步骤a3，依次对步骤a1所得的L₁列表中的各词汇进行步骤a2的操作，得到L₁列表中的所有词汇对应的近义词列表L_{1_total}；Step a3, sequentially perform the operation of step a2 on each word in the list _L1 obtained in step a1, and obtain a synonym list _{L1_total} corresponding to all words in the list _L1 ;

步骤a4，依次步骤a2所得的L’₁列表中的词汇，进行步骤a1的操作，得到L’₁列表中的所有词汇对应的近义词列表L’_{1_total}；Step a4, performing the operation of step a1 on the words in the list L' ₁ obtained in step a2, and obtaining a synonym list L' _{1_total} corresponding to all the words in the list L'₁;

步骤a5，对L₁、L'₁、L_{1_total}和L’_{1_total}进行合并去重，得到W₁的候选词库，并进一步进行筛选，最终得到触发词W₁对应的所有的近义触发词；Step a5, merge and remove duplicates from L ₁ , L' ₁ , L _{1_total} and L' _{1_total} to obtain the candidate word library of W ₁ , and further screen to finally obtain all the synonymous trigger words corresponding to the trigger word W ₁ ;

针对初始的触发词列表L中的触发词，进行步骤a1～步骤a5的操作，得到W₁,W₂,…,W_n-1,W_n对应的所有的近义触发词，最终将W₁,W₂,…,W_n-1,W_n对应的所有触发词进行合并去重，构建触发词字典。For the trigger words in the initial trigger word list L, perform the operations of step a1 to step a5 to obtain all the synonymous trigger words corresponding to W ₁ , W ₂ , …, W n _-1 , W _n. Finally, all the trigger words corresponding to W ₁ , W ₂ , …, W _n-1 , W _n are merged and deduplicated to construct a trigger word dictionary.

步骤S2中，所述文本分句采用先定位句子分隔符的位置，得到有序的分隔符位置列表信息，再根据位置列表信息进行初步分句；其中定位句子分隔符采用标点符号、换行、及空格的方式。In step S2, the text is segmented by first locating the position of the sentence separator to obtain ordered separator position list information, and then performing preliminary sentence segmentation according to the position list information; wherein the sentence separators are located by punctuation marks, line breaks, and spaces.

步骤S2中，采用一种双引号识别方法，以用于确定分隔符是否能做作为实际的分隔符。In step S2, a double quotation mark recognition method is used to determine whether the delimiter can be used as an actual delimiter.

步骤S2中，所述采用一种双引号识别方法，以用于确定分隔符是否能做作为实际的分隔符，具体包括：将文本初步按照分隔符进行分隔，得到有序的分隔符位置列表信息，记为P：[x₁,x₂,x₃,…,x_m-1,x_m]，其中，x₁,x₂,x₃,…,x_m-1,x_m分别表示分隔符在句子中出现的位置，[1,x₁]对应第一句话的起止位置，从左往右遍历分隔符位置列表，第一次遍历，句子的开始L所在的位置固定为1，句子的结束R对应的位置为x₁，起始位置L和结束位置R构成预分句S，判定预分句S中双引号是否满足：In step S2, a double quotation mark recognition method is used to determine whether the delimiter can be used as an actual delimiter, specifically comprising: preliminarily separating the text according to the delimiter to obtain an ordered delimiter position list information, recorded as P: [x ₁ ,x ₂ ,x ₃ ,…,x _m-1 ,x _m ], wherein x ₁ ,x 2 ,x ₃ _, …,x _m-1 ,x _m respectively represent the positions of the delimiters in the sentence, [1,x ₁ ] correspond to the start and end positions of the first sentence, and the delimiter position list is traversed from left to right. For the first traversal, the position where the start L of the sentence is fixed to 1, and the position corresponding to the end R of the sentence is x ₁ . The start position L and the end position R constitute a pre-sentence S, and determine whether the double quotation marks in the pre-sentence S meet the following conditions:

1)双引号个数是否为偶数；1) Whether the number of double quotes is even;

2)第一个是否为左引号，最后一个是否为右引号；2) Whether the first one is a left quotation mark and the last one is a right quotation mark;

如果上述两个条件均满足，则S作为一个完整的句子输出，下一次遍历将本次句子的结束x₁+1作为句子的开始L，x₁的下一个元素x₂作为句子的结束R，构成预分句S，继而进行同样的判定操作。If both of the above conditions are met, S is output as a complete sentence. In the next traversal, the end of this sentence x ₁ + 1 is taken as the beginning of the sentence L, and the next element x ₂ of x ₁ is taken as the end of the sentence R, forming a pre-sentence S, and then performing the same judgment operation.

步骤S3包括：对每个句子进行分词，判定分词后的句子是否包含已知的人物信息和触发词信息，如果均包含，则将所述句子作为候选言论语句。Step S3 includes: segmenting each sentence, determining whether the segmented sentence contains known character information and trigger word information, and if so, taking the sentence as a candidate speech statement.

步骤S3中，所述对每个句子进行分词，具体包括：通过大规模语料库，分析并抽取各个词条的词性及出现的次数，以构造一个分词词字典，用于对句子分词的词图扫描，生成句子中所有汉字所可能成词情况所构成的有向无环图；对于所述有向无环图，根据各个词条出现的概率求得每一种分词情况的初始分词概率，再引入分词惩罚因子C，得到最终的分词概率，选取最终的分词概率中的最大值作为分词的结果。In step S3, the word segmentation of each sentence specifically includes: analyzing and extracting the part of speech and the number of occurrences of each term through a large-scale corpus to construct a word segmentation dictionary for scanning the word graph of sentence segmentation, and generating a directed acyclic graph consisting of possible word formation situations of all Chinese characters in the sentence; for the directed acyclic graph, the initial word segmentation probability of each word segmentation situation is obtained according to the probability of occurrence of each term, and then the word segmentation penalty factor C is introduced to obtain the final word segmentation probability, and the maximum value of the final word segmentation probability is selected as the word segmentation result.

步骤S3中，采用如下公式计算初始分词概率P_ini：In step S3, the initial word segmentation probability _Pini is calculated using the following formula:

P_{ini_i}＝P(w_{i_1})×P(w_{i_2})×…×P(w_{i_n})P _{ini_i} =P( _{wi_1} )×P( _{wi_2} )×…×P( _{wi_n} )

其中，w_{i_n}表示第i种可能的分词情况下的第n词，P(w_{i_n})表示第n个词汇w_{i_n}的分词概率；Wherein, _{wi_n} represents the nth word in the i-th possible word segmentation case, and P( _{wi_n} ) represents the word segmentation probability of the nth word _{wi_n} ;

如待分词语句“在南京大学玩”，其存在多条路径，即多个分词结果，如下：For example, the sentence to be segmented is "playing at Nanjing University", which has multiple paths, that is, multiple segmentation results, as follows:

路径1：在/南/京/大/学/玩Path 1: Playing at Nanjing University

路径2：在/南京/大/学/玩Path 2: Studying and playing at Nanjing University

路径3：在/南京/大学/玩Path 3: Play at Nanjing University

路径4：在/南京大学/玩Path 4: Play at /Nanjing University/

路径1-4对应的初始分词概率分别为：The initial word segmentation probabilities corresponding to paths 1-4 are:

P_{ini_1}＝P(在)×P(南)×P(京)×P(大)×P(学)×P(玩) _{Pini_1} = P(in)×P(Nanjing)×P(Beijing)×P(university)×P(school)×P(play)

P_{ini_2}＝P(在)×P(南京)×P(大)×P(学)×P(玩) _{Pini_2} = P(in)×P(Nanjing)×P(university)×P(school)×P(play)

P_{ini_3}＝P(在)×P(南京)×P(大学)×P(玩) _{Pini_3} = P(in)×P(Nanjing)×P(university)×P(play)

P_{ini_4}＝P(在)×P(南京大学)×P(玩) _{Pini_4} = P(at)×P(Nanjing University)×P(play)

所述分词词字典包括公共词分词字典和领域词分词字典；公共词分词字典满足大部分情境下分词要求。公共词分词字典包含近40万词的词频，词性，该部分词主要由新闻语料、微信语料、论坛语料等构成的公共预料训练后，再经人工审核得来。领域词分词字典则更侧重于特定人物，地点，机构和专有领域的相关词汇。The word segmentation dictionary includes a public word segmentation dictionary and a domain word segmentation dictionary; the public word segmentation dictionary meets the word segmentation requirements in most situations. The public word segmentation dictionary contains the word frequency and part of speech of nearly 400,000 words. This part of the words is mainly obtained after public prediction training composed of news corpus, WeChat corpus, forum corpus, etc., and then manually reviewed. The domain word segmentation dictionary focuses more on the relevant vocabulary of specific people, places, institutions and proprietary fields.

最终的分词概率P_final计算公式为：The final word segmentation probability P _final is calculated as follows:

P_{final_i}＝Cⁿ×P_{ini_i} P _{final_i} = C ⁿ × _{Pini_i}

其中P_final__i为第i种可能的分词情况下的最终分词概率；C取值在[0.5,0.9]之间，一般选取0.8，n为语句中未将成功领域词切分出来的个数；如待分词语句“在南京大学玩”，如果“南京大学”为领域词，选取惩罚因子C为0.5，则针对上述四种分词路径，最终的分词概率分别为：Where P _final _ _i is the final word segmentation probability under the i-th possible word segmentation case; C takes a value between [0.5, 0.9], and is generally selected as 0.8; n is the number of domain words that have not been successfully segmented in the sentence; for example, for the sentence to be segmented "playing at Nanjing University", if "Nanjing University" is a domain word, the penalty factor C is selected as 0.5, then for the above four word segmentation paths, the final word segmentation probabilities are:

P_{final_1}＝0.5¹×P_{ini_1} P _{final_1} = 0.5 ¹ × _{Pini_1}

P_{final_2}＝0.5¹×P_{ini_2} P _{final_2} = 0.5 ¹ × _{Pini_2}

P_{final_3}＝0.5¹×P_{ini_3} P _{final_3} = 0.5 ¹ × _{Pini_3}

P_{final_4}＝0.5⁰×P_{ini_4} P _{final_4} = 0.5 ⁰ × _{Pini_4}

最终，选取最大的P_{final_i}对应的分词路径即为最终的分词结果。Finally, the segmentation path corresponding to the largest P _{final_i} is selected as the final segmentation result.

步骤S4包括：对候选言论语句采用句法分析，判定人物和触发词之间是否为主谓关系，如果为主谓关系，则判定句子为人物言论。Step S4 includes: performing syntactic analysis on the candidate speech sentence to determine whether there is a subject-predicate relationship between the character and the trigger word; if so, determining that the sentence is the character's speech.

有益效果：本发明人物言论抽取方法能够从全面的触发词构建，精准的句子切分，快速的句子过滤和合理的言论判定等多个步骤进行人物言论的抽取，从而提供一种准确的，高效的人物言论抽取方法。Beneficial effect: The character speech extraction method of the present invention can extract character speech from multiple steps such as comprehensive trigger word construction, accurate sentence segmentation, fast sentence filtering and reasonable speech judgment, thereby providing an accurate and efficient character speech extraction method.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面结合附图和具体实施方式对本发明做更进一步的具体说明，本发明的上述和/或其他方面的优点将会变得更加清楚。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments, and the above and/or other advantages of the present invention will become more clear.

图1为本发明的人物言论抽取整体流程方框图。FIG1 is a block diagram of the overall process of character speech extraction of the present invention.

图2为发明的触发词字典构建流程图。FIG. 2 is a flowchart of constructing a trigger word dictionary of the invention.

图3为本发明的文本分句流程图。FIG. 3 is a flow chart of text sentence segmentation of the present invention.

图4为本发明的句子过滤流程图。FIG. 4 is a flow chart of sentence filtering of the present invention.

图5为本发明的言论判定流程图。FIG5 is a flow chart of speech determination of the present invention.

具体实施方式DETAILED DESCRIPTION

如图1所示为本发明实施例的一种基于文本句法分析的人物言论抽取方法整体流程框图，包括构建触发词字典，文本分句，句子过滤，言论判定。As shown in FIG1 , it is a flowchart of the overall process of a method for extracting character speech based on text syntactic analysis according to an embodiment of the present invention, including building a trigger word dictionary, text sentence segmentation, sentence filtering, and speech judgment.

如图2所示为本发明实施例的构建触发词字典步骤，采用同义词林查找和word2vec查找两种扩充方式实现触发词的扩充，具体步骤为将初始触发词分别作为同义词林查找和word2vec查找的输入数据，分别得到两组候选词集，再将得到的候选词集作为另一种扩充方式的输入，最终得到更多的候选词集，最终通过人工研判，构成触发词字典。As shown in Figure 2, the steps of constructing a trigger word dictionary in an embodiment of the present invention are as follows. The trigger words are expanded by using two expansion methods, namely, synonym forest search and word2vec search. The specific steps are as follows: using the initial trigger words as input data for synonym forest search and word2vec search, respectively, to obtain two sets of candidate word sets, and then using the obtained candidate word sets as input for another expansion method, and finally obtaining more candidate word sets, and finally constructing a trigger word dictionary through manual judgment.

如图3所为本发明实施例的文本分句步骤，根据标点符号，空格、换行等信息作为分句分隔符，匹配得到分隔符位置列表信息，记为：As shown in FIG. 3 , the text sentence segmentation step of the embodiment of the present invention uses punctuation marks, spaces, line breaks and other information as sentence separators, and matches to obtain separator position list information, which is recorded as:

P＝[x₁,x₂,x₃,…,x_m-1,x_m]P＝[x ₁ ,x ₂ ,x ₃ ,…,x _m-1 ,x _m ]

从左往右遍历分隔符位置列表P，第一次遍历，句子的开始L所在的位置固定为1，句子的结束R对应的位置为x₁，起始位置L和结束位置R构成预分句S。Traverse the separator position list P from left to right. For the first traversal, the position where the sentence starts L is fixed to 1, and the position where the sentence ends R corresponds to is x ₁ . The starting position L and the ending position R constitute the pre-sentence S.

判定预分句S中双引号是否满足：Determine whether the double quotes in the pre-clause S satisfy:

1)双引号个数是否为偶数，确保双引号成对出现；1) Check whether the number of double quotes is even, ensuring that double quotes appear in pairs;

2)第一个是否为左引号，最后一个是都为右引号；2) Whether the first one is a left quotation mark and the last one is a right quotation mark;

若均满足，则S作为一个完整的句子输出，下一次遍历将本次句子的结束位置x₁的下一个位置x₁+1作为句子的开始位置L，并采用分隔符列表P中x₁的下一个元素x₂作为句子的结束位置R，起始位置L和结束位置R构成预分句构成预分句S，继而进行同样的判定操作。If all conditions are met, S is output as a complete sentence. In the next traversal, the next position x ₁ + ₁ of the end position x 1 of this sentence is used as the start position L of the sentence, and the next element x ₂ of x ₁ in the separator list P is used as the end position R of the sentence. The start position L and the end position R constitute a pre-sentence S, and then the same judgment operation is performed.

若任一个不满足，则S不能作为一个完整的句子输出，选择分隔符列表P中x₁的下一个元素x₂作为句子的结束R，构成预分句S，继而进行同样的判定操作。If any one of them is not satisfied, S cannot be output as a complete sentence. The next element _x2 of _x1 in the separator list P is selected as the end R of the sentence to form a pre-sentence S, and then the same judgment operation is performed.

循环往复，直至完成遍历。Repeat the process until the traversal is completed.

如图4所示为本发明的句子过滤流程图，对文本分句得到的句子列表进行遍历，对每个句子进行分词，判定分词后是否包含已知的人物信息和触发词信息，如果均包含，则作为候选言论语句。As shown in Figure 4, this is the sentence filtering flow chart of the present invention. The sentence list obtained by text segmentation is traversed, each sentence is segmented, and it is determined whether the segmented sentences contain known character information and trigger word information. If both are contained, they are used as candidate speech sentences.

如图5所示为本发明的言论判定流程图，对文本句子过滤后得到的句子列表进行遍历，对每个句子进行分词并对分词结果进行句法分析，判定所出现的人物和触发词之间是否是主谓关系，若为主谓关系，则判定该句为言论语句。As shown in Figure 5, this is the speech judgment flow chart of the present invention. The sentence list obtained after filtering the text sentences is traversed, each sentence is segmented and the segmentation results are subjected to syntactic analysis to determine whether there is a subject-predicate relationship between the characters and the trigger words. If so, the sentence is determined to be a speech sentence.

实施例Example

本实施例提供了一种基于文本句法分析的人物言论抽取方法，包括如下步骤：This embodiment provides a method for extracting character speech based on text syntactic analysis, comprising the following steps:

步骤S3，句子过滤；Step S3, sentence filtering;

步骤S4，言论判定。Step S4, speech determination.

步骤S1中，所述构建触发词字典是采用初始的触发词列表L：[W₁,W₂,…,W_n-1,W_n]，通过多种近义词方式进行扩充，其中，W₁,W₂,…,W_n-1,W_n为人工由新闻类舆情数据中初筛获得的初始言论类触发词，比如“说”、“表示”、“告诉”、“指出”、“透露”、“坦言”、“声明”等词汇。In step S1, the trigger word dictionary is constructed by adopting an initial trigger word list L: [ _W1 , _W2 , ..., Wn _-1 , _Wn ], which is expanded by a variety of synonyms, wherein _W1 , _W2 , ..., Wn _-1 , _Wn are initial speech trigger words manually obtained from news public opinion data, such as "say", "express", "tell", "point out", "reveal", "frankly", "statement" and other words.

步骤S1中，所述通过多种近义词扩充方式进行扩充包括基于同义词词林搜索近义词的扩充方式和基于词向量的word2vec搜索近义词的扩充方式，针对初始触发词W₁，具体包括如下步骤：In step S1, the expansion by multiple synonym expansion methods includes an expansion method based on the synonym search Cilin and an expansion method based on the word vector word2vec. For the initial trigger word W ₁ , the following steps are specifically included:

步骤a1，以W₁作为输入，通过同义词词林搜索W₁的近义词，返回W₁的近义词列表L₁：[W₁ ¹，W₁ ²，W₁ ³，W₁ ⁴，…，W₁ ^k]；Step a1, taking W ₁ as input, searching for synonyms of W ₁ through the synonym dictionary, and returning a synonym list L ₁ of W ₁ : [W ₁ ¹ , W ₁ ² , W ₁ ³ , W ₁ ⁴ , … , W ₁ ^k ];

步骤a2，以W₁作为输入，通过word2vec搜索W₁的近义词，返回W₁的近义词列表L’₁：[W’₁ ¹，W’₁ ²，W’₁ ³，W’₁ ⁴，…，W’₁ ^k]，其中，所述word2vec采用的训练语料是自己标注的舆情领域新闻类业务数据，因此，更适用于人物言论触发词对应的近义词的扩充；Step a2, taking W ₁ as input, searching for synonyms of W ₁ through word2vec, and returning a synonym list L' ₁ of W ₁ : [W' ₁ ¹ , W' ₁ ² , W' ₁ ³ , W' ₁ ⁴ , ..., W' ₁ ^k ], wherein the training corpus used by word2vec is the self-annotated news business data in the field of public opinion, and therefore, it is more suitable for the expansion of synonyms corresponding to trigger words of character speech;

步骤a5，对L₁、L'₁、L_{1_total}和L’_{1_total}进行合并去重，得到W₁的候选词库，并进一步进行人工筛选，最终得到触发词W₁对应的所有的近义触发词；Step a5, merge and remove duplicates from L ₁ , L' ₁ , L _{1_total} and L' _{1_total} to obtain the candidate word library of W ₁ , and further perform manual screening to finally obtain all the synonymous trigger words corresponding to the trigger word W ₁ ;

步骤S2中，所述文本分句采用先定位句子分隔符的位置，得到有序的分隔符位置列表信息，再根据位置列表信息进行初步分句；其中定位句子分隔符采用标点符号、换行、及空格的方式，其集合如下所示：{。？；！！.？\n\r\r\n......space～}。如下一段话所示。In step S2, the text sentence segmentation adopts the method of first locating the position of the sentence separator, obtaining the ordered separator position list information, and then performing preliminary sentence segmentation according to the position list information; wherein the sentence separator is located by punctuation, line break, and space, and its set is as follows: {. ? ; ! ! .? \n\r\r\n......space～}. As shown in the following paragraph.

“据我所知，在7月初，全国应该达到(覆盖率)40％，今年年底能够达到80％。按照疫苗保护率达到70％计算，中国的新冠疫苗覆盖率需要达到近80％，才有可能形成群体免疫。”钟南山对中新社记者说。本着自愿的原则，18至60周岁符合身体条件的中国公民均可免费接种新冠疫苗.居民甲、乙准备接种疫苗，其居住地及工作单位附近有两个大型医院和两个社区卫生服务中心均可免费接种疫苗。"As far as I know, the national coverage rate should reach 40% by the beginning of July and 80% by the end of this year. Based on the 70% vaccine protection rate, China's COVID-19 vaccine coverage rate needs to reach nearly 80% to form herd immunity," Zhong Nanshan told China News Service. Based on the principle of voluntariness, all Chinese citizens aged 18 to 60 who meet the physical conditions can receive the COVID-19 vaccine free of charge. Residents A and B are preparing to receive the vaccine. There are two large hospitals and two community health service centers near their residence and workplace where they can receive the vaccine free of charge.

上述语段经过分隔符分句后，得到的句子列表为：After the above paragraph is separated into sentences by delimiters, the resulting sentence list is:

(1)“据我所知，在7月初，全国应该达到(覆盖率)40％，今年年底能够达到80％。(1) “As far as I know, by early July, the national coverage rate should reach 40%, and by the end of this year it should reach 80%.

(2)按照疫苗保护率达到70％计算，中国的新冠疫苗覆盖率需要达到近80％，才有可能形成群体免疫。(2) Based on the calculation that the vaccine protection rate reaches 70%, China’s new crown vaccine coverage rate needs to reach nearly 80% before herd immunity can be achieved.

(3)”钟南山对中新社记者说。(3)” Zhong Nanshan told China News Service.

(4)本着自愿的原则，18至60周岁符合身体条件的中国公民均可免费接种新冠疫苗.居民甲、乙准备接种疫苗，其居住地及工作单位附近有两个大型医院和两个社区卫生服务中心均可免费接种疫苗。(4) Based on the principle of voluntariness, all Chinese citizens aged 18 to 60 who meet the physical conditions can receive the COVID-19 vaccine free of charge. Residents A and B are preparing to receive the vaccine. There are two large hospitals and two community health service centers near their residence and workplace where they can receive the vaccine free of charge.

在针对人物言论采用双引号进行表示，而对于双引号中的分隔符，并不能代表句子的结束，因此采用一种双引号识别方法，判定是否为双引号中的分隔符，如果是，则不在该分隔符处进行分句。如上述示例语句中，第一个句号和第二个句号均出现在双引号里面，因此均不作为分隔符。第三个句号则出现在双引号外面，因此可作为分隔符。Double quotes are used to represent the speech of a person, but the delimiter in the double quotes cannot represent the end of the sentence. Therefore, a double quote recognition method is used to determine whether it is a delimiter in the double quotes. If it is, the sentence is not separated at the delimiter. For example, in the above example sentence, the first and second periods appear inside the double quotes, so they are not used as delimiters. The third period appears outside the double quotes, so it can be used as a delimiter.

步骤S2中，所述采用一种双引号识别方法，以用于确定分隔符是否能做作为实际的分隔符。其具体实施为：将文本初步按照分隔符进行分隔，得到有序的分隔符位置列表信息，记为P：[x₁,x₂,x₃,…,x_m-1,x_m]，其中，x₁,x₂,x₃,…,x_m-1,x_m分别表示分隔符在句子中出现的位置。因此，[1,x₁]对应第一句候选语句的起止位置，[x₁+1,x₂]对应第二个候选语句的起止位置。从左往右遍历分隔符位置列表，第一次遍历，选择1作为句子的开始位置L，选择分隔符位置列表信息中的x₁则作为句子的结束R，构成预分句S，判定预分句S中双引号是否满足：In step S2, a double quote recognition method is used to determine whether the delimiter can be used as an actual delimiter. The specific implementation is: the text is initially separated according to the delimiter to obtain an ordered list of delimiter positions, recorded as P: [x ₁ ,x ₂ ,x ₃ ,…,x _m-1 ,x _m ], wherein x ₁ ,x ₂ ,x ₃ ,…,x _m-1 ,x _m respectively represent the positions where the delimiters appear in the sentence. Therefore, [1,x ₁ ] corresponds to the start and end positions of the first candidate sentence, and [x ₁ +1,x ₂ ] corresponds to the start and end positions of the second candidate sentence. Traverse the delimiter position list from left to right. For the first traversal, select 1 as the start position L of the sentence, and select x ₁ in the delimiter position list information as the end R of the sentence to form a pre-sentence S, and determine whether the double quotes in the pre-sentence S satisfy:

如果上述两个条件均满足，则S作为一个完整的句子输出，下一次遍历将本次句子的结束位置x₁的下一个位置x₁+1作为句子的开始位置L，并采用分隔符列表P中x₁的下一个元素x₂作为句子的结束位置R，构成下一个预分句S，继而进行同样的判定操作；If both of the above conditions are met, S is output as a complete sentence. In the next traversal, the next position x ₁ + 1 of the end position x ₁ of this sentence is used as the start position L of the sentence, and the next element x ₂ of x ₁ in the separator list P is used as the end position R of the sentence to form the next pre-sentence S, and then the same judgment operation is performed;

以如下段文字作为案例。Take the following paragraph as an example.

若不采用本发明所述方法，分句结果为：If the method of the present invention is not adopted, the sentence results are:

若采用本发明所述方法，分句结果为：If the method of the present invention is adopted, the sentence results are:

(1)“据我所知，在7月初，全国应该达到(覆盖率)40％，今年年底能够达到80％。按照疫苗保护率达到70％计算，中国的新冠疫苗覆盖率需要达到近80％，才有可能形成群体免疫。”钟南山对中新社记者说。(1) "As far as I know, the national coverage rate should reach 40% by the beginning of July and 80% by the end of this year. Based on the 70% vaccine protection rate, China's COVID-19 vaccine coverage rate needs to reach nearly 80% to form herd immunity," Zhong Nanshan told China News Service.

(2)本着自愿的原则，18至60周岁符合身体条件的中国公民均可免费接种新冠疫苗.居民甲、乙准备接种疫苗，其居住地及工作单位附近有两个大型医院和两个社区卫生服务中心均可免费接种疫苗。(2) Based on the principle of voluntariness, all Chinese citizens aged 18 to 60 who meet the physical requirements can receive the COVID-19 vaccine free of charge. Residents A and B are preparing to receive the vaccine. There are two large hospitals and two community health service centers near their residence and workplace where they can receive the vaccine free of charge.

步骤S3包括：对每个句子进行分词，判定分词后的句子是否包含已知的人物信息和触发词信息，如果均包含，则将所述句子作为候选言论语句。其中，分词采用jieba分词，并在jieba分词技术基础上，引入自定义词汇信息，以提升分词在业务场景的准确性。自定义词汇信息包括触发词，舆情领域关键人物，舆情领域关键词等词汇信息。Step S3 includes: segmenting each sentence, determining whether the segmented sentence contains known character information and trigger word information, and if both are contained, taking the sentence as a candidate speech statement. The segmentation adopts Jieba segmentation, and on the basis of Jieba segmentation technology, introduces custom vocabulary information to improve the accuracy of segmentation in business scenarios. Custom vocabulary information includes trigger words, key figures in the public opinion field, keywords in the public opinion field and other vocabulary information.

步骤S4包括：对候选言论语句采用句法分析，判定人物和触发词之间是否为主谓关系，如果为主谓关系，则判定句子为人物言论。如下语句(1)中：“钟南山”作为人物词，“说”作为触发词，二者在句法分析中关系为主谓关系，则该句判定为言论语句。而语句(2)中，尽管同样命中人物“钟南山”和触发词“说”，但二者并非主谓关系，因此该句判为非言论句。Step S4 includes: using syntactic analysis on the candidate speech sentence to determine whether there is a subject-predicate relationship between the character and the trigger word. If there is a subject-predicate relationship, the sentence is determined to be a character speech. In the following sentence (1), "Zhong Nanshan" is used as a character word, and "Say" is used as a trigger word. The relationship between the two is a subject-predicate relationship in syntactic analysis, so the sentence is determined to be a speech sentence. In sentence (2), although the character "Zhong Nanshan" and the trigger word "Say" are also hit, the two are not in a subject-predicate relationship, so the sentence is determined to be a non-speech sentence.

(1)“据我所知，在7月初，全国应该达到(覆盖率)40％，今年年底能够达到80％。按照疫苗保护率达到70％计算，中国的新冠疫苗覆盖率需要达到近80％，才有可能形成群体免疫。”钟南山对中新社记者说。(2)近日，钟南山院士在接受采访时讲了一句很可爱的话，也被很多人津津乐道，这位84位的老人也有很可爱的一面。(1) "As far as I know, by the beginning of July, the national coverage should reach 40%, and by the end of this year it should reach 80%. Based on the 70% vaccine protection rate, China's COVID-19 vaccine coverage needs to reach nearly 80% to achieve herd immunity," Zhong Nanshan told China News Service. (2) Recently, Academician Zhong Nanshan said something very lovely in an interview, which was also talked about by many people. This 84-year-old man also has a very lovely side.

本发明提供了一种基于文本句法分析的人物言论抽取方法，具体实现该技术方案的方法和途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides a method for extracting character speech based on text syntactic analysis. There are many methods and ways to implement the technical solution. The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications should also be regarded as the protection scope of the present invention. All components not specified in this embodiment can be implemented by existing technologies.

Claims

1. A method for extracting character speech based on text syntactic analysis, characterized in that it comprises the following steps:

Step S1, constructing a speech trigger word dictionary: for the initial speech trigger words, using synonym technology, expand the trigger words and construct a trigger word dictionary;

Step S2, text sentence segmentation: segment the entire text into complete sentences;

Step S3, sentence filtering;

Step S4, speech determination;

In step S1, the trigger word dictionary is constructed by using an initial trigger word list L: [W ₁ ,W ₂ ,…,W _n-1 ,W _n ], where W ₁ ,W ₂ ,…,W _n-1 ,W _n correspond to the first, second, third,…, nth initial trigger words respectively; the initial trigger words are speech trigger words obtained by preliminary screening in news public opinion data;

In step S1, the synonyms are expanded by a variety of expansion methods, including an expansion method based on the synonym dictionary search for synonyms and an expansion method based on word vector word2vec search for synonyms;

For the first initial trigger word W ₁ , the following steps are specifically included:

Step a1, taking W ₁ as input, searching for synonyms of W ₁ through the synonym dictionary, and returning a synonym set L ₁ of W ₁ , where the calculation formula of L ₁ is:

L ₁ ={W ₁ ⁱ |sim(W ₁ ,W ₁ ⁱ )>0.6}

To unify the expression, a list is used instead of a set, and L ₁ is denoted as: [W ₁ ¹ , W ₁ ² , W ₁ ³ , W ₁ ⁴ , … , W ₁ ^k ], where W ₁ ⁱ represents the i-th synonym of W ₁ searched through the synonym dictionary;

Step a2, taking W ₁ as input, searching for synonyms of W ₁ through word2vec, and returning the synonym set L' ₁ of W ₁ : where the calculation formula of L' ₁ is:

L' ₁ ={W ₁ ⁱ |sim _word2vec (W ₁ ,W ₁ ⁱ )>0.6}

To unify the expression, a list is used instead of a set, and L' ₁ is denoted as: [W' ₁ ¹ , W' ₁ ² , W' ₁ ³ , W' ₁ ⁴ , … , W' ₁ ^k ], where W' ₁ ⁱ represents the i-th synonym of W ₁ searched by word2vec;

Step a3, sequentially perform the operation of step a2 on each word in the list _L1 obtained in step a1, and obtain a synonym list _{L1_total} corresponding to all words in the list _L1 ;

Step a4, performing the operation of step a1 on the words in the list L' ₁ obtained in step a2 in turn, to obtain a synonym list L' _{1_total} corresponding to all the words in the list L'₁;

Step a5, merge and remove duplicates from L ₁ , L' ₁ , L _{1_total} and L' _{1_total} to obtain the candidate word library of W ₁ , and further screen to finally obtain all the synonymous trigger words corresponding to the trigger word W ₁ ;

For the trigger words in the initial trigger word list L, perform the operations of step a1 to step a5 to obtain all the synonymous trigger words corresponding to W ₁ , W ₂ , …, W _n-1 , W _n , and finally merge and remove duplicates of all the trigger words corresponding to W ₁ , W ₂ , …, W _n-1 , W _n to construct a trigger word dictionary;

Step S4 includes: performing syntactic analysis on the candidate speech sentence to determine whether there is a subject-predicate relationship between the character and the trigger word; if so, determining that the sentence is the character's speech.

2. The method according to claim 1 is characterized in that in step S2, the text sentence segmentation adopts the method of first locating the position of the sentence delimiter to obtain ordered separator position list information, and then performing preliminary sentence segmentation according to the position list information; wherein the sentence delimiters are located by punctuation marks, line breaks, and spaces.

3. The method according to claim 2 is characterized in that, in step S2, a double quotation mark recognition method is adopted to determine whether the delimiter can be used as an actual delimiter.

4. The method according to claim 3 is characterized in that, in step S2, a double quotation mark recognition method is used to determine whether the delimiter can be used as an actual delimiter, specifically comprising: preliminarily separating the text according to the delimiter to obtain an ordered delimiter position list information, recorded as P: [x ₁ , x ₂ , x ₃ , …, x _m-1 , x _m ], wherein x ₁ , x 2 , x ₃ , …, x _m-1 _, x _m respectively represent the positions of the delimiters in the sentence, [1, x ₁ ] correspond to the start and end positions of the first sentence, and the delimiter position list is traversed from left to right. For the first traversal, the position where the start L of the sentence is located is fixed to 1, and the position corresponding to the end R of the sentence is x ₁ . The start position L and the end position R constitute a pre-sentence S, and determine whether the double quotation marks in the pre-sentence S meet the following conditions:

Whether the number of double quotes is even;

Whether the first one is a left quotation mark and the last one is a right quotation mark;

If both of the above conditions are met, S is output as a complete sentence. In the next traversal, the end _x1 of this sentence is taken as the beginning L of the sentence, and the next element _x2 of _x1 is taken as the end R of the sentence, forming a pre-sentence S, and then the same judgment operation is performed.

5. The method according to claim 4 is characterized in that step S3 includes: segmenting each sentence, determining whether the segmented sentence contains known character information and trigger word information, and if so, taking the sentence as a candidate speech statement.

6. The method according to claim 5 is characterized in that in step S3, the word segmentation of each sentence specifically includes: analyzing and extracting the part of speech and the number of occurrences of each term through a large-scale corpus to construct a word segmentation dictionary for scanning the word graph of sentence segmentation, and generating a directed acyclic graph consisting of possible word formation situations of all Chinese characters in the sentence; for the directed acyclic graph, the initial word segmentation probability of each word segmentation situation is obtained according to the probability of occurrence of each term, and then the word segmentation penalty factor C is introduced to obtain the final word segmentation probability, and the maximum value of the final word segmentation probability is selected as the word segmentation result.

7. According to the method of claim 6, in step S3, the initial word segmentation probability P _ini is calculated using the following formula:

P _{ini_i} =P( _{wi_1} )×P( _{wi_2} )×…×P( _{wi_n} ),

Where w _{i_n} represents the nth word in the i-th possible word segmentation case, and P(w _{i_n} ) represents the word segmentation probability of the nth word w _{i_n} ;

The final word segmentation probability P _final is calculated as follows:

P _{final_i} = C ^v × _{Pini_i}

Where P _{final_i} is the final word segmentation probability under the i-th possible word segmentation case; v is the number of domain words that have not been successfully segmented in the sentence.