CN108491512A

CN108491512A - The method of abstracting and device of headline

Info

Publication number: CN108491512A
Application number: CN201810247766.8A
Authority: CN
Inventors: 邬小鹏; 余晓龙; 张华泉; 王浩; 张向征
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2018-09-04

Abstract

The invention provides a method and device for summarizing news headlines. The method includes: obtaining the original headline of the news, performing lexical and syntactic analysis on the original headline of the news, and obtaining the analysis result; based on the analysis result, extracting the main content of the sentence in the original headline of the news, and using the main content of the sentence as the news Candidate headlines: using the abstract quality evaluation strategy of news headlines to evaluate the quality of the news candidate headlines, and then determine the news abstract headlines according to the evaluation results. The embodiment of the present invention utilizes lexical and syntactic analysis to compress and summarize news headlines, so that the main content of the news headlines is extracted while retaining key information in the original news headlines as much as possible, and more accurate and rigorous news headlines can be obtained.

Description

Summary method and device for news headlines

技术领域technical field

本发明涉及互联网应用技术领域，特别是一种新闻标题的摘要方法及装置。The invention relates to the technical field of Internet applications, in particular to a method and device for summarizing news headlines.

背景技术Background technique

在当今信息量巨大的互联网中，网络用户在使用搜索引擎进行新闻搜索时，一般基于新闻标题的内容与描述筛选其需要的内容，进而产生点击行为，因此新闻标题对相应新闻信息的概括性、准确性以及关键信息覆盖能力，很大程度上决定了用户对该搜索引擎的使用体验。In today's Internet with a huge amount of information, when network users use search engines to search for news, they generally filter the content they need based on the content and description of the news title, and then generate click behavior. Accuracy and key information coverage largely determine the user experience of the search engine.

目前的搜索引擎产品中，尤其是新闻类搜索，大多直接使用新闻的原始标题作为搜索展现结果的标题，然而新闻原始标题为了博人眼球、增加点击量，往往会充斥大量冗余信息，甚至过多强调某个侧面以偏概全，导致标题不严谨、不准确，还可能会对用户产生错误引导。这样的标题在新闻主动推送产品中，会直接导致用户无法快速获取新闻关键信息，影响用户的体验，降低用户对于推送内容的信息获取欲望，以及降低对推送产品的粘性。Most of the current search engine products, especially news searches, directly use the original title of the news as the title of the search results. However, the original title of the news is often filled with a lot of redundant information, or even too Emphasizing more on a certain aspect in order to overgeneralize, resulting in imprecise and inaccurate titles, and may also mislead users. Such titles in active news push products will directly cause users to be unable to quickly obtain key news information, affect user experience, reduce users' desire to obtain information about push content, and reduce stickiness to push products.

因此，针对新闻的原始标题，去除冗余信息，以得到更准确、更严谨的新闻标题成为亟待解决的技术问题。Therefore, for the original news headlines, removing redundant information to obtain more accurate and rigorous news headlines has become an urgent technical problem to be solved.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的新闻标题的摘要方法及装置。In view of the above problems, the present invention is proposed to provide a method and device for summarizing news headlines which overcome the above problems or at least partly solve the above problems.

依据本发明的一方面，提供了一种新闻标题的摘要方法，包括：According to one aspect of the present invention, a method for summarizing news headlines is provided, including:

获取新闻的原始标题，对新闻的原始标题进行词法句法分析，得到分析结果；Obtain the original title of the news, perform lexical and syntactic analysis on the original title of the news, and obtain the analysis result;

基于所述分析结果，提取新闻的原始标题中的句子主干内容，并将提取的句子主干内容作为新闻候选标题；Based on the analysis result, extract the sentence main content in the original title of the news, and use the extracted sentence main content as the news candidate title;

利用新闻标题的摘要质量评估策略，对所述新闻候选标题的质量进行评估，进而根据评估结果确定新闻摘要标题。The quality evaluation strategy of news headline abstracts is used to evaluate the quality of the news candidate headlines, and then the news abstract headlines are determined according to the evaluation results.

可选地，所述获取新闻的原始标题，包括：Optionally, said acquiring the original headline of the news includes:

获取网络爬虫抓取的关于新闻资源的抓取日志；Obtain crawling logs about news resources crawled by web crawlers;

从抓取日志中提取新闻的原始标题。Extract the original headlines of the news from scraped logs.

可选地，所述从抓取日志中提取新闻的原始标题，包括：Optionally, the extracting the original headlines of the news from the crawl log includes:

对于抓取日志中关于新闻资源的各条记录，提取该条记录的指定字段的字段值作为新闻的原始标题。For each record about the news resource in the crawl log, extract the field value of the specified field of the record as the original title of the news.

可选地，所述对新闻的原始标题进行词法句法分析，得到分析结果，包括：Optionally, the lexical and syntactic analysis of the original title of the news is carried out to obtain the analysis results, including:

对新闻的原始标题进行分词处理，得到多个分词；Perform word segmentation processing on the original headline of the news to obtain multiple word segmentation;

对所述多个分词中的各分词分别进行词性标注和实体类别标注；Carry out part-of-speech tagging and entity category tagging respectively to each participle in the plurality of participles;

基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型。Based on the part-of-speech tagging and entity category tagging of each participle, the dependency syntax analysis of the original headline of the news is carried out, and the subscript and dependency type of each participle are identified.

可选地，所述对新闻的原始标题进行分词处理的方法包括下列至少之一：Optionally, the method for word segmentation of the original headline of the news includes at least one of the following:

基于字符串匹配的分词方法；Word segmentation method based on string matching;

基于语义理解的分词方法；Word segmentation method based on semantic understanding;

基于统计的分词方法。Statistical word segmentation method.

可选地，对所述多个分词中的各分词进行实体类别标注，包括：Optionally, performing entity category labeling on each of the plurality of word segmentations, including:

采用序列标注模型，对所述多个分词中的各分词的实体词进行识别，标注实体类别。A sequence labeling model is used to identify entity words in each of the multiple word segmentations, and label entity categories.

可选地，所述实体类别包括下列任意之一：Optionally, the entity category includes any one of the following:

人名、地名、机构名、品牌名、软件名。Names of people, places, organizations, brands, and software.

可选地，所述基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型，包括：Optionally, based on the part-of-speech tagging and entity category tagging of each participle, the original headline of the news is subjected to dependency syntactic analysis, and the dependent node subscript and dependency type of each participle are identified, including:

通过各分词的词性标注和实体类别标注，对新闻的原始标题的语法成分进行识别；Through the part-of-speech tagging and entity category tagging of each participle, the grammatical components of the original headline of the news are identified;

分析识别出的各语法成分之间的依存关系，得到各分词的依存节点下标和依存类型。Analyze the identified dependency relationship between each grammatical component, and obtain the dependency node subscript and dependency type of each participle.

可选地，基于所述分析结果，提取新闻的原始标题中的句子主干内容，包括：Optionally, based on the analysis result, the main content of the sentence in the original headline of the news is extracted, including:

根据各分词的词性标注、实体类别标注、依存节点下标以及依存类型，生成句法树，进而通过对句法树的筛选与剪枝，生成新闻的原始标题的句子主干内容。According to the part-of-speech tagging, entity category tagging, dependent node subscripts, and dependency types of each participle, a syntax tree is generated, and then the sentence backbone content of the original headline of the news is generated by screening and pruning the syntax tree.

可选地，所述根据各分词的词性标注、实体类别标注、依存节点下标以及依存类型，生成句法树，进而通过对句法树的筛选与剪枝，生成新闻的原始标题的句子主干内容，包括：Optionally, generating a syntax tree according to the part-of-speech tagging, entity category tagging, dependent node subscript, and dependency type of each participle, and then generating the main sentence content of the original headline of the news by screening and pruning the syntax tree, include:

选取依存类型中核心关系对应的head主节点为主干谓语；Select the head master node corresponding to the core relationship in the dependency type as the main predicate;

若主节点分词后词性为名词词性，则对所有特定类比的浅层依存的名词进行归并更新谓语；If the part of speech of the main node is a noun part of speech after the word segmentation, then merge and update the predicate for all shallow dependent nouns of a specific analogy;

若主节点分词后词性为动词词性，则设定主节点为谓语动词；If the part of speech after the main node participle is a verb part of speech, then set the main node as a predicate verb;

对于否定词定语进行识别并归并入谓语。Negative attributives are identified and incorporated into predicates.

可选地，所述方法还包括：Optionally, the method also includes:

识别主谓关系节点，对于主语周边节点进行归并，对并列关系节点依照主语规则保持名词词性部分，其余进行节点剪枝，并设置主语节点。Identify the subject-predicate relationship nodes, merge the surrounding nodes of the subject, keep the part of speech of the noun for the parallel relationship nodes according to the subject rules, and prune the rest of the nodes, and set the subject node.

可选地，所述方法还包括：Optionally, the method also includes:

根据宾语类型，若为名词对宾语进行识别，并列关系节点全部去除，并设置宾语节点。According to the object type, if it is a noun to identify the object, all the parallel relationship nodes are removed, and the object node is set.

可选地，利用新闻标题的摘要质量评估策略，对所述新闻候选标题的质量进行评估，包括：Optionally, the quality of the candidate news headlines is evaluated by using an abstract quality evaluation strategy for news headlines, including:

采用神经机器翻译模型对新闻的原始标题进行压缩式处理，得到新闻衡量标题；Use the neural machine translation model to compress the original headlines of the news to get the headlines for news measurement;

对所述新闻衡量标题和所述新闻候选标题，使用语言模型进行句子在该语言模型下的质量得分计算；For the news measurement headline and the news candidate headline, use a language model to calculate the quality score of the sentence under the language model;

将计算得到的质量得分结果，作为对所述新闻候选标题的质量进行评估的评估结果。The calculated quality score result is used as an evaluation result for evaluating the quality of the news candidate headline.

可选地，所述根据评估结果确定新闻摘要标题，包括：Optionally, the determining the headline of the news summary according to the evaluation result includes:

在所述新闻衡量标题和所述新闻候选标题中，根据计算得到的质量得分结果，确定质量得分最高的标题作为待选标题；Among the news measurement headlines and the news candidate headlines, according to the calculated quality score result, determine the headline with the highest quality score as the headline to be selected;

若该候选标题对应的质量得分大于质量分数阈值，则判断该待选标题是否满足预设审核条件，若是，则将该待选标题确定为新闻摘要标题。If the quality score corresponding to the candidate title is greater than the quality score threshold, it is judged whether the candidate title satisfies the preset review condition, and if so, the candidate title is determined as a news summary title.

可选地，所述该待选标题是否满足预设审核条件包括下列至少之一：Optionally, whether the title to be selected meets the preset review conditions includes at least one of the following:

该待选标题是否是主谓结构语法；Whether the title to be selected is a subject-predicate structure grammar;

该待选标题是否是主谓结构语法，且谓语动词含动词成分；Whether the title to be selected is a subject-predicate structure grammar, and the predicate verb contains a verb component;

该待选标题与新闻的原始标题的编辑距离是否小于编辑距离阈值；Whether the edit distance between the candidate title and the original news title is less than the edit distance threshold;

该待选标题与新闻的原始标题的语义距离是否小于语义距离阈值。Whether the semantic distance between the candidate title and the original news title is smaller than the semantic distance threshold.

可选地，在根据评估结果确定新闻摘要标题之后，所述方法还包括：Optionally, after determining the title of the news summary according to the evaluation result, the method further includes:

将所述新闻摘要标题提供给实时热点产品模块，从而由实时热点产品模块将所述新闻摘要标题作为实时热点进行展示。The news summary title is provided to the real-time hotspot product module, so that the real-time hotspot product module displays the news summary title as a real-time hotspot.

依据本发明的另一方面，还提供了一种新闻标题的摘要装置，包括：According to another aspect of the present invention, a device for summarizing news headlines is also provided, including:

获取模块，适于获取新闻的原始标题；Obtaining module, suitable for obtaining the original headline of the news;

分析模块，适于对新闻的原始标题进行词法句法分析，得到分析结果；The analysis module is suitable for performing lexical and syntactic analysis on the original headline of the news to obtain the analysis result;

提取模块，适于基于所述分析结果，提取新闻的原始标题中的句子主干内容，并将提取的句子主干内容作为新闻候选标题；The extraction module is adapted to extract the main sentence content in the original headline of the news based on the analysis result, and use the main sentence content extracted as the news candidate title;

确定模块，适于利用新闻标题的摘要质量评估策略，对所述新闻候选标题的质量进行评估，进而根据评估结果确定新闻摘要标题。The determining module is adapted to evaluate the quality of the news candidate headlines by using the abstract quality evaluation strategy of the news headlines, and then determine the news abstract headlines according to the evaluation results.

可选地，所述获取模块还适于：Optionally, the acquisition module is also suitable for:

可选地，所述分析模块包括：Optionally, the analysis module includes:

分词单元，适于对新闻的原始标题进行分词处理，得到多个分词；The word segmentation unit is suitable for performing word segmentation processing on the original headline of the news to obtain multiple word segmentations;

标注单元，适于对所述多个分词中的各分词分别进行词性标注和实体类别标注；A tagging unit, adapted to perform part-of-speech tagging and entity category tagging on each of the plurality of word segments;

识别单元，适于基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型。The recognition unit is adapted to perform dependency syntactic analysis on the original headline of the news based on the part-of-speech tagging and entity category tagging of each participle, and identify the dependent node subscript and dependency type of each participle.

基于统计的分词方法。Statistical word segmentation method.

可选地，所述标注单元还适于：Optionally, the labeling unit is also suitable for:

可选地，所述识别单元还适于：Optionally, the identification unit is also suitable for:

可选地，所述提取模块还适于：Optionally, the extraction module is also suitable for:

可选地，所述确定模块还适于：Optionally, the determination module is also suitable for:

可选地，所述装置还包括：提供模块，适于在所述确定模块根据评估结果确定新闻摘要标题之后，将所述新闻摘要标题提供给实时热点产品模块，从而由实时热点产品模块将所述新闻摘要标题作为实时热点进行展示。Optionally, the device further includes: a providing module, adapted to provide the news summary title to the real-time hot product module after the determination module determines the news summary title according to the evaluation result, so that the real-time hot product module sends the news summary title The above news summary headlines are displayed as real-time hotspots.

依据本发明的又一方面，还提供了一种计算机存储介质，所述计算机存储介质存储有计算机程序代码，当所述计算机程序代码在计算设备上运行时，导致所述计算设备执行根据上述的新闻标题的摘要方法。According to yet another aspect of the present invention, a computer storage medium is also provided, the computer storage medium stores computer program codes, and when the computer program codes are run on a computing device, the computing device is caused to execute the above-mentioned Summary method for news headlines.

依据本发明的再一方面，还提供了一种计算设备，包括：处理器；以及存储有计算机程序代码的存储器；当所述计算机程序代码被所述处理器运行时，导致所述计算设备执行根据上述的新闻标题的摘要方法。According to still another aspect of the present invention, there is also provided a computing device, including: a processor; and a memory storing computer program code; when the computer program code is executed by the processor, it causes the computing device to execute According to the summary method of news headlines above.

本发明实施例提供了一种新闻标题的摘要方法，首先获取新闻的原始标题，接着对新闻的原始标题进行词法句法分析，得到分析结果；随后，基于分析结果，提取新闻的原始标题中的句子主干内容，并将提取的句子主干内容作为新闻候选标题；之后，利用新闻标题的摘要质量评估策略，对新闻候选标题的质量进行评估，进而根据评估结果确定新闻摘要标题。可以看到，本发明实施例利用词法句法分析对新闻标题进行压缩式摘要，使新闻标题中的主干内容被提取的同时尽可能保留了原新闻标题中的重点信息，能够得到更准确、更严谨的新闻标题，与此同时引入摘要质量评估策略，对新闻候选标题的质量进行评估，对于摘要质量较好的结果进行自动审核，以降低人工运营审核的成本，并大幅降低了人工审核造成的摘要推送延迟。The embodiment of the present invention provides a method for summarizing news headlines. First, obtain the original headlines of the news, and then perform lexical and syntactic analysis on the original headlines of the news to obtain the analysis results; then, based on the analysis results, extract the sentences in the original headlines of the news. Then, use the abstract quality evaluation strategy of news titles to evaluate the quality of news candidate titles, and then determine the news summary titles according to the evaluation results. It can be seen that the embodiment of the present invention utilizes lexical and syntactic analysis to compress and summarize the news headlines, so that the main content in the news headlines is extracted while retaining the key information in the original news headlines as much as possible, and more accurate and rigorous news headlines can be obtained. At the same time, the summary quality evaluation strategy is introduced to evaluate the quality of news candidate titles, and automatically review the results with better abstract quality, so as to reduce the cost of manual operation review and greatly reduce the summaries caused by manual review. Push delay.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

根据下文结合附图对本发明具体实施例的详细描述，本领域技术人员将会更加明了本发明的上述以及其他目的、优点和特征。Those skilled in the art will be more aware of the above and other objects, advantages and features of the present invention according to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示意了根据本发明一实施例的新闻标题的摘要方法流程图；Fig. 1 illustrates a flow chart of a method for summarizing news headlines according to an embodiment of the present invention;

图2示意了根据本发明一实施例的对新闻的原始标题进行词法句法分析的方法流程图；FIG. 2 illustrates a flow chart of a method for lexical and syntactic analysis of an original news headline according to an embodiment of the present invention;

图3示意了根据本发明一实施例的提取新闻的原始标题中的句子主干内容的方法流程图；FIG. 3 illustrates a flow chart of a method for extracting sentence stem content in an original headline of news according to an embodiment of the present invention;

图4示意了根据本发明一实施例的对新闻候选标题的质量进行评估的方法流程图；FIG. 4 illustrates a flowchart of a method for evaluating the quality of news candidate titles according to an embodiment of the present invention;

图5示意了根据本发明一实施例的根据评估结果确定新闻摘要标题的方法流程图；FIG. 5 illustrates a flow chart of a method for determining a headline of a news abstract according to an evaluation result according to an embodiment of the present invention;

图6示意了根据本发明一实施例的在搜索结果页上展示新闻摘要标题；FIG. 6 illustrates displaying news summary titles on a search result page according to an embodiment of the present invention;

图7示意了根据本发明另一实施例的新闻标题的摘要方法流程图；FIG. 7 illustrates a flow chart of a method for summarizing news headlines according to another embodiment of the present invention;

图8示意了根据本发明一实施例的新闻标题的摘要装置的结构图；以及FIG. 8 illustrates a structural diagram of an apparatus for summarizing news headlines according to an embodiment of the present invention; and

图9示意了根据本发明另一实施例的新闻标题的摘要装置的结构图。Fig. 9 shows a structural diagram of an apparatus for summarizing news headlines according to another embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

在相关技术中，句子压缩使用到的主要方法有：句子中词语删除、句子中词语替换、重排或插入。其中句子中词语删除方法由于其复杂程度较低而成为主流方法，采用的技术主要包括噪声信道模型、结构化辨别模型、树到树的转换、整数线性规划等。就总体效果而言，目前主流方法技术对句子中删除的词语量有限，压缩效果并不明显，如下例：In related technologies, the main methods used in sentence compression include: deletion of words in a sentence, replacement, rearrangement or insertion of words in a sentence. Among them, the word deletion method in the sentence has become the mainstream method because of its low complexity. The techniques used mainly include noise channel model, structured discrimination model, tree-to-tree conversion, integer linear programming, etc. As far as the overall effect is concerned, the current mainstream method technology has a limited amount of deleted words in the sentence, and the compression effect is not obvious, as in the following example:

原句：But they are still continuing to search the area try and see ifthere were,in fact,any further shooting incidents.Original sentence: But they are still continuing to search the area try and see if there were, in fact, any further shooting incidents.

压缩后的句子：They are continuing to search the area to see if therewere any further incidents.Compressed sentence: They are continuing to search the area to see if there were any further incidents.

在上述提及的相关技术中，基于句子中词语删除、句子中词语替换、重排或插入的方式，一方面很难捕获全部标题中的内容与信息，另一方面，基于此技术方案改写后的标题普遍偏长。因而，无论从准确率以及改写后的标题长度都难以满足用户对于产品的需求与体验。另外，由于相关技术方案的效果与现状，需要对摘要后的结果进行人工审核，审核通过后进行推送上线以满足用户产品的高准确需求。因此，该技术方案依然摆脱不了较大的人工运营成本开销，以及人工流程造成的摘要结果的覆盖面低以及时效性差。In the related technologies mentioned above, based on the way of deleting words in sentences, replacing words in sentences, rearranging or inserting, on the one hand, it is difficult to capture the content and information in all titles, on the other hand, after rewriting based on this technical solution The titles are generally too long. Therefore, it is difficult to meet the user's needs and experience for the product in terms of accuracy and the length of the rewritten title. In addition, due to the effect and status quo of related technical solutions, it is necessary to manually review the summary results, and after the review is passed, it will be pushed online to meet the high accuracy needs of user products. Therefore, this technical solution still cannot get rid of the large manual operating cost overhead, and the low coverage and poor timeliness of the summary results caused by the manual process.

为了解决上述技术问题，本发明实施例提供了一种新闻标题的摘要方法。如图1所示，该方法可以包括以下步骤S102至步骤S106。In order to solve the above technical problem, an embodiment of the present invention provides a method for summarizing news headlines. As shown in Fig. 1, the method may include the following steps S102 to S106.

步骤S102，获取新闻的原始标题，对新闻的原始标题进行词法句法分析，得到分析结果。Step S102, obtaining the original headline of the news, performing lexical and syntactic analysis on the original headline of the news, and obtaining the analysis result.

步骤S104，基于分析结果，提取新闻的原始标题中的句子主干内容，并将提取的句子主干内容作为新闻候选标题。Step S104, based on the analysis result, extract the sentence main content in the original headline of the news, and use the extracted sentence main content as a candidate news headline.

步骤S106，利用新闻标题的摘要质量评估策略，对新闻候选标题的质量进行评估，进而根据评估结果确定新闻摘要标题。Step S106, using the news headline abstract quality evaluation strategy to evaluate the quality of the news candidate headlines, and then determine the news abstract headlines according to the evaluation results.

上文步骤S102中获取新闻的原始标题，本发明实施例提供了一种可选的方案，在该方案中，可以获取网络爬虫抓取的关于新闻资源的抓取日志，进而从抓取日志中提取新闻的原始标题。The original title of the news is obtained in step S102 above. The embodiment of the present invention provides an optional solution. In this solution, the crawling log about the news resource captured by the web crawler can be obtained, and then from the crawling log Extract the original headline of the news.

这里的网络爬虫(Web Crawlers)是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。网络爬虫在下载互联网资源时，例如从一家门户网站的首页出发，先下载门户网站首页的这个网页，然后通过分析这个网页，可以找到页面里的所有超链接，也就等于知道了这家门户网站首页所直接链接的全部网页，诸如邮件、财经、新闻等。接下来访问、下载并分析这家门户网站的邮件等网页，又能找到其他相连的网页。让计算机不停地做下去，就能下载整个的互联网。当然，也要记载哪个网页下载过了，以免重复。在网络爬虫中，使用一个称为“哈希表”(Hash Table)的列表而不是一个记事本记录网页是否下载过的信息。The web crawler (Web Crawlers) here is a program or script that automatically grabs information on the World Wide Web according to certain rules. When a web crawler downloads Internet resources, for example, starting from the homepage of a portal website, it first downloads the webpage on the homepage of the portal website, and then by analyzing this webpage, it can find all the hyperlinks in the page, which means it knows the portal website All web pages directly linked from the home page, such as mail, finance, news, etc. Next, visit, download and analyze the e-mail and other web pages of this portal, and find other connected web pages. Let the computer go on and on, and you can download the entire Internet. Of course, it is also necessary to record which web page has been downloaded to avoid duplication. In the web crawler, use a list called "hash table" (Hash Table) instead of a notepad to record whether the web page has been downloaded.

在上面的从抓取日志中提取新闻的原始标题的方案中，还可以具体是对于抓取日志中关于新闻资源的各条记录，提取该条记录的指定字段的字段值作为新闻的原始标题。举例来说，网络爬虫的抓取日志中关于新闻资源的记录格式为url_id+\t+url_title+\t+crawl_time，则提取url_title的字段值作为新闻的原始标题。需要说明的是，此处列举仅是示意性的，并不对本发明实施例进行限制。In the above scheme of extracting the original title of the news from the crawling log, specifically, for each record about the news resource in the crawling log, the field value of the specified field of the record can be extracted as the original title of the news. For example, if the record format of a news resource in a crawl log of a web crawler is url_id+\t+url_title+\t+crawl_time, then the field value of url_title is extracted as the original title of the news. It should be noted that the enumeration here is only illustrative and does not limit the embodiment of the present invention.

进一步地，上文步骤S102中对新闻的原始标题进行词法句法分析，得到分析结果，本发明实施例提供了一种可选的方案，图2示意了根据本发明一实施例的对新闻的原始标题进行词法句法分析的方法流程图。如图2所示，该方法可以包括以下步骤S202至步骤S206。Further, in step S102 above, the original headline of the news is analyzed lexically and syntactically, and the analysis result is obtained. The embodiment of the present invention provides an optional solution. FIG. 2 shows the original headline of the news according to an embodiment of the present invention. A flowchart of a method for performing lexical and syntactic analysis on a title. As shown in Fig. 2, the method may include the following steps S202 to S206.

步骤S202，对新闻的原始标题进行分词处理，得到多个分词。Step S202, perform word segmentation processing on the original headline of the news to obtain multiple word segments.

步骤S204，对多个分词中的各分词分别进行词性标注和实体类别标注。Step S204, perform part-of-speech tagging and entity category tagging on each of the plurality of word segments.

步骤S206，基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型。Step S206, based on the part-of-speech tagging and entity category tagging of each participle, perform dependency syntactic analysis on the original headline of the news, and identify the dependent node subscript and dependency type of each participle.

在步骤S202中，对新闻的原始标题进行分词处理的方法可以包括基于字符串匹配的分词方法，基于语义理解的分词方法或者基于统计的分词方法等等，本发明实施例对此不做限制。In step S202, the word segmentation method for the original headline of the news may include a word segmentation method based on character string matching, a word segmentation method based on semantic understanding, or a word segmentation method based on statistics, etc., which is not limited in this embodiment of the present invention.

基于字符串匹配的分词方法，又叫做机械分词方法，它是按照一定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行配，若在词典中找到某个字符串，则匹配成功(识别出一个词)。按照扫描方向的不同，串匹配分词方法可以分为正向匹配和逆向匹配；按照不同长度优先匹配的情况，可以分为最大(最长)匹配和最小(最短)匹配。常用的几种机械分词方法如下：The word segmentation method based on string matching, also known as the mechanical word segmentation method, matches the Chinese character string to be analyzed with an entry in a "sufficiently large" machine dictionary according to a certain strategy. If a character is found in the dictionary string, the match is successful (a word is recognized). According to different scanning directions, string matching word segmentation methods can be divided into forward matching and reverse matching; according to different length priority matching, they can be divided into maximum (longest) matching and minimum (shortest) matching. Several commonly used mechanical word segmentation methods are as follows:

1)正向最大匹配法(由左到右的方向)；1) forward maximum matching method (direction from left to right);

2)逆向最大匹配法(由右到左的方向)；2) reverse maximum matching method (direction from right to left);

3)最少切分(使每一句中切出的词数最小)；3) Minimal segmentation (making the minimum number of words cut out in each sentence);

4)双向最大匹配法(进行由左到右、由右到左两次扫描)。4) Two-way maximum matching method (carry out two scans from left to right and from right to left).

在实际分词过程中，还可以将上述各种方法相互组合，例如，可以将正向最大匹配方法和逆向最大匹配方法结合起来构成双向匹配法。由于汉语单字成词的特点，正向最小匹配和逆向最小匹配一般很少使用。一般说来，逆向匹配的切分精度略高于正向匹配，遇到的歧义现象也较少。统计结果表明，单纯使用正向最大匹配的错误率为1/169，单纯使用逆向最大匹配的错误率为1/245。但这种精度还远远不能满足实际的需要。实际使用的分词系统，都是把机械分词作为一种初分手段，还需通过利用各种其它的语言信息来进一步提高切分的准确率。一种方法是改进扫描方式，称为特征扫描或标志切分，优先在待分析字符串中识别和切分出一些带有明显特征的词，以这些词作为断点，可以将原字符串分为较小的串再来进行机械分词，从而减少匹配的错误率。另一种方法是将分词和词类标注结合起来，利用丰富的词类信息对分词决策提供帮助，并且在标注过程中又反过来对分词结果进行检验、调整，从而极大地提高切分的准确率。In the actual word segmentation process, the above various methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characteristics of Chinese characters into words, forward minimum matching and reverse minimum matching are generally seldom used. Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities encountered. The statistical results show that the error rate of purely using forward maximum matching is 1/169, and the error rate of purely using reverse maximum matching is 1/245. But this precision is far from meeting the actual needs. The word segmentation systems actually used all use mechanical word segmentation as a means of initial segmentation, and it is necessary to use various other language information to further improve the accuracy of segmentation. One method is to improve the scanning method, which is called feature scanning or flag segmentation, and firstly identify and segment some words with obvious characteristics in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Perform mechanical word segmentation for smaller strings, thereby reducing the error rate of matching. Another method is to combine word segmentation and part-of-speech tagging, use rich part-of-speech information to help word segmentation decisions, and in turn check and adjust the word segmentation results during the tagging process, thereby greatly improving the accuracy of segmentation.

基于语义理解的分词方法，是通过让计算机模拟人对句子的理解，达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析，利用句法信息和语义信息来处理歧义现象。它通常包括三个部分：分词子系统、句法语义子系统、总控部分。在总控部分的协调下，分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断，即它模拟了人对句子的理解过程，这种分词方法需要使用大量的语言知识和信息。The word segmentation method based on semantic understanding is to achieve the effect of recognizing words by allowing computers to simulate human understanding of sentences. Its basic idea is to perform syntactic and semantic analysis at the same time of word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually includes three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain syntactic and semantic information about words, sentences, etc. to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This method of word segmentation requires the use of a large number of languages knowledge and information.

基于统计的分词方法，从形式上看，词是稳定的字的组合，因此在上下文中，相邻的字同时出现的次数越多，就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计，计算它们的互现信息。定义两个字的互现信息，计算两个汉字X、Y的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时，便可认为此字组可能构成了一个词。这种方法只需对语料中的字组频度进行统计，不需要切分词典，因而又叫做无词典分词法或统计取词方法。但这种方法也有一定的局限性，会经常抽出一些共现频度高、但并不是词的常用字组，例如“这一”、“之一”、“有的”、“我的”、“许多的”等，并且对常用词的识别精度差，时空开销大。实际应用的统计分词系统都要使用一部基本的分词词典(常用词词典)进行串匹配分词，同时使用统计方法识别一些新的词，即将串频统计和串匹配结合起来，既发挥匹配分词切分速度快、效率高的特点，又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。Based on the statistical word segmentation method, from a formal point of view, a word is a combination of stable words, so in the context, the more adjacent words appear at the same time, the more likely it is to form a word. Therefore, the frequency or probability of adjacent co-occurrence of words can better reflect the credibility of words. The frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. Mutual appearance information reflects the closeness of the combination relationship between Chinese characters. When the degree of closeness is higher than a certain threshold, it can be considered that this word group may form a word. This method only needs to count the frequency of words in the corpus, and does not need to segment the dictionary, so it is also called the dictionary-free word segmentation method or the statistical word extraction method. However, this method also has certain limitations. It will often extract some common word groups that have a high co-occurrence frequency but are not words, such as "this", "one", "some", "my", "Many", etc., and the recognition accuracy of common words is poor, and the time and space overhead is large. The statistical word segmentation system used in practice must use a basic word segmentation dictionary (common word dictionary) for string matching word segmentation, and at the same time use statistical methods to identify some new words, that is, to combine string frequency statistics and string matching. It has the characteristics of fast speed and high efficiency, and also utilizes the advantages of no-dictionary word segmentation combined with contextual recognition of new words and automatic disambiguation.

另外一类是基于统计机器学习的方法。首先给出大量已经分词的文本，利用统计机器学习模型学习词语切分的规律(称为训练)，从而实现对未知文本的切分。汉语中各个字单独作词语的能力是不同的，此外有的字常常作为前缀出现，有的字却常常作为后缀，结合两个字相临时是否成词的信息，这样就得到了许多与分词有关的知识，这种方法就是充分利用汉语组词的规律来分词。Another category is based on statistical machine learning methods. First, a large number of texts that have been segmented are given, and the statistical machine learning model is used to learn the rules of word segmentation (called training), so as to realize the segmentation of unknown texts. The ability of each character in Chinese to be used as a word is different. In addition, some characters often appear as prefixes, and some characters often appear as suffixes. Combining the information of whether two characters form words when they meet together, we can get a lot of words related to word segmentation. This method is to make full use of the rules of Chinese word formation to segment words.

上文步骤S204中对多个分词中的各分词进行词性标注，具体标注的词性类别可以是名词、动词、形容词、副词、连词、叹词或数量词等等，本发明实施例对此不做限制。In step S204 above, the part-of-speech tagging is performed on each participle in the multiple participle words. The specific part-of-speech category of the tagging can be nouns, verbs, adjectives, adverbs, conjunctions, interjections, or quantifiers, etc., which are not limited in the embodiment of the present invention .

步骤S204中在对多个分词中的各分词进行实体类别标注，本发明实施例提供了一种可选的方案，即，可以采用序列标注模型，对多个分词中的各分词的实体词进行识别，标注实体类别。这里的实体类别可以是人名、地名、机构名、品牌名或软件名等等，本发明实施例不限于此。In step S204, the entity category labeling is performed on each of the multiple word segmentations. The embodiment of the present invention provides an optional solution, that is, a sequence labeling model can be used to perform entity classification on each of the multiple word segmentations. Identify and label entity categories. The entity category here may be a person name, a place name, an organization name, a brand name, or a software name, etc., and the embodiment of the present invention is not limited thereto.

在实际应用中，序列标注模型可以是HMM(Hidden Markov Model，隐马尔可夫模型)，MEMM(Maximum Entropy Markov Model，最大熵隐马尔科夫模型)以及CRF(Conditional Random Field Algorithm，条件随机场模型)等等。与一般分类问题不同的是，序列标注模型输出的是一个标签序列。通常而言，标签之间是相互联系的，构成标签之间的结构信息。利用这些结构信息，序列标注模型在序列标注问题上往往可以达到比传统分类方法更高的性能。In practical applications, the sequence labeling model can be HMM (Hidden Markov Model, hidden Markov model), MEMM (Maximum Entropy Markov Model, maximum entropy hidden Markov model) and CRF (Conditional Random Field Algorithm, conditional random field model )and many more. Unlike general classification problems, the output of a sequence labeling model is a sequence of labels. Generally speaking, tags are related to each other, constituting the structural information between tags. Using these structural information, sequence labeling models can often achieve higher performance than traditional classification methods on sequence labeling problems.

上文步骤S206中提及的依存类型可以如表1举例所示，需要说明的是，表1中示意的依存类型以及例子仅是示意性的，并不对本发明实施例进行限制。The dependency types mentioned in step S206 above can be shown in Table 1 as an example. It should be noted that the dependency types and examples shown in Table 1 are only illustrative and do not limit the embodiment of the present invention.

表1Table 1

依存类型Dependency type Tag(标签)Tag Description(描述信息)Description (descriptive information) 例子example 主谓关系subject-verb relationship SBVSBV subject-verbsubject-verb 我送她一束花(我<--送)I send her a bouquet of flowers (I <-- send) 动宾关系Verb-object relationship VOBVOB 直接宾语，verb-objectdirect object, verb-object 我送她一束花(送-->花)I send her a bouquet of flowers (send --> flowers) 间宾关系guest relationship IOBIOB 间接宾语，indirect-objectindirect object, indirect-object 我送她一束花(送-->她)I send her a bouquet of flowers (send --> her) 前置宾语prepositional object FOBFOB 前置宾语，fronting-objectFronting object, fronting-object 他什么书都读(书<--读)He reads everything (book <-- read) 定中关系fixed relationship ATTATT attributeattribute 红苹果(红<--苹果)redapple(red <-- apple) 状中结构State structure ADVADV adverbialadverbial 非常美丽(非常<--美丽)very beautiful (very <-- beautiful) 动补结构Verb structure CMPCMP complementcomplement 做完了作业(做-->完)Finished homework (done --> finished) 并列关系Constellation COOCOO coordinatecoordinate 大山和大海(大山-->大海)Mountain and Sea (Mountain --> Sea) 介宾关系Guest Relations POBPOB preposition-objectpreposition-object 在贸易区内(在-->内)In trade zone (in -->inside) 独立结构independent structure ISIS independent structureindependent structure 两个单句在结构上彼此独立Two single sentences are structurally independent of each other 核心关系core relationship HEDHED headthe head 指整个句子的核心Refers to the core of the entire sentence 兼语Concurrent language DBLDBL doubledouble 他请我吃饭(请-->我)He invites me to dinner (please --> me)

上文步骤S206中基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型，本发明实施例提供了一种可选的方案，在该可选方案中，可以通过各分词的词性标注和实体类别标注，对新闻的原始标题的语法成分进行识别，进而分析识别出的各语法成分之间的依存关系，得到各分词的依存节点下标和依存类型。In step S206 above, based on the part-of-speech tagging and entity category tagging of each participle, the original headline of the news is analyzed by dependency syntax, and the subscripts and dependency types of each participle are identified. The embodiment of the present invention provides an optional solution , in this alternative, the grammatical components of the original headline of the news can be identified through the part-of-speech tagging and entity category tagging of each participle, and then the dependency relationship between the identified grammatical components can be analyzed to obtain the dependency of each participle Node index and dependency type.

基于上面的依存句法分析，上文步骤S104在基于分析结果，提取新闻的原始标题中的句子主干内容时，具体可以是根据各分词的词性标注、实体类别标注、依存节点下标以及依存类型，生成句法树，进而通过对句法树的筛选与剪枝，生成新闻的原始标题的句子主干内容。Based on the above dependency syntactic analysis, when the above step S104 extracts the main content of the sentence in the original headline of the news based on the analysis result, specifically, it can be based on the part-of-speech tagging, entity category tagging, dependent node subscript and dependency type of each participle, Generate a syntax tree, and then generate the main sentence content of the original headline of the news by screening and pruning the syntax tree.

图3示意了根据本发明一实施例的提取新闻的原始标题中的句子主干内容的方法流程图。如图3所示，该方法可以包括以下步骤S302至步骤S306。Fig. 3 shows a flowchart of a method for extracting sentence stem content in an original headline of news according to an embodiment of the present invention. As shown in Fig. 3, the method may include the following steps S302 to S306.

步骤S302，选取依存类型中核心关系对应的head主节点为主干谓语。Step S302, selecting the head node corresponding to the core relationship in the dependency type as the main predicate.

步骤S304，若主节点分词后词性为名词词性，则对所有特定类比的浅层依存的名词进行归并更新谓语；若主节点分词后词性为动词词性，则设定主节点为谓语动词。Step S304, if the part of speech of the main node after word segmentation is the part of speech of a noun, then merge and update the predicates for all shallowly dependent nouns of a specific analogy; if the part of speech of the main node after word segmentation is a part of speech of a verb, set the main node as a predicate verb.

步骤S306，对于否定词定语进行识别并归并入谓语。Step S306, identifying and merging negative attributives into predicates.

在本发明的可选实施例中，还可以识别主谓关系节点，对于主语周边节点进行归并，对并列关系节点依照主语规则保持名词词性部分，其余进行节点剪枝，并设置主语节点。此外，还可以根据宾语类型，若为名词对宾语进行识别，并列关系节点全部去除，并设置宾语节点。In an optional embodiment of the present invention, it is also possible to identify the subject-predicate relationship nodes, merge the surrounding nodes of the subject, keep the part of speech of the noun for the parallel relationship nodes according to the subject rules, and perform node pruning on the rest, and set the subject node. In addition, according to the type of the object, if the object is a noun, the object can be identified, all the parallel relationship nodes are removed, and the object node is set.

本发明实施例利用词法句法分析对新闻标题进行压缩式摘要，使新闻标题中的主干内容被提取的同时尽可能保留了原新闻标题中的重点信息，能够得到更准确、更严谨的新闻标题。The embodiment of the present invention utilizes lexical and syntactic analysis to compress news headlines, so that the main content of the news headlines is extracted while retaining key information in the original news headlines as much as possible, and more accurate and rigorous news headlines can be obtained.

上文步骤S106中利用新闻标题的摘要质量评估策略，对新闻候选标题的质量进行评估，本发明实施例提供了一种可选的方案，图4示意了根据本发明一实施例的对新闻候选标题的质量进行评估的方法流程图。如图4所示，该方法可以包括以下步骤S402至步骤S406。In step S106 above, the summary quality evaluation strategy of news titles is used to evaluate the quality of news candidate titles. The embodiment of the present invention provides an optional solution. FIG. Flowchart of the method for assessing the quality of titles. As shown in Fig. 4, the method may include the following steps S402 to S406.

步骤S402，采用神经机器翻译模型对新闻的原始标题进行压缩式处理，得到新闻衡量标题。In step S402, the neural machine translation model is used to compress the original headlines of the news to obtain news measurement headlines.

在该步骤中，可以预先对神经机器翻译模型进行训练，例如可以使用历史上线审核后的数据对以及人工标注的数据集合使用Seq2Seq结合Attention机制训练神经机器翻译模型。In this step, the neural machine translation model can be trained in advance, for example, the neural machine translation model can be trained using Seq2Seq combined with the Attention mechanism using historically reviewed data pairs and manually labeled data sets.

步骤S404，对新闻衡量标题和新闻候选标题，使用语言模型进行句子在该语言模型下的质量得分计算。Step S404, for the news measurement headline and the news candidate headline, use the language model to calculate the quality score of the sentence under the language model.

步骤S406，将计算得到的质量得分结果，作为对新闻候选标题的质量进行评估的评估结果。In step S406, the calculated quality score is used as an evaluation result for evaluating the quality of news candidate headlines.

在根据步骤S402至步骤S406将计算得到的质量得分结果，作为对新闻候选标题的质量进行评估的评估结果之后，可以进一步根据评估结果确定新闻摘要标题。After using the calculated quality score results according to steps S402 to S406 as evaluation results for evaluating the quality of news candidate titles, the news abstract titles may be further determined according to the evaluation results.

图5示意了根据本发明一实施例的根据评估结果确定新闻摘要标题的方法流程图。如图5所示，该方法可以包括以下步骤S502至步骤S504。Fig. 5 shows a flow chart of a method for determining a headline of a news abstract according to an evaluation result according to an embodiment of the present invention. As shown in Fig. 5, the method may include the following steps S502 to S504.

步骤S502，在新闻衡量标题和新闻候选标题中，根据计算得到的质量得分结果，确定质量得分最高的标题作为待选标题。Step S502, among the measured news headlines and news candidate headlines, according to the calculated quality score result, determine the headline with the highest quality score as the headline to be selected.

步骤S504，若该候选标题对应的质量得分大于质量分数阈值，则判断该待选标题是否满足预设审核条件，若是，则将该待选标题确定为新闻摘要标题。Step S504, if the quality score corresponding to the candidate title is greater than the quality score threshold, judge whether the candidate title satisfies the preset review condition, and if so, determine the candidate title as a news summary title.

这里，该待选标题是否满足预设审核条件可以包括下列至少之一：Here, whether the title to be selected satisfies the preset review conditions may include at least one of the following:

在实际应用中，可以仅仅是满足预设审核条件之一，则将该待选标题确定为新闻摘要标题；也可以是满足预设审核条件中任意两个或两个以上的组合，则将该待选标题确定为新闻摘要标题；还可以是满足全部预设审核条件，则将该待选标题确定为新闻摘要标题。例如，可以首先判断该待选标题是否是主谓结构语法，若是，则继续判断谓语动词是否含动词成分。若谓语动词含动词成分，则继续判断该待选标题与新闻的原始标题的编辑距离是否小于编辑距离阈值。若该待选标题与新闻的原始标题的编辑距离小于编辑距离阈值，则继续判断该待选标题与新闻的原始标题的语义距离是否小于语义距离阈值。若该待选标题与新闻的原始标题的语义距离小于语义距离阈值，则将该待选标题确定为新闻摘要标题。In practical applications, if only one of the preset review conditions is met, the title to be selected will be determined as the title of the news summary; it can also be a combination of any two or more of the preset review conditions, and the The title to be selected is determined as the title of the news abstract; or if all preset review conditions are met, the title to be selected is determined as the title of the news abstract. For example, it is first possible to judge whether the title to be selected is a subject-predicate structure grammar, and if so, continue to judge whether the predicate verb contains a verb component. If the predicate verb contains a verb component, continue to judge whether the edit distance between the candidate title and the original news title is less than the edit distance threshold. If the edit distance between the candidate title and the original news title is less than the edit distance threshold, continue to judge whether the semantic distance between the candidate title and the original news title is less than the semantic distance threshold. If the semantic distance between the candidate title and the original news title is smaller than the semantic distance threshold, the candidate title is determined as a news summary title.

在本发明的可选实施例中，在根据评估结果确定新闻摘要标题之后，还可以将新闻摘要标题提供给实时热点产品模块，从而由实时热点产品模块将新闻摘要标题作为实时热点进行展示。在实际应用中，实时热点产品模块可以将新闻摘要标题作为实时热点展示在搜索结果页中，可以提升用户的搜索体验，提高搜索引擎生成的搜索结果项的点击率。如图6所示，在搜索词“乡村振兴”对应的搜索结果页上，以实时热点形式展示新闻摘要标题。In an optional embodiment of the present invention, after the headline of the news abstract is determined according to the evaluation result, the headline of the news abstract can also be provided to the real-time hotspot product module, so that the real-time hotspot product module can display the headline of the news abstract as a real-time hotspot. In practical applications, the real-time hotspot product module can display news summary titles as real-time hotspots on the search result page, which can improve the user's search experience and increase the click-through rate of the search result items generated by the search engine. As shown in Figure 6, on the search result page corresponding to the search term "rural revitalization", the headlines of news abstracts are displayed in the form of real-time hotspots.

以上介绍了图1所示实施例的各个环节的多种实现方式，下面将通过具体实施例来详细介绍本发明的新闻标题的摘要方法的实现过程。A variety of implementation modes of each link of the embodiment shown in FIG. 1 have been introduced above, and the implementation process of the method for summarizing news headlines of the present invention will be described in detail below through specific embodiments.

图7示意了根据本发明另一实施例的新闻标题的摘要方法流程图。如图7所示，该方法可以包括以下步骤S702至步骤S708。Fig. 7 shows a flow chart of a method for summarizing news headlines according to another embodiment of the present invention. As shown in Fig. 7, the method may include the following steps S702 to S708.

步骤S702，对互联网上的新闻资源进行抓取，提取出新闻对应的原始标题。Step S702, crawling news resources on the Internet, and extracting original headlines corresponding to the news.

步骤S704，对新闻原始标题使用分词技术、词法分析技术、句法分析技术、实体识别技术，对新闻原始标题中的句子主干内容进行抽取。Step S704, using word segmentation technology, lexical analysis technology, syntactic analysis technology, and entity recognition technology on the original headline of the news to extract the main content of the sentence in the original headline of the news.

步骤S706，使用神经机器翻译模型生成相应改写候选结果。Step S706, using the neural machine translation model to generate corresponding rewriting candidate results.

步骤S708，使用语言模型以及语义特征评估改写质量，并对其中高质量改写结果进行自动审核。Step S708, using the language model and semantic features to evaluate the rewriting quality, and automatically review the high-quality rewriting results.

本发明实施例利用句法分析对新闻原始标题进行压缩式摘要，使新闻原始标题中的主干内容被提取的同时尽可能保留了原新闻中的重点信息，与此同时引入改写摘要质量分模型，对改写摘要效果进行评估，对于摘要质量较好的结果进行自动审核，以降低人工运营审核的成本，并大幅降低了人工审核造成的摘要推送延迟。The embodiment of the present invention uses syntactic analysis to compress the original headline of the news, so that the main content of the original headline of the news is extracted and the key information in the original news is retained as much as possible. The effect of rewriting the abstract is evaluated, and the results with better abstract quality are automatically reviewed to reduce the cost of manual operation review and greatly reduce the delay in pushing the summary caused by manual review.

下面将通过具体例子，即新闻的原始标题为“湖北安陆突降大雪压垮菜市场已救出13人”来详细介绍各部分的具体实施过程。The following will use a specific example, that is, the original headline of the news is "13 people have been rescued from the vegetable market crushed by sudden heavy snowfall in Anlu, Hubei" to introduce the specific implementation process of each part in detail.

(1)模型预训练与已有模型获取(1) Model pre-training and acquisition of existing models

使用历史上线审核后的数据对以及人工标注的数据集合使用Seq2Seq结合Attention机制训练神经机器翻译模型，模型训练工具为360现有神经机器翻译工具包。Use historically reviewed data pairs and manually labeled data sets to train the neural machine translation model using Seq2Seq combined with the Attention mechanism. The model training tool is 360's existing neural machine translation toolkit.

训练数据为平行语料格式如下：The training data is a parallel corpus format as follows:

Ori：银行客户经理违规放贷160万其中138万未能收回Ori: Bank account managers illegally lent 1.6 million of which 1.38 million were not recovered

Sum：银行客户经理违规放贷Sum: bank account manager illegally lent money

获取360已有语言模型作为改写质量分评估。Obtain 360 existing language models as the rewriting quality score evaluation.

(2)标题获取以及对标题进行词法分析过程(2) Title acquisition and lexical analysis of the title

从网络爬虫中的抓取日志中获取新闻原始标题。Obtaining news original headlines from crawl logs in a web crawler.

格式如下：url_id+\t+url_title+\t+crawl_time。The format is as follows: url_id+\t+url_title+\t+crawl_time.

词法分析作为自然语言处理技术中的基本步骤，其产出的词性标注、依存关系以及实体标签类型，是后续句子主干提取、压缩式摘要等技术所依赖的基础特征。调用现有360分词模块后产出：As a basic step in natural language processing technology, lexical analysis produces part-of-speech tagging, dependency relationships, and entity tag types, which are the basic features that subsequent sentence stem extraction, compressed summarization, and other technologies rely on. Output after calling the existing 360 word segmentation module:

例：湖北安陆突降大雪压垮菜市场已救出13人Example: 13 people have been rescued after a sudden heavy snowfall crushed the vegetable market in Anlu, Hubei

分词后：湖北/ns安陆/ns突/d降/v大雪/n压垮/v菜市场/n已/d救出/v13人/mqAfter word segmentation: Hubei/ns Anlu/ns sudden/d drop/v heavy snow/n crushed/v vegetable market/n has/d rescued/v13 people/mq

其中/前为粗粒度分词后的结果，/后分词后的词性标注。Among them, /before is the result after coarse-grained word segmentation, and /post is the part-of-speech tag after word segmentation.

基于分词后的结果，对其中的专名与实体词采用基于序列标注的识别。Based on the results of word segmentation, sequence-based tagging is used to identify proper names and entity words.

待标注的原始数据格式如表2第一列所示，使用序列标注模型产出的标注结果如表2第二、三列所示。在表2中，B表示开始的字节，E表示最后的字节，LOC表示地点。需要说明的是，此处列举仅是示意性的，并不对本发明实施例进行限制。The format of the original data to be labeled is shown in the first column of Table 2, and the labeling results produced by using the sequence labeling model are shown in the second and third columns of Table 2. In Table 2, B indicates the beginning byte, E indicates the last byte, and LOC indicates the location. It should be noted that the enumeration here is only illustrative and does not limit the embodiment of the present invention.

表2Table 2

湖lake BB LOCLOC 北north EE. LOCLOC 安install BB LOCLOC 陆land EE. LOCLOC 突sudden 00 降drop 00 大Big 00 雪Snow 00 压to press 00 垮collapse 00 菜vegetable 00 市city 00 场field 00 已already 00 救save 00 出out 00 1313 00 人people 00

对上述表2中的结果与分词后结果进行归并。Merge the results in Table 2 above with the results after word segmentation.

分词与实体识别后：After word segmentation and entity recognition:

湖北/ns/LOC安陆/ns/LOC突/d/降/v/大雪/n/压垮/v/菜市场/n/已/d/救出/v/13人/mq/Hubei/ns/LOC Anlu/ns/LOC sudden/d/drop/v/heavy snow/n/crushed/v/vegetable market/n/already/d/rescued/v/13 people/mq/

其中/分割后第一列为粗粒度分词后的结果，第二列分词后的词性标注，第三列实体类别标注。Among them, the first column after segmentation is the result after coarse-grained word segmentation, the second column is the part-of-speech tag after word segmentation, and the third column is the entity category tag.

基于分词与识别后的结果，调用360基础句法分析模块完成句法分析。最终词法分析结果为：Based on the results of word segmentation and recognition, the 360 basic syntax analysis module is called to complete the syntax analysis. The final lexical analysis result is:

湖北/ns/LOC/2/ATTHubei/ns/LOC/2/ATT

安陆/ns/LOC/4/SBVAnlu/ns/LOC/4/SBV

突/d//4/ADVsudden /d//4/ADV

降/v//0/HEADdrop /v//0/HEAD

大雪/n//4/VOBheavy snow/n//4/VOB

压垮/v//4/COOcrush /v//4/COO

菜市场/n//6/VOBVegetable market/n//6/VOB

已/d//9/ADVAlready /d//9/ADV

救出/v//6/COORescue /v//6/COO

13人/mq//9/VOB13 people/mq//9/VOB

其中/分割后第一列为粗粒度分词后的结果，第二列分词后的词性标注，第三列实体类别标注，第四列为依存句法分析中的依存节点下标，第五列为依存类型。Among them, the first column after segmentation is the result of coarse-grained word segmentation, the second column is the part-of-speech tag after word segmentation, the third column is the entity category tag, the fourth column is the subscript of the dependent node in the dependency syntax analysis, and the fifth column is the dependency type.

(3)句子主干内容的提取(3) Extraction of the main content of the sentence

根据上文(2)产出的词法分析特征，生成句法树，通过对句法树的筛选与剪枝生成句子主干。具体规则与算法如下：According to the lexical analysis features produced in (2) above, a syntax tree is generated, and the sentence trunk is generated by screening and pruning the syntax tree. The specific rules and algorithms are as follows:

选取依存句法head节点为主干谓语；Select the dependency syntax head node as the main predicate;

若主节点分词后词性为名词词性：If the part of speech after the main node is a noun part of speech:

对所有特定类比的浅层依存的名词进行归并更新谓语；Merge and update predicates for all shallowly dependent nouns of a specific analogy;

若主节点分词后词性为动词词性：If the part of speech after the main node is a verb part of speech:

设定主节点为谓语动词；Set the main node as the predicate verb;

对于否定词定语进行识别并归并入谓语；Identify and incorporate negative word attributes into predicates;

识别其主谓逻辑关系节点：Identify its subject-predicate logical relationship nodes:

对于主语周边节点进行归并，对并列关系节点依照主语规则保持名词词性部分其余进行节点剪枝，并设置主语节点；Merge the peripheral nodes of the subject, and prune the remaining part of the part of speech of the noun according to the subject rule for the parallel relationship nodes, and set the subject node;

根据宾语此行，若为名词对宾语进行识别，并列关系节点全部去除，并设置宾语节点。According to the line of the object, if it is a noun to identify the object, all the nodes of the parallel relationship are removed, and the object node is set.

原句：湖北安陆突降大雪压垮菜市场已救出13人Original sentence: Sudden heavy snowfall in Anlu, Hubei crushed the vegetable market and rescued 13 people

句子主干：湖北安陆降大雪压垮菜市场Sentence stem: Heavy snowfall in Anlu, Hubei crushes the vegetable market

(4)使用神经机器翻译模型进行改写泛化(4) Rewrite and generalize using the neural machine translation model

对于每一个新闻原始标题，在分词后使用预训练好的神经机器翻译模型进行压缩式摘要，生成候选，并将句子主干同时加入候选集合。神经机器翻译对于句子文章可以进行端到端的压缩摘要。For each original news headline, after word segmentation, use a pre-trained neural machine translation model to perform a compressed summary, generate candidates, and add the sentence stem to the candidate set at the same time. Neural machine translation can perform end-to-end compressed summarization for sentence articles.

输入样例：湖北安陆突降大雪压垮菜市场已救出13人Input example: Sudden heavy snowfall in Anlu, Hubei crushed the vegetable market and rescued 13 people

产出候选集合：Output candidate set:

原句子主干：湖北安陆降大雪压垮菜市场Main sentence of the original sentence: Heavy snowfall crushed the vegetable market in Anlu, Hubei

神经机器翻译结果：湖北大雪压垮菜市场Neural machine translation results: Heavy snow in Hubei crushes vegetable market

(5)基于语言模型的标题改写审核(5) Title rewriting review based on language model

对每个标题产出的候选使用语言模型进行句子在该模型下的得分计算，命名quality_score。Use the language model to calculate the score of the sentence under the model for the candidates produced by each title, and name it quality_score.

(6)基于规则对高质量标题进行筛选进行自动上线(6) Screen high-quality titles based on rules for automatic online launch

初始化下列参数：Initialize the following parameters:

quality_threshold,quality_threshold,

jaccard_semantic_gap_threshold,jaccard_semantic_gap_threshold,

ed_semantic_gap_threshold；ed_semantic_gap_threshold;

对于每一个原始标题下的改选候选：For each candidate for reelection under the original title:

final_candidate＝将所有候选按质量分进行排序后质量分最高的结果。final_candidate=The result with the highest quality score after sorting all candidates by quality score.

对于final_candidate,if其质量分大于quality_threshold：For final_candidate, if its quality score is greater than quality_threshold:

if其满足主谓结构语法，且谓语动词汉动词成分：if it satisfies the subject-predicate structure grammar, and the predicate verb is a Chinese verb component:

且与原标题的编辑距离与jaccard语义距离均小于对应semantic_gap_threshold：And the edit distance and jaccard semantic distance from the original title are both smaller than the corresponding semantic_gap_threshold:

则该final_candidate为对应标题的自动审核压缩摘要结果。Then the final_candidate is the automatic review compression summary result of the corresponding title.

本发明实施例大幅降低了传统标题改写中需要的大量人力投入，并解决了由于运营人员主观标准不一致造成的改写效果不一致的问题。在得到新闻摘要标题后，还可以提供给360搜索实时热点产品，该产品可以展现在搜索首页、搜索结果页右侧、浏览器首页或者360导航等。产品使用该方法对原新闻标题改写并自动上线后，相比原人工编辑的方法产品点击率有明显提升。The embodiment of the present invention greatly reduces the large amount of manpower input required in traditional title rewriting, and solves the problem of inconsistent rewriting effects caused by inconsistent subjective standards of operating personnel. After getting the title of the news summary, it can also be provided to 360 to search for real-time hot products, which can be displayed on the search homepage, the right side of the search result page, the homepage of the browser or 360 navigation, etc. After the product uses this method to rewrite the original news headline and automatically go online, the click-through rate of the product has increased significantly compared with the original manual editing method.

需要说明的是，实际应用中，上述所有可选实施方式可以采用结合的方式任意组合，形成本发明的可选实施例，在此不再一一赘述。It should be noted that, in practical applications, all the above optional implementation manners may be combined in any way to form optional embodiments of the present invention, which will not be repeated here.

基于上文各个实施例提供的新闻标题的摘要方法，基于同一发明构思，本发明实施例还提供了一种新闻标题的摘要装置。Based on the methods for summarizing news headlines provided in the above embodiments and based on the same inventive concept, an embodiment of the present invention also provides a device for summarizing news headlines.

图8示意了根据本发明一实施例的新闻标题的摘要装置的结构图。如图8所示，该装置可以包括获取模块810、分析模块820、提取模块830以及确定模块840。Fig. 8 shows a structural diagram of an apparatus for summarizing news headlines according to an embodiment of the present invention. As shown in FIG. 8 , the apparatus may include an acquisition module 810 , an analysis module 820 , an extraction module 830 and a determination module 840 .

现介绍本发明实施例的新闻标题的摘要装置的各组成或器件的功能以及各部分间的连接关系：Now introduce the functions of each composition or device of the abstract device of the news headline of the embodiment of the present invention and the connection relationship between each part:

获取模块810，适于获取新闻的原始标题；An acquisition module 810, adapted to acquire the original headline of the news;

分析模块820，与获取模块810相耦合，适于对新闻的原始标题进行词法句法分析，得到分析结果；The analysis module 820, coupled with the acquisition module 810, is suitable for performing lexical and syntactic analysis on the original headline of the news to obtain the analysis result;

提取模块830，与分析模块820相耦合，适于基于所述分析结果，提取新闻的原始标题中的句子主干内容，并将提取的句子主干内容作为新闻候选标题；The extraction module 830, coupled with the analysis module 820, is suitable for extracting the sentence main content in the original headline of the news based on the analysis result, and using the extracted sentence main content as the news candidate title;

确定模块840，与提取模块830相耦合，适于利用新闻标题的摘要质量评估策略，对所述新闻候选标题的质量进行评估，进而根据评估结果确定新闻摘要标题。The determining module 840, coupled with the extracting module 830, is adapted to evaluate the quality of the candidate news headlines by using the abstract quality evaluation strategy of the news headlines, and then determine the news abstract headlines according to the evaluation results.

在本发明的可选实施例中，所述获取模块810还适于：In an optional embodiment of the present invention, the acquiring module 810 is further adapted to:

在本发明的可选实施例中，如图9所示，上文图8展示的分析模块820可以包括：In an optional embodiment of the present invention, as shown in FIG. 9, the analysis module 820 shown in FIG. 8 above may include:

分词单元821，适于对新闻的原始标题进行分词处理，得到多个分词；The word segmentation unit 821 is adapted to perform word segmentation processing on the original headline of the news to obtain multiple word segments;

标注单元822，与分词单元821相耦合，适于对所述多个分词中的各分词分别进行词性标注和实体类别标注；The labeling unit 822, coupled with the word segmentation unit 821, is adapted to perform part-of-speech labeling and entity category labeling on each of the multiple word segmentations;

识别单元823，与标注单元822相耦合，适于基于各分词的词性标注和实体类别标注，对新闻的原始标题进行依存句法分析，识别各分词的依存节点下标和依存类型。The identification unit 823, coupled with the labeling unit 822, is adapted to perform dependency syntax analysis on the original headline of the news based on the part-of-speech tagging and entity category tagging of each participle, and identify the dependent node subscript and dependency type of each participle.

在本发明的可选实施例中，所述对新闻的原始标题进行分词处理的方法包括下列至少之一：In an optional embodiment of the present invention, the method for segmenting the original headline of the news includes at least one of the following:

基于统计的分词方法。Statistical word segmentation method.

在本发明的可选实施例中，所述标注单元822还适于：In an optional embodiment of the present invention, the labeling unit 822 is also adapted to:

在本发明的可选实施例中，所述实体类别包括下列任意之一：In an optional embodiment of the present invention, the entity category includes any one of the following:

在本发明的可选实施例中，所述识别单元823还适于：In an optional embodiment of the present invention, the identification unit 823 is further adapted to:

在本发明的可选实施例中，所述提取模块830还适于：In an optional embodiment of the present invention, the extraction module 830 is also adapted to:

若主节点分词后词性为动词词性，则设定主节点为谓语动词；If the part of speech after the main node participle is the verb part of speech, then set the main node as the predicate verb;

在本发明的可选实施例中，所述确定模块840还适于：In an optional embodiment of the present invention, the determining module 840 is further adapted to:

在本发明的可选实施例中，所述该待选标题是否满足预设审核条件包括下列至少之一：In an optional embodiment of the present invention, whether the title to be selected satisfies a preset review condition includes at least one of the following:

在本发明的可选实施例中，如图9所示，上文图8展示的装置还可以包括：In an optional embodiment of the present invention, as shown in FIG. 9, the device shown in FIG. 8 above may also include:

提供模块910，适于在所述确定模块840根据评估结果确定新闻摘要标题之后，将所述新闻摘要标题提供给实时热点产品模块，从而由实时热点产品模块将所述新闻摘要标题作为实时热点进行展示。The providing module 910 is adapted to provide the news summary title to the real-time hotspot product module after the determination module 840 determines the news summary title according to the evaluation result, so that the real-time hotspot product module uses the news summary title as a real-time hotspot exhibit.

基于同一发明构思，本发明实施例还提供了一种计算机存储介质，所述计算机存储介质存储有计算机程序代码，当所述计算机程序代码在计算设备上运行时，导致所述计算设备执行根据上述的新闻标题的摘要方法。Based on the same inventive concept, an embodiment of the present invention also provides a computer storage medium, the computer storage medium stores computer program code, and when the computer program code runs on the computing device, it causes the computing device to execute the above-mentioned A summary method for news headlines.

基于同一发明构思，本发明实施例还提供了一种计算设备，包括：处理器；以及存储有计算机程序代码的存储器；当所述计算机程序代码被所述处理器运行时，导致所述计算设备执行根据上述的新闻标题的摘要方法。Based on the same inventive concept, an embodiment of the present invention also provides a computing device, including: a processor; and a memory storing computer program code; when the computer program code is run by the processor, the computing device Execute the summary method based on the news headlines described above.

根据上述任意一个可选实施例或多个可选实施例的组合，本发明实施例能够达到如下有益效果：According to any one of the above optional embodiments or a combination of multiple optional embodiments, the embodiments of the present invention can achieve the following beneficial effects:

所属领域的技术人员可以清楚地了解到，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，为简洁起见，在此不另赘述。Those skilled in the art can clearly understand that for the specific working processes of the above-described systems, devices, and units, reference can be made to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, details are not described here.

另外，在本发明各个实施例中的各功能单元可以物理上相互独立，也可以两个或两个以上功能单元集成在一起，还可以全部功能单元都集成在一个处理单元中。上述集成的功能单元既可以采用硬件的形式实现，也可以采用软件或者固件的形式实现。In addition, each functional unit in each embodiment of the present invention may be physically independent of each other, or two or more functional units may be integrated together, or all functional units may be integrated into one processing unit. The above-mentioned integrated functional units can be implemented not only in the form of hardware, but also in the form of software or firmware.

本领域普通技术人员可以理解：所述集成的功能单元如果以软件的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，其包括若干指令，用以使得一台计算设备(例如个人计算机，服务器，或者网络设备等)在运行所述指令时执行本发明各实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM)、随机存取存储器(RAM)，磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that: if the integrated functional unit is implemented in the form of software and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or all or part of the technical solution can be embodied in the form of software products, the computer software products are stored in a storage medium, which includes a number of instructions to make a A computing device (such as a personal computer, a server, or a network device, etc.) executes all or part of the steps of the methods described in the various embodiments of the present invention when executing the instructions. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, and various media capable of storing program codes.

或者，实现前述方法实施例的全部或部分步骤可以通过程序指令相关的硬件(诸如个人计算机，服务器，或者网络设备等的计算设备)来完成，所述程序指令可以存储于一计算机可读取存储介质中，当所述程序指令被计算设备的处理器执行时，所述计算设备执行本发明各实施例所述方法的全部或部分步骤。Alternatively, all or part of the steps for realizing the aforementioned method embodiments may be implemented by program instruction-related hardware (such as a personal computer, server, or computing device such as a network device), and the program instructions may be stored in a computer-readable memory In the medium, when the program instructions are executed by the processor of the computing device, the computing device executes all or part of the steps of the methods described in the various embodiments of the present invention.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：在本发明的精神和原则之内，其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案脱离本发明的保护范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: Within the spirit and principles of the present invention, it is still possible to modify the technical solutions described in the foregoing embodiments, or to perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the corresponding technical solutions deviate from protection scope of the present invention.

Claims

1. a kind of method of abstracting of headline, including：

The original header for obtaining news carries out morphology syntactic analysis to the original header of news, obtains analysis result；

Based on the analysis result, the sentence trunk content in the original header of news is extracted, and will be in the sentence trunk of extraction Hold and is used as news candidate's title；

Using the abstract quality evaluation strategy of headline, the quality of the news candidate title is assessed, and then basis Assessment result determines news in brief title.

2. according to the method described in claim 1, wherein, the original header for obtaining news, including：

Obtain the crawl log about News Resources of web crawlers crawl；

The original header of news is extracted from crawl log.

3. method according to claim 1 or 2, wherein the original header for extracting news from crawl log, packet It includes：

For being recorded about each item of News Resources in crawl log, the field value of the specific field of this record is extracted as new The original header of news.

4. method according to any one of claim 1-3, wherein the original header to news carries out morphology syntax Analysis, obtains analysis result, including：

Word segmentation processing is carried out to the original header of news, obtains multiple participles；

Part-of-speech tagging and entity class mark are carried out respectively to each participle in the multiple participle；

Part-of-speech tagging based on each participle and entity class mark carry out interdependent syntactic analysis, identification to the original header of news The interdependent node subscript and dependency type respectively segmented.

5. according to the described method of any one of claim 1-4, wherein the original header to news carries out word segmentation processing Method include at least one following：

Segmenting method based on string matching；

Segmenting method based on semantic understanding；

Segmenting method based on statistics.

6. method according to any one of claims 1-5, wherein carry out entity to each participle in the multiple participle Classification marks, including：

Using sequence labelling model, the entity word respectively segmented in the multiple participle is identified, marks entity class.

7. according to the method described in any one of claim 1-6, wherein the entity class includes following one of arbitrary：

Name, place name, mechanism name, brand name, software name.

8. a kind of summarization device of headline, including：

Acquisition module is suitable for obtaining the original header of news；

Analysis module is suitable for carrying out morphology syntactic analysis to the original header of news, obtains analysis result；

Extraction module is suitable for being based on the analysis result, extracts the sentence trunk content in the original header of news, and will extraction Sentence trunk content as news candidate's title；

Determining module is suitable for the abstract quality evaluation strategy using headline, is carried out to the quality of the news candidate title Assessment, and then news in brief title is determined according to assessment result.

9. a kind of computer storage media, the computer storage media is stored with computer program code, when the computer When program code is run on the computing device, the computing device is caused to execute according to described in any one of claim 1-7 The method of abstracting of headline.

10. a kind of computing device, including：

Processor；And

It is stored with the memory of computer program code；

When the computer program code is run by the processor, the computing device is caused to execute according to claim 1- The method of abstracting of headline described in any one of 7.