CN102254014B

CN102254014B - Adaptive information extraction method for webpage characteristics

Info

Publication number: CN102254014B
Application number: CN 201110205137
Authority: CN
Inventors: 金海�; 李毅; 赵峰; 严奉伟; 陈恒
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2011-07-21
Filing date: 2011-07-21
Publication date: 2013-06-05
Anticipated expiration: 2031-07-21
Also published as: CN102254014A

Abstract

The invention discloses a method for extracting information from academic homepages, the steps of which are: (1) discovering academic homepages in the Internet; (2) crawling and analyzing academic homepages, using heuristic strategies to reduce crawling of irrelevant pages (3) Parse the page into a form of DOM tree, and divide it according to the attribute and content of the element to obtain a list of cohesive text units; (4) use the information recognizer to identify the text unit, each A type of information recognizer only recognizes one type of information, and subfield extraction is also required for article information. (5) Carry out association analysis on the extraction results, use the relevance of information to eliminate ambiguity, and complete missing fields; (6) Match the extraction results with the database to eliminate redundant data, and the extraction results are stored in the form of semantic data in the in the semantic database. The invention can efficiently and accurately extract academic information from academic homepages through the combined use of heuristic rules, machine learning methods and conditional probability models.

Description

A Webpage Feature Adaptive Information Extraction Method

技术领域 technical field

本发明属于信息抽取系统领域，具体涉及一种网页特征自适应的信息抽取方法，该方法尤其适用于从学术主页中抽取作者名字，邮箱，机构信息和发表文章等信息。The invention belongs to the field of information extraction systems, and in particular relates to an information extraction method adaptive to web page features. The method is especially suitable for extracting information such as author names, email addresses, institution information, and published articles from academic homepages.

背景技术 Background technique

信息时代的来临使得网络逐渐成为人们分享和获取信息的主要途径，各种信息以网页的形式发布在互联网上供人们阅读。然而随着互联网信息的爆炸性增长，人们发现在互联网中找到所需的信息变得越来越困难，一方面信息量巨大，另一方面信息呈现的方式非常灵活和自由，这增加了人们辨别目标信息的成本。因此，网页信息抽取技术成为信息时代值得研究的领域。With the advent of the information age, the Internet has gradually become the main way for people to share and obtain information. Various information is published on the Internet in the form of web pages for people to read. However, with the explosive growth of Internet information, people find it more and more difficult to find the information they need on the Internet. On the one hand, the amount of information is huge, and on the other hand, the way information is presented is very flexible and free, which increases people's ability to identify targets. The cost of information. Therefore, web page information extraction technology has become a field worthy of research in the information age.

网页信息抽取技术是从传统的文本信息抽取上发展起来的。跟文本信息不同，网页内容是用超文本标记语言(HTML)表述的，包含文本，图片和其他多媒体信息，且标记之间允许相互嵌套形成树状的结构。网页信息抽取任务的主要目的是从半结构化的网页文本中抽取出目标信息。网页信息通常具有如下特征：(1)离散化，信息并不集中在某一站点，而是由不同的人发布到不同的站点上。(2)异构性，即使是同类的信息在不同的网站上也会使用不同的方式呈现。(3)冗余性，相同的信息可能会在多个站点上重复出现。针对网页信息的这些特征，网页信息抽取系统需要能够具有较强的适应能力和辨别能力。Web page information extraction technology is developed from traditional text information extraction. Different from text information, web page content is expressed in Hypertext Markup Language (HTML), including text, pictures and other multimedia information, and tags are allowed to nest with each other to form a tree structure. The main purpose of webpage information extraction task is to extract target information from semi-structured webpage text. Web page information usually has the following characteristics: (1) discretization, information is not concentrated on a certain site, but is published by different people to different sites. (2) Heterogeneity, even the same kind of information will be presented in different ways on different websites. (3) Redundancy, the same information may appear repeatedly on multiple sites. In view of these characteristics of web page information, the web page information extraction system needs to have strong adaptability and discrimination ability.

早期的网页信息抽取研究集中探索了规则化方法，从基于正则表达式的脚本化抽取方法，到之后发展起来的专有的抽取语言，其核心思想是提取出包含目标信息的特定模式。模式的提取的方法是这类系统的主要不同，一些系统使用手工方式来提取模式，这样的好处是提取的模式更加准确，不过在处理复杂抽取任务时需要提取模式将非常之多，因此人工成本较高。为了降低模式提取的成本，人们提出了基于自动训练的模式学习系统，系统需要接受一组训练样例，样例由人工标识出其中的目标信息块，学习系统自动的根据从样例中总结出可能的匹配模式，模式经过验证和筛选后被用于实际的抽取任务。该方法具有了一定的自动提取能力，但是由于底层仍然依赖于规则化方法，因此对复杂的抽取任务无法达到较高的准确率。最近几年来，抽取方法逐渐转向于机器学习模型，一些原本在处理自然语言理解过程中的方法被应用来处理信息抽取问题，取得了很好的效果。Early web page information extraction research focused on regularization methods, from scripted extraction methods based on regular expressions to proprietary extraction languages developed later, the core idea of which is to extract specific patterns containing target information. The method of pattern extraction is the main difference of this type of system. Some systems use manual methods to extract patterns. The advantage of this is that the extracted patterns are more accurate. However, when dealing with complex extraction tasks, there will be a lot of patterns to be extracted, so the labor cost higher. In order to reduce the cost of pattern extraction, people have proposed a pattern learning system based on automatic training. The system needs to accept a set of training samples. The samples are manually marked with the target information blocks, and the learning system automatically summarizes the information from the samples. The possible matching patterns, the patterns are verified and filtered to be used for the actual extraction task. This method has a certain automatic extraction ability, but because the bottom layer still relies on the regularization method, it cannot achieve high accuracy for complex extraction tasks. In recent years, extraction methods have gradually turned to machine learning models, and some methods that were originally used in the process of natural language understanding have been applied to deal with information extraction problems and achieved good results.

学术主页是学术领域内的研究人员用来展示自己个人基本信息和研究成果的站点。不同的作者根据自己的喜好制作不同的页面模板呈现个人信息。尽管页面风格各不相同，但是学术主页上通常包含了类似的信息，如作者名字，机构信息，联系方式，项目，文章信息等。使用信息抽取系统将这些信息收集起来是十分有价值的。The academic homepage is a site used by researchers in the academic field to display their basic personal information and research results. Different authors make different page templates to present personal information according to their preferences. Although the page styles vary, academic home pages usually contain similar information, such as author names, institution information, contact information, projects, article information, etc. It is very valuable to collect this information using an information extraction system.

发明内容 Contents of the invention

本发明的目的是提供一种网页特征自适应的信息抽取方法，该方法能够从不同风格的学术主页中提取所需的信息，并且具有适应能力强，准确率高，以及扩展性强特点。The purpose of the present invention is to provide an adaptive information extraction method for webpage features, which can extract required information from academic homepages of different styles, and has the characteristics of strong adaptability, high accuracy, and strong scalability.

本发明提供的一种网页特征自适应的信息抽取方法，其特征在于，该方法包括下述步骤：A method for extracting webpage feature self-adaptive information provided by the present invention is characterized in that the method comprises the following steps:

第1步从互联网中搜寻类型为学术主页的站点；The first step is to search for sites whose type is an academic homepage from the Internet;

第2步对搜寻的学术主页进行分析，将学术主页的页面看成是二元组(L，C)的集合，其中L是链接的URL，C是链接的上下文，再检查L和C中是否包含关键字，如果包含，则进入第3步，否则过滤掉该链接；The second step is to analyze the searched academic homepage, regard the page of the academic homepage as a set of binary groups (L, C), where L is the URL of the link, and C is the context of the link, and then check whether L and C are Contains keywords, if it contains, go to step 3, otherwise filter out the link;

第3步对所述链接进行分析，得到页面的文档树结构，根据树节点的属性和内容对页面进行划分，分成文本单元T，构成文本单元集合{T₁，T₂，...，T_n}The third step is to analyze the link to obtain the document tree structure of the page, divide the page according to the attributes and contents of the tree nodes, divide it into text units T, and form a text unit set {T ₁ , T ₂ , ..., T _n }

第4步从文本单元集合{T₁，T₂，...，T_n}中抽取出作者名字N，邮箱M，机构信息U和文章信息集合{P₁，P₂，...，P_n}这四个目标字段，作为初步抽取结果；Step 4 Extract author name N, email M, organization information U and article information set {P ₁ , P ₂ , ..., P from the text unit set {T ₁ , T ₂ , ..., T _n } _n } These four target fields are used as the preliminary extraction results;

第5步对第4步得到的初步抽取结果进行关联分析，利用信息的关联性消除歧义，对缺失字段进行补全，得到抽取结果，存至结果数据库中；The fifth step is to conduct correlation analysis on the preliminary extraction results obtained in step 4, use the correlation of information to eliminate ambiguity, complete the missing fields, obtain the extraction results, and store them in the result database;

第6步将文章信息集合{P₁，P₂，...，P_n}中的元素与结果数据库中的记录进行匹配，消除冗余数据；Step 6 Match the elements in the article information set {P ₁ , P ₂ , ..., P _n } with the records in the result database to eliminate redundant data;

第7步输出抽取结果。Step 7 outputs the extraction result.

本发明提供的一种网页特征自适应的信息抽取方法，该方法结合使用了机器学习算法，概率模型和规则化方法，能够从不同风格的学术主页中提取出作者的名字，邮箱，机构信息和发表文章等信息。具体而言，本发明有以下效果和优点：The present invention provides a web page feature self-adaptive information extraction method, which uses machine learning algorithms, probability models and regularization methods in combination, and can extract the author's name, email address, institution information and Publish articles and other information. Specifically, the present invention has the following effects and advantages:

(1)适应性强(1) Strong adaptability

学术主页的编写者是许多不同的研究者，内容和排版各式各样。本发明能够很好的解决页面格式不统一的的问题，自动的适应各种变化情况；Academic home pages are written by many different researchers, with a wide variety of content and layout. The present invention can well solve the problem of non-uniform page format, and automatically adapt to various changing situations;

(2)准确度高(2) High accuracy

本发明的核心算法基于机器学习算法和概率模型，并结合使用了启发式规则，对各个目标字段的抽取都能够达到很高的准确率；The core algorithm of the present invention is based on a machine learning algorithm and a probability model, and combined with the use of heuristic rules, the extraction of each target field can achieve a high accuracy rate;

(3)可扩展性强(3) Strong scalability

本发明能够被扩展来提取出页面中的其他字段，其识别过程也能够被应用来解决其他类似问题，扩展过程简单，通用性强。The invention can be extended to extract other fields in the page, and its identification process can also be applied to solve other similar problems, the expansion process is simple, and the universality is strong.

附图说明 Description of drawings

图1为本发明的抽取过程的整体流程图；Fig. 1 is the overall flowchart of the extraction process of the present invention;

图2为本发明对作者名进行抽取的流程图；Fig. 2 is the flow chart that the present invention extracts author's name;

图3为本发明对邮箱进行抽取的流程图；Fig. 3 is the flow chart that the present invention extracts mailboxes;

图4为本发明对机构信息进行抽取的流程图；Fig. 4 is the flow chart that the present invention extracts organization information;

图5为本发明对文章信息进行抽取的流程图。Fig. 5 is a flow chart of extracting article information in the present invention.

具体实施方式 Detailed ways

下面结合附图和实例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and examples.

本发明提供的一种网页特征自适应的信息抽取方法，其步骤包括：A method for extracting web page feature self-adaptive information provided by the present invention, the steps of which include:

(1)从互联网中搜寻类型为学术主页的站点，该过程可以分为两个阶段：寻找阶段和判定阶段。(1) Searching for academic homepage sites from the Internet, the process can be divided into two stages: the search stage and the judgment stage.

在寻找阶段，首先从已有的文献数据中导出作者名字的数据集作为种子数据，然后以数据集中的每一个作者名作为关键字在搜索引擎中进行检索，搜索引擎以列表形式返回检索结果，每一条检索结果通常由标题，链接特征和一小段摘要文本组成，搜索引擎通常会返回多页结果，将第一页的检索结果的链接特征和摘要文本存放在候选结果列表中。In the search phase, firstly, the data set of author’s name is derived from the existing literature data as the seed data, and then each author’s name in the data set is used as a keyword to search in the search engine, and the search engine returns the search results in the form of a list. Each search result usually consists of a title, a link feature and a short summary text. The search engine usually returns multiple pages of results, and stores the link feature and summary text of the first page of the search result in the candidate result list.

在判定阶段，首先根据链接特征和摘要文本对候选结果列表中的检索结果进行过滤。过滤过程中用到了一个数据库，该数据库包含了检索结果中经常出现的混淆站点，称之为屏蔽链接数据库。过滤策略包含两个步骤，首先检查检索结果是否存在于屏蔽链接数据库中，将位于该数据库中的检索结果直接排除。然后，对剩余的检索结果，检查其链接特征是否呈现为“～”+作者名字的模式，如果是则保留，否则则直接排除，经过这两步过滤之再依次对剩余的每一条检索结果进行如下操作：根据其链接特征发出页面请求，使用支持向量机分类算法判定返回的页面是否是作者学术主页，如果是，则直接将其保存为作者学术主页，判定结束，否则继续对下一条检索结果进行相同的操作。In the judging stage, the retrieval results in the candidate result list are firstly filtered according to link features and abstract text. A database is used in the filtering process, which contains obfuscated sites that frequently appear in the search results, called the blocked link database. The filtering strategy includes two steps. First, check whether the search results exist in the shielded link database, and directly exclude the search results located in the database. Then, for the rest of the search results, check whether the link feature is in the pattern of "~" + author's name, if it is, keep it, otherwise it will be directly excluded, and after these two steps of filtering, each of the remaining search results will be searched in turn The operation is as follows: send a page request according to its link characteristics, use the support vector machine classification algorithm to determine whether the returned page is the author’s academic homepage, if so, directly save it as the author’s academic homepage, and the judgment is over, otherwise continue to the next search result Do the same.

(2)对作者学术主页进行分析，作者学术主页通常是一个完整的站点，包含了许多子页面，其中有些包含了系统需要的目标信息，有些则是完全无关的。为了提高爬取效率，避免过多的无用页面被后续模块进行深入解析，消耗计算资源，本发明使用了一种基于启发式策略的过滤算法。该算法将页面看成是二元组(L，C)的集合，其中L是链接的URL，C是链接的上下文，该算法检查L和C中是否包含publication，paper，research等关键字，如果包含则进一步解析该链接(进入步骤(3))，否则过滤掉该链接。(2) Analyze the author's academic homepage. The author's academic homepage is usually a complete site, including many subpages, some of which contain the target information required by the system, and some are completely irrelevant. In order to improve the crawling efficiency and prevent too many useless pages from being deeply analyzed by subsequent modules and consume computing resources, the present invention uses a filtering algorithm based on a heuristic strategy. The algorithm regards the page as a set of two-tuples (L, C), where L is the URL of the link, and C is the context of the link. The algorithm checks whether L and C contain keywords such as publication, paper, and research. If If it contains, then further analyze the link (go to step (3)), otherwise filter out the link.

(3)对待解析页面进行分析，得到网页的文档树结构，根据文档树节点的属性和内容对页面进行划分，分成若干个小单元，称之为文本单元T，划分结果为文本单元集合{T₁，T₂，...，T_n}，步骤如下。(3) Analyze the page to be parsed to obtain the document tree structure of the web page, divide the page according to the attributes and contents of the document tree nodes, and divide the page into several small units, called text units T, and the division result is a text unit set {T ₁ , T ₂ , ..., T _n }, the steps are as follows.

(a)首先使用HTML解析器对页面进行解析，得到页面的文档树。文档树的节点即对应于页面里的HTML标签，文档树以树形结构展现出页面里各个HTML标签之间的关系。(a) Firstly, an HTML parser is used to parse the page to obtain a document tree of the page. The nodes of the document tree correspond to the HTML tags in the page, and the document tree shows the relationship between the HTML tags in the page in a tree structure.

(b)然后对页面进行划分。HTML标签可以分为块级元素和内联元素，常见的块级元素如BR，DIV，H1，H2，LI，UL，TH，TD，TR，TABLE等，常见的内联元素如SPAN，BOLD，A，FONT，IMG等。HTML页面可以被看做是块级元素的集合，块级元素之间拥有两种关系：父子关系和兄弟关系。块级元素和内联元素之间可以相互嵌套。文档树就是以树节点的形式呈现出这些关系，文档树中含有块级元素的节点称为块级节点，其他节点称为非块级节点，对文档树的节点进行遍历，通过判断节点的类别来对页面进行划分，划分步骤如下：(b) The pages are then divided. HTML tags can be divided into block-level elements and inline elements. Common block-level elements such as BR, DIV, H1, H2, LI, UL, TH, TD, TR, TABLE, etc., common inline elements such as SPAN, BOLD, A, FONT, IMG, etc. An HTML page can be regarded as a collection of block-level elements, and there are two relationships between block-level elements: parent-child relationship and sibling relationship. Block-level elements and inline elements can be nested within each other. The document tree presents these relationships in the form of tree nodes. Nodes containing block-level elements in the document tree are called block-level nodes, and other nodes are called non-block-level nodes. The nodes of the document tree are traversed, and the category of the nodes is judged To divide the page, the division steps are as follows:

(b1)初始，文本单元集合为空；(b1) Initially, the text unit set is empty;

(b2)对文档树进行深度优先遍历，找出所有的块级节点，对每一个块级节点Ni，生成一个文本单元Ti，并将Ni在页面中相应的内容划分至Ti；(b2) Perform depth-first traversal on the document tree, find out all block-level nodes, generate a text unit Ti for each block-level node Ni, and divide the corresponding content of Ni in the page into Ti;

(b3)对每一个块级子节点Ni，判断其在文档树中是否有非块级子节点，如果有则将其所有非块级子节点在页面中相应的内容划分至Ti；(b3) For each block-level child node Ni, judge whether it has non-block-level child nodes in the document tree, and if so, divide the corresponding content of all non-block-level child nodes in the page into Ti;

(b4)将Ti加入文本单元集合中；(b4) adding Ti to the text unit collection;

(b5)结束。(b5) end.

(c)遍历结束后，完成页面的划分，得到文本单元集合{T₁，T₂，...，T_n}。(c) After the traversal is completed, the division of the page is completed, and the text unit set {T ₁ , T ₂ , . . . , T _n } is obtained.

(4)从文本单元集合{T₁，T₂，...，T_n}中抽取出作者名字N，邮箱M，机构信息U和文章信息集合{P₁，P₂，...，P_n}这四个目标字段，作为初步抽取结果；(4) Extract the author's name N, mailbox M, institution information U and article information set {P ₁ , P ₂ , ..., P from the text unit set {T ₁ , T ₂ , ..., T _n } _n } These four target fields are used as the preliminary extraction results;

针对不同类型的目标字段，下面分别介绍不同字段的抽取方法：For different types of target fields, the following describes the extraction methods of different fields:

作者名字N的抽取过程如图2所示，其基本步骤如下：The extraction process of the author’s name N is shown in Figure 2, and the basic steps are as follows:

(a1)使用支持向量机分类算法对文本单元集合{T₁，T₂，...，T_n}里的文本单元进行分类，保留类别为作者名字的文本单元集合T_name；(a1) Use the support vector machine classification algorithm to classify the text units in the text unit set {T ₁ , T ₂ , ..., T _n }, and keep the text unit set T _name whose category is the author's name;

(a2)使用作者名字数据库从T_name中匹配出作者名字部分，作者名字数据库是一个事先准备好的数据库，该数据库收集和整理了常见的英文男女人名和一些中文拼音，使用该数据库从T_name中匹配出候选的作者名字集合；(a2) Use the author's name database to match the author _'s name part from T _name . The author's name database is a database prepared in advance. This database collects and organizes common English male and female names and some Chinese pinyin. Match the set of candidate author names;

(a3)提取出作者学术主页标题中的文字，大多数时候作者学术主页的标题会以“XXX’s Hompage”的形式包含作者的名字XXX，提取作者学术主页标题中的作者名字XXX；(a3) Extract the text in the title of the author's academic homepage. Most of the time, the title of the author's academic homepage will contain the author's name XXX in the form of "XXX's Hompage", and extract the author's name XXX in the title of the author's academic homepage;

(a4)用(a3)得到的作者名字XXX对(a2)得到的候选作者名字进行匹配，选择与XXX匹配程度最高的名字作为作者名字N输出。(a4) Use the author name XXX obtained in (a3) to match the candidate author names obtained in (a2), and select the name with the highest degree of matching with XXX as the author name N to output.

邮箱M的抽取过程如图3所示，其基本步骤如下：The extraction process of mailbox M is shown in Figure 3, and its basic steps are as follows:

(b1)首先使用支持向量机分类器从文本单元集合{T₁，T₂，...，T_n}中找出可能的邮箱候选文本单元集合T_Email。支持向量机的输入特征包括邮箱信息中的常见符号，如“Email”，“”，“.”等。在T_Email中寻找这些特征符号，生成特征向量。支持向量机算法根据特征向量对T_Email中邮箱候选文本单元进行判定，如果分类结果为肯定，则进行(b2)处理，否则直接过滤掉。(b1) First use a support vector machine classifier to find out a possible email candidate text unit set T _Email from the text unit set {T ₁ , T ₂ , . . . , T _n }. The input features of the support vector machine include common symbols in mailbox information, such as "Email", "", ".", etc. Find these characteristic symbols in T _Email and generate a characteristic vector. The support vector machine algorithm judges the mailbox candidate text unit in T _Email according to the feature vector, if the classification result is affirmative, then proceed to (b2) processing, otherwise directly filter it out.

(b2)去掉邮箱候选文本单元中多余的部分，如提示性前缀“Email：”，去除这些信息有利于后续步骤获得合法的邮箱信息。(b2) Remove the redundant part of the email candidate text unit, such as the suggestive prefix "Email:", which is beneficial to the subsequent steps to obtain legal email information.

(b3)接下来采用模糊匹配状态机算法对邮箱候选文本单元进行匹配，一个标准的邮箱有如下字段：用户名(提供商域名.)+.顶级域名。该算法为每一个字段建立一个匹配节点，使用状态机枚举可能的匹配形式，生成许多不同的匹配结果，通常有几十个。(b3) Next, the fuzzy matching state machine algorithm is used to match the mailbox candidate text units. A standard mailbox has the following fields: username (provider domain name.)+.top-level domain name. The algorithm builds a matching node for each field, uses a state machine to enumerate possible matching forms, and generates many different matching results, usually dozens.

(b4)将邮箱候选文本单元的各个字段和匹配结果进行比对，选取匹配程度最大的结果作为最终结果，并按照标准的邮箱字段将其转换为规范的合法邮箱格式输出。(b4) Compare each field of the mailbox candidate text unit with the matching result, select the result with the largest matching degree as the final result, and convert it into a standardized legal mailbox format according to the standard mailbox field for output.

机构信息U的抽取过程如图4所示，其基本步骤如下：The extraction process of institutional information U is shown in Figure 4, and its basic steps are as follows:

(c1)首先从互联网上收集全球大学和研究所的数据，包括机构的名字和其对应的主页链接，建立一个机构主页数据库。为数据库建立倒排索引。倒排索引支持快速的关键字查找，能够快速确定包含一组关键字的条目。(c1) Firstly, collect the data of global universities and research institutes from the Internet, including the name of the institution and its corresponding homepage link, and establish a database of the institution's homepage. Create an inverted index for the database. The inverted index supports fast keyword search and can quickly determine the entries containing a set of keywords.

(c2)使用支持向量机分类器从文本单元集合{T₁，T₂，...，T_n}中找出可能的机构信息文本单元集合T_U，将T_U中的机构信息文本单元转换为文本形式，将其作为关键字在索引中查找，取得排名前三的检索结果。将前三个检索结果和相应的机构信息文本单元进行模糊匹配，如果能够匹配上则确定该文本是对应该机构的，将匹配程度最高的匹配结果输出，否则如果均无法匹配上，则转(c3)处理。(c2) Use the support vector machine classifier to find out the possible organization information text unit set T _U from the text unit set {T ₁ , T ₂ , ..., T _n }, and convert the organization information text unit in T _U In text form, use it as a keyword to search in the index to obtain the top three search results. Perform fuzzy matching on the first three search results and the corresponding organization information text unit, if it can be matched, it is determined that the text corresponds to the organization, and the matching result with the highest matching degree is output, otherwise, if none of them can be matched, go to ( c3) processing.

(c3)利用主页的URL进行寻找，学术站点通常是机构站点的子站点，因此将主页的域名与机构主页数据库进行匹配，如果存在匹配的记录，则认为作者属于该所机构，将匹配的记录作为结果输出。(c3) Use the URL of the home page to search. Academic sites are usually sub-sites of institutional sites, so match the domain name of the home page with the institutional home page database. If there is a matching record, the author is considered to belong to the institution, and the matching record will be output as a result.

文章信息{P₁，P₂，...，P_n}的抽取的过程如图5所示，其基本步骤如下：The process of extracting article information {P ₁ , P ₂ , ..., P _n } is shown in Figure 5, and its basic steps are as follows:

(a)首先使用支持向量机分类算法对文本单元进行分类，筛选出可能包含文章信息的文本单元。分类算法的准确率与文章信息的最终识别准确率关系密切，分类算法需要过滤掉课程信息，专利，项目等容易发生混淆的相似信息。分类算法的准确率主要依赖于两个方面：训练样例和特征的选取。训练样例的构建按照迭代法，通过不断的将错误样例添加到训练集中来更正原有模型。特征向量由一组具有区分能力的词汇向量构成。经过分类算法的筛选，无关的文本单元被排除掉，得到候选文章信息文本单元。(a) First use the support vector machine classification algorithm to classify the text units, and screen out the text units that may contain article information. The accuracy of the classification algorithm is closely related to the final recognition accuracy of the article information. The classification algorithm needs to filter out similar information that is easy to confuse, such as course information, patents, and projects. The accuracy of the classification algorithm mainly depends on two aspects: the selection of training samples and features. The construction of training samples follows an iterative method, by continuously adding error samples to the training set to correct the original model. A feature vector consists of a set of discriminative vocabulary vectors. After screening by the classification algorithm, irrelevant text units are excluded, and candidate article information text units are obtained.

(b)然后对候选文章信息文本单元进行序列标注，提取候选文本中各个子字段，包括：作者名字，标题，会议期刊名，年份。序列标注的算法基于条件随机场模型，模型中用了下列特征：(b) Then perform sequential labeling on the information text unit of the candidate article, and extract each subfield in the candidate text, including: author name, title, conference journal name, and year. The algorithm for sequence labeling is based on a conditional random field model in which the following features are used:

①文本类特征① Text features

a)词条本身，包括原始形式和词根形式a) The term itself, both original and root forms

b)大小写特征，包括首字母大写，全大写，单个大写字母b) Case features, including initial capitalization, full capitalization, single capitalization

c)数字特征，全数字，数字和字母的混合，罗马字母c) Numerical features, all numbers, a mixture of numbers and letters, Roman letters

d)标点特征，逗号，引号，句号等d) Punctuation features, commas, quotation marks, periods, etc.

e)HTML标签特征，标签起始，中间部分和结束部分e) HTML tag features, tag start, middle part and end part

②模式特征② Mode characteristics

a)年份特征，19XX或者20XXa) Year characteristics, 19XX or 20XX

b)页模式，XXX-XXXb) Page mode, XXX-XXX

③词典特征③Dictionary features

作者名字，地理位置，出版社，时间，会议期刊名，机构名Author's name, geographic location, publisher, time, conference journal name, institution name

④术语特征④ Terminology Features

文献数据中常用的词汇，如pp/editor/volume等Common vocabulary in literature data, such as pp/editor/volume, etc.

从候选文章信息文本单元中提取出上述特征，条件随机场模型中的特征函数使用真值形式，即函数输出是或者否。经过模型的计算，给出候选文章信息文本单元的最可能的标注形式。具有相同标签的符号会被合并成相应的子字段，如作者名字字段，标题字段，会议期刊字段，年份字段等，然后分别对这些字段进行相应的后续处理。The above-mentioned features are extracted from the information text units of the candidate articles, and the feature functions in the conditional random field model use the true value form, that is, the function output is yes or no. After the calculation of the model, the most probable annotation form of the information text unit of the candidate article is given. Symbols with the same label will be merged into corresponding subfields, such as author name field, title field, conference journal field, year field, etc., and then perform corresponding subsequent processing on these fields respectively.

(c)作者名字段包含了整个作者列表，需要分割成单个作者的形式。分割算法基于启发式规则，主要依据与名字的长度，缩写形式以及标点符号。分割后的结果被保存在数组中。(c) The author name field contains the entire author list and needs to be split into individual author forms. The segmentation algorithm is based on heuristic rules, mainly based on the length of the name, abbreviation and punctuation. The split results are stored in an array.

标题字段需要经过规范化裁剪才能作为最终的结果。裁剪的主要目的是为了去除掉前缀和后缀的非法字符，比如标点符号，边界错误等。The title field needs to be normalized and cropped to be the final result. The main purpose of cropping is to remove illegal characters in the prefix and suffix, such as punctuation marks, boundary errors, etc.

会议期刊名在实际中存在多种表达方式，如大写字母的缩写和常见的习惯称呼等。直接提取的会议期刊字段不能作为最终的结果，需要和数据库中的进行匹配。文献期刊数据库收集了常见的会议和期刊名以及相应的缩写形式。首先提取出待识别字段中大写字母缩写部分，在数据库中进行查找，如果匹配则将匹配的全称与输入字段进行模糊匹配，防止缩写形式冲突的情况导致的错误。若匹配则直接输出结果。否则为会议期刊名建立索引，将待匹配字段在索引中进行检索，将检索结果与待匹配字段做模糊匹配。若找到匹配则输出结果。In practice, there are many ways to express the names of conference journals, such as abbreviations of capital letters and common customary names. The directly extracted conference journal fields cannot be used as the final result, and need to be matched with those in the database. The bibliographic journal database collects common conference and journal titles and their corresponding abbreviations. First, extract the uppercase abbreviation part of the field to be recognized, search it in the database, and if it matches, fuzzy match the full name of the match with the input field to prevent errors caused by the conflict of abbreviations. If it matches, output the result directly. Otherwise, build an index for the title of the conference journal, search the field to be matched in the index, and perform fuzzy matching on the search result and the field to be matched. Output the result if a match is found.

年份字段使用规则化方法，使用正则表达式在输入文本中寻找合法的年份模式。合法年份模式有两种形式：第一种以19或者20开始，并且为四位数字；第二种以会议期刊名字的大写字母缩写形式开始，接着引号和年份。使用这两种模式能够处理实际中的绝大部分情况，识别准确率超过百分之九十九。The year field uses a regularization method, using regular expressions to find legal year patterns in the input text. There are two forms of legal year patterns: the first starts with 19 or 20 and is a four-digit number; the second starts with the initials of the conference journal name in capital letters, followed by quotation marks and the year. Using these two modes can handle most of the situations in practice, and the recognition accuracy rate exceeds 99%.

(5)对步骤(4)得到的初步抽取结果(包括作者名字N，邮箱M，机构信息U和文章信息集合{P₁，P₂，...，P_n})进行缺失字段补全和歧义消除，得到最终的抽取结果，存至结果数据库中。( ₅ ) Perform missing _field completion _and Disambiguation is eliminated, the final extraction result is obtained, and stored in the result database.

实际页面中包含的信息可能存在一定程度的缺失和不规范的情况，对相同信息项可能识别出多个结果需要进一步判定。该过程利用信息之间的关联关系，对抽取结果进行补全，对存在歧义的结果进行进一步判定。信息关联包含如下情况：The information contained in the actual page may be missing and irregular to a certain extent, and multiple results may be identified for the same information item, which requires further judgment. This process uses the correlation between information to complete the extraction results and further judge the ambiguous results. Information association includes the following situations:

(a)作者名和邮箱用户名之间的关联；(a) the association between the author's name and the email username;

(b)机构信息与主页域名之间的关联；(b) The association between institutional information and the domain name of the homepage;

(c)作者名和文章信息中作者列表的关联；(c) the association of the author's name with the list of authors in the article information;

根据上述关联，可以对抽取结果进行补全，如当机构信息存在缺失时，可以将主页链接在数据库中进行查询，获得对应的机构信息。在信息的歧义消除方面，当存在多个邮箱时，可以利用作者名和用户名之间的对应关系，排除掉错误的结果。According to the above association, the extraction results can be supplemented. For example, when the institution information is missing, the home page can be linked to the database for query to obtain the corresponding institution information. In terms of information disambiguation, when there are multiple mailboxes, the corresponding relationship between author name and user name can be used to eliminate wrong results.

(6)将文章信息集合{P₁，P₂，...，P_n}中的元素与结果数据库中的记录进行匹配，消除冗余数据。(6) Match the elements in the article information set {P ₁ , P ₂ , . . . , P _n } with the records in the result database to eliminate redundant data.

虽然经过关联分析之后，抽取过程就已经完成，但是结果中可能存在重复的冗余信息。本步骤将抽取结果与结果数据库中的记录进行匹配。当找到匹配结果时，将两者进行模糊比对，如果结果数据库中的记录存在相关字段的缺失，则对该字段进行补全。如果在结果数据库中没有找到匹配结果，则将抽取结果添加到结果数据库中。Although the extraction process has been completed after association analysis, there may be repeated redundant information in the result. This step matches the extracted results with the records in the results database. When a matching result is found, a fuzzy comparison is performed between the two, and if the record in the result database is missing a relevant field, the field is completed. If no matching result is found in the results database, the extracted results are added to the results database.

(7)输出抽取结果。(7) Output the extraction result.

实例：Example:

以从学术主页http://www.cs.uiuc.edu/～hanj/中抽取信息的过程为例，首先使用Jiawei Han作为搜索关键字在搜索引擎中进行检索，首先根据屏蔽数据库的，排除掉Wikipedia和DBLP的结果，然后选取排名前三的结果发出页面请求，经过分类器判定，选择第一个搜索结果即为该作者的学术主页。Take the process of extracting information from the academic homepage http://www.cs.uiuc.edu/~hanj/ as an example, first use Jiawei Han as the search keyword to search in the search engine, and first exclude the The results of Wikipedia and DBLP, and then select the top three results to send a page request, after the classifier judges, the first search result selected is the author's academic homepage.

使用HTML解析器对页面进行解析，获取其中的子链接，根据链接关键字和上下文选定如下子页面进一步分析：Use an HTML parser to parse the page, obtain the sub-links, and select the following sub-pages for further analysis according to the link keywords and context:

http://www.cs.uiuc.edu/homes/hanj/pubs/index.htmhttp://www.cs.uiuc.edu/homes/hanj/pubs/index.htm

https://agora.cs.illinois.edu/display/cs591han/Research+Publications+-+Data+Mining+Research+Group+at+CS％2C+UIUChttps://agora.cs.illinois.edu/display/cs591han/Research+Publications+-+Data+Mining+Research+Group+at+CS%2C+UIUC

对每一个待分析的页面进行文本单元的划分，以首页的页面为例，得到如下结果：Divide each page to be analyzed into text units. Taking the home page as an example, the following results are obtained:

使用支持向量机对上述文本单元进行分类，分别判定为作者名字，无关数据，大学信息，邮箱，文章信息。根据判定的类别按照不同的提取流程进行进一步的提取，无关数据则直接放弃。Use the support vector machine to classify the above text units, and determine them as the author's name, irrelevant data, university information, email address, and article information. According to the determined category, further extraction is carried out according to different extraction processes, and irrelevant data are discarded directly.

作者名字的提取过程分别找到主页标题部分(Jiawei Han)，正文中的作者名字(Jiawei Han)，以及文章信息中包含的作者名字(Jiawei Han，Xiaofei He，Deng Cai)，经过交叉匹配，确定Jiawei Han为最终的结果。The process of extracting the author’s name finds the title part of the homepage (Jiawei Han), the author’s name in the text (Jiawei Han), and the author’s name contained in the article information (Jiawei Han, Xiaofei He, Deng Cai). After cross-matching, Jiawei is determined. Han is the final result.

邮箱信息的提取首先去掉前缀部分(E-mail)：之后使用模糊匹配自动机枚举所有可能的邮箱匹配结果，如：The extraction of mailbox information first removes the prefix part (E-mail): then uses the fuzzy matching automaton to enumerate all possible mailbox matching results, such as:

Hanj(用户名)at(分隔符)cs(域名).(点)uiuc(域名).(点)edu(域名)Hanj(username)at(delimiter)cs(domain name).(dot)uiuc(domain name).(dot)edu(domain name)

按照匹配的符合程度对结果进行评分，选取最优结果作为邮箱的合法形式，之后转换为合法形式输出。The results are scored according to the matching degree, and the best result is selected as the legal form of the mailbox, and then converted into a legal form for output.

机构信息的提取过程将被分类为机构信息的文本单元在机构索引中进行检索，在本例中以“Univ.of Illinois at Urbana-Champaign”为关键字进行检索，得到的检索结果中第一条记录即为“University of Illinois atUrbana-Champaign”，经过模糊匹配判定两者相符，因此可以直接输出结果。In the process of extracting institutional information, the text units classified as institutional information are searched in the institutional index. In this example, the keyword "Univ. The record is "University of Illinois at Urbana-Champaign". After fuzzy matching, it is determined that the two match, so the result can be output directly.

文章信息需要使用序列标注算法对文章信息进行标注，识别出其中的作者名，比如对于前面找到的文章信息，将其标注为如下形式：Article information needs to be annotated with a sequence annotation algorithm to identify the author’s name. For example, for the article information found above, it is annotated in the following form:

<作者>Peixiang Zhao，Xiaolei Li，Dong Xin，and Jiawei Han，</作者><标题>Gfaph Cube：OnWarehousing and OLAP Multidimensional Networks，</标题><会议>Proc.of 2011 ACM SIGMODInt.Conf.on Management of Data(SIGMOD′11)，</会议><地点>Athens，Greece，</地点><时间>June 2011</时间><Author>Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han,</author><title>Gfaph Cube: OnWarehousing and OLAP Multidimensional Networks,</title><Conference>Proc.of 2011 ACM SIGMODInt.Conf.on Management of Data (SIGMOD′11), </conference><location>Athens, Greece,</location><time>June 2011</time>

将各个子字段分别识别出来即完成了文章信息的识别过程。之后根据信息之间的相关关联对存在缺失和歧义的结果进行补全和判定，将结果与结果数据库进行合并。Identifying each sub-field separately completes the identification process of the article information. Then, according to the correlation between the information, the missing and ambiguous results are completed and judged, and the results are merged with the result database.

本发明不仅局限于上述具体实施方式，本领域一般技术人员根据本发明公开的内容，可以采用其它多种具体实施方式实施本发明，因此，凡是采用本发明的设计结构和思路，做一些简单的变化或更改的设计，都落入本发明保护的范围。The present invention is not limited to the above-mentioned specific embodiments, and those skilled in the art can adopt various other specific embodiments to implement the present invention according to the disclosed content of the present invention. Changes or modified designs all fall within the protection scope of the present invention.

Claims

1. the adaptive information extraction method of web page characteristics, is characterized in that, the method comprises the steps:

The 1st step was searched type from the internet be the website of academic homepage;

The 2nd step was analyzed the academic homepage of searching, regard the page of academic homepage as two tuple (L, C) set, wherein L is the URL that links in academic homepage, C is the context that links in academic homepage, reexamines whether to comprise key word in L and C, if comprise, entered for the 3rd step, otherwise filter out this link;

The 3rd step was analyzed the page of described link, obtained the document tree structure of the page, according to attribute and the content of tree node, the page was divided, and was divided into text unit T, consisted of text unit set { T ₁, T ₂...., T _n, step is as follows:

(a) at first use html parser that the page is resolved, obtain the document tree of the page; The node of document tree is namely corresponding to the html tag in the page, and document tree shows the relation between each html tag in the page with tree structure;

(b) then the page is divided; Html tag is divided into piece level element and inline element, html page is seen as the set of piece level element, have two kinds of relations between piece level element: set membership and brotherhood, can be mutually nested between piece level element and inline element, document tree is exactly that form with tree node presents these relations, the node that contains piece level element in document tree is called piece level node, other nodes are called non-level node, node to document tree travels through, classification by decision node is divided the page, and partiting step is as follows:

(b1) initial, the text unit set is empty;

(b2) document tree is carried out depth-first traversal, finds out all piece level nodes, to each piece level node Ni, generate a text unit Ti, and with Ni in the page corresponding division of teaching contents to Ti;

(b3) to each piece level node Ni, judge whether it has non-level child node in document tree, if had with it all non-level child nodes in the page corresponding division of teaching contents to Ti;

(b4) Ti is added in the text unit set;

(b5) finish;

(c) after traversal finishes, complete the division of the page, obtain text unit set { T ₁, T ₂...., T _n;

The 4th step is from text unit set { T ₁, T ₂...., T _nIn extract author's name N, mailbox M, mechanism information U and article information set P1, P2 ..., these four aiming fields of Pm} are as preliminary extraction result; For dissimilar aiming field, its abstracting method is as follows:

The extraction process of author's name N is as follows:

(a1) use the support vector machine sorting algorithm to text unit set { T ₁, T ₂..., T _nInner text unit classifies, retention class is the text unit set T of author's name _Name

(a2) use the authors' name numerical data base from T _NameIn match the authors' name character segment, use this database from T _NameIn match author's name set of candidate;

(a3) extract word in the academic homepage title of author, extract the author's name XXX in the academic homepage title of author;

(a4) the author's name XXX that obtains with (a3) mates candidate author's name that (a2) obtains, and the name that selection and XXX matching degree are the highest is exported as author's name N;

The extraction process of mailbox M is as follows:

(b1) at first use support vector machine classifier from text unit set { T ₁, T ₂..., T _nIn find out possible mailbox candidate text unit set T _EmailAlgorithm of support vector machine according to proper vector to T _EmailMiddle mailbox candidate text unit is judged, if classification results carries out (b2) and processes, otherwise directly filter out for certainly;

(b2) remove unnecessary part in mailbox candidate text unit;

(b3) employing fuzzy matching state machine algorithms is mated the mailbox of mailbox candidate text unit and standard, and the mailbox of a standard has following field: user name (provider's domain name .)+. TLD generates different matching results;

(b4) each field and the matching result of mailbox candidate text unit are compared, choose the result of matching degree maximum as net result, and be converted into the legal mailbox formatted output of standard according to the mailbox field of standard;

The extraction process of mechanism information U is as follows:

(c1) at first from the data of interconnected online collection whole world university and research institute, comprise name and the link of its corresponding homepage of mechanism, set up mechanism's homepage database; Be the Database inverted index;

(c2) use support vector machine classifier from text unit set { T ₁, T ₂...., T _nIn find out possible mechanism information text unit set T _U, with T _UIn the mechanism information text unit be converted to textual form, it is searched in index as key word, obtain first three result for retrieval of rank; First three result for retrieval and corresponding mechanism information text unit are carried out fuzzy matching, if can mate determine that the text is to should mechanism, the matching result output that matching degree is the highest all can't be mated else if, turns (c3) and processes;

(c3) utilize the URL of academic homepage to seek, academic website is the substation point of mechanism's website normally, and therefore domain name and the mechanism's homepage database with academic homepage mates, if there is the record of coupling, think that the author belongs to this mechanism, the record of coupling is as a result of exported;

Article information P1, P2 ..., the process of the extraction of Pm} is as follows:

(a) at first use the support vector machine sorting algorithm that text unit is classified, filter out the text unit that may comprise article information;

(b) then candidate article information text unit is carried out sequence labelling, extract each son field in candidate's text: extract text category feature, pattern feature, dictionary feature and term characteristics from candidate article information text unit, be used for conditional random field models, fundamental function in conditional random field models uses truth expression, i.e. function output is or is no; Through the calculating of conditional random field models, provide the most probable labeling form of candidate article information text unit; Symbol with same label can be merged into corresponding son field, then respectively these fields is carried out corresponding subsequent treatment;

(c) the authors' name field has comprised whole list of authors, need to be divided into single author's form; Partitioning algorithm is based on heuristic rule, according to the length of name, and abbreviated form and punctuation mark; Result after cutting apart is stored in array;

Header field process standardization cutting is as final result;

Mating in meeting journal title and literature periodical database; At first extract in field to be identified capitalization abbreviation part, search in database, if coupling full name and the field to be identified of coupling are carried out fuzzy matching prevents the mistake that the situation of abbreviated form conflict causes; If coupling is direct Output rusults; Otherwise set up index for the meeting journal title, field to be identified is retrieved in index, result for retrieval and field to be identified are done fuzzy matching; Mate Output rusults if find;

Time field service regeulations method uses regular expression to seek legal time pattern in input text;

The preliminary extraction result that the 5th step obtained the 4th step is carried out association analysis, utilizes the relevance disambiguation of information, and the disappearance field is carried out completion, obtains extracting result, deposits to result database;

The 6th the step with the article information set P1, P2 ..., the element in Pm} and the record in result database mate, and eliminate redundant data;

Result is extracted in the output of the 7th step.

2. information extraction method according to claim 1, is characterized in that, the 1st step was divided into two stages: searching stage and decision stage;

In the searching stage, at first derive the data set of author's name as seed data from existing data in literature, then retrieve in search engine as key word with each authors' name of data centralization, search engine returns to result for retrieval with tabular form, each result for retrieval is by title, chain feature and summary texts form, and the chain feature of the result for retrieval of the first page in returning results and summary texts leave in the candidate result list;

in decision stage, at first according to the chain feature of result for retrieval and summary texts to candidate result list filter in the following manner, at first check whether link is present in the shielding linked database, the result that will be arranged in this database is directly got rid of, then, to remaining result for retrieval, check whether its chain feature is rendered as the pattern of "～"+author name, if it is keep, otherwise directly get rid of, successively remaining each result for retrieval being proceeded as follows again of these two step filtrations of process: send page request according to its chain feature, use whether the page that the judgement of support vector machine sorting algorithm is returned is the academic homepage of author, if, directly it is saved as the academic homepage of author, judge and finish, otherwise continue next result for retrieval is carried out identical operation.