[go: up one dir, main page]

CN100447793C - Extraction Method of Page Query Interface Based on Visual Feature - Google Patents

Extraction Method of Page Query Interface Based on Visual Feature Download PDF

Info

Publication number
CN100447793C
CN100447793C CNB2007100195438A CN200710019543A CN100447793C CN 100447793 C CN100447793 C CN 100447793C CN B2007100195438 A CNB2007100195438 A CN B2007100195438A CN 200710019543 A CN200710019543 A CN 200710019543A CN 100447793 C CN100447793 C CN 100447793C
Authority
CN
China
Prior art keywords
query interface
block
blocks
label
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007100195438A
Other languages
Chinese (zh)
Other versions
CN101004760A (en
Inventor
崔志明
赵朋朋
方巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CNB2007100195438A priority Critical patent/CN100447793C/en
Publication of CN101004760A publication Critical patent/CN101004760A/en
Application granted granted Critical
Publication of CN100447793C publication Critical patent/CN100447793C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于视觉特征的页面查询接口抽取方法,首先获取一个包含查询接口的页面文档;采用基于视觉的文档分割方法,对上述页面文档构建视觉块树;定位查询接口区域;利用视觉特征识别标签块;再利用视觉特征完成控件块与标签块的分组,由此确定查询接口中的控件及其对应的属性标签,实现查询接口的自动抽取。本发明可以实现查询接口的自动抽取,为进行深层网页的集成搜索提供了基础;实验证明,本发明的基于视觉特征的查询接口自动抽取方法是可行的,并且具有较高的精度;将本发明应用于深层网页的集成搜索,可以提高搜索的准确度,从而较大范围地提高人们的工作效率。

Figure 200710019543

The invention discloses a method for extracting a page query interface based on visual features. Firstly, a page document containing a query interface is obtained; a visual block tree is constructed for the above-mentioned page document by using a vision-based document segmentation method; the query interface area is positioned; The feature identifies the label block; and then uses the visual feature to complete the grouping of the control block and the label block, thereby determining the control in the query interface and its corresponding attribute label, and realizing the automatic extraction of the query interface. The present invention can realize the automatic extraction of the query interface, which provides a basis for the integrated search of deep web pages; the experiment proves that the automatic extraction method of the query interface based on visual features of the present invention is feasible and has higher precision; the present invention The integrated search applied to the deep web can improve the accuracy of the search, thereby improving people's work efficiency on a large scale.

Figure 200710019543

Description

基于视觉特征的页面查询接口抽取方法 Extraction Method of Page Query Interface Based on Visual Feature

技术领域 technical field

本发明涉及一种信息检索的方法,尤其涉及一种基于信息抽取技术的搜索方法,用以实现深层网页查询接口的自动抽取。The invention relates to an information retrieval method, in particular to a search method based on information extraction technology, which is used to realize the automatic extraction of deep web page query interfaces.

背景技术 Background technique

国际互联网上存在着大量的信息页面,通常,搜索引擎可以通过网络爬虫(Crawler)搜索到这些页面,从而使访问者能够根据关键词获取其所需要的信息页面。然而,随着Web数据库的广泛应用,国际互联网正在加速“深化”,其中的大量页面是由后台数据库动态产生的,这部分页面信息不能直接通过静态链接获取,只能通过填写表单提交查询来获取。由于传统的网络爬虫不具有填写表单的能力,不能搜索到这些页面,因此,现有的搜索引擎无法提供这类数据库生成页面的信息,从而导致这部分信息对搜索引擎的使用者是隐藏、不可见的,可以称之为深层网页(Deep Web,又称为Invisible Web,HiddenWeb)。深层网页是一个与表层网页(Surface Web)相对应的概念,最初由Dr.Jill Ellsworth于1994年提出,指那些由普通搜索引擎难以发现其信息内容的网页页面。深层网页信息一般存储在数据库中,和静态网页相比通常信息量更大,主题更专一,信息质量更好,信息结构化更好,增长速度更快。研究表明,深层网页信息是表层网页信息的500倍,有近450,000个深层网页站点。因而,实现大规模深层网页数据集成是方便用户使用深层网页信息的一个有效途径。There are a large number of information pages on the Internet. Usually, search engines can search these pages through web crawlers (Crawlers), so that visitors can obtain the information pages they need according to keywords. However, with the wide application of Web databases, the Internet is accelerating its "deepening". A large number of pages are dynamically generated by the background database. The information on these pages cannot be obtained directly through static links, but can only be obtained by filling out forms and submitting queries. . Since traditional web crawlers do not have the ability to fill in forms and cannot search these pages, existing search engines cannot provide information on pages generated by such databases, resulting in this part of information being hidden and invisible to search engine users. Visible, can be called deep web (Deep Web, also known as Invisible Web, HiddenWeb). Deep web is a concept corresponding to Surface Web, which was first proposed by Dr. Jill Ellsworth in 1994, and refers to web pages whose information content is difficult to find by ordinary search engines. Deep web information is generally stored in a database. Compared with static web pages, it usually has more information, more specific topics, better information quality, better information structure, and faster growth rate. Research shows that there are 500 times more information on the deep web than on the surface, with nearly 450,000 deep web sites. Therefore, realizing large-scale deep web data integration is an effective way to facilitate users to use deep web information.

要实现大规模深层网页集成搜索,需要解决:1)数据源发现(Deep WebDiscovery);2)查询接口抽取(Query Interface Extraction);3)数据源分类(Source Classification);4)查询转换(Query Transfer);5)结果合成(ResultMerging)等五个关键问题,其中,要从Web数据库中获取信息,如何实现查询接口抽取是最基础也是最重要的一个问题。实现查询接口的自动抽取是实现大规模集成检索的一个关键问题(步骤)之一,是Web数据库建模,深层网页分类,查询接口模式匹配,构建统一查询接口等的基础。To realize large-scale deep web integrated search, it is necessary to solve: 1) Deep Web Discovery; 2) Query Interface Extraction; 3) Source Classification; 4) Query Transfer ); 5) Five key issues such as ResultMerging, among which, how to realize query interface extraction is the most basic and most important issue in order to obtain information from the Web database. Realizing the automatic extraction of query interface is one of the key issues (steps) to realize large-scale integrated retrieval, and it is the basis of Web database modeling, deep web page classification, query interface pattern matching, and construction of unified query interface.

现有技术中,网页查询接口通常位于可搜索的表层网页内,是人们使用HTML语言手工建立的,这就造成了不同网页上查询接口建立的模式并不一样,许多语义相关的内容在HTML文本上分散出现,查询接口的内容,表现形式,查询能力也各不相同。因此自动获取查询接口上的属性并理解与属性语义相关的信息内容是一项非常具备挑战性的工作。In the prior art, the web page query interface is usually located in the searchable surface web page, which is manually established by people using HTML language, which causes different modes of query interface establishment on different web pages, and many semantically related contents are in the HTML text It appears scattered on the Internet, and the content, form, and query capabilities of the query interface are also different. Therefore, it is a very challenging task to automatically obtain the attributes on the query interface and understand the information content related to the attribute semantics.

尽管查询接口自动抽取是许多应用的一个重要问题,然而专门讨论这个问题的方法却不多。例如下列引文,B.Chidlovskii and A.Bergholz.Crawlingfor domain-specific hidden web resources.In Proceedings of 4thInternational Conference on Web Information Systems Engineering,2003;H.He,W.Meng,C.Yu,and Z.Wu.Wise-integrator:An automaticintegrator of web search interfaces for e-commerce.In VLDBConference,2003;S.Raghavan and H.Garcia-Molina.Crawling thehidden web.In VLDB Conference,2001。这些引文中,有的方法涉及表单自动填写的问题,然而他们只处理仅有简单关键词的查询表单,或者仅使用下拉选择类标来进行简单交互。如提出了使用简单的启发式规则如proximity和alignment来关联表单中的控件(elements)和标签(labels)等,这种方法无法处理HTML文件中多种多样的查询接口,易造成抽取的接口不完整,从而导致查询接口抽取准确率低,影响集成查询接口的构建。Although automatic query interface extraction is an important issue for many applications, there are not many methods dedicated to this problem. For example the following citations, B. Chidlovskii and A. Bergholz. Crawling for domain-specific hidden web resources. In Proceedings of 4th International Conference on Web Information Systems Engineering, 2003; H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In VLDB Conference, 2003; S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB Conference, 2001. In these citations, some methods deal with the problem of automatic filling of forms, however they only deal with query forms with only simple keywords, or use only drop-down selection labels for simple interactions. For example, it is proposed to use simple heuristic rules such as proximity and alignment to associate elements and labels in the form. This method cannot handle a variety of query interfaces in HTML files, and it is easy to cause the extracted interfaces to be inconsistent. Integrity, which leads to low extraction accuracy of the query interface and affects the construction of the integrated query interface.

基于视觉特征的文档分割可以标识文档的语义内容的一个或多个部分。例如,中国发明专利申请CN1577328A公开了一种基于视觉的文档分割方法,能够根据视觉特征对Web页面文档进行分割,标识其语义内容的一个或多个部分,所述一个或多个部分通过在所述文档中标识多个可视块,并检测所述多个可视块的可视块之间的一个或多个分隔符来标识。至少部分地基于所述多个可视块和所述一个或多个分隔符对所述文档构造内容结构,并且所述内容结构标识所述文档的语义内容的一个或多个部分。使用基于视觉的文档分割所获得的内容结构能够在文档检索过程中被可任选地使用。该技术方案可以实现语义内容的划分,从而提高Web页面搜索过程中的准确性。但是,其针对的Web页面搜索仍然是现有技术中的表层页面的搜索,并不能直接被应用于深层页面搜索。Document segmentation based on visual features can identify one or more parts of a document's semantic content. For example, Chinese invention patent application CN1577328A discloses a vision-based document segmentation method, which can segment a Web page document according to visual features, and identify one or more parts of its semantic content. A plurality of visual blocks are identified in the document, and one or more delimiters between the visual blocks of the plurality of visual blocks are detected for identification. A content structure is constructed for the document based at least in part on the plurality of visual blocks and the one or more separators, and the content structure identifies one or more portions of semantic content of the document. The content structure obtained using vision-based document segmentation can optionally be used in the document retrieval process. The technical solution can realize the division of semantic content, thereby improving the accuracy in the web page search process. However, the Web page search it targets is still the surface page search in the prior art, and cannot be directly applied to the deep page search.

发明内容 Contents of the invention

本发明目的是提供一种基于视觉特征在页面查询接口抽取方法,以应用于深层页面搜索,实现深层页面搜索的查询接口的自动抽取,从而提高深层页面搜索的查全率、自动化程度及查询效率。The purpose of the present invention is to provide a method for extracting query interfaces on pages based on visual features, to be applied to deep page searches, to realize the automatic extraction of query interfaces for deep page searches, thereby improving the recall rate, degree of automation and query efficiency of deep page searches .

为达到上述目的,本发明采用的技术方案是:一种基于视觉特征的页面查询接口抽取方法,包括下列步骤:In order to achieve the above object, the technical solution adopted in the present invention is: a method for extracting page query interface based on visual features, comprising the following steps:

(1)获取一个包含查询接口的页面文档;(1) Obtain a page document containing a query interface;

(2)采用基于视觉的文档分割方法,对上述页面文档构建视觉块树;(2) Using a visual-based document segmentation method to construct a visual block tree for the above-mentioned page documents;

(3)定位查询接口区域;(3) Location query interface area;

(4)识别标签块,包括,(4) identification label blocks, including,

4-1)将查询接口区域中的文本块排成一个列表,取第一个文本块归入第一个类;4-1) arrange the text blocks in the query interface area into a list, and get the first text block and classify it into the first class;

4-2)取下一个文本块,计算其与已存在的类之间的相似度,两个文本块之间的相似度公式为,4-2) Take a text block and calculate the similarity between it and the existing class. The similarity formula between two text blocks is,

Sim(B1,B2)=w1×wfs(B1,B2)+w2×was(B1,B2)+w3×wcs(B1,B2)+w4×wss(B1,B2)Sim(B1, B2)=w1×wfs(B1, B2)+w2×was(B1, B2)+w3×wcs(B1, B2)+w4×wss(B1, B2)

式中,wfs(B1,B2)代表B1与B2字体、背景色是否相同,相同为1,否则为0,In the formula, wfs(B1, B2) represents whether the font and background color of B1 and B2 are the same, the same is 1, otherwise it is 0,

was(B1,B2)代表B1与B2文本是否左对齐或者右对齐,对齐为1,否则为0,was(B1, B2) represents whether the text of B1 and B2 is left-aligned or right-aligned, the alignment is 1, otherwise it is 0,

wcs(B1,B2)代表B1与B2是否同时出现或不出现冒号,同时为1,否则为0,wcs(B1, B2) represents whether B1 and B2 appear at the same time or not, and it is 1 at the same time, otherwise it is 0,

wss(B1,B2)代表B1和B2文本是否在同一行,不在同一行为1,否则为0,wss(B1, B2) represents whether the texts of B1 and B2 are in the same line, if they are not in the same line, it is 1, otherwise it is 0,

w1为3.5~4.5,w2为1.5~2.5,w3为1.5~2.5,w4为1.5~2.5,且w1+w2+w3+w4=10;w1 is 3.5-4.5, w2 is 1.5-2.5, w3 is 1.5-2.5, w4 is 1.5-2.5, and w1+w2+w3+w4=10;

文本块与类之间的相似度为该文本块与该类中所有文本块的相似度的平均值,若文本块与某一个类的相似度大于相似度阈值(Tas),则将该文本块归入该类;若其与任一个已存在的类的相似度均不大于相似度阈值,则新建一个类,并把该文本块归入该新建的类,所述相似度阈值为6;The similarity between a text block and a class is the average of the similarities between the text block and all text blocks in the class. If the similarity between a text block and a certain class is greater than the similarity threshold (Tas), the text block will be Be classified into this class; If its similarity with any existing class is not greater than the similarity threshold, then create a new class, and this text block is classified into the newly created class, and the similarity threshold is 6;

4-3)重复步骤4-2),直至完成文本块的分类;4-3) repeat step 4-2), until finishing the classification of text blocks;

4-4)根据显示特征,确定获得的文本块类中符合度最高的类为标签类;所述显示特征包括,标签通常不在同一行中,同一行中出现多个文本块,第一块为标签;标签通常左对齐或右对齐;标签的字体大小、颜色、背景色相同;4-4) According to the display feature, determine that the class with the highest degree of conformity in the obtained text block class is the label class; the display feature includes that the label is usually not in the same line, multiple text blocks appear in the same line, and the first block is Labels; Labels are usually left-aligned or right-aligned; Labels have the same font size, color, and background color;

(5)控件块与标签块的分组,(5) Grouping of control blocks and label blocks,

5-1)建立控件块列表,删除其中的submit,reset,image控件块;5-1) Create a list of control blocks, delete the submit, reset, and image control blocks;

5-2)对每一控件块与步骤(4)中获得的标签块进行比较,将显示于同一行的控件块与标签块归为一组;5-2) Comparing each control block with the label block obtained in step (4), grouping the control blocks and label blocks displayed on the same row into one group;

5-3)根据显示特征,将剩余的控件块和其上方最毗邻的标签块归为一组,完成控件块与标签块的分组;5-3) According to the display characteristics, group the remaining control blocks and the adjacent label blocks above them into one group to complete the grouping of control blocks and label blocks;

由此确定查询接口中的控件及其对应的属性标签,实现查询接口的自动抽取。In this way, the controls in the query interface and their corresponding attribute labels are determined, so as to realize the automatic extraction of the query interface.

上述技术方案中,所述页面文档为HTML格式文档。In the above technical solution, the page document is an HTML format document.

所述采用基于视觉的文档分割方法,对上述页面文档构建视觉块树的方法是现有技术,例如,可以采用中国发明专利申请CN1577328A中公开的文档分割方法。The method of constructing a visual block tree for the above-mentioned page document using a vision-based document segmentation method is a prior art, for example, the document segmentation method disclosed in Chinese invention patent application CN1577328A can be used.

上述技术方案中,查询接口区域通常就是包含<form>标签的最小块,但是一个网页上通常不单有查询表单,还有如会员登陆、邮件订阅等对于深层页面搜索没有意义的表单,由此产生了干扰,因而,为获得更好的效果,进一步的技术方案是:所述定位查询接口区域的方法是,In the above technical solution, the query interface area is usually the smallest block containing the <form> tag. However, there are usually not only query forms on a webpage, but also forms that are meaningless for deep page searches, such as member login and email subscription. interference, therefore, in order to obtain better results, the further technical solution is: the method for positioning and querying the interface area is:

3-1)设定控件数量阈值;3-1) Set the control number threshold;

3-2)在步骤(2)获得的视觉块树中,取包含<form>标签的最小块作为预设查询接口区域,计算其中包含的控件数量;3-2) In the visual block tree obtained in step (2), take the smallest block containing the <form> tag as the preset query interface area, and calculate the number of controls contained therein;

3-3)若控件数量大于设定的控件数量阈值,则标记该块为查询接口区域,否则检测下一个包含<form>标签的最小块;3-3) If the number of controls is greater than the set control number threshold, mark the block as the query interface area, otherwise detect the next smallest block containing the <form> tag;

3-4)重复步骤3-2)、3-3),完成查询接口区域的定位。3-4) Repeat steps 3-2) and 3-3) to complete the positioning of the query interface area.

其中,所述控件数量阈值在2~4之间。Wherein, the control number threshold is between 2 and 4.

或者,所述定位查询接口区域的方法是,在步骤(2)获得的视觉块树中,取包含<form>标签的最小块作为预设查询接口区域,若该区域中包含有PASSWORD控件,则删除该预设查询接口区域,检测下一个包含<form>标签的最小块,直至确定查询接口区域或检测完各视觉块。Or, the method for locating the query interface area is: in the visual block tree obtained in step (2), take the smallest block containing the <form> tag as the preset query interface area, if the PASSWORD control is included in the area, then Delete the preset query interface area, and detect the next smallest block containing the <form> tag until the query interface area is determined or all visual blocks are detected.

上述技术方案中,步骤5-3)中,所述显示特征还包括,控件与标签在不同行中垂直对齐;控件和标签相毗邻;若控件同一行上没有标签,则该控件与上方最邻近标签关联。In the above technical solution, in step 5-3), the display feature also includes that the control and the label are vertically aligned in different rows; the control and the label are adjacent to each other; if there is no label on the same line of the control, the control is closest to the top Label association.

由于上述技术方案运用,本发明与现有技术相比具有下列优点:Due to the use of the above-mentioned technical solutions, the present invention has the following advantages compared with the prior art:

1.本发明将基于视觉特征的文档分割与标签和控件的视觉特征进行结合,实现了查询接口区域中的控件及其标签属性的分组组合,从而可以实现查询接口的自动抽取,为进行深层网页的集成搜索提供了基础;1. The present invention combines the document segmentation based on visual features with the visual features of labels and controls, and realizes the grouping and combination of controls and their label attributes in the query interface area, so that automatic extraction of query interfaces can be realized. The integrated search provides the basis for;

2.实验证明,本发明的基于视觉特征的查询接口自动抽取方法是可行的,并且具有较高的精度;2. Experiments have proved that the automatic extraction method of query interface based on visual features of the present invention is feasible and has higher precision;

3.将本发明应用于深层网页的集成搜索,可以提高搜索的准确度,从而较大范围地提高人们的工作效率。3. Applying the present invention to the integrated search of deep web pages can improve the accuracy of search, thereby improving people's work efficiency in a large range.

附图说明 Description of drawings

附图1是本发明实施例的基于视觉特征的查询接口自动抽取系统图;Accompanying drawing 1 is the automatic extraction system diagram of the query interface based on the visual feature of the embodiment of the present invention;

附图2是本发明实施例的查询接口页面显示示意图;Accompanying drawing 2 is the query interface page display schematic diagram of the embodiment of the present invention;

附图3是图2的页面内容结构示意图;Accompanying drawing 3 is the schematic diagram of page content structure of Fig. 2;

附图4是图3对应的视觉块树示意图;Accompanying drawing 4 is the schematic diagram of visual block tree corresponding to Fig. 3;

附图5是本发明的自动抽取过程示意图。Accompanying drawing 5 is a schematic diagram of the automatic extraction process of the present invention.

具体实施方式 Detailed ways

下面结合附图及实施例对本发明作进一步描述:The present invention will be further described below in conjunction with accompanying drawing and embodiment:

实施例一:参见附图1至附图5所示,一种基于视觉特征的页面查询接口抽取方法,包括下列步骤:Embodiment 1: Referring to accompanying drawings 1 to 5, a method for extracting a page query interface based on visual features comprises the following steps:

(1)获取一个包含查询接口的页面文档;(1) Obtain a page document containing a query interface;

参见附图2,一个查询接口包含一些表单控件让用户输入查询信息,如文本框(Textbox),单选按钮(Radio Button),复选框(Check box)和下拉列表(Selection List)等控件。每个控件通常都关联一个标签——一个描述文本,每个控件可以有一个或多个值(value),例如一个下拉列表有一列值供用户选择,单选按钮和复选框通常有一个值。逻辑上讲,一个控件和它关联的标签构成了一个属性(attribute),对应了深层网页(Deep Web)后台数据库中的一个字段。通常,一个属性包含一个标签,一个或多个表单控件。例如图2中Author属性有4个表单控件,包括1个文本框和3个单选按钮。属性中的标签我们可以看作属性的名称(attribute name),属性中的表单控件我们可以看作属性的域(attribute domain)。如果一个属性包括多个表单控件,则这些表单控件之间有某种关系。例如图2中author属性的4个表单控件,文本框可以看作域元素(domain element),3个单选框都分别定义了域控件的约束条件,我们可以把他们看作约束元素(constraint elements)。Referring to accompanying drawing 2, a query interface includes some form controls to allow users to input query information, such as controls such as text boxes (Textbox), radio buttons (Radio Button), check boxes (Check box) and drop-down lists (Selection List). Each control is usually associated with a label - a descriptive text, each control can have one or more values (value), for example a drop-down list has a list of values for the user to choose from, radio buttons and check boxes usually have a value . Logically speaking, a control and its associated label constitute an attribute, which corresponds to a field in the backend database of the Deep Web. Typically, a property contains a label and one or more form controls. For example, the Author attribute in Figure 2 has 4 form controls, including 1 text box and 3 radio buttons. The label in the attribute can be regarded as the name of the attribute (attribute name), and the form control in the attribute can be regarded as the domain of the attribute (attribute domain). If a property includes multiple form controls, there is some relationship between those form controls. For example, the four form controls of the author attribute in Figure 2, the text box can be regarded as a domain element (domain element), and the three radio buttons define the constraints of the domain control respectively. We can regard them as constraint elements (constraint elements) ).

要能够从上述页面文档中抽取出查询接口,首先要确定查询接口区域,然后,从查询接口区域中抽取出标签、表单控件,再把他们按照逻辑关系重组成一个个属性(查询条件的一个逻辑单位)。我们可以抽象地将一个查询接口QI定义为QI=(N,{A1,A2,...,An}),其中,N为表单的名字,Ai表示查询接口包含的逻辑属性,Ai={Li,Ei...k},Li为属性标签,Ei为表单控件。To be able to extract the query interface from the above page document, first determine the query interface area, then extract the labels and form controls from the query interface area, and then reorganize them into attributes according to the logical relationship (a logic of the query condition unit). We can abstractly define a query interface QI as QI=(N, {A 1 , A 2 ,...,A n }), where N is the name of the form, A i represents the logical attributes contained in the query interface, A i ={L i , E i...k }, where L i is an attribute label, and E i is a form control.

(2)采用基于视觉的文档分割方法,对上述页面文档构建视觉块树;(2) Using a visual-based document segmentation method to construct a visual block tree for the above-mentioned page documents;

基于视觉特征的页面分割算法(VIPs)目标是根据Web页面的视觉表达(presentation)抽取出其内容结构。这些内容结构是一个树结构,树中的每个节点对应Web页面中的一个矩形区域。叶节点是不能进一步划分的块,它表示一个最小的语义单位。父节点和子节点之间的关系是包含关系,叶节点对应的矩形区域包含在父节点对应的矩形区域中,这棵树我们叫做视觉块树。我们采用VIPs算法把查询接口所在的页面转换成视觉块树,附图3显示了附图2所在页面的内容结构,虚线框是其查询接口区域,附图4给出了对应的视觉块树的示意图,实际的视觉块树要比附图4显示的更复杂,其中通常包含上百个块。The goal of visual feature-based page segmentation algorithms (VIPs) is to extract the content structure of Web pages according to their visual presentation. These content structures are a tree structure, and each node in the tree corresponds to a rectangular area in the Web page. A leaf node is a block that cannot be further divided, and it represents a minimum semantic unit. The relationship between the parent node and the child node is an inclusion relationship. The rectangular area corresponding to the leaf node is included in the rectangular area corresponding to the parent node. This tree is called a visual block tree. We use the VIPs algorithm to convert the page where the query interface is located into a visual block tree. Attached Figure 3 shows the content structure of the page located in Figure 2. The dotted box is the area of the query interface. Figure 4 shows the corresponding visual block tree Schematic diagram, the actual visual block tree is more complex than that shown in Figure 4, which usually contains hundreds of blocks.

(3)定位查询接口区域;(3) Location query interface area;

一般地,可以将包含<form>标签的最小Block作为查询接口区域,但是通常一个网页上不单有查询表单而且还有诸如会员登陆,邮件订阅等对本发明无意义的表单,形成了干扰。为此,本实施例中,可以作进一步的优化,根据一些启发规则来确定查询接口区域,例如有些网页表单有TEXTAREA控件和PASSWORD控件,根据实际经验我们可以直接判定这类网页表单不是查询接口。另外可以为网页表单中的控件数量设置一个阈值,当一个网页表单中的控件数量低于这个阈值时,就可以认为这个网页表单不是查询接口,例如有些站内搜索的网页表单元素数量很少,仅有一个文本框和一个提交按钮,对这类网页表单我们无法获得足够的信息,因此可将它们划入非查询接口一类。阈值的取值是个经验值,一般地,取值为2、3或4较适宜。Generally, the smallest Block containing the <form> tag can be used as the query interface area, but usually a web page not only has a query form but also forms such as member login, mail subscription, etc. that are meaningless to the present invention, forming interference. For this reason, in this embodiment, further optimization can be done, and the query interface area is determined according to some heuristic rules. For example, some web forms have TEXTAREA controls and PASSWORD controls. According to actual experience, we can directly determine that such web forms are not query interfaces. In addition, a threshold can be set for the number of controls in a web form. When the number of controls in a web form is lower than this threshold, it can be considered that the web form is not a query interface. There's a textbox and a submit button, and we don't get enough information about this type of web form, so we can classify them as non-query interfaces. The value of the threshold is an empirical value, generally, the value of 2, 3 or 4 is more appropriate.

(4)识别标签块(4) Identification label block

要识别标签块,首先需要理解查询接口的视觉特征。To identify label blocks, it is first necessary to understand the visual features of the query interface.

查询接口用来让用户输入查询信息,提交查询。为了让用户更容易理解和使用查询接口,设计者通常会融入多种类型的视觉特征,如字体,颜色,布局等。因此,视觉特征对查询接口抽取是非常重要的。下面我们将描述本发明方法主要用到的一些视觉特征。The query interface is used to allow users to input query information and submit queries. In order to make it easier for users to understand and use the query interface, designers usually incorporate various types of visual features, such as fonts, colors, layouts, etc. Therefore, visual features are very important for query interface extraction. Below we will describe some visual features that are mainly used by the method of the present invention.

位置特征(PF,Position Features):这些特征描述了标签和表单控件在表单中的相对位置。Position Features (PF, Position Features): These features describe the relative position of labels and form controls in the form.

PF1:标签通常不再同一行中,同一行中出现多个text块,第一块为标签,其它块为说明文字如:between...andPF1: Labels are usually not in the same line, and there are multiple text blocks in the same line. The first block is a label, and the other blocks are explanatory text, such as: between...and

PF2:表单控件和其属性标签通常在同一行或者在不同行中垂直对齐。PF2: Form controls and their property labels are usually vertically aligned on the same line or on different lines.

布局特征(LF,Layout Features):这些特征表明了属性和表单控件在查询接口中的排布规律。Layout Features (LF, Layout Features): These features indicate the arrangement of attributes and form controls in the query interface.

LF1:标签通常左对齐或右对齐。LF1: Labels are usually left-aligned or right-aligned.

LF2:属性中的标签和表单控件相毗邻。LF2: The label in the attribute is adjacent to the form control.

LF3:表单控件同一行上没有标签,则该控件与上方最邻近标签关联。LF3: If there is no label on the same line as the form control, the control is associated with the nearest neighbor label above.

外貌特征(AF,Appearance Features):这些特征描述了标签和表单控件的视觉特征。Appearance Features (AF, Appearance Features): These features describe the visual characteristics of labels and form controls.

AF1:标签的字体大小,字体颜色,背景相同。AF1: The font size, font color, and background of the label are the same.

AF2:标签通常以冒号结尾。AF2: Tags usually end with a colon.

在查询接口中存在多个文本块,需要根据PF,LF,AF特征从中确定标签块,具体方法是:There are multiple text blocks in the query interface, and the label block needs to be determined according to the PF, LF, and AF features. The specific method is:

4-1)将查询接口区域中的文本块按照任意顺序排成一个列表,取第一个文本块归入第一个类;4-1) arrange the text blocks in the query interface area into a list in any order, and take the first text block and classify it into the first class;

4-2)取下一个文本块,计算其与已存在的类之间的相似度,两个文本块之间的相似度公式为,4-2) Take a text block and calculate the similarity between it and the existing class. The similarity formula between two text blocks is,

Sim(B1,B2)=w1×wfs(B1,B2)+w2×was(B1,B2)+w3×wcs(B1,B2)+w4×wss(B1,B2)Sim(B1, B2)=w1×wfs(B1, B2)+w2×was(B1, B2)+w3×wcs(B1, B2)+w4×wss(B1, B2)

式中,wfs(B1,B2)代表B1与B2字体、背景色是否相同,相同为1,否则为0,In the formula, wfs(B1, B2) represents whether the font and background color of B1 and B2 are the same, the same is 1, otherwise it is 0,

was(B1,B2)代表B1与B2文本是否左对齐或者右对齐,对齐为1,否则为0,was(B1, B2) represents whether the text of B1 and B2 is left-aligned or right-aligned, the alignment is 1, otherwise it is 0,

wcs(B1,B2)代表B1与B2是否同时出现或不出现冒号,同时为1,否则为0,wcs(B1, B2) represents whether B1 and B2 appear at the same time or not, and it is 1 at the same time, otherwise it is 0,

wss(B1,B2)代表B1和B2文本是否在同一行,不在同一行为1,否则为0,wss(B1, B2) represents whether the texts of B1 and B2 are in the same line, if they are not in the same line, it is 1, otherwise it is 0,

w1为3.5~4.5,w2为1.5~2.5,w3为1.5~2.5,w4为1.5~2.5,且w1+w2+w3+w4=10;优选的取值分别为4,2,2,2;w1 is 3.5-4.5, w2 is 1.5-2.5, w3 is 1.5-2.5, w4 is 1.5-2.5, and w1+w2+w3+w4=10; the preferred values are 4, 2, 2, 2 respectively;

文本块与类之间的相似度为该文本块与该类中所有文本块的相似度的平均值,若文本块与某一个类的相似度大于6,则将该文本块归入该类;若其与任一个已存在的类的相似度均不大于6,则新建一个类,并把该文本块归入该新建的类;The similarity between a text block and a class is the average value of the similarity between the text block and all text blocks in this class, if the similarity between a text block and a certain class is greater than 6, the text block is classified into this class; If its similarity with any existing class is not greater than 6, then create a new class, and classify the text block into the newly created class;

4-3)重复步骤4-2),直至完成文本块的分类;4-3) repeat step 4-2), until finishing the classification of text blocks;

4-4)根据显示特征,确定获得的文本块类中符合度最高的类为标签类;所述显示特征包括,标签通常不在同一行中,同一行中出现多个文本块,第一块为标签;标签通常左对齐或右对齐;标签的字体大小、颜色、背景色相同。4-4) According to the display feature, determine that the class with the highest degree of conformity in the obtained text block class is the label class; the display feature includes that the label is usually not in the same line, multiple text blocks appear in the same line, and the first block is Labels; labels are usually left-aligned or right-aligned; labels have the same font size, color, and background color.

(5)控件块与标签块的分组,(5) Grouping of control blocks and label blocks,

上一步骤发现并将标签块划分为一类,这就相当于找到属性的名称,确定了属性的个数。接着通过3个步骤将控件块与相关标签块组成逻辑属性。The previous step found and divided the label blocks into one category, which is equivalent to finding the name of the attribute and determining the number of attributes. Then through three steps, the control block and the related label block are composed of logic attributes.

5-1)建立控件块列表,由于查询接口中含有用于用户提交查询的表单控件,通常有submit、reset、image三种类型,这些控件并不具备逻辑属性,因此首先要将这些控件block删除;5-1) Create a list of control blocks. Since the query interface contains form controls for users to submit queries, there are usually three types of submit, reset, and image. These controls do not have logical attributes, so these control blocks must be deleted first. ;

5-2)对每一控件块与步骤(4)中获得的标签块进行比较,将显示于同一行的控件块与标签块归为一组;5-2) Comparing each control block with the label block obtained in step (4), grouping the control blocks and label blocks displayed on the same row into one group;

5-3)根据显示特征PF2、LF2、LF3,将剩余的控件块和其上方最毗邻的标签块归为一组,完成控件块与标签块的分组。5-3) According to the display features PF2, LF2, and LF3, group the remaining control blocks and the adjacent label blocks above them into one group to complete the grouping of control blocks and label blocks.

由此确定查询接口中的控件及其对应的属性标签,实现查询接口的自动抽取。In this way, the controls in the query interface and their corresponding attribute labels are determined, so as to realize the automatic extraction of the query interface.

Claims (5)

1.一种基于视觉特征的页面查询接口抽取方法,包括下列步骤:1. A method for extracting page query interfaces based on visual features, comprising the following steps: (1)获取一个包含查询接口的页面文档,所述页面文档为HTML格式文档;(1) Obtain a page document that includes a query interface, and the page document is an HTML format document; (2)采用基于视觉的文档分割方法,对上述页面文档构建视觉块树;(2) Using a visual-based document segmentation method to construct a visual block tree for the above-mentioned page documents; (3)定位查询接口区域;(3) Location query interface area; (4)识别标签块,包括,(4) identification label blocks, including, 4-1)将查询接口区域中的文本块排成一个列表,取第一个文本块归入第一个类;4-1) arrange the text blocks in the query interface area into a list, and get the first text block and classify it into the first class; 4-2)取下一个文本块,计算其与已存在的类之间的相似度,两个文本块之间的相似度公式为,4-2) Take a text block and calculate the similarity between it and the existing class. The similarity formula between two text blocks is, Sim(B1,B2)=w1×wfs(B1,B2)+w2×was(B1,B2)+w3×wcs(B1,B2)+w4×wss(B1,B2)Sim(B1, B2)=w1×wfs(B1, B2)+w2×was(B1, B2)+w3×wcs(B1, B2)+w4×wss(B1, B2) 式中,wfs(B1,B2)代表B1与B2字体、背景色是否相同,相同为1,否则为0,was(B1,B2)代表B1与B2文本是否左对齐或者右对齐,对齐为1,否则为0,wcs(B1,B2)代表B1与B2是否同时出现或不出现冒号,同时为1,否则为0,wss(B1,B2)代表B1和B2文本是否在同一行,不在同一行为1,否则为0,w1为3.5~4.5,w2为1.5~2.5,w3为1.5~2.5,w4为1.5~2.5,且w1+w2+w3+w4=10;In the formula, wfs(B1, B2) represents whether the font and background color of B1 and B2 are the same, the same is 1, otherwise it is 0, was(B1, B2) represents whether the texts of B1 and B2 are left-aligned or right-aligned, and the alignment is 1, Otherwise, it is 0, wcs(B1, B2) represents whether B1 and B2 appear at the same time or not, and it is 1 at the same time, otherwise it is 0, wss(B1, B2) represents whether B1 and B2 text are in the same line, not in the same line 1 , otherwise it is 0, w1 is 3.5~4.5, w2 is 1.5~2.5, w3 is 1.5~2.5, w4 is 1.5~2.5, and w1+w2+w3+w4=10; 文本块与类之间的相似度为该文本块与该类中所有文本块的相似度的平均值,若文本块与某一个类的相似度大于相似度阈值,则将该文本块归入该类;若其与任一个已存在的类的相似度均不大于相似度阈值,则新建一个类,并把该文本块归入该新建的类,所述相似度阈值为6;The similarity between a text block and a class is the average of the similarities between the text block and all text blocks in the class. If the similarity between a text block and a class is greater than the similarity threshold, the text block will be included in the class. class; if its similarity with any existing class is not greater than the similarity threshold, then create a new class, and put the text block into the newly created class, and the similarity threshold is 6; 4-3)重复步骤4-2),直至完成文本块的分类;4-3) repeat step 4-2), until finishing the classification of text blocks; 4-4)根据显示特征,确定获得的文本块类中符合度最高的类为标签类;所述显示特征包括,标签通常不在同一行中,同一行中出现多个文本块,第一块为标签;标签通常左对齐或右对齐;标签的字体大小、颜色、背景色相同;4-4) According to the display feature, determine that the class with the highest degree of conformity in the obtained text block class is the label class; the display feature includes that the label is usually not in the same line, multiple text blocks appear in the same line, and the first block is Labels; Labels are usually left-aligned or right-aligned; Labels have the same font size, color, and background color; (5)控件块与标签块的分组,(5) Grouping of control blocks and label blocks, 5-1)建立控件块列表,删除其中的submit,reset,image控件块;5-1) Create a list of control blocks, delete the submit, reset, and image control blocks; 5-2)对每一控件块与步骤(4)中获得的标签块进行比较,将显示于同一行的控件块与标签块归为一组;5-2) Comparing each control block with the label block obtained in step (4), grouping the control blocks and label blocks displayed on the same row into one group; 5-3)根据显示特征,将剩余的控件块和其上方最毗邻的标签块归为一组,完成控件块与标签块的分组;5-3) According to the display characteristics, group the remaining control blocks and the adjacent label blocks above them into one group to complete the grouping of control blocks and label blocks; 由此确定查询接口中的控件及其对应的属性标签,实现查询接口的自动抽取。In this way, the controls in the query interface and their corresponding attribute labels are determined, so as to realize the automatic extraction of the query interface. 2.根据权利要求1所述的基于视觉特征的页面查询接口抽取方法,其特征在于:所述定位查询接口区域的方法是,2. the page query interface extraction method based on visual features according to claim 1, characterized in that: the method for the location query interface area is, 3-1)设定控件数量阈值;3-1) Set the control number threshold; 3-2)在步骤(2)获得的视觉块树中,取包含<form>标签的最小块作为预设查询接口区域,计算其中包含的控件数量;3-2) In the visual block tree obtained in step (2), take the smallest block containing the <form> tag as the preset query interface area, and calculate the number of controls contained therein; 3-3)若控件数量大于设定的控件数量阈值,则标记该块为查询接口区域,否则检测下一个包含<form>标签的最小块;3-3) If the number of controls is greater than the set control number threshold, mark the block as the query interface area, otherwise detect the next smallest block containing the <form> tag; 3-4)重复步骤3-2)、3-3),完成查询接口区域的定位。3-4) Repeat steps 3-2) and 3-3) to complete the positioning of the query interface area. 3.根据权利要求2所述的基于视觉特征的页面查询接口抽取方法,其特征在于:所述控件数量阈值在2~4之间。3 . The visual feature-based page query interface extraction method according to claim 2 , characterized in that: the control number threshold is between 2 and 4. 4 . 4.根据权利要求1所述的基于视觉特征的页面查询接口抽取方法,其特征在于:所述定位查询接口区域的方法是,在步骤(2)获得的视觉块树中,取包含<form>标签的最小块作为预设查询接口区域,若该区域中包含有PASSWORD控件,则删除该预设查询接口区域,检测下一个包含<form>标签的最小块,直至确定查询接口区域或检测完各视觉块。4. the method for extracting page query interface based on visual features according to claim 1, characterized in that: the method for the location query interface area is, in the visual block tree obtained in step (2), take <form> The smallest block of the label is used as the preset query interface area. If the area contains PASSWORD controls, delete the preset query interface area, and check the next smallest block containing the <form> tag until the query interface area is determined or all detections are completed. visual block. 5.根据权利要求1所述的基于视觉特征的页面查询接口抽取方法,其特征在于:步骤5-3)中,所述显示特征还包括,控件与标签在不同行中垂直对齐;控件和标签相毗邻;若控件同一行上没有标签,则该控件与上方最邻近标签归为一组。5. The method for extracting page query interface based on visual features according to claim 1, characterized in that: in step 5-3), said display features also include that controls and labels are vertically aligned in different rows; controls and labels Adjacent to each other; if the control has no label on the same row, the control is grouped with the nearest label above.
CNB2007100195438A 2007-01-10 2007-01-10 Extraction Method of Page Query Interface Based on Visual Feature Expired - Fee Related CN100447793C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100195438A CN100447793C (en) 2007-01-10 2007-01-10 Extraction Method of Page Query Interface Based on Visual Feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100195438A CN100447793C (en) 2007-01-10 2007-01-10 Extraction Method of Page Query Interface Based on Visual Feature

Publications (2)

Publication Number Publication Date
CN101004760A CN101004760A (en) 2007-07-25
CN100447793C true CN100447793C (en) 2008-12-31

Family

ID=38703897

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100195438A Expired - Fee Related CN100447793C (en) 2007-01-10 2007-01-10 Extraction Method of Page Query Interface Based on Visual Feature

Country Status (1)

Country Link
CN (1) CN100447793C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101515287B (en) * 2009-03-24 2011-01-12 苏州普达新信息技术有限公司 Automatic generating method of wrapper of complex page
JP5862260B2 (en) * 2011-12-09 2016-02-16 富士ゼロックス株式会社 Information processing apparatus and information processing program
CN103440107A (en) * 2013-09-04 2013-12-11 北京奇虎科技有限公司 Method and device for processing touch operation of electronic device
CN104281693A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Semantic search method and semantic search system
CN105577684B (en) * 2016-01-25 2018-09-28 北京京东尚科信息技术有限公司 Method, server-side, client and the system of anti-crawler capturing
CN108664535B (en) * 2017-04-01 2022-08-12 北京京东尚科信息技术有限公司 Information output method and device
CN109948019B (en) * 2019-01-10 2021-10-08 中央财经大学 Deep network data acquisition method
CN110222251B (en) 2019-05-27 2022-04-01 浙江大学 Service packaging method based on webpage segmentation and search algorithm
CN113485782B (en) * 2021-07-29 2024-08-06 北京百度网讯科技有限公司 Page data acquisition method and device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1399228A (en) * 2002-08-29 2003-02-26 北京北大方正技术研究院有限公司 Text excavating method of semi-structural document set
CN1797399A (en) * 2004-11-11 2006-07-05 微软公司 Application programming interface for text mining and search
CN1801166A (en) * 2006-01-09 2006-07-12 西安交通大学 Method for implementing data resource platform of fluid physical and chemical properties based on network
US20060294199A1 (en) * 2005-06-24 2006-12-28 The Zeppo Network, Inc. Systems and Methods for Providing A Foundational Web Platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1399228A (en) * 2002-08-29 2003-02-26 北京北大方正技术研究院有限公司 Text excavating method of semi-structural document set
CN1797399A (en) * 2004-11-11 2006-07-05 微软公司 Application programming interface for text mining and search
US20060294199A1 (en) * 2005-06-24 2006-12-28 The Zeppo Network, Inc. Systems and Methods for Providing A Foundational Web Platform
CN1801166A (en) * 2006-01-09 2006-07-12 西安交通大学 Method for implementing data resource platform of fluid physical and chemical properties based on network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Web的文本挖掘研究. 崔志明,谢春丽.微电子学与计算机,第10期. 2002
基于Web的文本挖掘研究. 崔志明,谢春丽.微电子学与计算机,第10期. 2002 *

Also Published As

Publication number Publication date
CN101004760A (en) 2007-07-25

Similar Documents

Publication Publication Date Title
CN100447793C (en) Extraction Method of Page Query Interface Based on Visual Feature
Liu et al. Vide: A vision-based approach for deep web data extraction
Gatterbauer et al. Towards domain-independent information extraction from web tables
US7739257B2 (en) Search engine
US20090248707A1 (en) Site-specific information-type detection methods and systems
CN108446368A (en) A kind of construction method and equipment of Packaging Industry big data knowledge mapping
Khare et al. Understanding deep web search interfaces: A survey
Sleiman et al. Tex: An efficient and effective unsupervised web information extractor
CN102163213B (en) Voice browsing method and browser
JP2002297602A (en) Method and device for structured document retrieval, structured document managing device, program, and recording medium
CN102254014A (en) Adaptive information extraction method for webpage characteristics
US20150026159A1 (en) Digital Resource Set Integration Methods, Interfaces and Outputs
CN105426529A (en) Image retrieval method and system based on user search intention positioning
JP2005063432A (en) Multimedia object search device and multimedia object search method
CN105677638A (en) Web information extraction method
CN101996190B (en) Method and device for extracting information from webpage
Ji et al. Tag tree template for Web information and schema extraction
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
CN109948015B (en) Meta search list result extraction method and system
Su et al. Understanding query interfaces by statistical parsing
Nie et al. Webpage understanding: beyond page-level search
Lam et al. Web information extraction
Zhou et al. Automatically constructing multi-dimensional resource space by extracting class trees from texts for operating and analyzing texts from multiple abstraction dimensions
Cui et al. From wrapping to knowledge: Domain ontology learning from deep Web
Kale A Review on Enabling Document Annotation Based on Content Value

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081231

Termination date: 20170110

CF01 Termination of patent right due to non-payment of annual fee