KR100503148B1

KR100503148B1 - System for processing web documents based style and content information and method thereof

Info

Publication number: KR100503148B1
Application number: KR10-2002-0020840A
Authority: KR
Inventors: 송종철; 문병주; 홍기채; 이동일; 손소현; 오병택
Original assignee: 정보통신연구진흥원
Priority date: 2002-04-17
Filing date: 2002-04-17
Publication date: 2005-07-25
Anticipated expiration: 2022-04-17
Also published as: KR20030082214A

Abstract

본 발명은 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템 및 그 방법에 관한 것으로서, 웹로봇(10)이 사이트정보에 따라서 웹문서를 수집하여 임시저장소(20)에 저장하면(S10) 파서(30)가 웹문서의 URL 정보와 구조정보 및 스타일정보를 구문 분석하여 추출하고(S20), 필터링 엔진(50)이 구조정보와 스타일정보를 대상으로 패턴정보와 패턴매칭작업을 수행하여 패턴매칭된 구조정보의 최하위 노드에 포함된 컨텐트를 추출하면(S30) 형태소분석기(40)가 추출된 컨텐트에서 명사들을 추출하며(S40), 이후에 필터링 엔진(50)이 추출된 명사들이 불용키워드에 존재하는지를 검사하여 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 임시저장소(20)에서 삭제하는 필터링작업을 수행하고(S50), 추출된 명사들이 불용키워드에 존재하지 않으면 해당 웹문서의 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출하도록 되어 있으며(S60),The present invention relates to a web document processing system and method based on style and content information. When the web robot 10 collects web documents according to site information and stores them in the temporary storage 20 (S10), the parser 30 The URL information, structure information, and style information of the web document are parsed and extracted (S20), and the filtering engine 50 performs pattern information and pattern matching on the structure information and style information, and pattern matching is performed. When extracting content included in the lowest node of (S30), the morpheme analyzer 40 extracts nouns from the extracted content (S40), and then the filtering engine 50 checks whether the extracted nouns exist in the unavailable keyword. If it exists in the unused keyword, the web document is regarded as an unused document and the filtering operation is deleted from the temporary storage 20 (S50). If the extracted nouns do not exist in the unused keyword, the corresponding web document Using the structural information and the style information and the available keywords is to extract the necessary information and objects (S60),

이에 따라서, 불용문서 수집으로 인한 시스템의 문서관리 및 처리에 따른 시간과 비용을 절감할 수 있고, 업무 효율이 우수한 검색 시스템을 구축할 수 있다.Accordingly, it is possible to reduce the time and cost of document management and processing of the system due to the collection of unused documents, it is possible to build a search system with excellent work efficiency.

Description

Web document processing system based on style and content information and method thereof {SYSTEM FOR PROCESSING WEB DOCUMENTS BASED STYLE AND CONTENT INFORMATION AND METHOD THEREOF}

본 발명은 웹문서 처리 시스템 및 그 방법에 관한 것이며, 보다 상세히는 웹로봇을 이용하여 웹문서를 수집하는 과정에서 발생하는 불용문서, 즉 중복문서와 정보 수집 대상 이외의 문서를 웹문서의 구조 및 스타일정보와 웹문서의 태그에 포함된 컨텐트를 이용하여 제거하고, 웹문서 내에 포함된 필요 정보를 추출하는 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템 및 그 방법에 관한 것이다.The present invention relates to a web document processing system and a method thereof, and more particularly to the structure of the web document and the unused documents generated in the process of collecting web documents using a web robot, that is, duplicate documents and documents other than information collection targets and The present invention relates to a web document processing system and method based on the style and content information that are removed by using style information and content included in a tag of a web document and extracting necessary information included in the web document.

공지된 바와 같이 종래의 포탈 사이트 등에서는 웹로봇을 이용하여 웹정보를 수집하고 각 분야별로 정해진 관리자를 통하여 수집된 정보를 가공 및 필터링하여 사용자에게 웹정보 검색 서비스를 제공한다.As is known, the conventional portal site collects web information using a web robot and processes and filters the collected information through a manager determined for each field to provide a web information search service to a user.

그러나, 상기와 같이 포탈 사이트 등에서 사용자에게 웹정보 검색 서비스를 제공하는 종래의 방식은 수집된 웹정보의 관리에 많은 비용과 인력을 소모해야 하므로 웹정보의 신속한 유통이 이루어지지 않으며, 중복 수집되는 웹문서를 URL을 이용하여 어느 정도 제거할 수 있으나 데이터베이스와 연동하여 사용되는 CGI 등에서 생성되는 웹문서에 대한 중복 제거 작업을 거의 수행하지 못하고, 로그인하는 홈페이지 등의 불용 정보를 포함하는 웹문서에 대한 필터링 작업도 관리자를 통하여 수행하는 문제점이 있다.However, the conventional method of providing a web information search service to a user in a portal site as described above requires a lot of cost and manpower to manage the collected web information, so that the rapid distribution of web information is not achieved, and the web is duplicated. Although the document can be removed to some extent using a URL, it can hardly perform the deduplication of the web document generated by the CGI used in conjunction with the database, and filtering the web document including the unused information such as the homepage to log in. There is a problem that work is also performed through the administrator.

특히, 웹로봇을 이용하여 웹문서를 수집하는 종래의 웹문서 처리 시스템에서는 URL을 단순히 비교하여 동일한 URL을 가지는 웹문서는 수집 대상에서 제외하는 방법으로 중복문서를 처리하는데, 이러한 방법은 웹 서비스를 수행하는 단일한 디렉토리 구조상에 존재하는 웹문서에 대해서는 효과적인 방법이지만 대량의 정보를 데이터베이스 등에 저장하여 보유하고 있고 CGI, PHP등을 통하여 웹 서비스를 제공하는 시스템에서 웹문서를 수집할 경우에는 문제가 된다.In particular, in a conventional web document processing system that collects web documents using a web robot, a web document having the same URL is excluded by simply comparing the URLs and processing the duplicate documents by removing the web documents. It is an effective method for web documents existing in a single directory structure to be executed, but it is a problem when web documents are collected from a system that stores a large amount of information in a database and provides web services through CGI and PHP. .

예컨대, CGI 등을 통해 웹문서를 서비스하는 경우 하나의 웹문서에 대하여 여러 개의 URL을 가지는 경우, 즉 CGI에서 동적 파라미터를 사용하는 경우가 발생할 수 있으며, CGI를 이용하여 웹 서비스를 수행하는 종래의 홈페이지에서는 대부분 이러한 동적 CGI를 이용하고 있다.For example, in the case of serving a web document through CGI or the like, there may be a case in which there are a plurality of URLs for one web document, that is, a dynamic parameter is used in CGI, and a web service is performed using CGI. Most homepages use this dynamic CGI.

또한, 불용문서 중 수집 대상 이외의 문서에 대한 처리는 관리자가 직접 문서 컨텐트를 확인하고 제거하는 방법으로 수행되고 있다.In addition, processing of documents other than the collection target among the unused documents is performed by a method of directly checking and removing document contents.

한편, 종래의 웹문서 처리 시스템에서 웹문서 내에 포함된 필요 정보(정보객체)를 추출하는 방법은 HTML 태그를 이용하거나 웹문서의 구조 및 스타일정보를 이용하고 있지만, 웹 서비스를 제공하는 기관에서 랩퍼(Wrapper)를 이용하여 웹문서의 구조 및 스타일정보를 학습하여 필요한 정보를 추출하기 위한 학습 시간 및 비용이 많이 소모되는 문제점이 있다.Meanwhile, the conventional web document processing system extracts the necessary information (information object) included in the web document by using HTML tags or the structure and style information of the web document, but the wrapper is provided by the institution that provides the web service. There is a problem in that learning time and cost for extracting necessary information by learning the structure and style information of a web document using (Wrapper) are consumed.

따라서, 본 발명은 상술한 종래의 문제점을 극복하기 위한 것으로서, 본 발명의 목적은 수집 대상의 웹문서에 대한 구조 및 스타일정보, 불용키워드를 미리 정의하고 수집되는 웹문서의 구조정보와 스타일정보, 컨텐트 내에 포함된 명사들과 매칭하여 불용문서를 제거하고, 웹문서의 구조정보와 스타일정보를 이용하여 정형화된 문서구조 및 스타일정보를 선택하고 선택된 최하위 노드의 컨텐트에 포함된 명사를 이용하여 웹문서 내에 포함된 정보객체를 정확하게 추출하도록 된 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템 및 그 방법을 제공하는데 있다. Accordingly, the present invention is to overcome the above-described problems, the object of the present invention is to define the structure and style information for the web document to be collected, the use keywords in advance and the structure information and style information of the collected web document, Removes the unused document by matching the nouns contained in the content, selects the structured document style and style information using the structure information and the style information of the web document, and uses the nouns included in the content of the selected lowest node. The present invention provides a web document processing system and method based on style and content information, which are configured to accurately extract information objects included therein.

상기 본 발명의 목적을 달성하기 위한 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템은, 사이트정보에 따라서 웹문서를 수집하고, 수집된 웹문서를 임시저장소에 저장하는 웹로봇과; 웹로봇에 의해 수집되어 임시저장소에 저장된 웹문서의 URL 정보와 구조정보 및 스타일정보를 구문 분석하여 추출하는 파서; 상기 파서에 의해 추출된 웹문서의 구조정보의 최하위 노드에 포함된 컨텐트에서 명사들을 추출하는 형태소분석기; 및 상기 구조정보와 스타일정보를 대상으로 기입력된 패턴정보와 패턴매칭작업을 수행하여 컨텐트를 추출하고, 이 컨텐트로부터 추출한 명사들이 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 삭제하고, 불용키워드에 존재하지 않으면 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출하는 필터링 엔진으로 구성된다. Web document processing system based on the style and content information for achieving the object of the present invention, Web robot for collecting the web document according to the site information, and storing the collected web document in a temporary storage; A parser for parsing and extracting URL information, structure information, and style information of a web document collected by a web robot and stored in a temporary repository; A morpheme analyzer for extracting nouns from the content included in the lowest node of the structure information of the web document extracted by the parser; And extracting the content by performing the pattern information and the pattern matching operation inputted to the structure information and the style information, and if nouns extracted from the content exist in the unusable keyword, the web document is regarded as unused document and deleted. If it does not exist in an invalid keyword, it consists of a filtering engine that extracts the necessary information objects using structure information, style information, and available keyword.

상기 본 발명의 목적을 달성하기 위한 스타일 및 컨텐트 정보 기반의 웹문서 처리 방법은, 사이트정보에 따라서 웹문서를 수집하고, 수집된 웹문서를 임시저장소에 저장하는 단계와; 웹로봇에 의해 수집되어 임시저장소에 저장된 웹문서의 URL 정보와 구조정보 및 스타일정보를 구문 분석하여 추출하는 단계; 상기 구조정보와 스타일정보를 대상으로 패턴정보와 패턴매칭작업을 수행하여 패턴매칭된 구조정보의 최하위 노드에 포함된 컨텐트를 추출하는 단계; 추출된 웹문서의 구조정보의 최하위 노드에 포함된 컨텐트에서 명사들을 추출하는 단계; 상기 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하는지를 검사하여 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 임시저장소에서 삭제하는 필터링작업을 수행하는 단계; 및 상기 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하지 않으면 해당 웹문서의 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출하는 단계로 이루어진다. Web document processing method based on the style and content information for achieving the object of the present invention, collecting the web document according to the site information, and storing the collected web document in a temporary storage; Parsing and extracting URL information, structure information, and style information of a web document collected by a web robot and stored in a temporary repository; Extracting content included in the lowest node of the pattern-matched structure information by performing pattern information and pattern matching on the structure information and style information; Extracting nouns from content included in the lowest node of the extracted web document structure information; Checking whether nouns extracted from the pattern-matched content exist in an unusable keyword and performing a filtering operation to consider the web document as an unusable document and delete it from the temporary storage if it exists in the unusable keyword; And extracting necessary information objects using structure information, style information, and available keyword of the web document if nouns extracted from the pattern-matched content are not present in the unavailable keyword.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1을 참조하면, 웹로봇(10)은 사이트정보에 따라서 웹문서를 수집하고, 수집된 웹문서를 임시저장소(20)에 저장한다.Referring to FIG. 1, the web robot 10 collects web documents according to site information and stores the collected web documents in a temporary storage 20.

파서(Parser; 30)는 웹로봇(10)에 의해 수집되어 임시저장소(20)에 저장된 웹문서의 URL 정보와 구조정보 및 스타일정보를 구문 분석하여 추출한다.Parser 30 is collected by the web robot 10 and parses and extracts URL information, structure information, and style information of a web document stored in the temporary storage 20.

형태소분석기(40)는 파서(30)에 의해 추출된 웹문서의 구조정보의 최하위 노드에 포함된 컨텐트에서 명사들을 추출한다.The morpheme analyzer 40 extracts nouns from the content included in the lowest node of the structural information of the web document extracted by the parser 30.

필터링 엔진(50)은 상기 구조정보와 스타일정보를 대상으로 기입력된 패턴정보와 패턴매칭작업을 수행하여 패턴매칭된 구조정보의 최하위 노드에 포함된 컨텐트를 추출하여 상기 형태소분석기(40)로 입력한다.The filtering engine 50 extracts the content included in the lowest node of the pattern matched structure information by inputting the pattern information and the pattern matching operation to the structure information and the style information, and inputs it to the stemmer 40. do.

상기 필터링 엔진(50)은 형태소분석기(40)에서 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하는가를 검사하여 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 임시저장소(20)에서 삭제하는 필터링작업을 수행하고, 불용문서가 아니면 상기 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출한다.The filtering engine 50 checks whether nouns extracted from the pattern-matched content in the morphological analyzer 40 exist in the unusable keyword. If the unfiltered keyword exists in the unused keyword, the filtering engine 50 considers the web document as an unused document and deletes it from the temporary storage 20. If the filter is not used, the required information object is extracted using the structure information, style information, and available keyword.

상기 필터링 엔진(50)의 패턴매칭모듈(51)은 파서(30)에서 추출한 구조정보와 스타일정보를 대상으로 기입력된 패턴정보와 패턴매칭작업을 수행하여 패턴매칭된 구조정보의 최하위 노드에 포함된 컨텐트를 추출하여 상기 형태소분석기(40)로 입력한다.The pattern matching module 51 of the filtering engine 50 performs pattern input and pattern matching operations input to the structural information and style information extracted by the parser 30 to include them in the lowest node of the pattern matching structure information. Extracted content and input the extracted content to the morpheme analyzer 40.

상기 필터링 엔진(50)의 불용문서 필터링모듈(52)은 형태소분석기(40)에서 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하는가를 검사하여 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 임시저장소(20)에서 삭제하는 필터링작업을 수행한다.The unwanted document filtering module 52 of the filtering engine 50 checks whether nouns extracted from the pattern-matched content in the morphological analyzer 40 exist in the unavailable keyword, and considers the web document as an invalid document if it exists in the unavailable keyword. To perform the filtering operation to delete from the temporary storage (20).

상기 필터링 엔진(50)의 정보객체 추출모듈(53)은 상기 불용문서 필터링 결과 추출한 명사들이 불용키워드에 존재하지 않으면 해당 웹문서의 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출한다.The information object extraction module 53 of the filtering engine 50 extracts the necessary information object using the structure information, the style information, and the available keyword of the web document if nouns extracted as a result of the filtering of the unused document are not present in the unavailable keyword. do.

상기 필터링 엔진(50)의 스케줄 매니저(54)는 패턴매칭모듈(51)과 불용문서 필터링모듈(52) 및 정보객체 추출모듈(53)을 각각 스레드(Thread)로 구동시키고 각 모듈들의 입출력을 제어한다.The schedule manager 54 of the filtering engine 50 drives the pattern matching module 51, the unused document filtering module 52, and the information object extraction module 53 with threads to control input / output of each module. do.

도 2를 참조하면, 상기 패턴매칭모듈(51)의 템플릿 생성기(51a)는 수집된 웹문서의 URL을 체크하고 매핑 대상 구조 및 스타일정보를 제공하는 패턴정보를 입력받아 템플릿을 생성한다.Referring to FIG. 2, the template generator 51a of the pattern matching module 51 generates a template by checking URLs of the collected web documents and receiving pattern information for providing a mapping target structure and style information.

상기 템플릿 생성기(51a)로 입력되는 패턴정보는 URL, 매핑 대상 문서구조 및 해당 문서구조에 포함된 스타일정보로 구성되는 불용문서 처리정보와, URL, 매핑 대상 문서구조, 해당 문서구조에 포함된 스타일정보 및 매핑 대상 구조의 범위로 구성되는 정보객체 추출정보로 구성된다.The pattern information inputted to the template generator 51a includes information on processing of an unused document including a URL, a mapping target document structure, and style information included in the document structure, and a URL, a mapping target document structure, and a style included in the document structure. It consists of information object extraction information that consists of information and a range of mapping objects.

상기 패턴매칭모듈(51)의 문서구조 순회기(51b)는 파서(30)에서 구문 분석된 웹문서의 구조를 순회하면서 템플릿 생성기(51a)에서 생성된 템플릿과 동일한 구조를 가지는 노드들을 추출한다.The document structure circuit 51b of the pattern matching module 51 extracts nodes having the same structure as the template generated by the template generator 51a while traversing the structure of the web document parsed by the parser 30.

상기 패턴매칭모듈(51)의 스타일 매핑기(51d)는 문서구조 순회기(51b)에서 추출된 문서구조의 각 노드에 포함된 스타일들과 템플릿의 스타일정보를 비교하여 동일한 스타일을 가지는 각 노드들을 선별한 후, 선별된 각 노드들에 대한 웹문서 위치 정보와 컨텐트를 템플릿과 연결하여 템플릿저장소(51c)에 저장한다.The style mapper 51d of the pattern matching module 51 compares styles included in each node of the document structure extracted by the document structure traverser 51b with style information of the template, and compares each node having the same style. After selection, the web document location information and content of each of the selected nodes are connected to the template and stored in the template storage 51c.

상기 패턴매칭모듈(51)의 컨텐트 추출기(51e)는 스타일 매핑기(51d)에 의해 선별된 노드 중에 최하위 노드에 포함된 컨텐트를 텍스트 형태로 추출하여 상기 형태소분석기(40)로 입력한다.The content extractor 51e of the pattern matching module 51 extracts the content included in the lowest node among the nodes selected by the style mapper 51d in text form and inputs the text to the stemmer 40.

도 3을 참조하면, 상기 불용문서 필터링모듈(52)의 불용키워드 검사기(52a)는 형태소분석기(40)에서 추출된 명사들이 불용키워드에 존재하는지를 검사한다.Referring to FIG. 3, the invalid keyword checker 52a of the invalid document filtering module 52 checks whether nouns extracted from the morpheme analyzer 40 exist in the invalid keyword.

상기 불용문서 필터링모듈(52)의 불용문서 제거기(52b)는 불용키워드 검사 결과, 추출된 명사들이 불용키워드에 존재할 경우 수집된 웹문서를 불용문서로 판단하여 임시저장소(20)에 수집된 웹문서와 템플릿저장소(51c)에 저장된 템플릿을 삭제한다. 이때, 상기 불용문서 제거기(52b)에 의해 삭제되는 웹문서는 정보 수집 대상 이외의 문서이다.The unused document remover 52b of the unused document filtering module 52 determines that the collected web document is an unused document when the extracted nouns exist in the unused keyword as a result of the unused keyword inspection, and collects the web document collected in the temporary storage 20. And delete the template stored in the template storage 51c. At this time, the web document deleted by the unnecessary document remover 52b is a document other than the information collection target.

상기 불용문서 필터링모듈(52)의 중복문서 제거기(52c)는 기존에 수집된 웹문서들과 현재 수집된 웹문서의 중복 검사를 수행하여 중복문서를 제거하기 위해 템플릿저장소(51c)에 기저장된 템플릿들과 현재 입력된 템플릿들의 컨텐트를 비교하여 동일한 컨텐트를 가지는 템플릿은 중복으로 인식하여 수집된 웹문서와 템플릿저장소(51c)에 저장된 템플릿을 제거한다.The duplicate document remover 52c of the unused document filtering module 52 performs a redundancy check of previously collected web documents and currently collected web documents to remove duplicate documents from the template storage 51c. Templates having the same content by comparing the contents of the currently input templates with the same contents are removed as duplicates, and the collected web document and the template stored in the template storage 51c are removed.

도 4를 참조하면, 상기 정보객체 추출모듈(53)의 정보객체 검사기(53a)는 템플릿저장소(51c)에서 정보객체 추출정보를 입력받고 정보객체를 추출할 문서구조 및 스타일정보를 가지는 노드들을 검사한 후 노드 끝단에 존재하는 컨텐트를 추출하여 형태소분석기(40)로 입력한다.Referring to FIG. 4, the information object inspector 53a of the information object extraction module 53 receives the information object extraction information from the template storage 51c and inspects nodes having document structure and style information for extracting the information object. After that, the content existing at the node end is extracted and input to the morpheme analyzer 40.

상기 정보객체 추출모듈(53)의 정보객체 선택기(53b)는 형태소분석기(40)에 의해 추출된 컨텐트로부터 명사들이 추출되면 추출된 명사들이 가용어키워드에 존재하는가를 검사하고, 검사 결과의 히트율(Hit Rate)을 계산하여 특정한 임계치 이상인 경우 컨텐트를 필요한 정보객체로 추출한다.The information object selector 53b of the information object extraction module 53 checks whether the extracted nouns exist in the available keyword when the nouns are extracted from the content extracted by the morpheme analyzer 40, and the hit rate of the test result. By calculating the hit rate, the content is extracted to the required information object if it exceeds a certain threshold.

상기와 같이 구성되는 본 발명에 따른 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템은 도 5에 도시된 바와 같은 방법에 의해 다음과 같이 작동한다.The web document processing system based on the style and content information according to the present invention configured as described above operates as follows by the method as shown in FIG.

최초에, 상기 웹로봇(10)은 사이트정보에 따라서 웹문서를 수집하고, 수집된 웹문서를 임시저장소(20)에 저장한 후, 수집한 웹문서를 파서(30)로 입력한다(S10).Initially, the web robot 10 collects web documents according to site information, stores the collected web documents in the temporary storage 20, and inputs the collected web documents into the parser 30 (S10). .

또한, 수집된 웹문서가 상기 파서(30)로 입력되면, 파서(30)는 웹로봇(10)에 의해 수집된 후 입력되는 웹문서의 URL 정보와 구조정보 및 스타일정보를 구문 분석하여 추출한다(S20).In addition, when the collected web document is input to the parser 30, the parser 30 parses and extracts URL information, structure information, and style information of the web document input after being collected by the web robot 10. (S20).

상기와 같이 수집된 웹문서에 대한 URL 정보와 구조정보 및 스타일정보가 추출되고 나면, 상기 필터링 엔진(50)의 패턴매칭모듈(51)이 먼저 추출된 구조정보와 스타일정보를 대상으로 패턴정보와 패턴매칭작업을 수행하여 패턴매칭된 구조정보의 최하위 노드에 포함된 컨텐트를 추출한 후, 추출한 컨텐트를 상기 형태소분석기(40)로 입력한다(S30).After the URL information, the structure information, and the style information of the collected web document are extracted, the pattern matching module 51 of the filtering engine 50 first extracts the pattern information and the style information from the extracted structure information and style information. After the pattern matching operation is performed to extract the content included in the lowest node of the pattern-matched structure information, the extracted content is input to the morpheme analyzer 40 (S30).

또한, 추출한 컨텐트가 상기 형태소분석기(40)로 입력되면, 형태소분석기(40)는 추출된 웹문서의 구조정보의 최하위 노드에 포함된 컨텐트에서 명사들을 추출하여 상기 필터링 엔진(50)으로 입력한다(S40).In addition, when the extracted content is input to the morpheme analyzer 40, the morpheme analyzer 40 extracts nouns from the content included in the lowest node of the extracted structure of the web document and inputs the nouns to the filtering engine 50 ( S40).

다음으로, 상기 필터링 엔진(50)의 불용문서 필터링모듈(52)은 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하는지를 검사하고, 만약 추출한 명사들이 불용키워드에 존재하면 해당 웹문서를 불용문서로 간주하여 임시저장소(20)에서 삭제하는 필터링작업을 수행한다(S50).Next, the dead document filtering module 52 of the filtering engine 50 checks whether nouns extracted from the pattern-matched content exist in the unusable keyword, and if the extracted nouns exist in the unused keyword, converts the web document into an unused document. It is considered to perform the filtering operation to delete from the temporary storage 20 (S50).

이때, 상기 불용문서 필터링모듈(52)은 추출된 명사들이 불용키워드에 존재할 경우에는 수집된 웹문서를 불용문서, 즉 정보 수집 대상 이외의 문서로 판단하여 임시저장소(20)에 수집된 웹문서와 템플릿저장소(51c)에 저장된 템플릿을 삭제하는 불용문서 제거작업을 수행하고, 기존에 수집된 웹문서들과 현재 수집된 웹문서의 중복 검사를 수행하기 위해 템플릿저장소(51c)에 기저장된 템플릿들과 현재 입력된 템플릿들의 컨텐트를 비교하여 동일한 컨텐트를 가지는 템플릿은 중복으로 인식하여 수집된 웹문서와 템플릿저장소(51c)에 저장된 템플릿을 제거하는 중복문서 제거작업을 함께 수행한다.In this case, the extracted document filtering module 52 determines that the collected web document is an unused document, that is, a document other than the information collection target, when the extracted nouns exist in the unavailable keyword and the web document collected in the temporary storage 20. Templates previously stored in the template storage 51c to perform an operation for removing an unused document to delete the template stored in the template storage 51c, and to perform a duplicate inspection of the previously collected web documents and the currently collected web documents. By comparing the contents of the currently input templates, templates having the same content are recognized as duplicates, and a duplicate document removal operation of removing the stored web documents and the templates stored in the template storage 51c is performed together.

반면에, 상기 필터링 엔진(50)의 불용문서 필터링모듈(52)이 패턴매칭된 컨텐트로부터 추출한 명사들이 불용키워드에 존재하는지를 검사한 결과, 추출한 명사들이 불용키워드에 존재하지 않으면 상기 정보객체 추출모듈(53)이 해당 웹문서의 구조정보와 스타일정보 및 가용어키워드를 사용하여 필요한 정보객체를 추출한다(S60).On the other hand, as the result of checking whether nouns extracted from the pattern-matched content exist in the unusable keyword, the unused document filtering module 52 of the filtering engine 50 finds that the information object extracting module ( 53) the required information object is extracted using the structure information, the style information, and the available keyword of the web document (S60).

이때, 상기 필터링 엔진(50)의 정보객체 추출모듈(53)은 템플릿 저장소에서 정보객체 추출정보를 입력받은 후, 먼저 정보객체를 추출할 문서구조 및 스타일정보를 가지는 노드들을 검사한 후 노드 끝단에 존재하는 컨텐트를 텍스트 형태로 추출한 다음, 추출된 컨텐트를 상기 형태소분석기(40)로 입력하여 형태소분석기(40)가 추출된 컨텐트로부터 명사들을 추출하도록 한다.In this case, the information object extraction module 53 of the filtering engine 50 receives the information object extraction information from the template repository, first checks the nodes having the document structure and style information to extract the information object, and then at the node end. After the existing content is extracted in the form of text, the extracted content is input to the morpheme analyzer 40 so that the morpheme analyzer 40 extracts nouns from the extracted content.

상기 형태소분석기(40)로부터 명사들이 추출되면, 상기 정보객체 추출모듈(53)은 추출된 명사들이 가용어키워드에 존재하는가를 검사한 후, 검사 결과의 히트율(Hit Rate)을 계산하여 미리 설정한 특정한 임계치 이상인 경우 컨텐트를 필요한 정보객체로 추출함으로써, 검색엔진을 통하여 필요한 정보객체가 검색되도록 한다.When nouns are extracted from the morpheme analyzer 40, the information object extraction module 53 checks whether the extracted nouns exist in the available keyword, and then calculates and sets a hit rate of the test result in advance. If the content is above a certain threshold, the content is extracted as the required information object so that the required information object is searched through the search engine.

상술한 바와 같이 본 발명에 따른 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템 및 그 방법은 웹로봇을 이용하여 웹문서를 수집하는 과정에서 발생하는 불용문서를 웹문서의 구조 및 스타일정보와 웹문서의 태그에 포함된 컨텐트를 이용하여 웹로봇의 웹문서 수집단계에서 제거하고, 웹문서 내에 포함된 정보객체를 정확하게 추출하도록 되어 있기 때문에, 포탈 사이트 등의 구축 및 운영 시에 수집된 문서의 관리 및 유지를 최적화하고 사이트 운영 시 소요되는 비용 및 인력을 경감할 수 있을 뿐만 아니라, 웹로봇의 네트워크 자원 낭비를 최소화할 수 있는 장점이 있다.As described above, the web document processing system and method based on the style and content information according to the present invention are used to generate an unused document generated in the process of collecting a web document using a web robot. The contents contained in the tag are removed in the web document collection step of the web robot, and the information objects contained in the web document are accurately extracted. Therefore, the management and maintenance of the documents collected during the construction and operation of the portal site, etc. In addition to optimizing and reducing the cost and manpower required to operate the site, it also has the advantage of minimizing the waste of network resources of the web robot.

특히, 불용문서를 시스템적으로 해결할 수 있으므로 불용문서 수집으로 인한 시스템 내부의 문서관리 및 문서 처리의 시간과 비용을 절감할 수 있을 뿐만 아니라, 필요한 문서만을 수집할 수 있으므로 업무 효율이 우수한 검색 시스템을 구축할 수 있으며, 최신 뉴스 기사 서비스, 웹 쇼핑몰의 특정 상품 검색 등의 서비스를 보다 간편하고 원활하게 구축 및 운영할 수 있는 장점이 있다. In particular, since unused documents can be solved systematically, the system can not only reduce the time and cost of document management and document processing due to the collection of unused documents, but also collect only the necessary documents, thereby providing a highly efficient search system. It is possible to build, and it is advantageous to easily and smoothly build and operate services such as the latest news article service and searching for a specific product in a web shopping mall.

이상에서 설명한 것은 본 발명에 따른 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템 및 그 방법을 실시하기 위한 하나의 실시예에 불과한 것으로서, 본 발명은 상기한 실시예에 한정되지 않고, 이하의 특허청구의 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 정신이 있다고 할 것이다.What has been described above is only one embodiment for implementing the web document processing system and method based on the style and content information according to the present invention, the present invention is not limited to the above-described embodiment, the following claims Without departing from the gist of the present invention claimed in the scope, anyone of ordinary skill in the art will have the technical spirit of the present invention to the extent that various modifications can be made.

도 1은 본 발명에 따른 스타일 및 컨텐트 정보 기반의 웹문서 처리 시스템을 도시한 구성도,1 is a block diagram showing a web document processing system based on style and content information according to the present invention;

도 2는 패턴매칭모듈을 도시한 구성도,2 is a block diagram showing a pattern matching module,

도 3은 불용문서 필터링모듈을 도시한 구성도,3 is a block diagram illustrating an unused document filtering module;

도 4는 정보객체 추출모듈을 도시한 구성도,4 is a block diagram showing an information object extraction module;

도 5는 본 발명에 따른 스타일 및 컨텐트 정보 기반의 웹문서 처리 방법을 도시한 흐름도이다.5 is a flowchart illustrating a web document processing method based on style and content information according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

10: 웹로봇 20: 임시저장소10: Web Robot 20: Temporary Storage

30: 파서 40: 형태소분석기30: Parser 40: Morphological Analyzer

50: 필터링 엔진 51: 패턴매칭모듈50: filtering engine 51: pattern matching module

52: 불용문서 필터링모듈 53: 정보객체 추출모듈52: filtering of the unused document 53: information object extraction module

Claims

A web robot that collects web documents according to site information and stores the collected web documents in a temporary storage;

A parser for parsing and extracting URL information, structure information, and style information of a web document collected by a web robot and stored in a temporary repository;

A morpheme analyzer for extracting nouns from the content included in the lowest node of the structure information of the web document extracted by the parser; And

Extracts content by performing pattern matching and pattern matching on the structure information and style information, and if nouns extracted from the content exist in an unusable keyword, the corresponding web document is regarded as an unused document, deleted, and unused. If the keyword does not exist, a filtering engine is provided to extract necessary information objects using structure information, style information, and available keyword keywords, and the filtering engine includes pattern information inputted to the structure information and style information extracted from the parser. A pattern matching module for extracting content included in the lowest node of the pattern-matched structure information by performing a pattern matching operation and inputting the content to the morpheme analyzer; An unused document filtering module for checking whether nouns extracted from the pattern-matched content exist in the unusable keyword, and if the unused keyword exists in the unused keyword, considers the web document as an unused document and deletes it from the temporary storage; An information object extraction module for extracting necessary information objects using structure information, style information, and available keyword of the web document if nouns extracted as a result of filtering the unwanted document are not present in the unavailable keyword; And a schedule manager for driving the pattern matching module, the unused document filtering module, and the information object extraction module with threads, respectively, and controlling the input / output of the respective modules.

delete

The method of claim 1, wherein the pattern matching module

A template generator for checking a collected web document URL and receiving a pattern information for providing a mapping target structure and style information to generate a template;

A document structure traversal unit which traverses the structure of the web document parsed by the parser and extracts nodes having the same structure as the template generated by the template generator;

After comparing the styles included in each node of the document structure extracted by the document structure traversal with the style information of the template, the nodes having the same style are selected, and the web document position information and content of each of the selected nodes are selected. A style mapper that associates a template with a template and stores the template in a template storage; And

A content extractor for extracting content contained in the lowest node among the nodes selected by the style mapper in a text form and inputting the content to the stemmer

Web document processing system based on the style and content information, characterized in that consisting of.

The method of claim 3, wherein the pattern information input to the template generator is

Unused document processing information consisting of a URL, a mapping target document structure, and style information included in the document structure; And

Information object extraction information composed of URL, mapping target document structure, style information included in the document structure, and range of mapping target structure

The method of claim 1 or 3, wherein the unnecessary document filtering module

An invalid keyword checker for checking whether nouns extracted from the morpheme analyzer exist in the invalid keyword;

An unused document eliminator for deleting the web documents collected in the temporary storage and the templates stored in the temporary storage by determining the collected web documents as unusable documents when the extracted nouns exist in the unusable keywords; And

In order to perform a duplicate check of previously collected web documents and currently collected web documents, templates previously stored in the template storage and contents of the currently input templates are compared, and templates having the same content are recognized as duplicates. Duplicate document remover to remove documents and templates stored in template storage

The method of claim 1, wherein the information object extraction module

An information object inspector that receives the information object extraction information from the template repository, inspects nodes having document structure and style information for extracting the information object, and extracts the content existing at the end of the node and inputs it to the morpheme analyzer; And

When nouns are extracted from the content extracted by the morpheme parser, the extracted nouns are examined in the available keyword, and the hit rate of the test result is calculated to extract the content as the required information object if it is above a certain threshold. Information object selector

Collecting web documents according to site information and storing the collected web documents in a temporary storage;

Parsing and extracting URL information, structure information, and style information of a web document collected by a web robot and stored in a temporary repository;

Extracting content included in the lowest node of the pattern-matched structure information by performing pattern information and pattern matching on the structure information and style information;

Extracting nouns from content included in the lowest node of the extracted web document structure information;

Checking whether nouns extracted from the pattern-matched content exist in an unusable keyword and performing filtering operation to consider the web document as an unused document and to delete it from the temporary storage if it exists in the unusable keyword, wherein the extracted nouns Removing the unused document from the web document and the template stored in the temporary storage by determining the collected web document as an unused document when the unused keyword exists; And comparing templates previously stored in the template storage and contents of the currently input templates to perform duplicate inspection of the previously collected web documents and the currently collected web documents. Performing a filtering operation including a duplicate document removing step of removing a template stored in a web document and a template storage; And

Extracting necessary information objects using structure information, style information, and available keyword of the web document if nouns extracted from the pattern-matched content are not present in the unavailable keyword.

Web document processing method based on the style and content information, characterized in that consisting of.

delete

8. The method of claim 7, wherein extracting the required information object

Inspecting nodes having document structure and style information for extracting the information object and extracting content existing at the end of the node;

Extracting nouns from the extracted content; And

Checking whether the extracted nouns exist in the available keyword, and calculating the hit rate of the test result and extracting the content as the required information object if it exceeds a certain threshold.