KR20000075203A

KR20000075203A - General-purpose robot agent and real-time search method

Info

Publication number: KR20000075203A
Application number: KR1019990019679A
Authority: KR
Inventors: 김소영
Original assignee: 홍오성; 주식회사 웹나라
Priority date: 1999-05-31
Filing date: 1999-05-31
Publication date: 2000-12-15
Also published as: WO2000074294A3; WO2000074294A2; AU4532599A

Abstract

기존에 프로그램 내부에 코드(Code)화 되어 있던 개별 모듈들인 수집기, 색인기, 검색기들의 상호 연결을 코드화 하지 않고 템플릿(Template)화 즉 독립된 형태로 구성하여 독립시스템이나 통합시스템으로 자유롭게 운영할 수 있게 하고, 다른 시스템의 콤포넌트로 쉽게 통합되게 하여 시스템의 유연성을 향상시키며It is possible to freely operate as an independent system or an integrated system by configuring a template, that is, an independent form, without coding the interconnections of the collectors, indexers, and searchers, which are previously coded in the program, without encoding them. Improves system flexibility by allowing easy integration into other system components

색인방법을 특정한 용어에 의존해서 검색하는하는 것이 아니라 공통된 형식에 의존해 색인을 하는 방식을 취하며Instead of searching for an index based on a specific term, the indexing method depends on a common format.

검색기가 데이터베이스만을 검색하는 것이 아니라 지정된 웹사이트를 직접 검색함으로써 갱신주기가 빠른 웹사이트의 내용들을 효과적으로 검색할 수 있게 하기위한 것을 특징으로 하는 범용 로봇에이전트 및 실시간 검색방법에 관한 것이다.The present invention relates to a general-purpose robot agent and a real-time search method, which enables the searcher to search the contents of a website having a quick update cycle by searching a designated website directly instead of searching a database only.

Description

General-purpose robot agent and real-time search method}

현재 서비스되고 있는 웹 검색엔진(또는 검색사이트)들은 일반적으로 로봇에이전트(Robot Agent)라 불리는 로봇에이전트시스템, 데이터베이스 그리고 검색단계를 그 기본으로 하고 있다.Currently serving web search engines (or search sites) are based on a robot agent system, a database, and a search stage, commonly referred to as a robot agent.

이와같은 종래의 웹 검색엔진 프로그램(또는 웹검색시스템)을 도 1과 함게 상세히 설명하면 다음과 같다.Such a conventional web search engine program (or web search system) will be described in detail with reference to FIG. 1 as follows.

로봇에이전트(Robot Agent)시스템(100)과 검색시스템(이하 검색기(200)) 그리고 데이터베이스(300)로 일반적으로 구성되어 있고, 로봇에이전트시스템(100)은 다시 수집시스템(이하 수집기(10))와 색인시스템(이하 색인기(20))으로 나눌수 있다.The robot agent system 100, the search system (hereinafter referred to as the searcher 200) and the database 300 is generally composed, and the robot agent system 100 is again a collection system (hereinafter collector 10) and It can be divided into an indexing system (hereinafter referred to as indexer 20).

로봇 에이전트시스템(100)내의 수집기(10)는 미리 설정된 웹상의 자료를 수집하는 역활을 하고 색인기(20)는 일반적으로 명사사전, 동의어 사전, 조사사전등 전자사전과 연계하여 형태소분석, 불용어.조사처리 과정 등의 색인하는 역활을 한다.Collector 10 in the robot agent system 100 serves to collect data on a web set in advance, and indexer 20 is generally associated with electronic dictionaries such as noun dictionaries, synonym dictionaries, and dictionaries. It serves as an index for processing.

이렇게 색인된 웹상의 데이터들은 데이터베이스(300)에 저장되고 이 저장된 데이터들을 검색기(200)가 검색하게 된다.The indexed data on the web is stored in the database 300 and the searcher 200 searches the stored data.

사용자들이 일반적으로 웹검색시스템을 사용 정보를 검색한다는 것은 검색기(200)를 이용하여 상기 데이터베이스(300)를 검색한다는 것을 의미한다.When users generally search usage information using the web search system, it means that the searcher 200 searches the database 300.

그러나 상기와 같은 종래의 인터넷 검색엔진들은 다음과 같은 단점을 갖고 있다.However, such conventional Internet search engines have the following disadvantages.

첫째, 특정한 목적을 위해 개발된 기존의 웹검색시스템은 수집기(10)와 색인기(20)등이 전체프로그램의 일부로 코드화 되어 있어 전체 프로그램에 강하게 종속된다.First, the existing web search system developed for a specific purpose is strongly dependent on the whole program because the collector 10 and the indexer 20 are coded as part of the whole program.

그러므로 다른목적으로 그 기능을 확장하거나 다른용도의 시스템으로 전환하는데 유연성이 크게 떨어지며, 또한 유지보수 측면에서도 소프트웨어적인 지식을 필요로 하므로 그 비용이 클 수 밖에 없다.Therefore, the cost is large because the flexibility to expand the function for other purposes or to switch to a system for other purposes, and also requires software knowledge in terms of maintenance.

예를 들어 상기의 단점을 설명하면 다음과 같다.For example, the above disadvantages will be described.

웹상에서 쇼핑몰들을 검색하기 위한 목적으로 제작된 웹 검색시스템들은 쇼핑몰 웹사이트들만을 수집하기 위한 수집기와 상기 수집된 쇼핑몰과 관련된 데이터들을 색인하기 위한 색인기로 구성되지고 또한 쇼핑몰과 관련된 전자사전을 연동하여 프로그램을 구축한다.Web search systems designed for the purpose of searching shopping malls on the web are composed of a collector for collecting only shopping mall websites and an indexer for indexing the data related to the collected shopping malls. Build the program.

그러나 이 웹 검색시스템을 쇼핑몰이 아닌 다른 용도로 사용한다면 프로그램내부에 쇼핑몰을 위해 코드(Code)화 되어 있는 수집기(10)와 쇼핑몰을 위해 코드(Code)화 되어 있는 색인기(20)와 검색기(300)를 변경해야 한다.However, if the web search system is used for a purpose other than the shopping mall, the collector 10 coded for the shopping mall and the indexer 20 and the searcher 300 coded for the shopping mall in the program are used. ) Must be changed.

이러한 변경은 결국 소프트웨어적인 상당한 지식을 필요로 하여 자유롭고 용이한 변경을 어렵게 한다.This change, in turn, requires significant software knowledge, making free and easy changes difficult.

종래의 색인방법은 형태소분석, 불용어.조사처리과정을 통해 이루어진다. 그러나 이 과정은 전자사전 구축의 어려움이 발생하며 전자사전 자체가 서비스 적용분야에 따라 달라지고 해석 및 처리 또한 프로그램내의 하드코딩을 피할 수 없게 하여, 시스템 전체의 유연성. 확장성이 크게 떨어지는 단점이 있다.Conventional indexing methods are through morphological analysis, stopwords, and investigation. However, this process creates difficulties in building an electronic dictionary, and the electronic dictionary itself is dependent on the service application area, and interpretation and processing also inevitably avoid the hard coding in the program, resulting in system-wide flexibility. There is a disadvantage in that scalability is greatly reduced.

종래의 검색기(200)는 데이터베이스(300)에 있는 내용을 검색한다. 그러므로 웹상에 있는 자료가 변경된후 아직 그 변경사항이 데이터베이스(300)에 반영되지 않았을 경우 검색기(200)는 변경된 내용을 검색하지 못하고 기존의 내용을 검색하게 된다.The conventional searcher 200 searches for contents in the database 300. Therefore, if the change is not yet reflected in the database 300 after the data on the web, the searcher 200 does not search for the changed content and searches for the existing content.

만일 웹상에 있는 데이터들이 수시로 변하는 데이터들이라면 기존의 로봇에이전트로는 변경된 데이터들을 검색할 수가 없는 단점이 발생하게 된다.If the data on the web changes frequently, there is a disadvantage that the existing robot agent cannot retrieve the changed data.

본 발명은 상기와 같은 단점을 보완하기 위한 것으로 기존에 프로그램 내부에 코드(Code)화 되어 있던 개별 모듈들인 수집기, 색인기, 검색기들을 코드화 하지 않고 템플릿(Template)화 즉 독립된 형태로 구성하여 독립시스템이나 통합시스템으로 자유롭게 운영할 수 있게 하고, 다른 시스템의 콤포넌트로 쉽게 통합되게 하여 시스템의 유연성을 향상시키며;The present invention is to compensate for the drawbacks as described above, without the coding of the collector, the indexer, the searcher, the individual modules that were previously coded in the program template (Template), that is to form an independent form to form an independent system or Improve the flexibility of the system by allowing it to operate freely as an integrated system and easily integrated into components of other systems;

색인방법을 특정한 용어에 의존해서 검색하는하는 것이 아니라 공통된 형식에 의존해 색인을 하는 방식을 취하며;Do not search based on a specific term, but on a common format;

도 1 종래 웹검색시스템의 개략도.1 is a schematic diagram of a conventional web search system.

도 2 본원 발명인 범용 로봇에이전트 및 실시간 검색방법의 개략도.2 is a schematic diagram of a general purpose robot agent and a real-time search method of the present inventors.

본원 발명은 범용 로봇에이전트와 실시간 검색 방법에 관한 것으로 전체가 Pure Java로 개발되어 있어 현존하는 대부분의 플랫폼에서 별도의 포팅(Porting)절차없이도 운영이 가능하다.The present invention relates to a general purpose robot agent and a real-time retrieval method, all of which are developed in pure Java, and thus can be operated without a separate porting procedure on most existing platforms.

로봇 에이전트(100)는 수집단계를 형성하는 수집기(10)와 색인단계를 형성하는 색인기(20)로 구성된다.The robot agent 100 is composed of a collector 10 forming a collecting step and an indexer 20 forming an indexing step.

색인단계에서 색인기(20)를 통해 색인된 데이터들은 데이터베이스(300)를 형성하고 검색단계에서 검색기(200)을 통해 상기 데이터베이스(300)는 검색된다.The data indexed through the indexer 20 in the indexing step forms the database 300, and the database 300 is searched through the searcher 200 in the searching step.

상기 웹 검색시스템의 구동 즉 초기화시는 외부에서 제공되는 초기화 화일을 통해 프로세서가 제어된다. 초기화 파일은 단순 텍스트화일로 규칙에 따라 쉽게 편집이 가능하며 이를 통해 시스템을 제어하게 된다.When the web search system is driven or initialized, the processor is controlled through an externally provided initialization file. The initialization file is a simple text file that can be easily edited according to the rules, thereby controlling the system.

수집기의 초기화 화일은 사이트(Site)별 초기 URL, Thread 수 ,수집/배제 패턴(pattern)등이 기술된다.The initialization file of the collector describes the initial URL of each site, the number of threads, and the collection / exclusion pattern.

색인기에는 사이트 별 색인룰(rule)이 검색기에는 사이트(Site)별 수집 정보/색인룰, 디스플레이형식이 기술된다.The indexer describes site-specific indexing rules, and the searcher describes site-specific collection information / index rules and display formats.

그리고 프로그램내에서 상기 수집기(10), 색인기(20), 검색기(200)의 제어정보들이 기존에는 코드(Code)화 했으나 본원 발명에서는 템플릿(Template)화 하여 시스템(program)외부에 위치하는 방법을 사용하여 사용목적에 따라 시스템 외부에 위치한 기능들을 쉽게 변경함으로써 범용 로봇에이전트로 사용할 수 있고 또한 각 서브(Sub) 시스템은 독립 시스템으로 기능하게 설계되어 있어 특수 목적에 자유롭게 활용될 수 있는 장점이 있다.In the program, the control information of the collector 10, the indexer 20, and the searcher 200 is coded, but in the present invention, a template is located to be located outside the system. It can be used as a general purpose robot agent by easily changing the functions located outside the system according to the purpose of use, and each sub system is designed to function as an independent system, which can be freely used for special purposes.

기존의 색인 방식이 웹문서상에서 특정한 문구나 단어를 이용 색인작업이 이루어진 반면 본원 발명은 웹문서상에서 보여지는 일정한 형식(일정한 패턴)에 근거하여 색인작업이 이루어짐으로 종래의 형태소분석, 불용어.조사처리과정이 필요없게 되어 전자사전 구축과 전자사전 자체가 서비스 적용분야에 따라 달라져야하는 어려움을 줄일수가 있다. 결국 시스템 전체의 유연성. 확장성을 크게 향상시킬 수 있다.While the existing indexing method is indexed using a specific phrase or word on the web document, the present invention is indexed based on a certain format (constant pattern) shown on the web document. This eliminates the need for a process, which reduces the difficulty of building an electronic dictionary and the electronic dictionary itself, depending on the service application. Eventually system-wide flexibility. Scalability can be greatly improved.

검색단계의 검색기(200)가 데이터베이스(300)를 검색하는 기존의 방식이외에 직접 미리 지정된 웹사이트들을 검색함으로써 기존의 방식으로는 최신 업데이트된 자료를 검색하지 못했던 웹사이트들도 검색이 가능하다.In addition to the existing method of searching the database 300 by the searcher 200 of the search step, the predetermined websites may be directly searched for websites that have not been able to search for the latest updated data by the existing method.

이와 같은 방법을 통해 본 범용 로봇에이전트 및 실기간 검색방법을 일반검색시스템, 쇼핑몰 전문검색시스템, 도서정보전문 검색시스템등 범용으로 사용하는 것이 가능하다.Through this method, it is possible to use this general robot agent and real-time search method for general use such as general search system, shopping mall specialized search system, book information specialized search system.

상기와 같은 특성을 갖는 범용 로봇에이전트의 작동방법을 각각의 모듈별로 설명하면 다음과 같다.The operation of the general purpose robot agent having the above characteristics will be described for each module as follows.

수집기(10)는 로봇에이전트의 서브(Sub) 시스템으로 기본적으로 현존하는 다른 검색시스템 처럼 웹서버 미러링 기능를 수행한다.Collector 10 is a sub-system of the robot agent and basically performs a web server mirroring function like other existing search systems.

기존의 시스템과 차별화 되는 부분은 상기에서 설명한 것처럼 템플릿 기능을 통한 유연성 확보이다. 효율적인 수집제어를 위해 수집 Skip URL pattern을 정교하게 지정할 수 있으며, 색인과 연계하여 수집문서 자체를 제한할 수도 있다. 또한 네트웍 트래픽(Network traffic)을 고려하여 병렬처리(Thread)를 사이트(Site)별로 지정할 수도 있다.The difference from the existing system is securing flexibility through the template function as described above. For efficient collection control, the collection Skip URL pattern can be specified precisely, and the collection document itself can be limited in association with the index. Also, in consideration of network traffic, parallel processing can be specified for each site.

색인기(20)는 기본적으로 템플리트(Template) 즉 초기화 화일에 의해 제어된다.The indexer 20 is basically controlled by a template, that is, an initialization file.

기존의 색인시스템과 달리 문서의 해석이나 분석(Parsing) 및 데이터베이스화 과정에서 서비스 종속적인 전자사전 및 프로그램 루틴을 전혀 사용하지 않는다. 오직 템플랫(Template)에 기술된 색인룰에 의거 전체 문서를 색인하기 때문에, 범용성이 확보되고 빠른 색인속도를 통해 높은 색인율을 구현한다.Unlike conventional indexing systems, no service-dependent electronic dictionaries and program routines are used in the interpretation, parsing, and database processes of documents. Since only the entire document is indexed according to the indexing rules described in the template, it is possible to achieve a high index rate through generality and fast indexing speed.

색인 룰(rule) 메카니즘은 색인의 대상이 되는 웹문서는 정보기술에 있어 특정 양식 (Pattern)을 갖는다는 사상에서 출발한다.The indexing rule mechanism starts with the idea that web documents subject to indexing have a certain pattern in information technology.

예를 들어 웹쇼핑몰의 문서는 상품정보(품명, 가격, 제조사등)를 기술하기 위해 개별상품마다 별도의 양식을 갖기 보다는 고유한 기술양식을 반복 사용하는 것이 보통이다. 특히 대량의 정보를 서비스 하는 경우에 이런 특정양식의 사용은 가장 보편적인 현상이다. 결국 쇼핑몰의 정보를 색인하고자 하는 경우 그 반복되는 패턴(pattern)을 정의 된 문법에 따라 하나의 색인룰로 기술하여 주면 검색기는 이 색인 룰과 문서를 input으로 하여 색인을 수행한다.For example, documents in web shopping malls typically use a unique technology form repeatedly to describe product information (product name, price, manufacturer, etc.) rather than having a separate form for each product. The use of this particular form is the most common phenomenon, especially when serving large amounts of information. After all, if you want to index the information of the shopping mall, the repeated pattern is described as one index rule according to the defined grammar, and the searcher performs the index using this index rule and the document as input.

또한 동일 사이트에 복수개의 색인룰을 기술하면 검색기는 문서유형에 맞는 색인룰을 자동으로 선택한다. 또한 색인룰에는 효율적인 색인을 위해 특정단어 포함문서만 색인하는 기능, 추출된 자료의 가공기능(예를 들면 특정문자열의 제거, 대체), 임의정보의 데이터베이스화기능(예를 들면 통화단위나 홈페이지 URL등을 임의의 필드로 삽입)을 지원한다.In addition, if multiple index rules are described on the same site, the searcher automatically selects an index rule that matches the document type. In addition, indexing rules include indexing only documents containing specific words for efficient indexing, processing of extracted data (for example, removing and replacing specific strings), and databaseting of arbitrary information (for example, currency unit or homepage URL). Etc. into an arbitrary field).

검색기(200)는 Java Servlet기술로 구현되며, 기능상으로 메타검색시스템과 디렉토리(주제별)검색 시스템으로 구분된다. 전체적으로 로봇 에이전트(100)과 같은 방식으로 템슬릿 사상을 적용하여 검색사이트의 추가나 유저인터페이스 구성요소변경이 초기화 화일의 변경만으로 가능하다. 이를 통해 사소한 정보의 변경에도 프로그램수정, 컴파일, 디플로이먼트(deployment)의 번거로운 과정을 반복해야하는 기존 검색기의 단점을 극복할 수 있다.The searcher 200 is implemented by Java Servlet technology, and functionally divided into a meta search system and a directory (by topic) search system. As a whole, by applying the template mapping in the same manner as the robot agent 100, the addition of the search site or the change of the user interface component can be performed only by the change of the initialization file. This overcomes the shortcomings of existing browsers, which require the iterative process of modifying, compiling, and deploying programs even with minor information changes.

또한 웹응용프로그램의 가장 큰 장애물중 하나인 스테이트리스(Stateless:기본적으로 HTTP Protocol은 Session을 가정하지 않음)극복과 상기에서 언급한 바와 같이 갱신주기가 빠른 웹사이트들을 실시간으로 검색하는 방법의 핵심인 반응시간(Responsetime) 최적화를 위해 다음과 같은 요소기술을 사용한다.In addition, one of the biggest obstacles to web applications is stateless (basically the HTTP protocol does not assume Session) and the key to how to search websites with fast update cycles in real time as mentioned above. The following element techniques are used to optimize response time.

초기화 화일에는 검색 사이트를 가장 효율적으로 검색 즉 수집과 색인을 하기위한 제어정보가 기술된다. 제어정보로는 캐쉬타임(Cache time), 보유정보유형(디렉토리와 연계), 최단검색경로, 색인rule(색인기 색인룰의 Subset)등이 있다.The initialization file describes control information for searching, collecting, and indexing a search site most efficiently. Control information includes cache time, retention information type (associated with directory), shortest search path, and index rule (subset of indexer index rule).

리소스(Resource)관리를 통해 서비스시의 오버해드(Over head)를 최소화하기위해 기동시 예비 글래스(Spare Class)생성 및 초기화 , 쓰래드(Thread)생성, 공유메모리 할당을 수행한다.In order to minimize the overhead of service through resource management, sparse glass class creation and initialization, thread generation, and shared memory allocation are performed at startup.

지역데이터베이스의 장점을 취하기 위해 내부 캐쉬(Cache)를 구현한다. 캐쉬(Cache)는 메모리 캐쉬(Cache)와 디스크 캐쉬(Cache)로 구성되고, 키워드(검색어)단위, 주제(디렉토리)단위로 구현된다. 즉 동일한 검색어와 디렉토리에 대한 서비스 요청에는 캐쉬(Cache)의 컨텐츠(content)로 응답한다. 캐쉬(Cache)관리자는 별도의 Thread로 실행되면서 캐쉬 설정과 해제, 전환을 자동으로 수행한다.Implement an internal cache to take advantage of local databases. The cache consists of a memory cache and a disk cache, and is implemented by keyword (search term) unit and subject (directory) unit. That is, the service request for the same search word and directory is responded with the content of the cache. The Cache Manager executes as a separate thread and automatically performs cache setting, clearing, and switching.

상기와 같은 방법을 통해 범용로봇 에이전트 및 실시간 검색방법은 특정분야에 사용되는 검색엔진을 일반검색시스템, 쇼핑몰 전문검색시스템, 도서정보 전문 검색시스템등에서도 사용 가능하다.Through the above method, the universal robot agent and the real-time search method can use the search engine used in a specific field in a general search system, a shopping mall specialized search system, a book information specialized search system, and the like.

또한 검색기가 데이터베이스만을 검색하는 것이 아니라 지정된 웹사이트를 직접검색함으로써 실시간으로 갱시주기가 빠른 웹사이트들을 검색할 수 있다.In addition, the searcher can search fast-updated websites in real time by directly searching a designated website instead of just a database.

Claims

In the web search system for data search on the Internet,

A collection step of collecting data of a predetermined website through the collector 10,

An indexing step of indexing the data collected in the collecting step through the indexer 20;

Converting the data formed through the indexing step into a database (300);

General purpose robot agent and real-time search method comprising a search step for searching the database data through a searcher (200).

The general robot agent and real-time retrieval method according to claim 1, wherein the subordinate control information of a service is implemented in a template separate from a program in the collection step, the index step, and the retrieval step.

The method of claim 1, wherein the indexing of the collected data is based on the description format of the document rather than the content of the document.

The general robot agent and real-time search method according to claim 1, wherein the searching step may search a website directly in addition to the method of searching the database data.

The general-purpose robot agent and real-time search method according to claim 1, wherein the web search system is entirely developed in Java language.