JP5108660B2

JP5108660B2 - Information collection method, apparatus, and program

Info

Publication number: JP5108660B2
Application number: JP2008171883A
Authority: JP
Inventors: 健一山本
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-06-30
Filing date: 2008-06-30
Publication date: 2012-12-26
Anticipated expiration: 2028-06-30
Also published as: JP2010015202A

Description

本発明は、Ｗｅｂ文書に記載された情報を収集する情報収集方法、装置及びプログラムに関する。 The present invention relates to an information collection method, apparatus, and program for collecting information described in a Web document.

従来、ある特定の項目に関しＷｅｂ文書から情報を得たい場合、Ｗｅｂ文書を一つ一つ閲覧し、情報を収集しなければならなかった。 Conventionally, when it is desired to obtain information from a Web document regarding a specific item, the Web document has to be browsed and collected.

そこで、パソコンに関する情報を調べたい場合には、非特許文献１に記載の技術によれば、一覧表等で表示する情報に基づいて、ユーザは、様々なメーカー、販売会社毎にパソコンに関するを情報を探索することができるようになっている。
［ｏｎｌｉｎｅ］、［平成２０年６月１８日検索］、インターネット〈ＵＲＬ：ｈｔｔｐ：／／ｋａｋａｋｕ．ｃｏｍ／ｐｃ／ｄｅｓｋｔｏｐ−ｐｃ／〉 Therefore, when it is desired to examine information related to a personal computer, according to the technique described in Non-Patent Document 1, the user can obtain information on personal computers for each of various manufacturers and sales companies based on information displayed in a list or the like. Can be explored.
[Online], [Search June 18, 2008], Internet <URL: http: // kakaku. com / pc / desktop-pc />

しかしながら、非特許文献１に記載の技術によっても、パソコンに関する情報を人手により収集、蓄積して再編集しているために、その作業に要する時間と労力は膨大なものとなる。また、非特許文献１に記載の技術は、特定の商品について提供されるものであり、取り扱いのない商品や、商品以外のものについて調べようとすると、蓄積情報のカバレッジの面で限界がある。 However, even with the technique described in Non-Patent Document 1, since information relating to a personal computer is manually collected, accumulated, and re-edited, the time and labor required for the work become enormous. Further, the technique described in Non-Patent Document 1 is provided for specific products, and there is a limit in coverage of stored information when trying to examine products that are not handled or products other than products.

そこで、本発明は、このような事情を考慮して提案されるものであり、Ｗｅｂページ上に分散して存在している共通の項目やその属性及び属性値の関係にある情報を自動的に収集する情報収集方法、装置及びプログラムを提供することを目的とする。 Therefore, the present invention is proposed in consideration of such circumstances, and automatically displays information on a common item distributed on a Web page and information related to its attribute and attribute value. An object of the present invention is to provide an information collection method, apparatus, and program to be collected.

上記目的を達成するために、本発明者は、項目に関する情報を複数のＷｅｂ文書から自動的に収集する仕組みを見出し、本発明を想到するに至った。 In order to achieve the above object, the present inventor has found a mechanism for automatically collecting information on items from a plurality of Web documents, and has come up with the present invention.

本発明に係る情報収集方法は、Ｗｅｂ文書に含まれるタグに基づいて、項目、属性及び属性値の関係を有する情報を抽出することにより、項目に関する情報をＷｅｂ文書から自動的に収集するものである。 The information collection method according to the present invention automatically collects information on items from a Web document by extracting information having a relationship between items, attributes, and attribute values based on tags included in the Web document. is there.

（１）情報収集装置が、通信ネットワークを介してアクセス可能なＷｅｂ文書から、前記Ｗｅｂ文書に含まれるタグに基づいて表形式あるいはデータベース形式の情報を抽出するステップと、
抽出した前記表形式あるいはデータベース形式の情報から、前記タグが示す各情報間の従属関係に基づいて、所定の項目に対して従属する関係にある属性及び前記属性の内容を示す属性値の関係を有する情報を抽出するステップと、
抽出した前記項目、属性及び属性値の関係を有する情報を関連付けて記憶手段に記憶するステップと、を少なくとも実行することを特徴とする情報収集方法。 (1) an information collecting apparatus extracting information in a table format or a database format from a Web document accessible via a communication network based on a tag included in the Web document;
From the extracted information in the table format or database format, based on the dependency relationship between the information indicated by the tag, the relationship between the attribute dependent on the predetermined item and the attribute value indicating the content of the attribute Extracting information having,
And a step of associating and storing in the storage means information having a relationship between the extracted item, attribute and attribute value.

（１）に記載の発明の構成によれば、Ｗｅｂ文書に含まれるタグに基づいて表形式あるいはデータベース形式の情報を抽出し、抽出した表形式あるいはデータベース形式の情報から、タグが示す各情報間の従属関係に基づいて、所定の項目に対して従属する関係にある属性及び当該属性の内容を示す属性値の関係を有する情報を抽出する。 According to the configuration of the invention described in (1), information in a table format or a database format is extracted based on a tag included in a Web document, and each piece of information indicated by the tag is extracted from the extracted table format or database format information. Based on the subordinate relationship, information having a relationship between an attribute subordinate to a predetermined item and an attribute value indicating the content of the attribute is extracted.

このことにより、項目、属性及び属性値の関係を有する情報を複数のＷｅｂ文書から自動的に収集することが可能となる。 This makes it possible to automatically collect information having a relationship between items, attributes, and attribute values from a plurality of Web documents.

ここで、前記表形式の情報とは、カンマ区切り、スペース区切りなどにより表と同等の表現がされた平文により示された情報も含む。 Here, the information in the table format includes information indicated in plain text expressed in the same way as the table by comma separation, space separation, or the like.

また、所定の項目に対して、より多くのＷｅｂ文書において共通して従属する情報を、より大きなウェイトを掛けて属性及び属性値として抽出してもよい。また、所定の項目に対して、所定の閾値を超える数のＷｅｂ文書において共通して従属することがない情報は、当該属性及び属性値として抽出しないこととしてもよい。 In addition, information that is commonly subordinated in a larger number of Web documents with respect to a predetermined item may be extracted as an attribute and an attribute value with a greater weight. In addition, information that does not commonly depend on a predetermined item in a number of Web documents exceeding a predetermined threshold may not be extracted as the attribute and the attribute value.

ここで、抽出して記憶した属性及び属性値は、項目に関するＷｅｂ文書を検索する際の検索クエリーの示唆等に活用することができる。 Here, the attribute and the attribute value extracted and stored can be used for suggesting a search query when searching for a Web document related to an item.

（２）前記抽出するステップは、前記表形式の情報の直上方、直下方あるいは直左方に位置する情報を前記項目として抽出する（１）に記載の方法。 (2) The method according to (1), wherein the extracting step extracts, as the item, information located immediately above, immediately below, or immediately to the left of the tabular information.

（２）に記載の発明の構成によれば、表形式の情報の直上方、直下方あるいは直左方に位置する情報を前記項目として抽出する。 According to the configuration of the invention described in (2), information located immediately above, directly below, or immediately to the left of the tabular information is extracted as the item.

このことにより、Ｗｅｂ文書において表形式の情報のタイトルが表示されることが多い位置に位置する情報を当該表形式の情報に対する項目として抽出することができる。 This makes it possible to extract information located at a position where a title of tabular information is often displayed in a Web document as an item for the tabular information.

ここで、上述の様に、当該関係を有する頻度に応じて、ウェイトを掛けたり、当該頻度が所定の閾値に達するまで、当該抽出を行なわないこととして、精度を上げてもよい。 Here, as described above, the accuracy may be increased by multiplying the weight according to the frequency having the relationship or not performing the extraction until the frequency reaches a predetermined threshold.

（３）前記抽出するステップは、前記表形式の情報において、上端行あるいは左端列に位置する情報を属性として、それぞれその下方あるいは右方に位置する情報を属性値として抽出する（１）又は（２）に記載の方法。 (3) In the extracting step, the information located in the uppermost row or the leftmost column is extracted as the attribute in the tabular information, and the information located below or on the right is extracted as the attribute value (1) or ( The method according to 2).

（３）に記載の発明の構成によれば、前記表形式の情報において、上端行あるいは左端列に位置する情報を属性として、それぞれその下方あるいは右方に位置する情報を属性値として抽出する。 According to the configuration of the invention described in (3), in the tabular information, the information located in the upper row or the left column is extracted as the attribute, and the information located below or to the right is extracted as the attribute value.

このことにより、Ｗｅｂ文書において表形式の情報の属性が表示されることが多い位置に位置する情報を属性として、その属性の内容が表示されることが多い位置に位置する情報を属性値として抽出することができる。 As a result, information located at positions where attributes of tabular information are often displayed in Web documents are extracted as attributes, and information located at positions where the contents of the attributes are often displayed are extracted as attribute values. can do.

（４）前記抽出するステップは、前記データベース形式の情報の直左方に位置する情報を属性として、前記データベース形式の情報をその属性値として抽出する（１）から（３）のいずれかに記載の方法。 (4) In the extracting step, information located immediately to the left of the database format information is used as an attribute, and the database format information is extracted as an attribute value thereof. the method of.

（４）に記載の発明の構成によれば、データベース形式の情報の直左方に位置する情報を属性として、前記データベース形式の情報をその属性値として抽出する。 According to the configuration of the invention described in (4), information located immediately to the left of the database format information is used as an attribute, and the database format information is extracted as its attribute value.

このことにより、Ｗｅｂ文書においてデータベース形式の情報の属性が表示されることが多い位置に位置する情報を属性として、その属性の内容が表示されることが多いデータベース形式の情報を属性値として抽出することができる。 As a result, information located in a position where the attribute of the information in the database format is often displayed in the Web document is used as an attribute, and information in the database format in which the content of the attribute is often displayed is extracted as an attribute value. be able to.

（５）前記表形式あるいはデータベース形式の情報を抽出する際に基づく前記タグが、プルダウンリストを形成するＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）タグ、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）文書に含まれるタグ又はＨＴＭＬのテーブルタグである（１）から（４）のいずれかに記載の方法。 (5) The tag based on extracting information in the table format or the database format is an HTML (HyperText Markup Language) tag that forms a pull-down list, a tag included in an XML (Extensible Markup Language) document, or an HTML table tag The method according to any one of (1) to (4).

（５）に記載の発明の構成によれば、プルダウンリストを形成するＨＴＭＬタグ、ＸＭＬ文書に含まれるタグ又はＨＴＭＬのテーブルタグに基づいて表形式あるいはデータベース形式の情報と判断するので、Ｗｅｂ文書に含まれるタグの一致を判定することにより自動的に表形式あるいはデータベース形式の情報を抽出することができる。 According to the configuration of the invention described in (5), since it is determined as information in a table format or a database format based on an HTML tag forming a pull-down list, a tag included in an XML document, or an HTML table tag, By determining the matching of the included tags, information in a table format or a database format can be automatically extracted.

（６）（１）から（５）のいずれかに記載の方法をコンピュータに実行させることを特徴としたプログラム。 (6) A program that causes a computer to execute the method according to any one of (1) to (5).

（７）通信ネットワークを介してアクセス可能なＷｅｂ文書から、前記Ｗｅｂ文書に含まれるタグに基づいて表形式あるいはデータベース形式の情報を抽出する情報群抽出手段と、
抽出した前記表形式あるいはデータベース形式の情報から、前記タグが示す各情報間の従属関係に基づいて、所定の項目に対して従属する関係にある属性及び前記属性の内容を示す属性値の関係を有する情報を抽出し、抽出した前記項目、属性及び属性値の関係を有する情報を関連付けて記憶手段に記憶する属性関係抽出手段と、を備えたことを特徴とする情報収集装置。 (7) Information group extracting means for extracting information in a table format or a database format from a Web document accessible via a communication network based on a tag included in the Web document;
From the extracted information in the table format or database format, based on the dependency relationship between the information indicated by the tag, the relationship between the attribute dependent on the predetermined item and the attribute value indicating the content of the attribute An information collection apparatus comprising: an attribute relationship extraction unit that extracts information stored therein and associates information having a relationship between the extracted item, attribute, and attribute value and stores the information in a storage unit.

この発明によれば、Ｗｅｂ文書に含まれるタグに基づいて、項目、属性及び属性値の関係を有する情報を抽出することにより、Ｗｅｂページ上に分散して存在している共通の項目やその属性及び属性値の関係にある情報を自動的に収集することができる。 According to the present invention, by extracting information having a relationship between items, attributes, and attribute values based on tags included in the Web document, common items existing on the Web page and their attributes are distributed. And information related to the attribute value can be automatically collected.

以下、本発明を実施するための最良の形態について図を参照しながら説明する。なお、これはあくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this .

［情報収集装置と関連要素の全体構成］
図１において、情報収集装置１は、通信ネットワークとしてのインターネットＮを通じて複数のＷｅｂサーバ装置２に接続される。インターネットＮとの接続は、有線であるか無線であるかを問わない。 [Overall configuration of information collection device and related elements]
In FIG. 1, an information collection device 1 is connected to a plurality of Web server devices 2 through the Internet N as a communication network. It does not matter whether the connection with the Internet N is wired or wireless.

情報収集装置１は、複数のＷｅｂサーバ装置２からＷｅｂ文書を取得する。各Ｗｅｂサーバ装置２は、情報収集装置１からのリクエストに応じて種々のＷｅｂ文書を提供する。 The information collection device 1 acquires Web documents from a plurality of Web server devices 2. Each Web server device 2 provides various Web documents in response to requests from the information collection device 1 .

［情報収集装置の機能構成］
図２は、本実施形態に係る情報収集装置１の機能構成の概要を示す図である。情報収集装置１は、Ｗｅｂ文書蓄積手段１１、情報群抽出手段１２及び属性関係抽出手段１３を備えている。また、Ｗｅｂ文書ＤＢ１５、情報群記憶部１６及び属性関係ＤＢ１７を有する（ＤＢはデータベースの略）。 [Functional configuration of information collection device]
FIG. 2 is a diagram illustrating an outline of a functional configuration of the information collecting apparatus 1 according to the present embodiment. The information collection device 1 includes a Web document storage unit 11, an information group extraction unit 12, and an attribute relationship extraction unit 13. In addition, it has a Web document DB 15, an information group storage unit 16, and an attribute relationship DB 17 (DB is an abbreviation for database).

Ｗｅｂ文書蓄積手段１１は、Ｗｅｂサーバ装置２からＷｅｂ文書を取得しＷｅｂ文書ＤＢ１５に格納する。情報群抽出手段１２は、Ｗｅｂ文書ＤＢ１５に蓄積されたＷｅｂ文書を読み出し、読み出した当該Ｗｅｂ文書に含まれるタグに基づき、表形式又はデータベース形式の情報を抽出し情報群記憶部１６に格納する。属性関係抽出手段１３は、情報群記憶部１６に格納された表形式又はデータベース形式の情報を読み出し、読み出した当該情報に含まれるタグに基づき、当該情報に含まれる項目、属性及び属性値の関係を有する情報を抽出し、抽出した当該情報を属性関係ＤＢ１７に登録する。Ｗｅｂ文書蓄積手段１１、情報群抽出手段１２及び属性関係抽出手段１３は、コンピュータがプログラムを実行することによって実現される。 The Web document storage unit 11 acquires a Web document from the Web server device 2 and stores it in the Web document DB 15. The information group extraction unit 12 reads the Web document stored in the Web document DB 15, extracts information in a table format or a database format based on the tag included in the read Web document, and stores the information in the information group storage unit 16. The attribute relationship extraction unit 13 reads information in a table format or a database format stored in the information group storage unit 16, and based on the tag included in the read information, the relationship between items, attributes, and attribute values included in the information Is extracted, and the extracted information is registered in the attribute relation DB 17. The Web document storage unit 11, the information group extraction unit 12, and the attribute relationship extraction unit 13 are realized by a computer executing a program.

また、Ｗｅｂ文書ＤＢ１５、情報群記憶部１６及び属性関係ＤＢ１７は、後述のハードウェアに含まれる記憶装置４１０の一領域に設けられている。 The Web document DB 15, the information group storage unit 16, and the attribute relationship DB 17 are provided in one area of the storage device 410 included in hardware described later .

［各種データベースと関連要素の構成］
図３は、Ｗｅｂ文書ＤＢ１５、情報群記憶部１６及び属性関係ＤＢ１７と関連要素の構成の概要を示す図である。 [Configuration of various databases and related elements]
FIG. 3 is a diagram showing an outline of the configuration of the Web document DB 15, the information group storage unit 16, the attribute relationship DB 17, and related elements.

図３（ａ）に示すように、Ｗｅｂ文書ＤＢ１５は、文書ＩＤと、インターネットＮ上で配信されているＷｅｂ文書のＵＲＬ等のインターネットＮ上の通信アドレス及びこのＷｅｂ文書の記述であるソースコードと、をそれぞれ対応付けて記憶している。 As shown in FIG. 3 (a), Web documents DB15 includes a document ID, a source code is a description of the communication address and the Web documents on the Internet N such as URL of the Web document that is delivered over the Internet N Are stored in association with each other.

図３（ｂ）及び（ｃ）に示すように、情報群記憶部１６には、Ｗｅｂ文書のソースコードから抽出された表形式又はデータベース形式の情報が当該表形式又はデータベース形式を構成するタグと共に格納される。 As shown in FIGS. 3B and 3C, the information group storage unit 16 stores information in the table format or database format extracted from the source code of the Web document together with tags constituting the table format or database format. Stored.

図３（ｅ）に示すように、属性関係ＤＢ１７には、表形式又はデータベース形式の情報から抽出された項目、属性及び属性値の関係を有する情報が格納される。 As shown in FIG. 3E, the attribute relationship DB 17 stores information having a relationship between items, attributes, and attribute values extracted from information in a table format or a database format.

図３（ｄ）に示すように、本実施形態では、表形式又はデータベース形式の情報から項目、属性及び属性値の関係を有する情報を抽出する際に利用する属性辞書１４を備えている。 As shown in FIG. 3D, the present embodiment includes an attribute dictionary 14 that is used when extracting information having a relationship between items, attributes, and attribute values from information in a table format or a database format.

図３（ｂ）及び（ｃ）は、Ｗｅｂ文書から抽出される表形式又はデータベース形式の情報の例である。例えば、ＰＣ販売会社ＡのＷｅｂページ（Ｗｅｂ文書）が、図３（ｂ）に示す表形式又はデータベース形式の情報を含んでおり、ＰＣ販売会社ＢのＷｅｂページ（Ｗｅｂ文書）が、図３（ｃ）に示す表形式又はデータベース形式の情報を含んでいるものとする。 FIGS. 3B and 3C are examples of information in a table format or database format extracted from a Web document. For example, the Web page (Web document) of the PC sales company A includes information in the table format or database format shown in FIG. 3B, and the Web page (Web document) of the PC sales company B is shown in FIG. It is assumed that the information in the table format or database format shown in c) is included.

当該情報は、プルダウンリストを形成する一群のＨＴＭＬとして記述されている場合、ＸＭＬ文書として記述されている場合、テーブルタグにより表を形成する一群のＨＴＭＬとして記述されている場合など、種々考えられる。 The information may be variously described as a group of HTML forming a pull-down list, described as an XML document, or described as a group of HTML forming a table with a table tag.

プルダウンリストを形成する一群のＨＴＭＬとして記載されている場合は、例えば＜Ｓｅｌｅｃｔ＞タグの開始タグと終了タグを判定し、当該開始タグ及び終了タグとその間にある要素の内容とを抽出することが考えられる。 When it is described as a group of HTML that forms a pull-down list, for example, it is possible to determine the start tag and end tag of the <Select> tag and extract the start tag and end tag and the contents of the elements in between Conceivable.

また、ＸＭＬ文書として記載されている場合は、ＸＭＬインスタンスが情報の階層構造を持つので、例えば、ＸＭＬインスタンスの最上位の開始タグと終了タグとの間にある要素の内容を抽出することが考えられる。 If the XML instance is described as an XML document, the XML instance has a hierarchical structure of information. For example, it is considered to extract the contents of an element between the top start tag and end tag of the XML instance. It is done.

また、テーブルタグにより表を形成する一群のＨＴＭＬとして記述されている場合は、例えば、＜Ｔａｂｌｅ＞タグの開始タグと終了タグを判定し、当該開始タグ及び終了タグとその間にある要素の内容とを抽出することが考えられる。 If the table tag is described as a group of HTML forming a table, for example, the start tag and end tag of the <Table> tag are determined, and the start tag and end tag and the contents of the elements between them are determined. Can be considered.

本実施形態において、図３（ｂ）の情報は、テーブルタグにより、１列目に項目「ノートＰＣ」が記述され、１行目にノートＰＣの属性として「ＣＰＵ」、「クロック」が記述され、「ＣＰＵｘｘｘ」、「ＣＰＵｙｙｙ」が属性「ＣＰＵ」の属性値として記述され、「１．５ＧＨｚ」、「２．０ＧＨｚ」が属性「クロック」の属性値として記述されているものとする。 In the present embodiment, in the information of FIG. 3B, the table tag describes the item “note PC” in the first column, and “CPU” and “clock” as the attributes of the notebook PC in the first row. , “CPU xxx” and “CPU yyy” are described as attribute values of the attribute “CPU”, and “1.5 GHz” and “2.0 GHz” are described as attribute values of the attribute “clock”.

なお、項目、属性及び属性値の関係がこれらのＷｅｂ文書の基礎となるデータを格納しているデータベースサーバ等に蓄積され、ＣＧＩ等のプログラムを用いて取得する構造となっている場合には、これらのＣＧＩ等のプログラムを実行することにより、これらの関係を有する情報を収集する。 When the relationship between items, attributes, and attribute values is accumulated in a database server or the like that stores the data that is the basis of these Web documents and is acquired using a program such as CGI, Information having these relationships is collected by executing a program such as CGI.

また、本実施形態において、図３（ｃ）の情報は、テーブルタグにより形成されているものとし、＜Ｔａｂｌｅ＞タグの要素内容として記述された＜Ｃａｐｔｉｏｎ＞タグの要素内容の一部に、項目「ノートパソコン」が記述され、その直後の表の１行目に、項目「ノートパソコン」の属性として「ＣＰＵ」、「クロック」が記述され、以降、属性「ＣＰＵ」の属性値として、「ＣＰＵｚｚｚ」、「ＣＰＵｐｐｐ」が記述され、属性「クロック」の属性値として、「８００ＭＨｚ」、「３．２ＧＨｚ」が記述されているものとする。 In the present embodiment, the information in FIG. 3C is assumed to be formed by a table tag, and an item is included in a part of the <Caption> tag element content described as the <Table> tag element content. “Note PC” is described, and “CPU” and “Clock” are described as attributes of the item “Note PC” in the first row of the table immediately after that, and thereafter, “CPU” is set as an attribute value of the attribute “CPU”. “zzzz” and “CPU ppp” are described, and “800 MHz” and “3.2 GHz” are described as attribute values of the attribute “clock”.

図３（ｄ）は、属性関係抽出手段が利用する属性辞書１４の例である。属性辞書１４は、後述のハードウェアに含まれる記憶装置４１０（図４参照）に格納されている。本実施形態において、属性辞書１４は、項目と属性とを関連付けている。例えば、項目には「ノートパソコン」のほか、その類義語である「ノートＰＣ」等が登録されている。一方、項目に関連する属性として「ＣＰＵ」「ＨＤＤ」「バッテリ」「価格」等が登録されている。更に、属性（属性１）に関する下位の属性（属性２）が登録されている。例えば、属性１「ＣＰＵ」に関し、属性２「クロック」「キャッシュ」等が登録されている。 FIG. 3D is an example of the attribute dictionary 14 used by the attribute relationship extraction unit. The attribute dictionary 14 is stored in a storage device 410 (see FIG. 4) included in hardware described later. In the present embodiment, the attribute dictionary 14 associates items with attributes. For example, in addition to “notebook personal computer”, an item such as “notebook PC” is registered in the item. On the other hand, “CPU”, “HDD”, “battery”, “price”, and the like are registered as attributes related to the items. Furthermore, a lower attribute (attribute 2) related to the attribute (attribute 1) is registered. For example, for attribute 1 “CPU”, attribute 2 “clock”, “cache”, and the like are registered.

図３（ｅ）は、属性関係ＤＢ１７に格納される項目、属性及び属性値の関係を有する情報の例を示している。ユニークな番号である属性関係ＩＤごとに、項目、属性及び属性値を関連付けて保存している。例えば、ある項目「ノートパソコン」は、属性１「ＣＰＵ」の属性値が「ＣＰＵｘｘｘ」であり、かつ、属性２「クロック」の属性値が「１．５ＧＨｚ」であるとして保存される。 FIG. 3E shows an example of information having a relationship between items, attributes, and attribute values stored in the attribute relationship DB 17 . For each attribute relationship ID that is a unique number, an item, an attribute, and an attribute value are stored in association with each other. For example, an item “notebook personal computer” is stored assuming that the attribute value of attribute 1 “CPU” is “CPU xxx” and the attribute value of attribute 2 “clock” is “1.5 GHz” .

［情報収集装置のハードウェア構成図］
図４は、本実施形態に係る情報収集装置１のハードウェア構成を示す図である。
情報収集装置１は、制御部３００を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３１０（マルチプロセッサ構成ではＣＰＵ３２０等複数のＣＰＵが追加されてもよい）、バスライン２００、通信Ｉ／Ｆ（Ｉ／Ｆ：インタフェース）３３０、メインメモリ３４０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）３５０、Ｉ／Ｏコントローラ３６０、ハードディスク３７０、光ディスクドライブ３８０及び半導体メモリ３９０を備える。尚、ハードディスク３７０、光ディスクドライブ３８０及び半導体メモリ３９０はまとめて記憶装置４１０と呼ばれる。 [Hardware configuration diagram of information collection device]
FIG. 4 is a diagram illustrating a hardware configuration of the information collecting apparatus 1 according to the present embodiment.
The information collection device 1 includes a CPU (Central Processing Unit) 310 (a plurality of CPUs such as a CPU 320 may be added in a multiprocessor configuration), a bus line 200, a communication I / F (I / F: An interface) 330, a main memory 340, a BIOS (Basic Input Output System) 350, an I / O controller 360, a hard disk 370, an optical disk drive 380, and a semiconductor memory 390. The hard disk 370, the optical disk drive 380, and the semiconductor memory 390 are collectively referred to as a storage device 410.

制御部３００は、情報収集装置１を統括的に制御する部分であり、ハードディスク３７０（後述）に記憶された各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The control unit 300 is a part that controls the information collecting apparatus 1 in an integrated manner, and appropriately reads and executes various programs stored in the hard disk 370 (described later), thereby cooperating with the above-described hardware. Various functions related to are realized.

通信Ｉ／Ｆ３３０は、情報収集装置１が、インターネットＮ（図１）を介してＷｅｂサーバ装置２（＃１）〜２（＃Ｎ）等（図１）と情報を送受信する場合のネットワーク・アダプタである。通信Ｉ／Ｆ３３０は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。 The communication I / F 330 is a network adapter used when the information collecting apparatus 1 transmits / receives information to / from the Web server apparatuses 2 (# 1) to 2 (#N) (FIG. 1) via the Internet N (FIG. 1). It is. The communication I / F 330 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.

ＢＩＯＳ３５０は、情報収集装置１の起動時にＣＰＵ３１０が実行するブートプログラムや、情報収集装置１がハードウェアに依存するプログラム等を記録する。 The BIOS 350 records a boot program executed by the CPU 310 when the information collecting apparatus 1 is started, a program that the information collecting apparatus 1 depends on hardware, and the like.

Ｉ／Ｏコントローラ３６０には、ハードディスク３７０、光ディスクドライブ３８０、及び半導体メモリ３９０等の記憶装置４１０を接続することができる。 A storage device 410 such as a hard disk 370, an optical disk drive 380, and a semiconductor memory 390 can be connected to the I / O controller 360.

ハードディスク３７０は、本ハードウェアを情報収集装置１として機能させるための各種プログラム、本発明の機能を実行するプログラム及び前述の各ＤＢ１５，１７、情報群記憶部１６及び属性辞書１４を記憶する。なお、情報収集装置１は、外部に別途設けたハードディスク（図示せず）を外部記憶装置として利用することもできる。 The hard disk 370 stores various programs for causing the hardware to function as the information collecting apparatus 1, programs for executing the functions of the present invention, the DBs 15 and 17, the information group storage unit 16, and the attribute dictionary 14. The information collecting apparatus 1 can also use a hard disk (not shown) separately provided as an external storage device.

光ディスクドライブ３８０としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ及びＣＤ−ＲＡＭドライブを使用することができる。この場合は各ドライブに対応した光ディスク４００を使用する。光ディスク４００から光ディスクドライブ３８０によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ３６０を介してメインメモリ３４０又はハードディスク３７０に提供することもできる。 As the optical disk drive 380, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, and a CD-RAM drive can be used. In this case, the optical disk 400 corresponding to each drive is used. A program or data can be read from the optical disk 400 by the optical disk drive 380 and provided to the main memory 340 or the hard disk 370 via the I / O controller 360.

なお、本発明でいうコンピュータとは、記憶装置、制御部等を備えた情報処理装置をいい、情報収集装置１は、記憶装置４１０、制御部３００等を備えた情報処理装置により構成され、この情報処理装置は、本発明のコンピュータの概念に含まれる。 The computer referred to in the present invention refers to an information processing device including a storage device, a control unit, and the like, and the information collection device 1 includes an information processing device including a storage device 410, a control unit 300, and the like. The information processing apparatus is included in the concept of the computer of the present invention .

［Ｗｅｂサーバ装置のハードウェア構成］
Ｗｅｂサーバ装置２も、上述の情報収集装置１と同様なハードウェア構成を持つ。 [Hardware configuration of Web server device]
The Web server device 2 also has a hardware configuration similar to that of the information collection device 1 described above .

［本発明の実施形態に係るフローチャート］
図５は、本発明の実施形態に係る情報収集処理のフローチャートを示している。 [Flowchart According to Embodiment of the Present Invention]
FIG. 5 shows a flowchart of information collection processing according to the embodiment of the present invention.

Ｓ１：Ｗｅｂ文書蓄積手段１１は、ネットワークＮ上に配信されているＷｅｂ文書を任意のＷｅｂサーバ装置２からダウンロードし、このＷｅｂ文書のＵＲＬをネットワーク上のアドレスとして、自動生成した文書ＩＤとこの通信アドレスとを対応付けてＷｅｂ文書ＤＢ１５に記憶する。また、Ｗｅｂ文書蓄積手段１１は、このＷｅｂ文書の文書ＩＤと、このＷｅｂ文書の記述であるソースコードとを関連付けて記憶する。 S1: The Web document storage unit 11 downloads a Web document distributed on the network N from an arbitrary Web server apparatus 2, and automatically generates the document ID and the communication using the URL of the Web document as an address on the network. The address is associated with and stored in the Web document DB 15. The Web document storage unit 11 stores the document ID of the Web document and the source code that is a description of the Web document in association with each other.

Ｓ２：情報群抽出手段１２は、Ｗｅｂ文書ＤＢ１５に蓄積されたＷｅｂ文書のソースコードを解析し、当該ソースコードに含まれるタグの記述に基づいて表形式又はデータベース形式の情報の存在を判定し、表形式又はデータベース形式の情報を見つけた場合は当該情報を抽出して情報群記憶部１６に格納する。 S2: The information group extraction unit 12 analyzes the source code of the Web document stored in the Web document DB 15, determines the presence of information in the table format or the database format based on the description of the tag included in the source code, When the information in the table format or the database format is found, the information is extracted and stored in the information group storage unit 16.

表形式又はデータベース形式の情報をタグに基づいて判定する方法としては、前述したように、プルダウンリストを構成するタグを見つける方法、ＸＭＬ宣言を判定する方法、テーブルタグを見つける方法等が考えられる。 As described above, as a method for determining information in a table format or a database format based on a tag, a method for finding a tag constituting a pull-down list, a method for judging an XML declaration, a method for finding a table tag, or the like can be considered.

本実施形態においては、例えば、図３（ｂ）に示すＰＣ販売会社ＡのＷｅｂページに掲載されていた情報と、図３（ｃ）に示すＰＣ販売会社ＢのＷｅｂページに掲載されていた情報とを抽出できたとする。図３（ｂ）及び（ｃ）の情報は、前述したようにテーブルタグによって構成されているものとする。 In the present embodiment, for example, information posted on the Web page of the PC sales company A shown in FIG. 3B and information posted on the Web page of the PC sales company B shown in FIG. And can be extracted. The information in FIGS. 3B and 3C is assumed to be constituted by table tags as described above.

Ｓ３：属性関係抽出手段１３は、情報群記憶部１６に格納された表形式又はデータベース形式の情報を解析し、当該情報に含まれる項目、属性及び属性値の関係にある情報を抽出する。 S3: The attribute relationship extraction unit 13 analyzes the information in the table format or the database format stored in the information group storage unit 16, and extracts information related to items, attributes, and attribute values included in the information.

抽出にあたり、属性関係抽出手段１３は、項目、属性及び属性値の関係にある情報の所在を推定する。推定の方法は幾つか考えられる。
（１）表形式又はデータベース形式の情報が、＜Ｓｅｌｅｃｔ＞タグによりプルダウンリストを形成している場合、例えば＜Ｓｅｌｅｃｔ＞タグのｎａｍｅ属性の値を項目であると推定し、同＜Ｓｅｌｅｃｔ＞タグの要素内容に列記された＜Ｏｐｔｉｏｎ＞タグの要素内容を当該項目に関する属性及び属性値であると推定することが可能である。例えば、
＜ｓｅｌｅｃｔｎａｍｅ＝”ノートＰＣ”＞
＜ｏｐｔｉｏｎ＞ＣＰＵｘｘｘ１．５ＧＨｚ＜／ｏｐｔｉｏｎ＞
＜ｏｐｔｉｏｎ＞ＣＰＵｙｙｙ２．０ＧＨｚ＜／ｏｐｔｉｏｎ＞
＜／ｓｅｌｅｃｔ＞
上記において、項目「ノートＰＣ」、属性「ＣＰＵ」の属性値「ＣＰＵｘｘｘ」と推定することができる。また、属性「ＣＰＵ」に続く属性は「クロック」と推定し、属性「クロック」の属性値「１．５ＧＨｚ」と推定することができる。同様に、項目「ノートパソコン」、属性「ＣＰＵ」の属性値「ＣＰＵｙｙｙ」、属性「クロック」の属性値「２．０ＧＨｚ」と推定することができる。
（２）表形式又はデータベース形式の情報が、ＸＭＬインスタンスの場合、階層構造を成している上位のタグ要素名を「項目」と推定し、その一つ下位のタグ要素名を「属性」と推定し、当該「属性」を示すタグ要素名の属性又は要素内容を「属性値」と推定することが可能である。例えば、
＜ノートパソコン＞
＜ＣＰＵｔｙｐｅ＝”ＣＰＵｚｚｚ”＞
＜クロック＞１．１ＧＨｚ＜／クロック＞
＜／ＣＰＵ＞
＜ＣＰＵｔｙｐｅ＝”ＣＰＵｐｐｐ”＞
＜クロック＞３．２ＧＨｚ＜／クロック＞
＜／ＣＰＵ＞
＜／ノートパソコン＞
上記において、項目「ノートパソコン」、属性「ＣＰＵ」の属性値「ＣＰＵｚｚｚ」、属性「クロック」の属性値「１．１ＧＨｚ」を推定することができる。同様に、項目「ノートパソコン」、属性「ＣＰＵ」の属性値「ＣＰＵｐｐｐ」、属性「クロック」の属性値「３．２ＧＨｚ」を推定することができる。
（３）表形式又はデータベース形式の情報がテーブルタグによって構成されている場合、例えば、表のタイトルを「項目」と推定し、１行目にある要素の列を各「属性」と推定し、２行目以降にある要素を同列の属性に対応する「属性値」と推定することが考えられる。例えば、
＜ｔａｂｌｅ＞
＜ｃａｐｔｉｏｎ＞ノートパソコン＜／ｃａｐｔｉｏｎ＞
＜ｔｒ＞
＜ｔｄ＞ＣＰＵ＜／ｔｄ＞
＜ｔｄ＞クロック＜／ｔｄ＞
＜／ｔｒ＞
＜ｔｒ＞
＜ｔｄ＞ＣＰＵｚｚｚ＜／ｔｄ＞
＜ｔｄ＞１．１ＧＨｚ＜／ｔｄ＞
＜／ｔｒ＞
＜ｔｒ＞
＜ｔｄ＞ＣＰＵｐｐｐ＜／ｔｄ＞
＜ｔｄ＞３．２ＧＨｚ＜／ｔｄ＞
＜／ｔｒ＞
＜／ｔａｂｌｅ＞
上記において、項目「ノートパソコン」、属性「ＣＰＵ」の属性値「ＣＰＵｚｚｚ」、属性「クロック」の属性値「１．１ＧＨｚ」を推定することができる。同様に、項目「ノートパソコン」、属性「ＣＰＵ」の属性値「ＣＰＵｐｐｐ」、属性「クロック」の属性値「３．２ＧＨｚ」を推定することができる。 At the time of extraction, the attribute relationship extracting unit 13 estimates the location of information having a relationship between items, attributes, and attribute values. Several estimation methods are conceivable.
(1) When the information in the table format or the database format forms a pull-down list by the <Select> tag, for example, the value of the name attribute of the <Select> tag is estimated as an item, and the <Select> tag It is possible to presume that the element contents of the <Option> tag listed in the element contents are attributes and attribute values related to the item. For example,
<Select name = "Note PC">
<Option> CPU xxx 1.5 GHz </ option>
<Option> CPU yyy 2.0GHz </ option>
</ Select>
In the above description, the attribute value “CPU xxx” of the item “notebook PC” and the attribute “CPU” can be estimated. The attribute following the attribute “CPU” can be estimated as “clock”, and the attribute value “1.5 GHz” of the attribute “clock” can be estimated. Similarly, the attribute value “CPU yy” of the item “notebook computer”, the attribute “CPU”, and the attribute value “2.0 GHz” of the attribute “clock” can be estimated.
(2) If the information in the table format or the database format is an XML instance, the upper tag element name forming the hierarchical structure is estimated as “item”, and the tag element name one lower level is set as “attribute” It is possible to estimate, and the attribute or element content of the tag element name indicating the “attribute” can be estimated as the “attribute value”. For example,
<Notebook PC>
<CPU type = “CPU zzz”>
<Clock> 1.1GHz </ Clock>
</ CPU>
<CPU type = “CPU ppp”>
<Clock> 3.2GHz </ Clock>
</ CPU>
</ Notebook PC>
In the above, the attribute value “CPU zzz” of the item “notebook computer”, the attribute “CPU”, and the attribute value “1.1 GHz” of the attribute “clock” can be estimated. Similarly, the attribute value “CPU pp” of the item “notebook computer”, the attribute “CPU”, and the attribute value “3.2 GHz” of the attribute “clock” can be estimated.
(3) When the information in the table format or the database format is configured by a table tag, for example, the table title is estimated as “item”, the element column in the first row is estimated as each “attribute”, It is conceivable that the elements in the second and subsequent rows are estimated as “attribute values” corresponding to the attributes in the same column. For example,
<Table>
<Caption> notebook computer </ caption>
<Tr>
<Td> CPU </ td>
<Td> clock </ td>
</ Tr>
<Tr>
<Td> CPU zzz </ td>
<Td> 1.1 GHz </ td>
</ Tr>
<Tr>
<Td> CPU ppp </ td>
<Td> 3.2 GHz </ td>
</ Tr>
</ Table>
In the above, the attribute value “CPU zzz” of the item “notebook computer”, the attribute “CPU”, and the attribute value “1.1 GHz” of the attribute “clock” can be estimated. Similarly, the attribute value “CPU pp” of the item “notebook computer”, the attribute “CPU”, and the attribute value “3.2 GHz” of the attribute “clock” can be estimated.

（４）属性辞書１４を利用する方法も考えられる。属性関係抽出手段１３は、属性辞書１４に登録された項目を参照し、情報群記憶部１６に格納された情報Ａに同一の項目が含まれているか判定する。同一の項目が含まれていたら、属性辞書１４においてその項目に関連付けられている属性を参照し、同一の属性が情報Ａに含まれているか判定する。同一の属性が含まれていたら、情報Ａにおいて当該属性の例えば直後にある要素内容を当該属性についての属性値であると推定する。属性辞書１４を利用すると、項目や属性を表すテキストが不要な語句を一部に含んでいても、当該不要な語句を無視して項目名や属性名を取得することができる。 (4) A method using the attribute dictionary 14 is also conceivable. The attribute relationship extraction unit 13 refers to the item registered in the attribute dictionary 14 and determines whether the same item is included in the information A stored in the information group storage unit 16. If the same item is included, the attribute dictionary 14 refers to the attribute associated with the item and determines whether the information A includes the same attribute. If the same attribute is included, the element content immediately after the attribute in the information A is estimated to be the attribute value for the attribute. When the attribute dictionary 14 is used, even if a part of a word that does not require text representing an item or attribute is included, the unnecessary word or phrase can be ignored and the item name or attribute name can be acquired.

なお、属性関係抽出手段１３は、表形式の情報の直上方、直下方あるいは直左方に位置する情報を前記項目として抽出してもよい。また、表形式の情報において、上端行あるいは左端列に位置する情報を属性として、それぞれその下方あるいは右方に位置する情報を属性値として抽出してもよい。さらに、前記データベース形式の情報の直左方に位置する情報を属性として、前記データベース形式の情報をその属性値として抽出してもよい。 Note that the attribute relationship extraction unit 13 may extract information located immediately above, directly below, or immediately to the left of the tabular information as the item. Further, in the tabular information, the information located in the upper row or the left column may be used as an attribute, and the information located below or to the right may be extracted as an attribute value. Furthermore, the information located immediately to the left of the database format information may be extracted as an attribute, and the database format information may be extracted as the attribute value.

Ｓ４：属性関係抽出手段１３は、Ｓ３において抽出した項目、属性及び属性値の関係を有する情報を属性関係ＤＢ１７に登録する。本実施形態において、属性関係抽出手段１３は、この登録の際に属性辞書１４を参照し、項目の類義語を１種類に統一する。例えば、図３（ｂ）に示す情報から得た項目「ノートＰＣ」は、図３（ｄ）に示した属性辞書の項目を参照し「ノートパソコン」に統一して属性関係ＤＢ１７に登録する。このようにすると、項目が統一されることによって、生成された属性関係ＤＢ１７の情報を活用しやすくなる。もっとも、属性辞書１４とは別に類義語辞書を備え、この類義語辞書を参照することにより、項目や属性の類義語を統一してもよい。図３に示した情報の例によると、図３（ｂ）及び（ｃ）に示した表形式又はデータベース形式の情報から、図３（ｅ）に示した属性関係ＤＢを生成することができる。 S4: The attribute relationship extraction unit 13 registers information having the relationship between the item, the attribute, and the attribute value extracted in S3 in the attribute relationship DB 17. In the present embodiment, the attribute relationship extraction unit 13 refers to the attribute dictionary 14 at the time of registration and unifies the item synonyms into one type. For example, the item “notebook PC” obtained from the information shown in FIG. 3B is registered in the attribute relation DB 17 with reference to the item in the attribute dictionary shown in FIG. If it does in this way, it becomes easy to utilize the information of generated attribute relation DB17 by unifying an item. However, a synonym dictionary may be provided separately from the attribute dictionary 14, and the synonyms of items and attributes may be unified by referring to the synonym dictionary. According to the information example shown in FIG. 3, the attribute relationship DB shown in FIG. 3E can be generated from the information in the table format or database format shown in FIGS. 3B and 3C.

以上説明したように、情報群抽出手段１２及び属性関係抽出手段１３が、Ｗｅｂ文書に含まれるタグに基づいて、項目、属性及び属性値の関係を有する情報を抽出するので、項目に関する情報をサイトの異なる複数のＷｅｂ文書から自動的に収集し、属性ごとに整理された情報として取得することができる。 As described above, the information group extracting unit 12 and the attribute relationship extracting unit 13 extract information having a relationship between items, attributes, and attribute values based on tags included in the Web document. Can be automatically collected from a plurality of different Web documents and acquired as information organized by attribute.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。例えば、「項目」は本実施形態の例示に限られず、「属性」及び「属性値」を伴うものはすべて「項目」になり得る。また、表形式データベース形式の情報を抽出する方法及び当該情報から項目、属性及び属性値の関係を有する情報を抽出する方法は、本実施形態の例示に限られるものではない。また、図６に示した各ステップは、１つのＷｅｂ文書を蓄積するごとに全てのステップを一通り実行する必要はない。各ステップが非同期でバッチ処理を行なってもよい。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. For example, “item” is not limited to the example of the present embodiment, and anything with “attribute” and “attribute value” can be “item”. Further, a method for extracting information in a tabular database format and a method for extracting information having a relationship between items, attributes, and attribute values from the information are not limited to the examples in the present embodiment. In addition, each step shown in FIG. 6 does not need to be executed all at once every time one Web document is accumulated. Each step may perform batch processing asynchronously.

また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施例に記載されたものに限定されるものではない。 The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

本実施形態に係る情報収集装置と関連要素の全体構成を示す図である。It is a figure which shows the whole structure of the information collection device which concerns on this embodiment, and a related element. 本実施形態に係る情報収集装置の機能構成の概要を示す図である。It is a figure which shows the outline | summary of the function structure of the information collection apparatus which concerns on this embodiment. 本実施形態に係るＷｅｂ文書データベース、情報群抽出手段及び属性関係データベースと関連要素の構成の概要を示す図である。It is a figure which shows the outline | summary of a structure of the web document database which concerns on this embodiment, an information group extraction means, an attribute relation database, and a related element. 本実施形態に係る情報収集装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the information collection apparatus which concerns on this embodiment. 本発明の実施形態に係る情報収集処理のフローチャートを示している。3 shows a flowchart of an information collection process according to an embodiment of the present invention.

Explanation of symbols

１情報収集装置
２Ｗｅｂサーバ装置
１１Ｗｅｂ文書蓄積手段
１２情報群抽出手段
１３属性関係抽出手段
１４属性辞書
１５Ｗｅｂ文書ＤＢ
１６情報群記憶部
１７属性関係ＤＢ DESCRIPTION OF SYMBOLS 1 Information collection apparatus 2 Web server apparatus 11 Web document storage means 12 Information group extraction means 13 Attribute relation extraction means 14 Attribute dictionary 15 Web document DB
16 Information group storage unit 17 Attribute relation DB

Claims

Information gathering device
A first step of extracting information in a table format or a database format from a plurality of Web documents accessible via a communication network based on a tag included in the Web document;
From the extracted information in the table format or database format, based on the dependency relationship between the information indicated by the tag, the relationship between the attribute dependent on the predetermined item and the attribute value indicating the content of the attribute A second step of extracting information comprising:
A third step of associating information having a relationship between the item, attribute, and attribute value for each extracted Web document, and storing the information for each of the associated Web documents as unified information in a storage unit; ,
At least ,
In the information collecting method , the second step extracts, as attributes and attribute values, information that is commonly subordinate to a predetermined item in a number of Web documents exceeding a predetermined threshold .

In the extracting step, the item and the attribute are extracted from the information in the table format or the database format by referring to an attribute dictionary in which the item and the attribute associated with the item are stored. The information collection method according to claim 1, wherein the attribute value is specified from an attribute.

In the storing step, the information extracted from the information in the table format or the database format as the synonym of the item by referring to the attribute dictionary in which the synonym of the item is further stored when storing in the storage unit The information collecting method according to claim 2, wherein items are unified into one type of item.

4. The method according to claim 1, wherein in the extracting step, information located immediately to the left of the database format information is used as an attribute, and the database format information is extracted as an attribute value thereof. 5. Information collection method.

5. The tag according to any one of claims 1 to 4, wherein the tag based on extracting information in the table format or database format is an HTML tag forming a pull-down list, a tag included in an XML document, or an HTML table tag. The information collection method described in the section.

A program for causing a computer to execute the information collecting method according to any one of claims 1 to 5.

Information group extracting means for extracting information in a table format or a database format from a plurality of Web documents accessible via a communication network based on tags included in the Web document;
From the extracted information in the table format or database format, based on the dependency relationship between the information indicated by the tag, the relationship between the attribute dependent on the predetermined item and the attribute value indicating the content of the attribute The extracted information is extracted, the information having the relationship between the item, the attribute, and the attribute value is associated with each extracted Web document, and the information for each of the associated Web documents is stored in the storage unit as unified information. Attribute relationship extracting means for
Equipped with a,
The attribute collection device extracts, as attributes and attribute values, information that commonly depends on a predetermined item in a number of Web documents exceeding a predetermined threshold .