JP2004318746A

JP2004318746A - Information collection system, information collection method and information collection program

Info

Publication number: JP2004318746A
Application number: JP2003115226A
Authority: JP
Inventors: Hitoshi Otomi; 仁大富; Akiji Koizumi; 章治小泉; Yoshiyasu Kubota; 能泰久保田; Yuuki Kamata; 祐希鎌田
Original assignee: TOWARD Inc
Current assignee: TOWARD Inc
Priority date: 2003-04-21
Filing date: 2003-04-21
Publication date: 2004-11-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system or the like efficiently and quickly collecting opened information. <P>SOLUTION: This information collection system has: a data processor 10 having a plurality of data processing means 14 each for accessing one page inside one site to collect and process various kinds of data on a plurality of contents constituting the site, and a management means managing the means; and a storage device 20 storing specific site information including information showing the site first accessed by the processor 10 and information about each of the contents inside the site. The data processing means 14 executes the access to the page based on the specific site information, acquisition of header information of the page, acquisition of the contents, acquisition of link information, and registration of prescribed information into the storage device 20 based on each information. The management means executes a process for starting at least one of the data processing means 14 on the basis of the link information and the information inside the storage device 20, and a process for completing the data collection to the site on the basis of the link information. Thereby, the system for high-speed and high-quality information collection is constructed, and the information collection system, an information collection method and an information collection program efficiently and quickly collecting the information in the specified site can be provided. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、情報収集システム、情報収集方法、及び情報収集プログラムに関し、詳しくは、主にネットワークに公開された情報を収集するシステム等、及び、収集した情報をクライアント側の各種コンピュータ機器に提供するシステム等に関するものであり、特に、指定されたサイト内における多数の文書やデータの高速な取得を実現する情報収集システム等に関するものである。
【０００２】
【従来の技術】
従来より、ネットワークに公開された情報を収集するシステムとしては、ロボット型サーチエンジンが知られている。このロボット型サーチエンジンでは、ロボットと呼ばれるプログラムがインターネットを巡回して、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）ページのデータを自動収集している。
【０００３】
従来のロボット型サーチエンジンは、主にインターネット全体を対象にしてＷＷＷページのデータ収集をしており、特定のサイトだけを対象にしたＷＷＷページのデータ収集ではなく、また、効率化に関しても、いかに大量のＷＷＷページのデータを収集するかに主眼が置かれており、収集の速さを目的としたものではなかった。
【０００４】
そのため、従来のロボット型サーチエンジンのシステムは、データ収集の対象がインターネット全体であるために、ロボットがインターネット全体を常に巡回しつづける構成となっており、該システム全体の制御を司るＣＰＵや、該システムが利用するネットワークに高い負荷をかけてしまう、という問題があった。
【０００５】
これに対し、複数のロボットを効率的に動作させ、各ロボットの同時稼動時間を短縮することで、ＣＰＵやネットワークの負荷を減らそうとするものもある（例えば特許文献１参照）。
【０００６】
しかしながら、この特許文献１のシステムでは、複数のロボットをページ（コンテンツ）毎に分散して動作させる構成であり、ＣＰＵやネットワークの負荷軽減を実現するために、一度に大量のロボットが動作して多量のページを収集しないように各々のロボットが時間をおいて順番的に起動してしまうことから、情報収集時間自体が延長してしまう問題を有していた。
【０００７】
また、従来のロボットは、仕様が公開されているために解読が容易なＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）の巡回・取得に限られることが多く、複数の形式の情報収集を行うプログラムを作成すると構造が複雑になることなどから、昨今広く普及しているＭａｃｒｏｍｅｄｉａ（登録商標）社（ｈｔｔｐ：／／ｗｗｗ．ｍａｃｒｏｍｅｄｉａ．ｃｏｍ／）のＦｌａｓｈに代表されるＨＴＭＬ以外のマルチメディアコンテンツに対応できないことが多い。
【０００８】
また、従来のロボットは、サーチエンジンの一部であり、かつ、インデクサと呼ばれるＷＷＷの全文検索用の専用プログラムと密な関係にある場合が多いため、ＷＷＷ検索以外の機能への応用が難しい、という問題があった。
【０００９】
また、ＷＷＷなどの普及に伴い、検索以外のさまざまなサービスのニーズがあるにもかかわらず、従来提供されているソフトウェアは、ＨＴＭＬ単独の取得ツールやライブラリか、または上述のような応用プログラムとのセットである場合が多かった。
【００１０】
【特許文献１】
特開２０００−７６２６４号公報
【００１１】
【発明が解決しようとする課題】
本発明は、上述の問題点を解決すべく提案されたものであり、高速で質の高い情報収集のシステムを構築し、指定されたサイト内における情報を効率良く高速に収集する、情報収集システム、情報収集方法、及び情報収集プログラムを提供することを第一の目的とする。
【００１２】
また、本発明は、情報収集に際してＣＰＵやネットワーク等のハードウェア資源の負荷を軽減することが可能な、情報収集システム、情報収集方法、及び情報収集プログラムを提供することを第二の目的とする。
【００１３】
また、本発明は、指定されたサイト内における情報収集後の情報提供に関するサービスの向上を実現した、情報収集システム、情報収集方法、及び情報収集プログラムを提供することを第三の目的とする。
【００１４】
【課題を解決するための手段】
本発明の情報収集システムの第一の構成は、一のサイト内の一のページにそれぞれアクセスして、当該サイトを構成する複数のコンテンツについての各種データを収集し、処理するための複数のデータ処理手段と、データ処理手段を管理する管理手段と、を備えたデータ処理装置と、データ処理装置が最初にアクセスするためのサイトを示す情報が含まれた特定サイト情報と、当該サイト内の各コンテンツについての情報と、を少なくとも記憶するための記憶装置と、を有し、データ処理装置のデータ処理手段は、予め記憶装置に記憶された特定サイト情報に基づくページ、及び、該ページにリンクされたリンクページにアクセスするためのページアクセス手段と、アクセスしたページのヘッダ情報を取得するヘッダ情報取得手段と、アクセスしたページのコンテンツを取得するためのコンテンツ取得手段と、アクセスしたページにリンクされたリンクページの場所を示すリンク情報を取得するリンク情報取得手段と、取得した各情報に基づく所定の情報を記憶装置に登録する情報登録手段とを有し、データ処理装置の管理手段は、記憶装置に記憶された情報及びリンク情報取得手段で取得したリンク情報に基づいて、データ処理手段をｎ（ｎは１以上）個起動させる起動管理手段と、リンク情報取得手段で取得したリンク情報に基づいて、当該サイトに対するデータ収集を終了させる終了管理手段と、を備えたことを特徴とする。
【００１５】
上記第一の構成を備えた情報収集システムにおいては、記憶装置に記憶された特定サイト情報に基づいて、起動管理手段によりデータ処理手段の起動が行われると、ページアクセス手段により一のページへのアクセス処理が行われ、続いてヘッダ情報取得手段による当該ページのヘッダ情報の取得処理、コンテンツ取得手段による当該ページのコンテンツの取得処理、リンク情報取得手段によるリンク情報の取得処理、が順次行われるとともに、取得した各情報に基づく所定の情報が情報登録手段により記憶装置に登録され、また、取得したリンク情報に基づいて、起動管理手段により１又は複数個のデータ処理手段の起動が行われることで、上述の各処理が同時並行的に、かつ繰り返し行われ、また、終了管理手段により当該サイトに対するデータ収集が終了させられる。
【００１６】
従って、第一の構成によれば、予め設定されたサイト内の各種データを、各ページのリンク（或いは階層）関係に基づいて、高速に収集することが可能となる。
【００１７】
本発明の情報収集システムの第二の構成は、上記第一の構成において、記憶装置には、特定サイト情報として、収集開始日時を示す収集開始情報が記憶され、データ処理装置の起動管理手段は、収集開始情報に基づいて、データ処理手段を起動させることを特徴とする。
【００１８】
上記第二の構成においては、記憶装置に記憶された特定サイト情報に基づくページひいては当該サイト内の各ページへのアクセス（巡回）が定期的に行われ、これにより、当該サイトを構成する複数のページのコンテンツについての情報が定期的に収集されるので、ハードウェア資源（ＣＰＵやネットワーク等）の負荷を軽減しつつ、各種コンテンツについての更新や新規追加等の情報を定期的に得ることが可能となる。
【００１９】
本発明の情報収集システムの第三の構成は、上記第一又は第二の構成において、記憶装置には、特定サイト情報として、当該一のサイト内の各ページの処理を行うデータ処理手段の最大同時併存数を示す併存数上限情報が記憶され、データ処理装置の起動管理手段は、併存数上限情報の最大同時併存数の範囲内で、データ処理手段を起動させることを特徴とする。
【００２０】
上記第三の構成においては、併存数上限情報により、ページアクセス手段の最大同時起動数が制限され、かつ、ヘッダ情報取得手段やコンテンツ取得手段、リンク情報取得手段、情報登録手段、等の各手段の動作も制限されるので、ハードウェア資源（ＣＰＵやネットワーク等）に一時的な高負荷をかけることがなくなり、ハードウェア資源を効率的に使用することが可能となる。
【００２１】
本発明の情報収集システムの第四の構成は、上記第一乃至第三のいずれか１の構成において、データ処理手段は、情報登録手段による記憶装置への登録が終了すると、処理が終了した旨を管理手段に通知し、管理手段は、該通知に基づいて、起動管理手段によるデータ処理手段の起動或いは終了管理手段によるデータ収集の終了の処理を行うことを特徴とする。
【００２２】
上記第四の構成においては、管理手段は、データ処理手段の起動後にはいずれかのデータ処理手段からの処理通知通知を待機すれば良く、データ処理手段の状態を定期的に検出する必要が無くなるので、ハードウェア資源（ＣＰＵ等）の負荷が軽減される。
【００２３】
本発明の情報収集システムの第五の構成は、上記第一乃至第四のいずれか１の構成において、データ処理装置は、ヘッダ情報取得手段で取得したヘッダ情報に基づいて、コンテンツの変更の有無について判定する判定手段を有し、コンテンツ取得手段は、判定手段で変更無しと判定された場合には、アクセスしたページのコンテンツを取得しないことを特徴とする。
【００２４】
上記第五の構成においては、コンテンツの変更有りと判定された場合にのみ、アクセスしたサイトのコンテンツが取得されるので、当該サイトを構成する複数のページの各種データの収集、処理の迅速な終了が可能となるとともに、ハードウェア資源（ＣＰＵやネットワーク等）の負荷が軽減される。
【００２５】
本発明の情報収集システムの第六の構成は、上記第五の構成において、記憶装置は、ヘッダ情報取得手段で取得したヘッダ情報を記憶する記憶領域を備え、判定手段は、今回取得したヘッダ情報と記憶装置に記憶された前回取得分のヘッダ情報とを比較して、一致しない場合にはコンテンツの変更有りと判定することを特徴とする。
【００２６】
上記第六の構成においては、ヘッダ情報が相互に一致する場合にはコンテンツの変更無しと判定され、コンテンツ取得手段やリンク情報取得手段による処理を省略することが可能となり、ハードウェア資源（ＣＰＵやネットワーク等）の負荷が軽減される。
【００２７】
本発明の情報収集システムの第七の構成は、上記第一乃至第六のいずれか１の構成において、データ処理装置は、ヘッダ情報取得手段で取得したヘッダ情報に基づいて、当該ページが当該サイト内のページであるか否かについて判定するサイト判定手段を有し、コンテンツ取得手段は、サイト判定手段で当該サイト内のページではないと判定された場合には、アクセスしたページのコンテンツを取得しないことを特徴とする。
【００２８】
上記第七の構成においては、データ処理手段で収集、処理する情報の無限連鎖により処理が終了しない事態が確実に防止され、当該サイトを構成する複数のページにおける各種データの収集、処理の迅速な終了が担保されるとともに、ハードウェア資源（ＣＰＵやネットワーク等）の負荷が軽減される。
【００２９】
本発明の情報収集システムの第八の構成は、上記第一乃至第七のいずれか１の構成において、コンテンツ取得手段は、ページのコンテンツを解析するための解析プログラムを複数種類具備し、リンク情報取得手段は、当該ページのコンテンツの種類を示すコンテンツ種類情報が含まれたリンク情報を取得し、起動管理手段は、データ処理手段の起動にあたり、当該リンク情報を含めたデータ処理要求を出力し、起動したデータ処理手段におけるページアクセス手段のアクセスしたページに関して、コンテンツ取得手段は、当該データ処理要求に含まれたコンテンツ種類情報に対応する解析プログラムを用いてコンテンツを解析することを特徴とする。
【００３０】
上記第八の構成においては、多様な種類のコンテンツを取得することが可能となる。
【００３１】
本発明の情報収集システムの第九の構成は、上記第一乃至第八のいずれか１の構成において、ページアクセス手段は、ページにアクセスするための通信規約についてのプログラムを複数種類具備し、リンク情報取得手段は、コンテンツ取得手段で取得したコンテンツから、当該リンクページへのアクセス方法についての情報が含まれたリンク情報を取得し、起動管理手段は、データ処理手段の起動にあたり、当該リンク情報を含めたデータ処理要求を出力し、起動したデータ処理手段のページアクセス手段は、当該データ処理要求に含まれたアクセス方法に対応するプログラムを用いて当該リンクページにアクセスすることを特徴とする。
【００３２】
上記第九の構成においては、アクセスするページに対して各種アクセス方法を用いてアクセスすることが出来るので、種々のサイトに対するアクセスが可能となる。
【００３３】
本発明の情報収集システムの第十の構成は、上記第一乃至第九のいずれか１の構成において、記憶装置には、特定サイト情報として、当該一のサイトについての情報を欲する他のシステムについての連携システム名が記憶され、データ処理装置は、データ処理手段で収集、処理する各情報に関し、他のシステムに提供するための提供データを生成する提供データ生成手段と、生成した提供データを前記連携システム名に基づく他のシステムに送信する提供データ送信手段と、を備えたことを特徴とする。
【００３４】
上記第十の構成においては、提供データ生成手段で生成された提供データが連携システム名に基づく他のシステムに送信されるので、当該他のシステムに対して種々の情報提供のサービスを行うことが可能となる。
【００３５】
本発明の情報収集システムの第十一の構成は、上記第十の構成において、提供データ生成手段は、サイトに対するデータ収集の開始の際、当該サイト内の一のページについてのデータ処理手段による処理終了の際、及び、当該サイトに対するデータ収集の終了の際、の各々の時点で、それぞれ同一のデータフォーマットによる提供データを生成することを特徴とする。
【００３６】
上記第十一の構成においては、提供データ生成手段で生成された提供データが上記各時点で同一のフォーマットで他のシステムに送信されるので、当該他のシステムでの提供データの受信及び解析処理を統一することが可能となり、当該他のシステムの開発効率を高めることが可能となる。
【００３７】
本発明の情報収集システムの第十二の構成は、上記第十又は第十一の構成において、提供データ送信手段は、コンテンツ取得手段で取得したコンテンツを、連携システム名に基づく他のシステムに送信することを特徴とする。
【００３８】
上記第十二の構成においては、コンテンツ取得手段で取得したコンテンツについて、記憶装置に記憶する処理を省くことが可能となり、記憶装置の記憶容量の節約及び負荷軽減が図られる。
【００３９】
本発明の情報収集方法の主たる構成は、一のサイト内の一のページにそれぞれアクセスして、当該サイトを構成する複数のコンテンツについての各種データを収集し、処理するための複数のデータ処理手段と、データ処理手段を管理する管理手段と、を備えたデータ処理装置と、データ処理装置が最初にアクセスするためのサイトを示す情報が含まれた特定サイト情報と、当該サイト内の各コンテンツについての情報と、を少なくとも記憶するための記憶装置と、を用いた情報収集方法であって、データ処理装置のデータ処理手段は、予め記憶装置に記憶された特定サイト情報に基づくページ、及び、該ページにリンクされたリンクページにアクセスするページアクセス処理と、アクセスしたページのヘッダ情報を取得するヘッダ情報取得処理と、アクセスしたページのコンテンツを取得するためのコンテンツ取得処理と、アクセスしたページにリンクされたリンクページの場所を示すリンク情報を取得するリンク情報取得処理と、取得した各情報に基づく所定の情報を記憶装置に登録する情報登録処理と、を実行し、データ処理装置の管理手段は、記憶装置に記憶された情報及びリンク情報取得処理で取得したリンク情報に基づいて、データ処理手段をｎ（ｎは１以上）個起動させる起動処理と、リンク情報取得処理で取得したリンク情報に基づいて、当該サイトに対するデータ収集を終了させる終了処理と、を実行することを特徴とする。
【００４０】
上記構成の情報収集方法においては、記憶装置に記憶された特定サイト情報に基づいて、起動処理によりデータ処理手段の起動が行われると、ページアクセス処理により一のページへのアクセスが行われ、続いてヘッダ情報取得処理による当該ページのヘッダ情報の取得、コンテンツ取得処理による当該ページのコンテンツの取得、リンク情報取得処理によるリンク情報の取得、が順次行われるとともに、取得した各情報に基づく所定の情報が情報登録処理により記憶装置に登録され、また、取得したリンク情報に基づいて、起動処理により１又は複数個のデータ処理手段の起動が行われることで、上述の各処理が同時並行的に、かつ繰り返し行われ、また、終了処理により当該サイトに対するデータ収集が終了させられる。
【００４１】
従って、本発明の情報収集方法によれば、予め設定されたサイト内の各種データを、各ページのリンク（或いは階層）関係に基づいて、高速に収集することが可能となる。
【００４２】
本発明の情報収集プログラムは、コンピュータを、一のサイト内の一のページにそれぞれアクセスして、当該サイトを構成する複数のコンテンツについての各種データを収集し、処理するための複数のデータ処理手段と、データ処理手段を管理する管理手段と、を備えたデータ処理装置と、データ処理装置が最初にアクセスするためのサイトを示す情報が含まれた特定サイト情報と、当該サイト内の各コンテンツについての情報と、を少なくとも記憶するための記憶装置と、して機能させるための情報収集プログラムであって、データ処理装置のデータ処理手段を、特定サイト情報に基づくページ、及び、該ページにリンクされたリンクページにアクセスするためのページアクセス手段、アクセスしたページのヘッダ情報を取得するヘッダ情報取得手段、アクセスしたページのコンテンツを取得するためのコンテンツ取得手段、アクセスしたページにリンクされたリンクページの場所を示すリンク情報を取得するリンク情報取得手段、取得した各情報に基づく所定の情報を記憶装置に登録する情報登録手段、として機能させるとともに、データ処理装置の管理手段を、記憶装置に記憶された情報及びリンク情報取得手段で取得したリンク情報に基づいて、データ処理手段をｎ（ｎは１以上）個起動させる起動管理手段、リンク情報取得手段で取得したリンク情報に基づいて、当該サイトに対するデータ収集を終了させる終了管理手段、として機能させるための情報収集プログラム、を要旨とする。
【００４３】
本発明の情報収集プログラムによれば、記憶装置に記憶された特定サイト情報に基づいて、起動管理手段によりデータ処理手段の起動が行われると、ページアクセス手段により一のページへのアクセス処理が行われ、続いてヘッダ情報取得手段による当該ページのヘッダ情報の取得処理、コンテンツ取得手段による当該ページのコンテンツの取得処理、リンク情報取得手段によるリンク情報の取得処理、が順次行われるとともに、取得した各情報に基づく所定の情報が情報登録手段により記憶装置に登録され、また、取得したリンク情報に基づいて、起動管理手段により１又は複数個のデータ処理手段の起動が行われることで、上述の各処理が同時並行的に、かつ繰り返し行われ、また、終了管理手段により当該サイトに対するデータ収集が終了させられるように、コンピュータが機能する。
【００４４】
従って、本発明の情報収集プログラムによれば、予め設定されたサイト内の各種データを、各ページのリンク（或いは階層）関係に基づいて高速に収集するように、コンピュータが機能する。
【００４５】
【発明の実施の形態】
本発明の実施の形態を、図面を参照しながら詳細に説明する。
【００４６】
図１及び図２に、本発明を適用した情報収集・提供システムの一実施形態の概略図を示す。本実施の形態では、１台のコンピュータ（例えばパーソナルコンピュータ）にソフトウェアプログラムをインストールすることにより、該コンピュータを情報収集・提供システムとして機能させた例について説明する。
【００４７】
（情報収集・提供システム全体の概略）
実施の形態の情報収集・提供システム１は、インターネット等の各種通信ネットワーク（以下、単にネットワークという）を介して参照できる、階層（リンク）関係を有する情報群（サイト）に対して、該階層（リンク）関係に基づいてアクセスして、コンテンツ自体を含む各種情報を収集する機能と、収集した各種情報を他のコンピュータシステムに提供する機能とを備えたシステムであって、図１に示すように、ネットワーク１００経由で参照できる各種データの収集及び提供を行うデータ処理装置１０と、システム自体を管理する情報とデータ処理装置１０が収集した情報とを記憶する記憶装置２０と、を備えている。
【００４８】
ここで、データ処理装置１０の収集対象となる情報（データ）は、ネットワーク１００で公開されうる全てのデータであり、ｗｅｂ上のページ、コンテンツ、ファイル、等の種々の概念のものが含まれる。なお、後述のように、ローカルコンピュータ内のデータを収集対象とすることも可能である。
【００４９】
また、データ処理装置１０の収集対象となる情報（データ）の種類としては、文書データ，動画データ，データベースなどから動的に作成されるデータ（例えば、商品販売やネット予約等に関するデータなど）、ディレクトリツリー型ファイルシステムに記憶されている個々のファイル、ドメイン参加型ネットワークに属しているネットワーク機器そのもの、などの、各種のデータが含まれる。
【００５０】
文書データの「文書」とは、ＨＴＭＬで記述された文書やＷＷＷ公開文書、Ｍａｃｒｏｍｅｄｉａ（登録商標）Ｆｌａｓｈなどを含む、ネットワーク１００で公開されうる文書全てを指す。同様に、動画データの「動画」についても、ＭＭＳ（Ｍｉｃｒｏｓｏｆｔ（登録商標）ＭｅｄｉａＳｅｒｖｅｒ）やＲＴＳＰ（ＲｅａｌＴｉｍｅＳｔｒｅａｍｉｎｇＰｒｏｔｏｃｏｌ）などで公開されたストリーム配信されているデータ全てを指し、これには音声データなども含まれることは勿論である。
【００５１】
同様に、「動的に作成される」データについても、ＣＧＩやＰｅｒｌなどのコンピュータ言語を用いて、アクセスされる都度ページを作成するデータを含む、ネットワーク１００で公開されうる動的データ全てを指す。
【００５２】
同様に、「ディレクトリツリー型ファイルシステムの個々のファイル」についても、Ｍｉｃｒｏｓｏｆｔ（登録商標）社のＷｏｒｄ文書やＥｘｃｅｌ文書などを含む、ハードディスク装置等で記録されるファイル全てを指す。
【００５３】
同様に、「ドメイン参加型ネットワークに属しているネットワーク機器」についても、パーソナルコンピュータやサーバコンピュータなどを含む、ドメインに参加できるネットワーク機器全てを指す。
【００５４】
また、データ処理装置１０のアクセス対象となる「情報群（サイト）」とは、例えば、１企業のＷｅｂサイト全て，１部署のイントラサイト全て，１台のファイルサーバに格納されているファイル全て，など、種々の形態のものが含まれる。
【００５５】
また、情報群（サイト）につき「参照できる」とは、公開ファイル全てを複写（ダウンロード）できることは勿論、全ての複写は出来なくても、ファイル情報（後述のサイズや日時などのヘッダ情報）を取得出来れば良い。
【００５６】
さらに、情報群の「階層（リンク）関係」とは、サイト内での階層関係のみならず、他のサイトへのリンク関係も含み、ディレクトリ階層や、ＨＴＭＬのＵＲＬなど、あるデータが別のデータへ関係付けられているものを指す。
【００５７】
また、「ディレクトリツリー型ファイルシステム」としては、ＭＳ−ＤＯＳ（登録商標）やＷｉｎｄｏｗｓ（登録商標）などＭｉｃｒｏｓｏｆｔ（登録商標）社製のＯＳで利用されるＦＡＴや、ＮＴＦＳ，ＵＮＩＸ（登録商標）で利用されているＵＦＳやＳＳＦＳなど、ディレクトリ（またはフォルダ）と呼ばれるものを持ち、階層構造を持つファイルシステム全てを指す。
【００５８】
また、「ドメイン参加型ネットワーク」としては、インターネットドメイン，Ｍｉｃｒｏｓｏｆｔ（登録商標）社のワークグループやＡｃｔｉｖｅＤｉｒｅｃｔｏｒｙ，Ｎｏｖｅｌｌ（登録商標）社のｅＤｉｒｅｃｔｏｒｙなど、ネットワーク１００に接続された複数の機器を、ある単位で管理し、その管理グループがサブグループを持つような、ネットワーク全てを指す。
【００５９】
また、「ヘッダ情報」とは、主にＨＴＴＰにおいて本文の送信に先立って送信される各種データを指すが、ＨＴＴＰ以外でのプロトコルでも同様であり、本文（データ内容そのもの）以外の、名前，種別，サイズ，更新日時などの情報全てを指す。
【００６０】
また、「コンテンツ」とは、主にＷｅｂサーバで公開されているテキストや画像などの文書の内容を指すが、Ｗｅｂサーバ以外のネットワーク１００で公開されている文書の内容や、ローカルコンピュータ内の文書自体、また対象をネットワーク機器にした場合などは、そのネットワーク機器そのものの情報（稼動情報など）を指す。
【００６１】
そして、この情報収集・提供システム１は、ローカルまたはネットワーク１００を介して接続される他のシステムによってアクセスされるとともに、記憶装置２０の参照情報記憶部２２とデータ情報記憶部２３に格納されたデータを、当該他のシステムに対して、後述するコンテンツボール３０の形態で供給するようになっている。
【００６２】
ここで、情報収集・提供システム１にアクセスしてくる他のシステムとしては、例えば、ネットワーク１００の利用者に検索情報を提供する各種サーチエンジンのシステム，Ｗｅｂページや共有フォルダに格納されたファイルなどの新着情報の取得を目的とするコンピュータシステム，Ｗｅｂページのリンク切れ情報の取得を目的とするコンピュータシステム，等が挙げられる（以下、これらの他のシステムを「連携システム」と呼ぶ）。
【００６３】
一般には、連携システムがサーチエンジンシステムの場合には、該連携システムは、ダウンロードしたコンテンツ（後述する一時ファイル）と、データ情報記憶部２３又は後述するコンテンツボール３０と、を参照して、何という言葉が入ったページがどのＵＲＬにあるか、についての情報を取得することになる。
【００６４】
また、連携システムが新着情報の取得を目的とするコンピュータシステムの場合には、該連携システムは、データ情報記憶部２３又は後述のコンテンツボール３０を参照して、ある日付以降に更新されたデータの存在箇所（例えばＵＲＬ）を取得することになる。
【００６５】
さらには、連携システムがＷｅｂページのリンク切れ情報の取得を目的とするコンピュータシステムの場合には、該連携システムは、参照情報記憶部２２又はコンテンツボール３０を参照して、何のＷｅｂページがどのＷｅｂページにリンクされて、何のＷｅｂページが存在しないか、等の情報を取得することになる。
【００６６】
なお、これら他の連携システムに対して情報収集・提供システム１が行うデータ提供の処理の詳細に関しては、図１０乃至図１２で後述する。
【００６７】
図１に示すように、データ処理装置１０は、システム管理部１１、サイト管理部１２、参照情報処理部１３、データ処理部１４、参照情報処理管理部１５、データ処理管理部１６、の６つの機能に大別され、本実施の形態では、これら各部１１〜１６につき１個のＣＰＵが共通のハードウェア資源を使って所謂マルチタスク方式で行うようになっている。
【００６８】
また、図示しないが、情報収集・提供システム１は、ネットワーク１００に接続してデータの送受信を行う送受信手段としてのモデム等、データ入力や各種の設定を行う手段としてのキーボードやマウス等、及び、表示手段としてのディスプレイ装置（ＣＲＴ或いはＬＣＤなど）を有している。
【００６９】
本実施の形態では１台のコンピュータを情報収集・提供システム１として機能させる場合について説明するが、本発明は、ハードウェア資源の配置等については特に限定されないものであり、例えばデータ処理装置１０と記憶装置２０とを別のコンピュータとする等、複数のコンピュータを用いて情報収集・提供システム１を構成することも可能である。
【００７０】
（データ処理装置１０の概略）
データ処理装置１０は、設定されたサイトについての情報を提供するサーバにアクセスし、ネットワーク１００に公開されている文書を収集してデータ情報記憶部２３に記憶する。
【００７１】
この情報収集・提供システム１では、システム管理者等の使用者がキーボード等での入力操作を行うことにより、データ処理装置１０でアクセスするサイト（特定サイト）を複数設定することが可能となっており、図２に示すように、複数の参照情報処理部１３と複数のデータ処理管理部１６と複数のデータ処理部１４とを並列処理することで、複数の特定サイトの情報収集を実施する。
【００７２】
また、データ処理装置１０は、一の特定サイトの情報を提供するサーバがドメイン毎に複数存在する場合に対しても、透過的に複数サーバにアクセスし、同一のサイト情報として記憶装置２０のデータ情報記憶部２３に情報を記憶するようになっており、この処理については後述する。
【００７３】
なお、「透過的にアクセスする」とは、使用者から見ると複数サーバのそれぞれがあたかも同じプロトコルで情報を提供しているかのようにアクセスすることを言う。
【００７４】
図１及び図２に示すように、データ処理装置１０は、情報収集・提供システム１全体を統括するシステム管理部１１と、複数の特定サイトに対する情報収集処理を管理するサイト管理部１２と、一の特定サイトの情報収集についての処理を行う参照情報処理部１３及びデータ処理管理部１６と、一の特定サイト内の一のページの情報収集についての処理を行うデータ処理部１４と、各参照情報処理部１３を管理する参照情報処理管理部１５と、を備えている。
【００７５】
このデータ処理装置１０では、図２に示すように、参照情報処理部１３とデータ処理管理部１６とデータ処理部１４とが、それぞれ複数個併存可能とされる。具体的には、参照情報処理部１３とデータ処理管理部１６とは、それぞれ、同時にアクセスする特定サイトの数だけ存在し、データ処理部１４は、同時にアクセスする特定サイト内のコンテンツの数だけ存在することになる。本実施の形態では、参照情報処理部１３とデータ処理管理部１６はそれぞれ最大で２５４個（すなわち２５４の特定サイトに同時アクセス可能）、データ処理部１４は最大で２５４×１０個が併存し、互いに並列して処理を行うようになされている。
【００７６】
これにより、データ処理装置１０では、最大で２５４の特定サイトに同時アクセス可能で、各特定サイト内で最大１０のページに同時アクセス可能となっている。
【００７７】
なお、参照情報処理部１３及びデータ処理管理部１６とデータ処理部１４における最大同時アクセス上限数は特に限定されるものではなく、ＣＰＵの処理速度、メインメモリの容量、ネットワーク１００の速度に応じて適宜設定することが可能であるが、一般にはそれぞれ１０程度までであり、また、例えばＣＰＵがインテル社のＰｅｎｔｉｕｍ（登録商標）３− １．４ＧＨｚで、メインメモリ容量が２５６ＭＢで、回線速度が１Ｍｂｐｓの場合には、それぞれ、参照情報処理部１３の最大同時アクセス上限数（すなわち同時にアクセスする特定サイトの上限数）を５程度、データ処理部１４の最大同時アクセス上限数（すなわち一の特定サイト内で同時アクセスするページの上限数）を１０程度、に設定することが好ましい。
【００７８】
データ処理装置１０内の各部１１〜１６は、１個のＣＰＵによって実現可能であるが、それぞれ独立したソフトウェアプログラム（すなわち、システム管理プログラム，サイト管理プログラム，参照情報処理プログラム，データ処理プログラム，参照情報処理管理プログラム，データ処理管理プログラム）に基づいて動作するようになっている。
【００７９】
ここで、上述のように、参照情報処理部１３とデータ処理管理部１６とはそれぞれ同時にアクセスする特定サイトの数だけ存在するため、参照情報処理プログラムとデータ処理管理プログラムは、同時にアクセスする特定サイトの数（この実施形態では最大で２５４個）だけ存在する。
【００８０】
また、上述のように、データ処理部１４は同時にアクセスするページの数だけ存在するため、データ処理プログラムも、同時にアクセスするページの数（この実施の形態では最大で２５４×１０個）だけ存在する。
【００８１】
また、各部１１〜１６におけるこれら各プログラムは、それぞれが連携するプログラムへの通知（呼び覚まし）機能を持っている。すなわち、処理が必要ないプログラムや他の連携するプログラムの処理を待っているプログラムは、ＣＰＵを使わない待機状態となり、連携している他のプログラムからの通知により待機状態が解除されるようになっている。
【００８２】
（記憶装置２０の概略）
一方、記憶装置２０は、情報収集・提供システム１全体を管理するための管理情報を記憶するサイト情報記憶部２１と、特定サイト内のリンク関係情報を記憶する参照情報記憶部２２と、特定サイト内の情報を記憶するデータ情報記憶部２３と、を備えている。これら記憶装置２０の各記憶部２１〜２３は、十分なデータ記憶容量を有する例えば１台のハードディスク装置により実現可能である。
【００８３】
ここで、サイト情報記憶部２１は、図１に示すように、主としてサイト管理部１２が参照するための情報を格納するものであり、記憶される情報としては、主として、アクセスすべき特定サイトに関する種々の情報である特定サイト情報が含まれる。ここで、特定サイト情報としては、例えば図９にそのデータテーブルを示すように、特定サイトの場所（ＵＲＬなど）を示すサイト情報や、収集を開始する日時を示す収集開始情報や、特定サイトの情報を提供する（すなわち特定サイトの情報を欲している）連携システムについての連携システム名、特定サイトへの最大同時アクセス上限数、特定サイトにアクセスする際に用いられるアクセス方法についての情報、などが挙げられる。
【００８４】
図９に示すように、サイト情報記憶部２１では、アクセスすべき特定サイト毎に連番が付されるようになっており、各連番と、後述する参照情報記憶部２２及びデータ情報記憶部２３の「サイト」欄とが対応付けられる構成となっている。なお、図９ではサイト情報記憶部２１に２つの特定サイトについての特定サイト情報が登録された例を示しているが、実際には多数の特定サイトについての登録が可能である。
【００８５】
また、図示しないが、サイト情報記憶部２１には、上述した特定サイト情報以外にも、システム全体を管理するための管理情報が記憶される。この管理情報としては、例えば、後述する参照情報処理部１３の最大同時起動上限数（同時にアクセスする特定サイトの上限数）についての情報が含まれる。
【００８６】
これらの情報については、使用者による操作入力部の操作に基づいて、データ処理装置１０のシステム管理プログラムを起動させるとともに、不図示の表示部に図９のデータテーブルを表示させてサイト登録のための入力、設定等の各種操作を行なうことで、システム管理部１１によってサイト情報記憶部２１に記録されることになる。
【００８７】
ここで、入力する「サイト情報」としては、データ処理装置１０が最初にアクセスするためのサイト（場所）を示す情報であって、特定サイトがＷｅｂ上にある場合はＵＲＬ（通常はトップページのＵＲＬ）が用いられ、特定サイトと本システム１を同一のコンピュータ内に置いてデータを取得する場合には、ディレクトリ名（例えばｃ：￥ｄａｔａ￥００１．ｄｏｃや／ｄａｔａ／００１．ｔｘｔなど）が用いられ、特定サイトがディレクトリツリー型の共有ファイルシステムの場合には、ＩＰアドレスとディレクトリ名（例えば１９２．１６８．０．１／共有フォルダ／００１．ｔｘｔなど）が用いられ、特定サイトがドメイン参加型ネットワークの場合には、そのドメインを示す記述法が用いられることになる。以下は、説明の便宜のため、特定サイトがＷｅｂ上にあり、サイト情報としてトップページＵＲＬを用いた場合について説明する。
【００８８】
また、「連番」については、個々のサイト登録の際に自動的に付されるようになっている。
【００８９】
「連携システム名」は、一の特定サイトに対して１又は複数登録できるようになっている。ここで、一の特定サイトに対して複数の連携システム名を登録した場合には、情報収集・提供システム１のデータ処理装置１０で収集、取得した情報が、登録された各連携システムに対して提供されることになるが、詳細は図１０以下を参照して後述する。
【００９０】
「最大同時アクセス上限数」は、一の特定サイト内におけるデータ処理部１４の最大同時併存数を規定する数値が入力される。
【００９１】
「収集開始日時」は、特定サイトの更新の頻度、或いは連携システムの稼働できる時間帯、等の事情に応じて、例えば「毎月１日０時０分」、「毎週日曜日２３時３０分」、「毎日６時０分」など、任意に設定されることになる。
【００９２】
「アクセス方法」についての情報は、プロキシ情報、簡易認証（Ｂａｓｉｃ認証）、フォーム認証（ＣＧＩ認証）などの認証情報、等が入力されることになる。
【００９３】
ここで、「プロキシ」とは、内部ネットワーク（この場合は情報収集・提供システム１）から外部ネットワーク（この場合は特定サイト）に接続する際、セキュリティ確保と高速アクセスを実現するために設置されるサーバまたはソフトウェアのことを指す。
【００９４】
また、「簡易認証（Ｂａｓｉｃ認証）」とは、ユーザ名（ＩＤ）とパスワードにより特定のデータへのアクセスに対し制限を設ける方式で、主にＷｅｂ上のディレクトリやファイルに対してのアクセス制限に使われるが、ここでは、ユーザ名（ＩＤ）とパスワードでアクセス制限を設ける方式全てを指す。
【００９５】
さらに、「フォーム認証（ＣＧＩ認証）」とは、主にＨＴＴＰクライアント側リダイレクトを使用して、認証されていない要求をＨＴＭＬフォームにリダイレクトするシステムのことを指すが、ここでは、データのネットワーク１００上の場所（ＵＲＬなど）にパラメータを付けてアクセスすることで、アクセス制限がかかっているデータにアクセス可能になる方式全てを指す。
【００９６】
サイト情報記憶部２１に登録された上述の各種情報については、データ処理装置１０が特定サイトの情報収集を行う際に、データ処理装置１０の各部によって適宜参照されることになるが、その詳細については後述する。
【００９７】
参照情報記憶部２２は、図１に示すように、主として参照情報処理部１３が参照するための情報を格納するものであり、記憶される参照情報としては、例えば図９にそのデータテーブルを示すように、「サイト」の情報，「リンク情報」，「パラメータ」情報，「処理状態」情報、などが挙げられ、これらの情報については、情報収集処理の際にデータ処理部１４によって記録・更新される。
【００９８】
ここで、「サイト」の情報（数値）は、サイト情報記憶部２１の上述した特定サイト情報及び後述するデータ情報記憶部２３の「サイト」の情報に対応付けるためのものであり、図９では、参照情報記憶部２２の情報として、サイト情報記憶部２１の連番１（すなわち特定サイトの「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」）に対応付けられたもののみを抽出して示している。
【００９９】
また、「リンク情報」は、次に収集するコンテンツ（データ）やそのコンテンツ（データ）がどこに記述されているか、どういう風に記述されているか、などを示す情報であり、本実施の形態では、図９に示すように、リンク情報として、特定サイトの「リンク元」，「リンク先」，「行数」，「タグ名」、についての情報が含まれる。
【０１００】
ここで、リンク元とは、該リンク情報が記述されているページのネットワーク１００上の場所のことであり、該当するサイトのデータ情報記憶部２３の「連番」の数値が記憶される。
【０１０１】
リンク先とは、該リンク情報により、リンク元のページからリンクされているコンテンツ（データ）やページのネットワーク１００上の場所のことであり、該当するサイトのデータ情報記憶部２３の「連番」の数値が記憶される。
【０１０２】
行数とは、該リンク情報が、リンク元コンテンツ（データ）の何行目に記述されているかを示すものであり、行番号が数値で記憶される。
【０１０３】
タグ名とは、該リンク情報がどのようにリンクされているかを示すものであり、ＨＴＭＬの場合はＨＴＭＬタグ名（ＡタグのＨＲＥＦであったり、ＩＭＧタグのＳＲＣであったり、といった内容）が記憶される。
【０１０４】
さらに、「パラメータ」情報とは、データ処理装置１０が特定サイト内のページ（リンク情報のリンク先で示されるページのコンテンツ（データ））にアクセスする際に使用するパラメータであり、例えばＨＴＭＬの場合はネームアンカー（ＵＲＬｆｌａｇｍｅｎｔｉｄｅｎｔｉｆｉｒｅ）を記憶したり、ＣＧＩなどの動的プログラムの場合は、そのプログラムの引数を記憶したりする。
【０１０５】
また、「処理状態」情報は、データ処理装置１０が一の特定サイトについての情報収集処理を全て完了したか否かを示すもので、全て完了した場合には例えば「処理済」のフラグで、未だ全て完了していない場合には例えば「未処理」のフラグで示される。
【０１０６】
なお、図９に示すように、参照情報記憶部２２は、その記憶領域が、新参照情報の記憶領域と前参照情報のための領域とに分けられており、新参照情報の記憶領域がデータ処理装置１０によるデータ取得の処理中に処理結果を書き込むための領域で、前参照情報の記憶領域が前回のデータ取得の処理結果を保存しておくための領域となっている。
【０１０７】
データ情報記憶部２３は、主として、特定サイトの情報取得を行った結果について、データ処理部１４が書き込み及び比較するための情報を格納するものであり、記憶される情報としては、例えば図９にそのデータテーブルを示すように、「サイト」，「連番」，「ネットワーク上の場所」，「階層数」，「種別」，「サイズ」，「更新日時」，「収集状態」，についての情報などが挙げられる。
【０１０８】
ここで、「サイト」の情報は、サイト情報記憶部２１の特定サイト情報及び参照情報記憶部２２のサイトの情報に対応付けるためのものであり、同様に、図９では、参照情報記憶部２３の情報として、サイト情報記憶部２１の連番１（すなわち特定サイトの「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」）に対応付けられたもののみを抽出して示している。
【０１０９】
また、データ情報記憶部２３における「連番」の情報（数値）は、特定サイトに存在するページ（通常はトップページ）及び当該ページにリンクするリンクページの数に基づく連番を示すものであり、この例では、図９に示すように、参照情報記憶部２２のリンク元及びリンク先の情報と対応付けられている。そして、データ情報記憶部２３では、サイト情報記憶部２１に登録された特定サイトのページについての情報が「連番」１となり、特定ページからリンクされたリンク先のページに対して、順次、「連番」２，３，４・・・が付されて行くことになる。
【０１１０】
「ネットワーク上の場所」には、上述したサイト情報記憶部２１の「サイト情報」と等価な情報が記憶され、この実施形態では、特定サイト及び特定サイトに直接或いは間接的にリンクされたリンクページについての場所を示す情報が記憶される。図９に示す例では、特定サイトに存在するトップページ及び当該トップページにリンクするページのネットワーク１００上の場所（ＵＲＬ等）を示している。
【０１１１】
ここで、「特定サイトに直接或いは間接的にリンク」とは、サイト情報記憶部２１に登録された特定サイト（トップページ等）に直接リンクされた（直接のリンク先となる）リンクページのみならず、特定サイトには直接リンクされていないが、特定サイトへのアクセスに基づいて最終的にはアクセスすることが出来る各種リンクページを含む意である。但し、この情報収集・提供システム１では、アクセスするサイトの無限連鎖により情報収集処理が終了しない事態を防止するため、アクセス対象とするリンクページを所定の範囲で制限しているが、このための構成については図７等を参照して後述する。
【０１１２】
「階層数」、「種別」、「サイズ」、「更新日時」には、それぞれ、データ処理装置１０で収集したデータ（巡回した各ページ）についての、階層数、種別、更新日時を示す情報が記憶される。
【０１１３】
ここで、階層数の情報は、当該ページが、収集開始データ（図９ではトップページ「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」）から数えて何番目に関連付けられたものかを示すものである。
【０１１４】
「収集状態」の情報は、データ処理装置１０が当該一の特定サイトに存在するページについての情報収集処理を全て完了したか否かを示すもので、全て完了した場合には例えば「ＯＫ」のフラグで、未だ全て完了していない場合には例えば「未」のフラグで示される。この実施の形態では、一のサイトに関する各ページについての収集状態が全て「ＯＫ」になると、上述した参照情報記憶部２２の当該サイトに関する処理状態情報が「未処理」から「処理済」に変わることになる。
【０１１５】
データ情報記憶部２３のこれら各情報については、情報収集処理の際に、データ処理部１４によって記録・更新される。
【０１１６】
なお、図９に示すように、データ情報記憶部２３は、その記憶領域が、新データ情報の記憶領域と前データ情報の記憶領域とに分けられており、新データ情報の記憶領域がデータ処理装置１０によるデータ取得の処理中に処理結果を書き込むための領域で、前データ情報の記憶領域が前回のデータ取得の処理結果を保存しておくための領域となっている。
【０１１７】
また、本実施の形態では、各ページにおけるデータの本文すなわちコンテンツ自体については、記憶装置２０に記憶せずに、他の連携システムに一時ファイルとして提供するようになっているが、この処理については後述する。
【０１１８】
（データ処理装置１０における各部の機能の概要）
次に、データ処理装置１０の各部１１〜１６の有する機能について説明する。
【０１１９】
システム管理部１１は、情報収集・提供システム１全体を管理するためのシステム管理プログラムに基づいて動作することで、以下のような機能を発揮する。すなわち、システム管理部１１は、主電源投入時の起動処理、使用者の入力操作に基づいて行われる、サイト情報記憶部２１の各情報（アクセスする特定サイトの、サイト情報（ＵＲＬなど），収集開始日時，連携システム名，最大同時アクセス上限数，特定サイトへのアクセス方法、参照情報処理部１３の最大同時起動上限数、など）の設定処理、使用者の入力操作に基づく起動・終了処理、特定サイトへの情報収集の開始・停止、等について管理する機能を担っている。
【０１２０】
なお、システム管理部１１は、主として使用者の入力操作に基づいて動作するものであり、一旦起動してしまえば、使用者が入力操作を行うまでは待機状態となる。また、使用者の入力操作に当たっては、システム管理部１１は、不図示の表示部の表示画面に入力項目等を表示する処理や、入力された情報に基づいて、サイト情報記憶部２１内のデータについての、追加、変更、削除、等の処理を行う。
【０１２１】
サイト管理部１２は、収集する複数のサイトを管理するためのサイト管理プログラムに基づいて、サイト情報記憶部２１から特定サイトのサイト情報（ＵＲＬなど）を参照する処理、情報収集開始の際に現在日時とサイト情報記憶部２１の収集開始日時とを比較して、収集開始日時が到来すると、参照情報処理管理部１５を介して参照情報処理部１３を起動し、情報収集を開始する特定サイトのサイト情報（ＵＲＬなど）を各参照情報処理部１３に通知する処理を行う。
【０１２２】
なお、サイト情報記憶部２１に登録された特定サイトの内の複数が同じ収集開始日時に設定された場合には、サイト管理部１２は、該収集開始日時が到来すると、複数の参照情報処理部１３を起動する処理を行うことになる。
【０１２３】
参照情報処理部１３は、収集する一の特定サイトに関する全データを管理するための参照情報処理プログラムに基づいて動作することで、以下のような機能を発揮する。すなわち、参照情報処理部１３は、一の特定サイトに関する参照情報記憶部２２とデータ情報記憶部２３内のデータを初期化する初期化機能や各データを整合させる整合機能、参照情報記憶部２２から次に収集するデータについてのリンク情報を抽出してデータ処理管理部１６にデータ処理要求として通知する機能、一の特定サイト内の全データを収集したかどうかを判断する機能を担っている。
【０１２４】
データ処理部１４は、収集対象となる一のページ内の各種データ（ヘッダ情報やコンテンツ、など）の取得・解析等を実施するためのデータ処理プログラムに基づいて動作することで、当該一のページ内のデータを取得・解析する機能、及び、当該一のページ内のデータについての各種情報を記憶装置２０の参照情報記憶部２２とデータ情報記憶部２３に登録する機能を担っている。
【０１２５】
なお、本実施の形態では、データ処理部１４における特定サイト内のデータの取得機能については、アクセス対象をＷｅｂページとしているため、一のデータ処理部１４が一のＷｅｂページだけを取得対象としており、例えば１０個のＷｅｂページのデータを取得するには１０個のデータ処理部１４が動作することになる。これに対して、アクセス対象をファイルとする場合にも、同様に、一のデータ処理部１４が一のファイルだけを取得対象とすることになる。
【０１２６】
データ処理部１４は、このように一の特定サイトに対して複数（ｎ個）併存するが、個々のデータ処理部１４（１４ａ，１４ｂ，・・・１４ｎ）には、図４に示すように、それぞれ、複数種類のプロトコル（通信手順）と、複数のデータ形式解析プログラムが実装されており、これにより種々のデータに対する取得及び解析が可能となっている。
【０１２７】
すなわち、データ処理部１４の各機能に関しては、図４に示すように、データ取得部１４１とデータ解析部１４２とデータ登録部１４３とに大別されることになるが、これらの詳細については後述する。
【０１２８】
参照情報処理管理部１５は、複数の参照情報処理部１３を管理するための参照情報処理管理プログラムに基づいて動作することで、以下のような機能を発揮する。すなわち、参照情報処理管理部１５は、サイト管理部１２からのサイト処理要求を受信する機能、受信したサイト処理要求に基づいて１又は複数の参照情報処理部１３を起動する機能、参照情報処理部１３の起動後に、各特定サイトについて情報収集中か否かを判断する機能、を担っている（図２参照）。
【０１２９】
データ処理管理部１６は、複数のデータ処理部１４を管理するためのデータ処理管理プログラムに基づいて動作することで、以下のような機能を発揮する。すなわち、データ処理管理部１６は、参照情報処理部１３からのデータ処理要求を受信する機能、及び、受信したデータ処理要求に基づいて、記憶装置２０のサイト情報記憶部２１に登録した最大同時アクセス上限数（図９参照）内で、データ処理部１４を起動する機能を担っている。
【０１３０】
次に、主に図３を参照して、データ処理管理部１６と複数のデータ処理部１４（１４ａ，１４ｂ）とにおける通知（呼び覚まし）機能に関するデータ処理装置１０の動作について説明する。ここで、図３には、データ処理管理部１６と各データ処理部１４との関係を、従来のシステムと対比して示している。
【０１３１】
なお、以下は、サイト情報記憶部２１の最大同時アクセス上限数を「２」に設定した特定サイトにアクセスする場合、すなわち、ある特定サイトに対してアクセスできるデータ処理部１４の最大値が「２」に設定された場合を想定して説明する。
【０１３２】
まず、データ処理管理部１６は、設定された最大値２に従って、２つのデータ処理部（図３に示すデータ処理部１４ａとデータ処理部１４ｂ）を起動するように起動命令を出力するとともに、起動した各データ処理部１４ａ，１４ｂに対して、それぞれ未収集の異なるページのデータ（コンテンツ、ファイル等）にアクセスするように、データ処理要求を送信する。
【０１３３】
詳細には、情報収集・提供システム１においては、まず、サイト情報記憶部２１のサイト情報のトップページＵＲＬ（例えば図９の「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」）にアクセスするように起動命令及びデータ処理要求が出されてデータ処理部１４ａが起動し、起動したデータ処理部１４ａが当該トップページに設定されたリンク先ページの情報（例えば「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００２．ｈｔｍｌ」）を取得すると、続いてこのリンク先ページのＵＲＬにアクセスするように起動命令及びデータ処理要求が出されることで、データ処理部１４ｂが起動する。ここで、データ処理部１４ｂもこのリンク先ページに設定されたさらなるリンク先ページの情報（例えば「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００３．ｈｔｍｌ」）を取得するが、特定サイトに対してアクセスできるデータ処理部１４の最大値が「２」に設定されているので、次のデータ処理部１４ｃが起動するのは、データ処理部１４ａ又は１４ｂによるデータ取得の処理が終了していずれかのデータ処理部が閉じた後になる。
【０１３４】
なお、データ処理装置１０においては、起動した各データ処理部１４ａ，１４ｂが各ページのデータにアクセスするに先立って、参照情報処理部１３が、それぞれのデータ処理部１４ａ，１４ｂが担当するデータについての処理状態を、予め「処理済」にするように、参照情報記憶部２２内の処理状態情報を更新し、かつ、データ処理管理部１６から各データ処理部１４ａ，１４ｂに対して、アクセスするページについての、ネットワーク上の場所（ＵＲＬ等）、アクセス方法、等について個別に通知される。
【０１３５】
ここで、参照情報記憶部２２内の処理状態情報を予め「処理済」の状態にするのは、データ処理部１４が処理中のときに、再び参照情報処理部１３が同じページに対してデータ処理要求を行なわないようにするためである。換言すると、参照情報記憶部２２内のこの処理状態情報は、参照情報処理部１３がデータ処理部１４にデータ処理要求を行なったか否か（要求済みか否か）を示すフラグとして機能していることになる。
【０１３６】
これにより、データ処理装置１０においては、起動した各データ処理部１４ａ，１４ｂが、それぞれが同一サイト内の異なるページのデータにアクセスして（図２参照）、情報収集する処理を直ちに開始するが、一方で、データ処理管理部１６は、各データ処理部１４ａ，１４ｂの情報収集処理開始を確認すると、待機状態となる。この際に、データ収集装置１０では、データ処理管理部１６の待機分だけＣＰＵのハードウェア資源が有効利用できるので、各データ処理部１４ａ，１４ｂの処理速度を可及的に速くすることが可能となり、ネットワーク１００に公開された情報を高速に収集することが可能となる。
【０１３７】
続いて、各データ処理部１４ａ，１４ｂは、それぞれ、アクセスしたページについて、コンテンツ自体以外の各種情報（例えば、コンテンツのサイズ、コンテンツの更新日時、など）を記憶装置２０のデータ情報記憶部２３に格納し、参照情報（どのページがどのページやコンテンツにリンクしているかというリンク情報、ページにアクセスするためのパラメータ、など）を記憶装置２０の参照情報記憶部２２に格納する処理を行なう。なお、各データ処理部１４ａ，１４ｂは、コンテンツ自体については、他の連携システム（例えばサーチエンジンのシステム）に一時ファイルの形で提供し、記憶装置２０には記憶しない。また、ここで、他の連携システムが一時ファイルを無視した場合には、提供した一時ファイルが自動的に削除されることになる。このような処理を行うことにより、記憶装置２０に記憶するデータの肥大化が防止され、記憶装置２０の記憶容量の節約が図られる。
【０１３８】
そして、各データ処理部１４ａ，１４ｂは、アクセスしたページについての後述する各種処理が完了すると、データ処理管理部１６に処理完了通知を送信する。
【０１３９】
図３の例では、データ処理部１４ａのアクセスしたページのコンテンツの方が情報量が少なかったため、先にデータ処理部１４ａが情報収集の処理を終え、データ処理管理部１６に処理完了通知を送信した場合を示している。
【０１４０】
ここで、データ処理管理部１６は、処理完了通知を受信すると、図３に示すように、待機状態を解除して、さらに同一サイト内の未処理のページの情報収集を行うためのデータ処理部１４ｃを起動するように、データ処理部１４に対して起動命令及びデータ処理要求を出力する。
【０１４１】
これにより、図３に示すように、データ処理部１４では、データ処理部３が起動して、処理中のデータ処理部２と新たに起動したデータ処理部３とで相互に同一サイトの異なるページについての各種情報を収集する処理が並列して行われ、一方で、データ処理管理部１６は、各データ処理部１４ｂ，１４ｃの並列処理が行われるのを確認すると、再び待機状態となる。
【０１４２】
そして、この情報収集・提供システム１では、以下同様にして、いずれかのデータ処理部が処理を終えると、該データ処理部からデータ処理管理部１６に処理完了通知が伝達され、データ処理管理部１６からの命令に基づいて新たなデータ処理部が起動して他の未処理ページの情報を収集する処理が行われ、このような動作が繰り返されることで、データ処理部が１つしか存在しない状態を可及的に短くすることが可能となり、かつ、システム全体の処理能力を大幅に向上させることが可能となる。
【０１４３】
すなわち、従来のロボット型サーチエンジンのシステムでは、図３左側の従来システムの動作に示すように、サイトの情報を収集する各ロボット１，２，３・・・（各データ処理部１４ａ，１４ｂ，１４ｃ，・・・に相当）の動作状態を、ロボット管理部（データ処理管理部１６に相当）が定期的に検出し、判定する構成としていたため、検出時の直後にロボットの処理が終了した場合には次の検出時までに無駄な時間が生じ、また検出後の判定処理を行う必要から新たなロボットの起動に時間がかかり、一方で、検出の期間（周期）を短くするとシステム全体として処理が重くなる問題があった。
【０１４４】
これに対して、実施の形態の情報収集・提供システム１では、各データ処理部１４（１４ａ，１４ｂ，１４ｃ，・・）の動作状態をデータ処理管理部１６が検出、判定する処理に代わって、処理を終えたデータ処理部１４がデータ処理管理部１６に対して、いわば自主的に処理完了を通知する信号を送信し、この信号を受信したデータ処理管理部１６が直ちに次の新たなデータ処理部を起動するための命令を発するように、データ処理プログラムとデータ処理管理プログラムが構成されているので、各データ処理部１４（１４ａ，１４ｂ，１４ｃ，・・）が処理中（情報収集中）の間は、データ処理管理部１６の行うべき処理は処理完了通知の受信を待つことのみとなり、ＣＰＵの能力を他の必要なプログラムに全て回すことが可能となり、従来の定期的に休眠・活動するシステムと比較して、処理能力を大幅に向上させることができ、ネットワーク１００に公開された情報をより高速に収集することが可能となる。
【０１４５】
この情報収集・提供システム１では、階層（リンク）関係を持つが、ＨＴＭＬやＭａｃｒｏｍｅｄｉａ（登録商標）Ｆｌａｓｈ、ディレクトリツリー型ファイルシステムなどの、相互に異なる形式のコンテンツが混在した情報群（特定サイトの各ページ）に対して、同一のファイルシステムとなるようにデータを収集するようになっている。以下、このデータ収集のための処理について、図４を参照して説明する。
【０１４６】
図４に示すように、データ処理部１４は、複数の異なるプロトコルによるサーバへのアクセスを透過的に行うためのデータ取得部１４１（１４１ａ，１４１ｂ，１４１ｃ，・・・，１４１ｎ）と、複数の異なる形式のデータ内容を透過的に解析する、すなわち使用者から見ると同じ形式のデータとして見えるように解析するためのデータ解析部１４２（１４２ａ，１４２ｂ，１４２ｃ，・・・，１４２ｎ）と、各データ解析部１４２と接続され、統一されたデータ形式で記憶装置２０にデータを登録するデータ登録部１４３と、を備える。
【０１４７】
データ取得部１４１は、ＨＴＴＰやＭＭＳ（Ｍｉｃｒｏｓｏｆｔ（登録商標）ＭｅｄｉａＳｅｒｖｅｒ）、ＲＴＳＰ（ＲｅａｌＴｉｍｅＳｔｒｅａｍｉｎｇＰｒｏｔｏｃｏｌ）、ＳＭＢ（ＳｅｒｖｅｒＭｅｓｓａｇｅＢｌｏｃｋ）、ＷｅｂＤＡＶ（Ｗｅｂ−ｂａｓｅｄＤｉｓｔｒｉｂｕｔｅｄＡｕｔｈｏｒｉｎｇａｎｄＶｅｒｓｉｏｎｉｎｇ）など、相互に異なるアクセス方法（プロトコル）を要する各種サイトへのアクセスを可能とするために、これら各種のアクセス方法（プロトコル、通信規約）についてのプログラムを実装しており、図４に示すように、例えばデータ取得部１４１ａがＨＴＴＰを使用するアクセスプログラム、データ取得部１４１ｂがＭＭＳを使用するアクセスプログラム、データ取得部１４１ｃがＲＴＳＰを使用するアクセスプログラム、・・・のように割り当てられている。
【０１４８】
そして、データ処理管理部１６からデータ取得部１４１にコンテンツ取得命令が送られると、データ取得部１４１は、コンテンツ情報から合致したプロトコルを選択し、そのプロトコルを用いてサイトにアクセスし、命令されたコンテンツを取得し、後述するコンテンツボール３０（の大部分）を作成し、データ解析部１４２による解析のために、取得したデータを一時ファイルとして保存する。
【０１４９】
なお、一時ファイルの保存は、この実施の形態では記憶装置２０の記憶容量の節約を図るため、不図示の連携システムのハードディスク内に行っているが、例えば記憶装置２０の記憶容量に余裕があるような場合には、データ情報記憶部２３等に一時ファイル記憶領域を設けて記憶装置２０内に行うようにしても良い。
【０１５０】
データ解析部１４２は、ＨＴＭＬやＭａｃｒｏｍｅｄｉａ（登録商標）Ｆｌａｓｈ、ＳＭＩＬ（ＳｙｎｃｈｒｏｎｉｚｅｄＭｕｌｔｉｍｅｄｉａＩｎｔｅｇｒａｔｉｏｎＬａｎｇｕａｇｅ）、Ａｄｏｂｅ（登録商標）社（ｈｔｔｐ：／／ｗｗｗ．ａｄｏｂｅ．ｃｏｍ／）ＰＤＦ（ＰｏｒｔａｂｌｅＤｏｃｕｍｅｎｔＦｏｒｍａｔ）、ＷｉｎｄｏｗｓＭｅｄｉａ（登録商標）、Ｍｉｃｒｏｓｏｆｔ（登録商標）ＷｏｒｄやＥｘｃｅｌなど、異なる形式のコンテンツを解析するための各種プログラムを実装しており、図４に示すように、例えばデータ解析部１４２ａがＨＴＭＬの解析プログラム、データ解析部１４２ｂがＭａｃｒｏｍｅｄｉａ（登録商標）Ｆｌａｓｈの解析プログラム、データ解析部１４２ｃがＳＭＩＬの解析プログラム、データ解析部１４２ｄがＰＤＦの解析プログラム、・・・のように割り当てられている。
【０１５１】
そして、データ解析部１４２は、データ取得部１４１で取得された一時ファイルから合致したコンテンツ解析プログラム（１４２ａ，１４２ｂ，１４２ｃ，・・・，１４２ｎ）を選択し、その解析プログラムを起動してコンテンツ及びコンテンツ内のリンク情報を取得して、取得したコンテンツを統一されたデータ形式（後述するコンテンツボール３０の不足箇所の補充と、参照情報記憶部２２に相当するリンク情報）になるように変換する処理を行ない、解析・変換したデータ（コンテンツボール３０とリンク情報）をデータ登録部１４３に転送する。これらの処理は、ＣＰＵ内の作業領域である不図示のＲＡＭ内で行うことになる。一方、データ解析部１４２は、取得したコンテンツ自体については、連携システムの不図示のハードディスク内に一時ファイルとして保存する。
【０１５２】
データ登録部１４３は、データ解析部１４２から解析・変換済みのコンテンツボール３０とリンク情報を受信すると、該コンテンツボール３０からサイト情報とページ情報の内容を記憶装置２０のデータ情報記憶部２３に格納し、該リンク情報を記憶装置２０の参照情報記憶部２２に格納する処理を行なう。
【０１５３】
データ処理部１４においては、このようにデータ取得部１４１とデータ解析部１４２とデータ登録部１４３とが、それぞれ定められたインタフェースを持っているので、例えばＨＴＴＰのサイト中にＰＤＦによるコンテンツデータが参照可能に設けられている場合といった、種類の異なる組み合わせであっても（この場合はデータ取得部１４１ａとデータ解析部１４２ｄとを使用することで）、情報収集の処理が可能となる。
【０１５４】
また、本実施の形態によれば、データ取得に使用するプロトコルやデータ形式が増えた場合であっても、それぞれのデータ取得部１４１とデータ解析部１４２を追加すれば良いので、これにより他のプログラムを変更することなくシステムが対応できる。
【０１５５】
次に、本実施の形態の情報収集・提供システム１全体の動作について、図５乃至図８を参照して詳細に説明する。
【０１５６】
なお、図５は、特定サイトに対する情報収集を行う場合の情報収集・提供システム１の動作の全体概要を示すフローチャートであり、主に情報収集を開始するまでの動作を説明するためのものである。また、図６は、情報収集・提供システム１における一の特定サイトに対する情報取得の実行動作の概要を説明するためのフローチャートであり、図５のステップＳ７から派生したルーチンの動作を示している。さらに、図７は、一の特定サイト内の一のページに対する情報取得の実行動作の詳細を説明するためのフローチャートであり、図６のステップＳ７５から派生したルーチンの動作を示している。また、図８は、一の特定サイトに対する情報取得の実行動作の一例を、情報収集・提供システム１の記憶装置２０に記憶される情報との関係で説明するための図である。
【０１５７】
情報収集・提供システム１では、サイト管理部１２と参照情報処理管理部１５とにより、一又は複数の特定サイトに対する情報取得が開始されることになり、以下、この処理について図５のフローチャートを参照して説明する。
【０１５８】
特定サイトの情報取得の実行に際して、情報収集・提供システム１では、まず、データ処理装置１０のサイト管理部１２が、ＣＰＵの内部クロックに基づいて現在時刻を取得し（ステップＳ１）、さらにサイト情報記憶部２１の「サイト情報」と「収集開始日時」（図９参照）に登録されている、各特定サイトのＵＲＬ等と情報収集開始時刻とを取得する（ステップＳ２）。
【０１５９】
続いて、サイト管理部１２は、現在時刻と情報収集開始時刻が一致するものが有るか否かを判定し（ステップＳ３）、Ｎｏすなわち一致するものがないとの判定の場合には一定時間待機（ステップＳ５）した後にステップＳ１に戻り、Ｙｅｓすなわちいずれかの特定サイトのうち、一致するものがあるとの判定の場合には、当該サイトについての情報収集を開始するため、参照情報処理管理部１５に、当該特定サイトのサイト情報（ＵＲＬ等）を含めたサイト処理要求を通知する（ステップＳ４）。
【０１６０】
これにより、データ処理装置１０では、参照情報処理管理部１５が呼び覚まされるとともに、サイト管理部１２が待機状態（ステップＳ５）になる。
【０１６１】
サイト処理要求を受信した参照情報処理管理部１５は、サイト管理部１２が取得したＵＲＬの特定サイト（すなわち情報収集を開始しようとしているサイト）について、既に巡回中か否か、すなわち当該特定サイトについての処理を担当している参照情報処理部１３が既に存在するか否か、についてチェックし（ステップＳ６）、Ｎｏすなわち存在していないとの判定の場合にはステップＳ７に移行し、一方、Ｙｅｓすなわち既に存在しているとの判定の場合には、該サイトに再度情報収集しに行く必要なしとして、サイト処理要求を破棄して、一定時間待機（ステップＳ５）した後にステップＳ１に戻り、再びサイト管理部１２からサイト処理要求を受信するまで待機する。
【０１６２】
ステップＳ７で、参照情報処理管理部１５は、参照情報処理部１３に起動命令を出力して一の参照情報処理部１３を起動して、起動した参照情報処理部１３にサイトのＵＲＬを通知することで、情報収集及び連携システムへの情報提供を開始させるととともに、特定サイトについて処理中であることをＣＰＵのＲＡＭ内に記憶する。
【０１６３】
なお、ステップＳ３の判定で現在時刻と情報収集開始時刻が一致する特定サイトが複数あり、かつ、ステップＳ６でいずれの特定サイトも巡回中でない、と判定された後のステップＳ７では、複数の参照情報処理部を起動するように参照情報処理管理部１５から起動命令が出力され、該特定サイトの数に応じた複数の参照情報処理部１３が起動することになる。
【０１６４】
また、ステップＳ７の情報収集及び連携システムへの情報提供の処理については、主に、記憶装置２０のサイト情報記憶部２１の情報に基づいて、参照情報処理部１３とデータ処理管理部１６とデータ処理部１４とで行い、各処理にあたっては記憶装置２０の参照情報記憶部２２やデータ情報記憶部２３に各種データを記録、更新等することになるが、この概要及び詳細については、図６以下を参照して後述する。
【０１６５】
ステップＳ７における情報収集及び連携システムへの情報提供の処理を終了する際には、参照情報処理部１３がサイト処理参照情報処理管理部１５に対して通知を行うようになっている（後述する図６のステップＳ８１）。本実施の形態では、このように、参照情報処理部１３の起動と終了（すなわち情報収集等の開始と終了）の際には参照情報処理管理部１５と参照情報処理部１３との間で通知を行うようにしているので、参照情報処理管理部１５は、常に、どのサイトが現在処理中であるかについて示すデータをＣＰＵのＲＡＭ内に記憶して、参照情報処理部１３を管理することが可能となる。
【０１６６】
そして、ステップＳ７の情報収集の処理が終了すると、サイト管理部１２及び参照情報処理管理部１５は、使用者の操作入力に基づきシステム管理部１１から通知される、システム全体の終了指示があるか否かについて判定し（ステップＳ８）、Ｎｏすなわち終了指示がない場合には一定時間待機（ステップＳ５）した後にステップＳ１に戻り、Ｙｅｓすなわち終了指示がある場合には情報収集・提供システム１全体の処理を終了する。
【０１６７】
なお、ステップＳ５における一定時間の待機は、常に動作してＣＰＵに負担をかけることを防ぐためのものであり、通常は１分間の待機を行なうようになっている。
【０１６８】
このように、実施の形態の情報収集・提供システム１によれば、データ処理装置１０は、記憶装置２０のサイト情報記憶部２１に予め設定、記憶した情報収集開始時刻に基づいて管理され、指定された時刻が到来すれば（ステップＳ３でＹｅｓ）、指定されたネットワーク上のサイトにアクセスして情報収集及び連携システムへの情報提供の処理を行う構成となっており、従来のロボット型エンジンのシステムのように常に巡回し続けるのではなく、特定サイトへの巡回が定期的に行われ、収集した情報が記憶装置２０に登録されることになるので、ＣＰＵとネットワーク１００の使用量を抑えることができるようになり、ＣＰＵやネットワーク１００の負荷が低減する、という効果が得られる。
【０１６９】
また、この情報収集・提供システム１によれば、サイト管理部１２は、定期的に記憶装置２０から情報収集開始時刻を取得して（ステップＳ５，ステップＳ２）、現在時刻との比較を行い（ステップＳ３）、一致した場合には、参照情報処理管理部１５に、対象となる特定サイトのサイト情報（ＵＲＬ等）を含むサイト処理要求を通知する（ステップＳ４）構成となっているので、常に動作する従来のサーチエンジンのシステムと比較して、ＣＰＵとネットワーク１００の負荷が低減され、かつ、自動的に情報収集するため管理の手間が軽減する、という効果が得られる。
【０１７０】
（情報収集処理の概要）
次に、図６のフローチャートを参照して、図５のステップＳ７から派生したルーチン、すなわち、参照情報処理部１３の起動後に行われる情報収集の実行処理の概要について説明する。
【０１７１】
この情報収集・提供システム１においては、データ処理装置１０の参照情報処理部１３と、データ処理管理部１６と、データ処理部１４とで、特定サイト内のデータ（ページ）についての多数の情報を取得する処理を行うようになっており、以下は図６を参照して説明する。
【０１７２】
まず、ステップＳ７１において、データ処理装置１０は、上述の参照情報処理管理部１５からの起動命令に基づいて起動した参照情報処理部１３が、参照情報処理管理部１５から転送された特定サイトのＵＲＬ等（すなわちサイト情報記憶部２１のサイト情報）を取得する。なお、ステップＳ７１では、データ処理装置１０は、参照情報処理部１３が当該特定サイトの処理を開始する旨についてのコンテンツボールを作成して連携システムに送信する処理を行った後にステップＳ７２に移行することになるが、この処理については図１１及び図１２で後述する。
【０１７３】
次のステップＳ７２で、データ処理装置１０では、参照情報処理部１３がデータ処理管理部１６に対してデータ処理要求を送信することで、一のデータ処理管理部１６が起動する（図２参照）。このデータ処理要求には、特定サイトのＵＲＬ等（サイト情報記憶部２１のサイト情報）及び当該特定サイトについての処理を開始すべき命令が含まれる。
【０１７４】
次のステップＳ７３で、データ処理装置１０は、前ステップで取得したサイト情報（ＵＲＬ等）に基づいて、データ処理管理部１６が、サイト情報記憶部２１から該特定サイトにおける最大同時アクセス上限数を取得する。すなわち、データ処理部１４の起動にあたり、実際にデータ処理部１４を起動するのは参照情報処理部１３ではなくデータ処理管理部１６であるため、ステップＳ７２の処理主体はデータ処理管理部１６となる。
【０１７５】
ここで、データ処理部１４が大量に起動するとＣＰＵや他のサーバ等に負荷をかけることになるため、データ処理管理部１６は、このような負荷を回避するために、特定サイトの処理にあたり、最大同時アクセス上限数に基づいて、起動するデータ処理部１４の数を以下のように管理する。
【０１７６】
次のステップＳ７４で、データ処理装置１０は、データ処理管理部１６が、当該特定サイトに現在アクセスしているデータ処理部１４の数をチェックして、アクセスしているデータ処理部１４が該特定サイトにおける最大同時アクセス上限数（図９参照）に達しているか否かについて判定し、Ｎｏすなわち達していないとの判定の場合にはステップＳ７５の処理を経てステップＳ７６に移行し、Ｙｅｓすなわち達しているとの判定の場合にはステップＳ７６に移行する。通常、特定サイトへのデータ処理部１４の最初の起動時には、このステップＳ７４でＮｏの判定が出ることになる。
【０１７７】
ステップＳ７５で、データ処理装置１０は、データ処理管理部１６によってデータ処理部１４を起動させて、データ処理部１４で特定サイトのデータを取得する処理を開始して、ステップＳ７６に移行する。具体的には、ステップＳ７５では、データ処理管理部１６からデータ処理部１４に対して、データ処理部を起動するように起動命令が出力されるとともに、参照情報処理部１３から出されたデータ処理要求が、データ処理管理部１６から、起動した一又は複数のデータ処理部１４に対して引き渡される。
【０１７８】
より詳細には、特定サイトの処理開始時の最初のステップＳ７５では、一のデータ処理部を起動するように起動命令が出力されるとともに、データ処理部１４に引き渡されるデータ処理要求には、記憶装置２０におけるサイト情報記憶部２１のサイト情報（この例ではトップページＵＲＬ）が含まれる。
【０１７９】
一方、特定サイトの処理の進行後の２度目以降のステップＳ７５では、当該特定サイトにおけるリンクの数及び最大同時アクセス上限数に応じて、一又は複数のデータ処理部を起動するように起動命令が出力されるとともに、データ処理部１４に引き渡されるデータ処理要求（後述するステップＳ８０）には、記憶装置２０におけるデータ情報記憶部２３の新データ情報記憶領域内に記憶した「ネットワーク上の場所」の情報（図９参照）が含まれる。
【０１８０】
このようなステップＳ７５の処理により、起動及びデータ処理要求を受信した各データ処理部１４は、特定サイト内のいずれか一のページにアクセスすることになる。
【０１８１】
なお、このステップＳ７５で起動したデータ処理部１４が行う処理の詳細については、図７の派生ルーチン（ステップＳ７５０１乃至ステップＳ７５１１）で表されるが、これについては後述する。
【０１８２】
ステップＳ７６で、データ処理管理部１６は、後述するデータ処理部１４からの終了通知を受信するまで（図３参照）、又は、次に参照情報処理部１３からのデータ処理要求を受信するまで待機する待機状態となり、これらのいずれかを受信するとステップＳ７４に戻る。
【０１８３】
すなわち、例えば図３に示すデータ処理管理部１６の２回目の処理状態中の期間には、参照情報処理部１３からのデータ処理要求及びデータ処理部１４からの終了通知を受信する処理と、ステップＳ７４及びステップＳ７５の処理を行っていることになる。
【０１８４】
データ処理部１４の起動後のステップＳ７７で、データ処理装置１０は、起動した各データ処理部１４（１４ａ〜１４ｎ）の中で、処理（すなわち情報収集処理と連携システムへの情報提供処理）が終了したものがあるか否かを、参照情報処理部１３で判定して、処理が終了するものが現れるまでステップＳ７７で待機し、Ｙｅｓすなわち終了しているものありと判定した場合には、ステップＳ７８に移行する。
【０１８５】
なお、ステップＳ７７のこの判定は、データ処理管理部１６に送信したデータ処理要求のプール数（すなわち起動待ちの数）の値に基づいて行う。このプール数は、ステップＳ７２及び後述するステップＳ８０で送信したデータ処理要求の合計数からステップＳ７５で起動したデータ処理部１４の合計数を減算することで求められ、プール数の範囲としては０〜最大同時アクセス上限数の間を推移することになる。
【０１８６】
詳細には、参照情報処理部１３は、ステップＳ７２及び後述するステップＳ８０で、プール数の値に１を加えて記憶するとともに、ステップＳ７５でデータ処理管理部１６からデータ処理部１４にｎ個の起動命令が出された際に、記憶した数値からｎを減算して、減算値が記憶値より少なくなったときに、起動していたデータ処理部１４のいずれかが終了したものと判定する。これは、データ処理部１４（例えば１４ａ）の処理が終了すると、データ処理管理部１６が次のデータ処理部１４（例えば１４ｂ）を起動することからである。
【０１８７】
なお、いずれかのデータ処理部１４の情報収集処理が終了すると、記憶装置２０の参照情報記憶部２２及びデータ情報記憶部２３には、処理したデータについての情報が登録されている（図７のステップＳ７５０７又はステップＳ７５１０）状態となっているが、これについては後述する。
【０１８８】
ステップＳ７７でＹｅｓと判定された後のステップＳ７８で、データ処理装置１０は、参照情報処理部１３が、記憶装置２０の参照情報記憶部２２における新参照情報の「処理状態」欄をチェックして、処理状態のフラグが「未」であるリンク先の情報に基づいて、データ情報記憶部２３の新データ情報の連番を検索することで、当該特定サイトに関係する未処理（未収集）のデータの「ネットワーク上の場所」の情報を、データ情報記憶部２３の新データ情報の記憶領域から取得して、ステップＳ７９に移行する。ここで、当該特定サイトに関係する未処理（未収集）のデータとは、例えば、当該特定サイト内のページでデータ処理部１４によるアクセスがまだなされていないページについての情報、或いは、データ処理部１４がアクセスしたページからのリンクが張られた当該特定サイト外のページについてのリンク情報、などである。
【０１８９】
ステップＳ７９で、データ処理装置１０は、取得した未処理（未収集）の情報から、当該特定サイトに関係する全ての参照情報のデータ（ページ）を収集したか否かを参照情報処理部１３で判定して、Ｎｏすなわち当該特定サイトに関係する未処理（未収集）のデータがまだ存在するとの判定の場合には、ステップＳ８０に移行し、一方、Ｙｅｓすなわち当該特定サイトに関係する全ての参照情報のデータを収集し終え、未処理（未収集）の情報が無くなった、と判定した場合には、ステップＳ８１に移行する。なお、ステップＳ７９において、参照情報処理部１３は、記憶装置２０における参照情報記憶部２２の処理状態が全て「処理済」になっており（すなわち当該特定サイト内の全てのリンクページ等をデータ処理部１４でアクセスした状態であること）、また、上述したデータ処理要求のプール数がゼロの状態であり（すなわちデータ処理部１４の起動待ちが存在しないこと）、さらには、起動中のデータ処理部１４が存在しないこと、を条件に、Ｙｅｓの判定を行う。
【０１９０】
ステップＳ８０で、データ処理装置１０は、ステップＳ７８で取得した「ネットワーク上の場所」の情報を含めたデータ処理要求を参照情報処理部１３からデータ処理管理部１６に送信する。この際に、参照情報処理部１３は、当該未処理（未収集）のデータについての参照情報記憶部２２の処理状態欄に予め「済」のフラグを立てておくようにする。
【０１９１】
かくして、ステップＳ８０によるデータ処理要求を受信したデータ処理管理部１６は、上述したステップＳ７６の待機状態から脱してステップＳ７４に移行して、ステップＳ７４〜ステップＳ７６の処理及び待機を繰り返し行う。一方、参照情報処理部１３は、ステップＳ７９で未処理（未収集）の情報が無くなったと判定されるまでは、ステップＳ７７〜ステップＳ８０の処理を繰り返し行う。このように、各処理が繰り返し行われることで、データ処理装置１０では、一の特定サイトを構成する全てのページについてのデータの収集及び連携システムへの提供が行われることになる。
【０１９２】
なお、本実施形態では、特定サイト内のページ等にリンクされた特定サイト外のページ等については、そのデータを当該特定サイトに関するデータとして記憶装置２０に記憶するが、情報取得処理が終了しない事態を防止するため、データ処理部１４によるアクセスは行うが、コンテンツ等の取得、解析等は行わないようになっている。
【０１９３】
すなわち、図９の特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」の例で言えば、例えば該特定サイト内のあるページ「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００２．ｈｔｍｌ」に、全く別のサイトのページ「ｒｔｓｐ：／／ｈｉｊｋ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」がリンクされていた場合でも、このぺージのデータを、特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」にリンクされたデータとして記憶装置２０の参照情報記憶部２２及びデータ情報記憶部２３に記憶するが、データ処理部１４では、「ｒｔｓｐ：／／ｈｉｊｋ．ｃｏ．ｊｐ／００１．ｈｔｍｌ」にアクセスはするが、該ページのコンテンツを取得することはなく、データ処理部１４により特定サイト内の全てのページ等に対する情報取得が終了した時点で、一の特定サイトに関する情報収集が終了した（ステップＳ７９でＹｅｓ）と、参照情報処理部１３により判定される。
【０１９４】
情報収集・提供システム１では、上述のような一連の動作が繰り返し行われることにより、最初に取得したデータから次にリンクされているデータを取得し、更に次にリンクされているデータを取得するという、特定サイトのいわば階層関係に基づいてデータを収集することで、特定サイト内のデータ全て、さらには当該特定サイトのページ等にリンクされた特定サイト外のデータの一部（すなわちヘッダ情報）を取得し、さらには取得した各種データを連携システムに提供することが可能になる。
【０１９５】
ステップＳ８１で、データ処理装置１０は、参照情報処理部１３が参照情報処理管理部１５に終了通知を行なうことで、ステップＳ７の派生ルーチン処理を終了する。ステップＳ８１の終了通知が行なわれた後には、参照情報処理部１３及びデータ処理管理部１６は、処理を終了して、次に収集開始時刻（図９参照）が来て参照情報処理管理部１５から起動命令が出力される（図５のステップＳ７）までは、或いは、参照情報処理部１３からのデータ処理要求が出される（図６のステップＳ７２）までは、待機状態となる。
【０１９６】
なお、ステップＳ８１では、データ処理装置１０は、参照情報処理部１３が当該特定サイトのＵＲＬの処理を完了した旨についてのコンテンツボールを作成して連携システムに送信する処理を行った後にステップＳ８に移行することになるが、この処理については図１１及び図１２で後述する。
【０１９７】
このように、本実施形態のデータ処理装置１０においては、データ処理管理部１６は、当該データ処理管理部１６の起動時にサイト情報記憶部２１から該サイトにおける最大同時アクセス上限数を取得するとともに（ステップＳ７３）、参照情報処理部１３から受信したデータ処理要求（ステップＳ７２，ステップＳ８０）を一旦保持しつつ、現在アクセスしているデータ処理部１４の数をチェックして（ステップＳ７４）、最大同時アクセス上限数に達している場合にはデータ処理部１４の終了、又は参照情報処理部１３からのデータ処理要求の受信を待機し（ステップＳ７６，図３参照）、一方、現在アクセスしているデータ処理部１４の数が最大同時アクセス上限数に達していない場合には、データ処理部１４を起動する（ステップＳ７５）こととしているので、一の特定サイトの情報取得中におけるデータ処理部１４の数がデータ処理管理部１６によって管理され、予め設定められた一定数以上のデータ処理部１４が起動する事態が回避されることになり、ＣＰＵやネットワーク１００の負荷を軽減することが可能となる。
【０１９８】
また、このデータ処理装置１０においては、一の特定サイトに対して複数のデータ処理部１４が同時にデータを収集し、データ処理部１４によるデータ収集中の間はデータ処理管理部１６や参照情報処理部１３がＣＰＵを使わない待機状態（図３、ステップＳ７６，ステップＳ７７）となり、また、上述のように、データ処理部１４が収集を終えたか否かをデータ処理管理部１３やデータ処理管理部１６等で定期チェックするのではなく、終了したデータ処理部１４からデータ処理管理部１６に終了の旨の信号を個別に送信することから、無駄な動作や無用な空き時間が無くなるとともに、ＣＰＵとネットワーク１００の負荷を軽減しながらネットワーク１００に公開された特定サイトの情報を高速に収集することが可能となる。
【０１９９】
（情報収集処理の詳細）
次に、図７のフローチャートを参照して、図６のステップＳ７５の派生ルーチン、すなわち特定サイト内の一のページのデータについて、一のデータ処理部１４が行う情報収集の処理の詳細について説明する。
【０２００】
この情報収集・提供システム１において、データ処理部１４は、その構成要素であるデータ取得部１４１と、データ解析部１４２と、データ登録部１４３とで、特定サイト内に存在する個々の情報を取得する処理を行なうようになっており、以下は図７を参照して説明する。
【０２０１】
データ処理部１４のデータ取得部１４１は、上述のデータ処理管理部１６からのデータ処理要求を受信し（ステップＳ７５０１）、このデータ処理要求に含まれるＵＲＬのプロトコル情報から、（図４の１４１ａ〜ｎで）対応可能か否かを判定し（ステップＳ７５０２）、Ｎｏすなわち対応不可能の場合はデータ処理管理部１６に終了を通知（ステップＳ７５１２）して終了し、Ｙｅｓすなわち対応可能の場合は、記憶装置２０のサイト情報記憶部２１から、該サイトへのアクセス方法についての情報（図９参照）を取得する（ステップＳ７５０３）。
【０２０２】
ステップＳ７５０３で取得するアクセス方法についての情報の具体例としては、上述のように、簡易認証（Ｂａｓｉｃ認証）情報や、フォーム認証（ＣＧＩ＝ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ認証）情報や、プロキシ情報などが含まれる。
【０２０３】
ここで、簡易認証情報を取得した場合には、データ取得部１４１は、アクセスするＵＲＬが簡易認証のＵＲＬと一致すれば、簡易認証情報としてのＩＤとパスワードを当該特定サイトのＵＲＬ（ページ）へのアクセス時に用いることになる。
【０２０４】
また、フォーム認証情報を取得した場合には、データ取得部１４１は、アクセスするＵＲＬがフォーム認証のＵＲＬと一致すれば、フォーム認証のパラメータをアクセス時に用いることになる。
【０２０５】
また、プロキシ情報情報を取得した場合には、データ取得部１４１は、プロキシを経由して当該特定サイトのＵＲＬ（ページ）にアクセスすることになる。
【０２０６】
該サイトへのアクセス方法についての情報を取得したデータ処理部１４のデータ取得部１４１は、次のステップＳ７５０４で、ＵＲＬのプロトコル情報がＨＴＴＰであればＨＴＴＰで（すなわち図４のデータ取得部１４１ａを用いて）、そうでなければそのプロトコルを用いてＵＲＬにアクセスして、該ＵＲＬのヘッダ情報（名前，種別，サイズ，更新日時などの本文以外の情報）を取得して、ステップＳ６４０５に移行する。なお、ＵＲＬのプロトコル情報とは、ＵＲＬの先頭部分を指しており、ＵＲＬの先頭部分が「ｈｔｔｐ：」であればＨＴＴＰを用い、ＵＲＬが「ｒｔｓｐ：」で始まっていればＲＴＳＰを用いることになる。
【０２０７】
ステップＳ７５０５で、データ処理部１４のデータ取得部１４１は、ヘッダ情報を取得できたか否かについて判定し、Ｎｏすなわち取得できなかったと判定した場合にはデータ処理管理部１６に終了（この場合はページ処理異常：「ＥＲＲ」による終了）を通知して（ステップＳ７５１２）処理を終え、一方、Ｙｅｓすなわち取得できたと判定した場合には、本文（コンテンツ）の情報を取得する必要があるか否かを判断すべく、ステップＳ７５０６に移行する。
【０２０８】
ステップＳ７５０６で、データ処理部１４のデータ取得部１４１は、データ情報記憶部２３の前データ情報の記憶領域を参照して、今回取得したヘッダ情報が前回に登録したヘッダ情報と一致するか否かについて判定し、Ｙｅｓすなわち前回取得したヘッダ情報と一致するとの判定の場合には、名前，種別，サイズ，更新日時、等が同じであるため、本文（コンテンツ）についても変更なしとみなしてステップＳ７５０７に移行し、Ｎｏすなわち一致しないとの判定の場合には、本文（コンテンツ）についても変更されたもの（「ＵＰＤＡＴＥ」）若しくは新規（「ＮＥＷ」）に追加されたものであるとみなして、ステップＳ７５０８に移行する。
【０２０９】
なお、特定サイトについての初めてのアクセスの場合には、当該特定サイトに関してはデータ情報記憶部２３の前データ情報及び新データ情報のいずれの記憶領域にもヘッダ情報が登録されていないため、ステップＳ７５０６ではＮｏの判定が出ることになる。
【０２１０】
ステップＳ７５０７で、データ処理部１４は、データ取得部１４１からデータ登録部１４３にヘッダ情報を転送し、データ登録部１４３がこのヘッダ情報をデータ情報記憶部２３の新データ情報の記憶領域に記憶するとともに、参照情報記憶部２２の前参照情報の記憶領域に登録されたデータの内容をそのまま新参照情報の記憶領域に複写してステップＳ７５１２に移行し、ステップＳ７５１２でデータ処理管理部１６に終了（この場合はページ処理済：「ＩＮＦＯ」の更新されていないデータであることを示す「ＮＯＮＥ」）を通知して処理を終わる。
【０２１１】
一方、ステップＳ７５０８で、データ処理部１４は、ステップＳ７５０５で取得したヘッダ情報が外部サイトの情報であるか否かを判定して、Ｙｅｓすなわち外部サイトの情報であるとの判定の場合にはコンテンツ取得の必要なしとしてステップＳ７５１１に移行し、一方、Ｎｏすなわち外部サイトの情報でないとの判定の場合にはコンテンツ取得の必要有りとしてステップＳ７５０９に移行する。
【０２１２】
ここで、ステップＳ７５０８における外部サイトの情報であるか否かの判定は、サイト情報記憶部２１の「サイト情報」欄に登録された情報に基づいて行われ、具体的には、登録された情報がｗｅｂサイトならば、当該特定サイトのドメイン名（図９連番１の例では「ａｂｃｄ．ｃｏ．ｊｐ」）が同一であるか否かが基準となる。なお、アクセス対象がディレクトリツリー型ファイルシステムの場合には、サイト情報記憶部２１の「サイト情報」欄にはディレクトリ名が登録されることになるので、ステップＳ７５０８では、当該特定サイトのディレクトリ名が同一であるか否かが基準となる。
【０２１３】
ステップＳ７５０９で、データ処理部１４は、データ取得部１４１で実際にコンテンツを取得し、その内容を一時ファイル（この例では連携システムのハードディスク）に保存して、ステップＳ７５１０に移行する。
【０２１４】
ステップＳ７５１０で、データ処理部１４は、一時ファイルに保存したコンテンツをデータ解析部１４２で解析し、必要な情報を抽出する処理を行う。具体的には、データ解析部１４２は、コンテンツがＨＴＭＬであればＨＴＭＬで（すなわち図４のデータ解析部１４２ａを用いて）、そうでなければその解析法を用いて一時ファイルにアクセスして、コンテンツを解析し、解析した内容をデータ登録部１４３にＲＡＭを用いて転送し、ステップＳ７５１１に移行する。
【０２１５】
なお、ステップＳ７５１０における「必要な情報」とは、連携システムに送信する後述のコンテンツボール３０（図１２参照）を構成する情報と、参照情報記憶部２２に記憶すべきリンク情報を構成する情報（この実施形態では、行数，タグ名）である。
【０２１６】
そして、ＲＡＭでコンテンツの解析内容を取得したデータ処理部１４のデータ登録部１４３は、ステップＳ７５１１で、この解析内容から、ネットワーク上の場所，階層数，及びヘッダ情報を構成するサイズ，更新日時、などの情報をデータ情報記憶部２３の新データ情報の記憶領域に登録するとともに、リンク情報、パラメータなどを参照情報記憶部２２の新参照情報の記憶領域に登録してステップＳ７５１２に移行し、ステップＳ７５１２でデータ処理管理部１６に終了通知（この場合はページ処理済：「ＩＮＦＯ」の新規のデータであることを示す「ＮＥＷ」または更新されたデータであることを示す「ＵＰＤＡＴＥ」）を通知して、処理を終える。
【０２１７】
一方、ステップＳ７５０８でＹｅｓすなわち外部サイトの情報であると判定された後のステップＳ７５１１では、データ登録部１４３は、この場合にはコンテンツの解析内容が存在しないので、ヘッダ情報を構成するサイズ，更新日時、などの情報をデータ情報記憶部２３の新データ情報の記憶領域に登録してステップＳ７５１２に移行し、ステップＳ７５１２でデータ処理管理部１６に終了通知（この場合はページ処理済：「ＩＮＦＯ」の新規のデータであることを示す「ＮＥＷ」または更新されたデータであることを示す「ＵＰＤＡＴＥ」）を通知して、処理を終える。
【０２１８】
このように、実施の形態の情報収集・提供システム１によれば、データ処理部１４は、アクセスしたページのコンテンツ等の概要を示すヘッダ情報を取得して（ステップＳ７５０４）、前回に取得したヘッダ情報との比較を行い（ステップＳ７５０６）、前回と異なるヘッダ情報の場合には、外部サイトでないこと（ステップＳ７５０８でＮｏ）を条件に該コンテンツの取得（ステップＳ７５０９）及び解析（ステップＳ７５１０）を行って記憶装置２０に記憶し、一方、前回と同一のヘッダ情報の場合には該コンテンツの取得及び解析を行なわずに、以前に記憶したヘッダ情報とそのコンテンツに含まれるリンク情報を利用して記憶装置２０に記憶する（ステップＳ７５０７）構成としたので、ＣＰＵとネットワーク１００の使用量を抑えることができるようになり、ＣＰＵやネットワーク１００の負荷が低減する、という効果が得られる。
【０２１９】
すなわち、実施の形態の情報収集・提供システム１によれば、データ処理装置１０は、処理に時間及び負荷のかかるコンテンツの解析に先立って、当該コンテンツの固有情報であるヘッダ情報を取得する（ステップＳ７５０４）とともに、前回取得したヘッダ情報と比較（ステップＳ７５０６）しながら情報を収集管理して、記憶装置２０に登録し（ステップＳ７５０７，ステップＳ７５０８〜７５１１）、前回の記憶装置２０に登録されている内容よりも新しく更新された情報のみを対象にコンテンツを収集し（ステップＳ７５０９）、外部サイトのコンテンツや、前回の記憶装置２０に登録されている内容から更新されていないコンテンツについては取得（解析）しない（ステップＳ７５１１，ステップＳ７５０７）ので、一の特定サイト内の情報を高速かつ低負荷で取得することができるようになり、ひいては複数の特定サイトを同時並行的に情報収集する場合のＣＰＵ及びネットワーク１００の負担を著しく減少させ、各情報収集の処理を迅速に終了させることが可能となる。
【０２２０】
さらには、実施の形態の情報収集・提供システム１によれば、データ処理装置１０のデータ処理部１４は、特定サイトにつき収集を行うページのサイト情報（ＵＲＬ等）を受信し（ステップＳ７５０１）、アクセス方法の情報（簡易認証（Ｂａｓｉｃ認証）情報、フォーム認証（ＣＧＩ＝ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ認証）情報、プロキシ情報、等）を記憶装置２０のサイト情報記憶部２１から取得する（ステップＳ７５０３）とともに、取得した該情報に基づいてＵＲＬにアクセスしてそのヘッダ情報を取得する（ステップＳ７５０４）ので、使用者は、システム管理プログラムを起動させてサイト情報記憶部２１へ登録する情報の設定を一度行えば良く、各特定サイト毎に異なる複数のアクセス方法についての情報を毎回入力する必要がなくなるので、管理の手間が軽減する。
【０２２１】
次に、図８を参照して、上述した情報収集処理の実行にあたり、参照情報処理部１３とデータ処理部１４とが記憶装置２０の参照情報記憶部２２とデータ情報記憶部２３とに対して行う処理等について説明する。
【０２２２】
図８に概略的に示すように、情報収集処理の実行にあたっては、参照情報処理部１３及びデータ処理部１４は、記憶装置２０の参照情報記憶部２２とデータ情報記憶部２３を参照、更新しながら行うことになる。
【０２２３】
上述のように、参照情報記憶部２２とデータ情報記憶部２３は、それぞれ、データ処理装置１０による処理結果の記憶領域が二重化された領域を備えており、一方が前回の情報取得処理の結果を保存する領域とされ、他方が次に情報取得処理を実行したときに新たに作成する新規の情報の記憶領域とされる。
【０２２４】
そして、データ処理装置１０において、データ処理部１４は、参照情報処理部１３から指示されるＵＲＬにネットワーク１００経由でアクセスし、該ＵＲＬのヘッダ情報を取得してデータ情報記憶部２３の「新データ情報の記録領域」に登録する（図７のステップＳ７５０４）とともに、データ情報記憶部２３の「前データ情報の記憶領域」の該当ページにおけるヘッダ情報との比較を行う（図７のステップＳ７５０５）。
【０２２５】
なお、ヘッダ情報の比較に関しては、ここでは説明の便宜のため、更新日時を比較した例を挙げているが、他にも、種別、サイズなど、データ情報記憶部２３に記憶している内容を同様に比較することになる。
【０２２６】
ここで、データ情報記憶部２３の各記憶領域のヘッダ情報を比較した結果、両者が一致すれば（ステップＳ７５０６でＹｅｓ）、データ処理部１４は、参照情報記憶部２２の「前参照情報の記憶領域」に記憶しているＵＲＬの情報を、参照情報記憶部２２の「新参照情報の記憶領域」に、「未処理」として複写する（ステップＳ７５０７）。比較の結果、一致しない、またはデータ情報記憶部２３の「前データ情報の記憶領域」に存在しないのであれば（ステップＳ７５０６でＮｏ）、データ処理部１４は、外部サイトでないこと（ステップＳ７５０８でＮｏ）を条件にコンテンツを取得して、当該コンテンツを一時ファイルとして記憶し（ステップＳ７５０９）、コンテンツ内部に含まれるリンクＵＲＬ情報を抽出して（ステップＳ７５１０）、参照情報記憶部２２の「新参照情報の記憶領域」に「未処理」として登録する（ステップＳ７５１１）。
【０２２７】
このような処理を行うことで、前回取得してから変更のないＵＲＬのコンテンツを取得する時間と転送量を省き、ＣＰＵとネットワーク１００の負荷を軽減しながらも高速に情報を取得することが可能となる。
【０２２８】
一方、データ収集の処理中には、参照情報処理部１３は、参照情報記憶部２２の新参照情報の記憶領域の処理状態が「未処理」であるところの「ネットワーク上の場所」の情報（図９参照）を、データ情報記憶部２３の新データ情報の記憶領域から取得し続け（図６のステップＳ７８）、「未処理」が存在すれば（ステップＳ７９でＹｅｓ）、データ処理部１４を起動べくデータ処理管理部１６にデータ処理要求を送信するとともに、参照情報記憶部２２の処理状態情報を「処理済」に変更する（ステップＳ８０）。
【０２２９】
図８には、ネットワーク１００上のサーバ等の装置に特定サイトを含む３つのページ等（特定サイトとしてのデータ１、及びデータ１にリンクされたデータ２とデータ３）が存在し、前回（例えば１回目）の情報収集時にはデータ１からデータ２とデータ３に向けてリンクが張られていたが、今回（例えば２回目）の情報収集時には、新たにデータ２からデータ３に向けてのリンクが張られていた（すなわちデータ２が更新されており、更新日時は２００３年１月１日）という事例を示している。
【０２３０】
また、図８では、情報収集・提供システム１で前回に収集した情報が、参照情報記憶部２２の前参照情報の記憶領域及びデータ情報記憶部２３の前データ情報の記憶領域に格納されており、今回に収集した情報が、参照情報記憶部２２の新参照情報の記憶領域及びデータ情報記憶部２３の新データ情報の記憶領域に記録される状態を概略的に示している。
【０２３１】
この事例では、データ１とデータ３については、前回と今回とで変更が無いために、図８に示すように、今回の情報収集の結果、データ情報記憶部２３の新データ情報の記憶領域には、前データ情報の記憶領域と同じ情報が登録されるとともに、参照情報記憶部２２の前参照情報の記憶領域から新参照情報の記憶領域に、データ１とデータ３の情報が複写される。
【０２３２】
詳細には、データ１とデータ３については、今回の情報収集時には、各データ１，３のヘッダ情報をデータ処理部１４のデータ取得部１４１で収集し（図７のステップＳ７５０４）、データ情報記憶部２３の「新データ情報の記憶領域」に各情報を登録するが、この際に、データ情報記憶部２３の「前データ情報の記憶領域」に登録された各情報との比較を行い（ステップＳ７５０６）、比較の結果、前回と今回とで変更が無い（全ての内容が変わっていない）ことから、データ１とデータ３についてはデータ解析部１４２を用いた解析（ステップＳ７５１０）を行うことなく、参照情報記憶部２２の「新参照情報の記憶領域」には「前参照情報の記憶領域」の情報がそのまま複写される（ステップＳ７５０７）ことになる。
【０２３３】
一方で、データ２については、前回と今回とでは、２００３年１月１日にデータ３へのリンクが設定されている点で変更（更新）が有ったために、今回の情報収集時には、この更新日時と更新内容の情報を取得し、更新日時についてはデータ情報記憶部２３の「新データ情報の記憶領域」に、更新内容については参照情報記憶部２２の「前参照情報の記憶領域」に、それぞれ格納されることになる。
【０２３４】
詳細には、今回の情報収集時には、データ２のヘッダ情報をデータ処理部１４のデータ取得部１４１で収集し、データ情報記憶部２３の「新データ情報の記憶領域」に各情報を登録するが、この際に、データ情報記憶部２３の「前データ情報の記憶領域」に登録された各情報との比較を行い（ステップＳ７５０６）、比較の結果、「更新日時」の日付が異なっていることから、データ２については取得（ステップＳ７５０９）及びデータ解析部１４２を用いた解析（ステップＳ７５１０）を行い、参照情報記憶部２２の「新参照情報の記憶領域」に新たにデータ２からデータ３に向けての参照情報を新たに登録する（ステップＳ７５１１）。
【０２３５】
そして、データ処理部１４で行うべき処理が全て終了し、かつ参照情報記憶部２２の「新参照情報の記憶領域」内の全てが処理済になれば（この事例では図８に示す新参照情報の記憶領域の再下段の「未処理」が「処理済」に変更され、かつ各データ処理部１４（１４ａ〜ｃ）の全ての処理が終了した時点で、特定サイト内にある多数の情報を全て取得したことになり、参照情報処理部１３は終了する。
【０２３６】
次に、記憶装置２０におけるサイト情報記憶部２１と参照情報記憶部２２とデータ情報記憶部２３との関係を、図９に示す各データテーブルを参照してより具体的に説明する。なお、図９に示す参照情報記憶部２２及びデータ情報記憶部２３に登録された各データは、サイト情報記憶部２１に登録された一の特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」に関して対応付けられたデータのみを抽出して示している。
【０２３７】
図９に示すように、記憶装置２０では、参照情報記憶部２２の「サイト」欄にサイト情報記憶部２１の「連番」の数値（図９では１）が登録されることで、参照情報記憶部２２のデータがサイト情報記憶部２１に登録された一の特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」のデータに対応付けられており、同様に、データ情報記憶部２３の「サイト」欄にサイト情報記憶部２１の「連番」の数値（図９では１）が登録されることで、データ情報記憶部２３のデータがサイト情報記憶部２１の当該特定サイトのデータに対応付けられている。
【０２３８】
また、記憶装置２０では、参照情報記憶部２２の「リンク情報」の「リンク元」欄と「リンク先」欄にデータ情報記憶部２３の「連番」の番号が登録されることで、参照情報記憶部２２のデータがデータ情報記憶部２３のデータに対応付けられている。
【０２３９】
さらに、この記憶装置２０においては、参照情報記憶部２２とデータ情報記憶部２３とが、それぞれ二重化されており、具体的には、一方を前回の収集情報についての記憶領域、他方を最新の収集情報についての記憶領域として利用している。
【０２４０】
ここで、情報取得対象としての一の特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」につき、１度目の情報収集処理ではトップページの０００１．ｈｔｍｌとその下位ページの０００２．ｈｔｍｌの２ページで構成されていたが、２度目の情報収集処理の時点では、０００２．ｈｔｍｌに対してさらに下位となる０００３．ｈｔｍｌのページが新規で追加されていた、という事例を想定して、各記憶部２１〜２３におけるデータの更新等の処理について説明する。
【０２４１】
まず、１度目の情報収集処理では、上述した図７のステップＳ７５０１のデータ処理要求により、「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」で示されるサイトを処理するように指示を受けたデータ処理部１４（１４ａとする。）は、サイト情報記憶部２１の「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」に関する各情報（図９の▲１▼の段の各データ）から、「アクセス方法」の情報（この事例ではプロキシ情報）を取得して（ステップＳ７５０３）、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌにアクセスする。
【０２４２】
そして、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌにアクセスしたデータ処理部１４ａは、まずｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌのヘッダ情報をデータ取得部１４１で取得する（ステップＳ７５０４）が、この場合には前回取得した情報がないため（ステップＳ７５０６でＮｏ）、ステップＳ７５０９〜ステップＳ７５１２の処理を行う。
【０２４３】
具体的には、ステップＳ７５０９において、データ処理部１４ａは、データ取得部１４１でｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ（トップページ）内のコンテンツを取得し、後述するコンテンツボール３０の大部分を作成し、データ解析部１４２による解析のために、取得したデータを一時ファイルとして保存する。また、次のステップＳ７５１０において、データ処理部１４ａは、データ解析部１４２でコンテンツの解析と必要な情報を抽出してコンテンツボール３０の不足箇所の補充を行い、次のステップＳ７５１１では、データ登録部１４３が上述した各データを参照情報記憶部２２とデータ情報記憶部２３に新規登録するとともに、完成したコンテンツボール３０をステップＳ７５１２で連携システムに提供する。
【０２４４】
すなわち、ステップＳ７５１１では、データ処理部１４ａは、前ステップで抽出した情報に基づいて、該ページ（０００１．ｈｔｍｌ）の階層数（この例では１），種別（この例ではＨＴＭＬ），サイズ（この例では１０２４（バイト）），更新日時（この例では２００２年１２月１日零時零分），等のデータを、データ情報記憶部２３の新データ情報の記録領域（図９の▲５▼の段の該当する各欄）に記録するとともに、該ページ（０００１．ｈｔｍｌ）にリンクされたリンク先の情報があれば、該情報を参照情報記憶部２２の新参照情報の記憶領域（図９の▲２▼の段の該当する各欄）に記録して行く。
【０２４５】
そして、ステップＳ７５１１において、データ処理部１４ａのデータ登録部１４３は、該ページ（０００１．ｈｔｍｌ）に関する各情報を参照情報記憶部２２及びデータ情報記憶部２３に記録すると、該ページについての全てのデータ収集を完了したものとして、データ情報記憶部２３の新データ情報の記録領域（図９の▲５▼の段）の「収集状態」欄に、該ページのデータ収集が完了したことを示すフラグ（図９の「ＯＫ」）を記録して、データ処理管理部１６に終了を通知し（ステップＳ７５１２）、この際にコンテンツボール３０を連携システムに送信する。
【０２４６】
この事例では、データ処理部１４ａは、該ページ０００１．ｈｔｍｌのコンテンツ解析時（ステップＳ７５１０）に、０００１．ｈｔｍｌの本文（コンテンツ）に記述されているリンクページの「０００２．ｈｔｍｌ」を抽出することになるので、ステップＳ７５１１では、データ登録部１４３により、データ情報記憶部２３の新参照情報の記憶領域に、０００２．ｈｔｍｌについての記録欄（図９の▲６▼の段）を確保（新設）するとともに、確保（新設）した０００２．ｈｔｍｌの「収集状態」欄に、当該ページのデータを未だ収集していない状態であることを示すフラグ（例えば「未」）を記録する。
【０２４７】
また、ステップＳ７５１０で０００２．ｈｔｍｌを抽出したデータ処理部１４ａは、当該リンク元のサイト（この場合は「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」）に存在するリンク情報を記録するため、次のステップＳ７５１１では、データ登録部１４３により、参照情報記憶部２２の新参照情報の記憶領域に確保された、０００１．ｈｔｍｌから０００２．ｈｔｍｌへリンクされていることを示す記録欄（図９の▲２▼の段）に、リンク情報（リンク元，リンク先，行数，タグ名）を記録するとともに、その「処理状態」欄に、特定コンテンツ（この場合は「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」のコンテンツ）の全部のデータを未だ収集していない状態であることを示す「未」のフラグを記録する。なお、図９の▲２▼の段に示す例は、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌの本文の３２行目に０００２．ｈｔｍｌへリンクするためのリンク先情報があり、かつ、そのタグ名がＡタグのＨＲＥＦだった場合である。
【０２４８】
そして、情報収集・提供システム１では、データ処理部１４ａのこの記録処理に基づいて、参照情報記憶部２２の「処理状態」欄を監視している参照情報処理部１３が次のデータ処理部１４（１４ｂ）を起動させるべく、「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」を含めたデータ処理要求をデータ処理管理部１６に送信し（図６のステップＳ７２）、データ処理管理部１６から起動命令が出力される（ステップＳ７５）。
【０２４９】
続いて、次に起動したデータ処理部１４ｂは、起動命令に含まれる「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」に基づいて、当該ページへアクセスし、同様に、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌのヘッダ情報をデータ取得部１４１で取得し（ステップＳ７５０４）、ここでも前回取得した情報がなく（ステップＳ７５０６でＮｏ）、かつ外部サイトではないため（ステップＳ７５０８でＮｏ）、上述と同様にステップＳ７５０９〜ステップＳ７５１２の処理を行う。
【０２５０】
すなわち、ステップＳ７５０９において、データ処理部１４ｂは、データ取得部１４１でｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ内のコンテンツを取得し、後述するコンテンツボール３０の大部分を作成し、データ解析部１４２による解析のために、取得したデータを一時ファイルとして保存する。また、次のステップＳ７５１０において、データ処理部１４ｂは、データ解析部１４２でコンテンツの解析と必要な情報を抽出してコンテンツボール３０の不足箇所の補充を行い、次のステップＳ７５１１では、データ登録部１４３が上述した各データを参照情報記憶部２２とデータ情報記憶部２３に新規登録するとともに、完成したコンテンツボール３０をステップＳ７５１２で連携システムに提供する。
【０２５１】
すなわち、ステップＳ７５１１では、データ処理部１４ｂは、前ステップで抽出した情報に基づいて、該ページ（０００２．ｈｔｍｌ）の階層数（この例では２），種別（この例ではＨＴＭＬ），サイズ（この例では１０２４（バイト）），更新日時（この例では２００２年１２月１日零時零分），等のデータを、データ情報記憶部２３の新データ情報の記録領域（図９の▲６▼の段の該当する各欄）に記録するとともに、該ページ（０００２．ｈｔｍｌ）にリンクされたリンク先の情報があれば、該情報を参照情報記憶部２２の新参照情報の記憶領域に記録して行く。
【０２５２】
なお、この事例では、当該１回目の情報収集（巡回）時点では０００２．ｈｔｍｌにはリンク情報（０００３．ｈｔｍｌ）が存在しなかったため、この場合には、データ処理部１４ｂのデータ登録部１４３は、ステップＳ７５１０では、データ情報記憶部２３及び参照情報記憶部２２に新しい欄を新設（確保）することなく、特定サイトの該ページ（０００２．ｈｔｍｌ）に関する各情報をデータ情報記憶部２３の新データ情報の記録領域（図９の▲６▼の段）に記録すると、該ページについての全てのデータ収集を完了したものとして、この「収集状態」欄に、該ページのデータ収集が完了したことを示すフラグ（図９の「ＯＫ」）を記録するとともに、参照情報記憶部２２の（図９の▲２▼の段）の「処理状態」欄の「未」フラグを、図９の「済」に変更する。この処理により、参照情報記憶部２２の処理状態が全て「済」になり、該特定サイトの全てのページ（すなわちｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／にある各ページ）についてのデータ収集が完了したことが示されるので、この特定サイトの１度目の情報収集を完了させることが可能となる。
【０２５３】
すなわち、次のステップＳ７５１２でデータ処理部１４ｂのデータ登録部１４３からデータ処理管理部１６に処理の終了が通知されると、参照情報記憶部２２の「処理状態」欄を監視している参照情報処理部１３が「全ての参照情報のデータを収集した」（図６のステップＳ７９でＹｅｓ）と判定することで、この特定サイトの１度目の情報収集が終了する。
【０２５４】
なお、情報収集・提供システム１においては、当該特定サイトの次（２度目）の情報収集（巡回）の際には、データ処理部１４の起動に先立って、参照情報記憶部２２の「新」参照情報の記憶領域及びデータ情報記憶部２３の「新」データ情報の記憶領域を、それぞれ「前」参照情報の記憶領域及び「前」データ情報の記憶領域として扱うとともに、新たに新参照情報の記憶領域及び新データ情報の記憶領域を確保するように、参照情報処理部１３によって処理がなされる。
【０２５５】
次に、特定サイト「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」において新たなページ０００３．ｈｔｍｌが追加された後にデータ処理装置１０が行う２度目の情報収集の処理について説明する。
【０２５６】
２度目の情報収集処理においても、上述した図７のステップＳ７５０１のデータ処理要求により、「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」で示されるサイトを処理するように指示を受けたデータ処理部１４（同様に１４ａとする。）は、上述と同様に、サイト情報記憶部２１の「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ」に関する各情報（図９の▲１▼の段の各データ）から、「アクセス方法」の情報（この事例ではプロキシ情報）を取得して（ステップＳ７５０３）、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌにアクセスする。
【０２５７】
そして、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌにアクセスしたデータ処理部１４ａは、同様に、まずｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌのヘッダ情報をデータ取得部１４１で取得する（ステップＳ７５０４）が、この事例では取得したヘッダ情報が前回の情報（すなわち図９の▲５▼の段の情報）と一致するため（ステップＳ７５０６でＹｅｓ）、該トップページ（０００１．ｈｔｍｌ）のコンテンツ（リンク情報を含む）についても同一とみなして、今回はデータ登録部１４３でステップＳ７５０７の処理を行う。すなわち、ステップＳ７５０７で、データ処理部１４ａのデータ登録部１４３は、データ情報記憶部２３の前データ情報の記録領域（図９の▲５▼及び▲６▼の段）の各欄のデータ（すなわちトップページ０００１．ｈｔｍｌ及びその下位ページ０００２．ｈｔｍｌについてのデータ）を、新データ情報の記録領域（図９の▲７▼及び▲８▼の段）にとりあえず全部複写して、下位ページ０００２．ｈｔｍｌについての「収集状態」欄だけ「未」のフラグを記録するとともに、参照情報記憶部２２の前参照情報の記録領域（図９の▲２▼の段）の情報を新参照情報の記録領域（図９の▲３▼の段）に複写して、「処理状態」欄だけ「未」のフラグを記録する。
【０２５８】
なお、コンテンツボール３０の作成については、データ取得部１４１及びデータ解析部１４２で作成・更新を行い、データ登録部１４３が参照情報記憶部２２及びデータ情報記憶部２３の記録情報を用いて最終的な更新を行う。これにより、ぺージ状態がＮＯＮＥ（更新なし）の前回と同一内容のコンテンツボール３０が、連携システムに提供されることになる。
【０２５９】
そして、情報収集・提供システム１では、データ処理部１４ａの記録処理に基づいて、参照情報記憶部２２の「処理状態」欄を監視している参照情報処理部１３が次のデータ処理部１４（１４ｂ）を起動させるべく、「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」を含めたデータ処理要求をデータ処理管理部１６に送信し（図６のステップＳ７２）、データ処理管理部１６から起動命令が出力される（ステップＳ７５）。
【０２６０】
続いて、次に起動したデータ処理部１４ｂは、起動命令に含まれる「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」に基づいて、当該ページへアクセスし、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌのヘッダ情報をデータ取得部１４１で取得して（ステップＳ７５０４）、取得したヘッダ情報を、前回取得したヘッダ情報（すなわち図９の▲６▼の段の情報）と比較する（ステップＳ７５０６）。
【０２６１】
この事例では、ページ０００２．ｈｔｍｌのヘッダ情報の「サイズ」及び「更新日時」が前回と異なるため（ステップＳ７５０６でＮｏ）、該ページ０００２．ｈｔｍｌのコンテンツについても異なるものとみなし、ステップＳ７５０９〜ステップＳ７５１２の処理を行うことになる。
【０２６２】
すなわち、ステップＳ７５０９において、データ処理部１４ｂは、データ取得部１４１でｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ内のコンテンツを取得し、後述するコンテンツボール３０の大部分を作成し、データ解析部１４２による解析のために、取得したデータを一時ファイルとして保存する。また、次のステップＳ７５１０において、データ処理部１４ｂは、データ解析部１４２でコンテンツの解析と必要な情報を抽出してコンテンツボール３０の不足箇所の補充を行い、次のステップＳ７５１１では、データ登録部１４３が上述した各データを参照情報記憶部２２とデータ情報記憶部２３に新規登録するとともに、完成したコンテンツボール３０をステップＳ７５１２で連携システムに提供する。
【０２６３】
すなわち、ステップＳ７５１１では、データ処理部１４ｂは、前ステップで抽出した情報に基づいて、該ページ（０００２．ｈｔｍｌ）の階層数（この例では２），種別（この例ではＨＴＭＬ），サイズ（この例では２０４８（バイト）），更新日時（この例では２００３年１月１日零時零分），等のデータを、データ情報記憶部２３の新データ情報の記録領域（図９の▲８▼の段の該当する各欄）に記録するとともに、該ページ（０００２．ｈｔｍｌ）にリンクされたリンク先の情報があれば、該情報を参照情報記憶部２２の新参照情報の記憶領域に記録して行く。
【０２６４】
そして、ステップＳ７５１１において、データ処理部１４ｂのデータ登録部１４３は、該ページ（０００２．ｈｔｍｌ）に関する各情報を参照情報記憶部２２及びデータ情報記憶部２３に記録すると、該ページについての全てのデータ収集を完了したものとして、データ情報記憶部２３の新データ情報の記録領域（図９の▲８▼の段）の「収集状態」欄に、該ページのデータ収集が完了したことを示すフラグ（図９の「ＯＫ」）を記録して、データ処理管理部１６に終了を通知する（ステップＳ７５１２）。
【０２６５】
この事例では、当該２回目の情報収集（巡回）時点では、該ページ０００２．ｈｔｍｌにさらなる下位ページｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００３．ｈｔｍｌについてのリンク情報が存在しているため、データ処理部１４ｂは、該ページ０００２．ｈｔｍｌのコンテンツ解析時（ステップＳ７５１０）に、０００２．ｈｔｍｌの本文（コンテンツ）に記述されているリンクページの「０００３．ｈｔｍｌ」を抽出することになるので、ステップＳ７５１１では、データ登録部１４３により、データ情報記憶部２３の新参照情報の記憶領域に、０００３．ｈｔｍｌについての記録欄（図９の▲９▼の段）を確保（新設）するとともに、確保（新設）した０００３．ｈｔｍｌの「収集状態」欄に、当該ページのデータを未だ収集していない状態であることを示すフラグ「未」を記録する。
【０２６６】
また、ステップＳ７５１０で０００３．ｈｔｍｌを抽出したデータ処理部１４ｂは、当該リンク元のサイト（この場合は「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌ」）に存在するリンク情報を記録するため、次のステップＳ７５１１では、データ登録部１４３により、参照情報記憶部２２の新参照情報の記憶領域に確保された、０００２．ｈｔｍｌから０００３．ｈｔｍｌへリンクされていることを示す記録欄（図９の▲４▼の段）に、リンク情報（リンク元，リンク先，行数，タグ名）を記録するとともに、その「処理状態」欄に、特定コンテンツ（この場合は下位ページ「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００３．ｈｔｍｌ」のコンテンツ）の全部のデータを未だ収集していない状態であることを示す「未」のフラグを記録する。なお、図９の▲４▼の段に示す例は、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００２．ｈｔｍｌの本文の４８行目に０００３．ｈｔｍｌへリンクするためのリンク先情報があり、かつ、そのタグ名がＡタグのＨＲＥＦだった場合である。
【０２６７】
そして、情報収集・提供システム１では、データ処理部１４ｂのこの記録処理に基づいて、参照情報記憶部２２の「処理状態」欄を監視している参照情報処理部１３が次のデータ処理部１４（１４ｃ）を起動させるべく、「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００３．ｈｔｍｌ」を含めたデータ処理要求をデータ処理管理部１６に送信し（図６のステップＳ７２）、データ処理管理部１６から起動命令が出力される（ステップＳ７５）。
【０２６８】
続いて、次に起動したデータ処理部１４ｃは、起動命令に含まれる「ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００３．ｈｔｍｌ」に基づいて、当該ページへアクセスし、同様に、ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００３．ｈｔｍｌのヘッダ情報をデータ取得部１４１で取得し（ステップＳ７５０４）、この場合は初めてのアクセスであり前回取得した情報がなく（ステップＳ７５０６でＮｏ）、かつ外部サイトではないため（ステップＳ７５０８でＮｏ）、上述と同様にステップＳ７５０９〜ステップＳ７５１２の処理を行う。
【０２６９】
そして、この事例では、該ページ０００３．ｈｔｍｌにはリンク情報が存在しなかったため、この場合には、データ処理部１４ｃのデータ登録部１４３は、ステップＳ７５１１では、データ情報記憶部２３及び参照情報記憶部２２に新しい欄を新設（確保）することなく、特定サイトの該ページ（０００３．ｈｔｍｌ）に関する各情報をデータ情報記憶部２３の新データ情報の記録領域（図９の▲９▼の段）に記録すると、該ページについての全てのデータ収集を完了したものとして、この「収集状態」欄に、該ページのデータ収集が完了したことを示すフラグ（図９の「ＯＫ」）を記録するとともに、参照情報記憶部２２の（図９の▲４▼の段）の「処理状態」欄の「未」フラグを、図９の「済」に変更する。この処理により、参照情報記憶部２２の処理状態が全て「済」になり、該特定サイト（ｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌ）の全データ収集が完了したことが示されるので、この特定サイトの２度目の情報収集を完了させることが可能となる。
【０２７０】
すなわち、次のステップＳ７５１１でデータ処理部１４ｂのデータ登録部１４３からデータ処理管理部１６に処理の終了が通知されると、参照情報記憶部２２の「処理状態」欄を監視している参照情報処理部１３が「全ての参照情報のデータを収集した」（図６のステップＳ７９でＹｅｓ）と判定することで、この特定サイトの２度目の情報収集が終了する。
【０２７１】
なお、情報収集・提供システム１においては、当該特定サイトの次（３度目）の情報収集（巡回）の際には、データ処理部１４の起動に先立って、参照情報処理部１３により、参照情報記憶部２２の新参照情報の記憶領域内のデータ（図９の▲３▼及び▲４▼の段）は、前参照情報の記憶領域に上書きされる処理が行われ、かつ、データ情報記憶部２３の新データ情報の記憶領域内のデータ（図９の▲７▼，▲８▼，▲９▼の段）は、前データ情報の記憶領域に上書きされる処理が行われる。このような処理を行うことで、３度目以降の収集も、上述と同様の処理によりデータ収集が可能となる。
【０２７２】
この事例では、３ページしか存在しないサイトでのデータ収集等の動作を説明したが、それ以上のページを含むサイトでは、上述した処理を繰り返して、データ収集を行うことになる。
【０２７３】
また、この事例では、説明の複雑化を避けるために、一のページに他の一のページのみがリンクされているサイトでのデータ収集等の動作を説明したが、実際には一のページに他の複数のページがリンクされていることが多く、その場合には、上述のように、サイト情報記憶部２１に設定された最大同時アクセス数の範囲内で複数のデータ処理部１４が起動して（図３参照）、各データ処理部１４がそれぞれ図７（及び後述する図１１）の処理を行うことになる。
【０２７４】
さらには、ある特定サイト内のページに他の別のサイトのページがリンクされているような場合（例えば上述した特定サイトｈｔｔｐ：／／ａｂｃｄ．ｃｏ．ｊｐ／０００１．ｈｔｍｌの所定ページ０００３．ｈｔｍｌに他の別のサイトｈｔｔｐ：／／ｗｘｙｚ．ｃｏ．ｊｐのページ０００１．ｈｔｍｌがリンクされているような場合）であっても、上述した処理を繰り返すことで、一の特定サイトに関係する全てのデータ収集を行うことが可能である。但し、上述のように、この場合には、ｈｔｔｐ：／／ｗｘｙｚ．ｃｏ．ｊｐ／０００１．ｈｔｍｌについてのコンテンツ自体の取得が行われることはなく、また該ぺージからリンクされたページ等（例えばｈｔｔｐ：／／ｗｘｙｚ．ｃｏ．ｊｐ／０００２．ｈｔｍｌなど）についてアクセスされることもない。
【０２７５】
なお、図９では、サイト情報記憶部２１に別の特定サイト（特定サイト２）であるｈｔｔｐ：／／ｅｆｇｈ．ｃｏ．ｊｐ／０００１．ｈｔｍｌのサイト情報も登録された例を示しているが、情報収集・提供システム１では、このように複数の特定サイトが登録されて、複数のサイトの収集を行なう際にも、同様の処理を同時に並行して行なうことになる（図２参照）。但し、図９に示す例では、特定サイト１と特定サイト２とでは収集開始日時の設定が異なるため、情報収集等の処理は同時には行われない。
【０２７６】
さらに、この図９では、特定サイトをＷｅｂサイトとした場合の参照情報記憶部２２及びデータ情報記憶部２３に記憶する内容を示したが、特定サイトをディレクトリツリー型のファイルシステムや、ドメイン参加型ネットワーク機器群などとした場合でも、同様の処理により、情報収集及び連携システムへの情報提供の処理を実現できる。
【０２７７】
（本システムと他システムとの連携）
次に、情報収集・提供システム１と他システムとの連携、すなわち収集した情報を連携システムに提供する処理等について、図１０乃至図１２を参照して詳細に説明する。
【０２７８】
なお、図１０は、情報収集・提供システム１のデータ処理装置１０と他システムとの連携（結合形態）を概略的に示している。一方、図１１は、データ処理装置１０と他システムの実行動作を説明するためのフローチャートであり、図６のステップＳ７１のサブルーチンの動作を示している。また、図１２は、データ処理装置１０から他システムに送信される、統一したデータ形式としてのコンテンツボールの内容を示している。
【０２７９】
情報収集・提供システム１のデータ処理装置１０では、サイト全体の処理を管理する参照情報処理部１３と、サイトに含まれる個々のページのデータを処理するデータ処理部１４とにより、ローカルまたはネットワーク１００を介して該情報収集・提供システム１と接続される連携システムに対して、収集情報の提供を行う。
【０２８０】
ここで、情報収集・提供システム１から連携システムに対して提供（送信）される収集情報としては、ＣＰＵ内のＲＡＭを用いて転送するコンテンツボール３０（図１０及び図１２参照）と、コンテンツ自体を格納している不図示の一時ファイルとから成る。なお、これら収集情報のうちの一時ファイルについては、連携システムが必要とする情報の如何等によっては、提供しないようにしても構わない。
【０２８１】
また、情報収集・提供システム１は、接続された連携システムに対して、記憶装置２０における各記憶部２１，２２，２３へのアクセスを許可し、これらの登録データを連携システムが適宜参照可能とすることで、補助的な情報の提供も行う。
【０２８２】
情報収集・提供システム１から連携システムに対しての収集情報（コンテンツボール３０及び一時ファイル）の提供（送信）時期については、情報収集・提供システム１が特定サイトの情報収集を行なう直前の（ａ）サイト情報収集開始（ＳＴＡＲＴ）の際（すなわち図６のステップＳ７１の処理中）と、情報収集・提供システム１が特定サイトに含まれるページのデータ処理を終了した際（すなわち図７のステップＳ７５１１の処理中）と、情報収集・提供システム１が特定サイトの情報収集を全て完了した直後の（ｂ）サイト情報収集完了（ＥＮＤ）の際（すなわち図６のステップＳ８１の処理中）と、の３つに大別され、このうちページのデータ処理を終了した際においては、特定サイトに含まれる個々のページを正常に処理した（ｃ）ページ処理済（ＩＮＦＯ）の場合と、特定サイトに含まれる個々のページを処理する際に、異常等を検出したため該ページのデータを正常に処理できなかった（ｄ）ページ処理異常（ＥＲＲ）の場合とがある。
【０２８３】
なお、（ａ）サイト情報収集開始或いは（ｂ）サイト情報収集完了の際の提供情報には、当該特定サイトのサイト名が含まれ、（ｃ）ページ処理済の際の提供情報には、特定サイトに含まれる当該ページの処理結果を示す情報が含まれ、（ｄ）ページ処理異常の際の提供情報には、当該ページの処理時に検出された異常を示す情報が含まれることになる。
【０２８４】
また、コンテンツボール３０及び一時ファイルの作成主体については、（ａ）サイト情報収集開始、或いは（ｂ）サイト情報収集完了の場合には、データ処理装置１０の参照情報処理部１３が作成し、（ｃ）ページ処理済、或いは（ｄ）ページ処理異常の場合には、データ処理装置１０のデータ処理部１４が作成することになる。
【０２８５】
情報収集・提供システム１は、連携する他システムに対して、共通のインタフェースを提供する。ここで、共通のインタフェースとしては、図１１のステップＳ７１３１〜ステップＳ７１３６に示す処理、及び、情報収集・提供システム１への登録処理、さらには、情報収集・提供システム１からのイベント検知処理、等が挙げられる。
【０２８６】
各連携システムを情報収集・提供システム１に登録する（すなわち図９に示す記憶装置２０のサイト情報記憶部２１の「連携システム名」に登録する）処理においては、不図示の表示部の入力画面上で、当該連携システムを図１０のいずれの位置に結合するかについての設定を行うことが可能となっている。結合の設定を行う際には、当該連携システムのシステム名（図９参照）を用い、どの位置に結合したかは、情報収集・提供システム１のＣＰＵ内に記憶される。
【０２８７】
例えば、図１０に示す例では、連携システムＡが情報収集・提供システム１に直接結合した設定とされ、また、連携システムＢは、先に結合したシステムＡに対して結合した設定とされる。本実施形態では、このような結合態様とすることにより、連携システムＡへの情報提供処理を行なった後に連携システムＢへの情報提供処理を行なえる、という処理の連続性について保証している。
【０２８８】
一方、図１０に示す例では、連携システムＣについては、連携システムＡ及びＢと関わりなく、情報収集・提供システム１に直接結合した設定とされる。本実施形態では、このように、先に結合している連携システムＡ等が存在していても、情報収集・提供システム１自体に直接結合できる構成となすことにより、他の連携システムの動作に影響を受けない、という連携システム相互間での独立性についても保証している。
【０２８９】
本実施の形態の情報収集・提供システム１では、このように、各連携システムに対して、処理に応じたシステム連携（結合態様）を選択できるようにしているので、連携システム側でも、本システムを利用したネットワークアプリケーションの開発が容易になる。
【０２９０】
（コンテンツボールのデータ構造）
この情報収集・提供システム１に結合するシステムは、コンテンツボール３０を受信することで、データ処理装置１０の処理状況と、特定サイトの情報を全て知ることができる。コンテンツボール３０は、上述した共通インタフェースと同様に、（ａ）サイト情報収集開始と、（ｂ）サイト情報収集完了と、（ｃ）ページ処理済と、（ｄ）ページ処理異常と、でそれぞれ同じデータ構造（データフォーマット）となっており、以下は、図１２を参照してコンテンツボール３０のデータ構造について説明する。
【０２９１】
図１２に示すように、コンテンツボール３０には、メッセージステータスが含まれる。このメッセージステータスの種類としては、図１２に示すように、「ＳＴＡＲＴ」，「ＥＮＤ」，「ＩＮＦＯ」，「ＥＲＲ」があり、これらはそれぞれ、前述の（ａ）サイト情報収集開始，（ｂ）サイト情報収集完了，（ｃ）ページ処理済，（ｄ）ページ処理異常を示すものである。
【０２９２】
連携システムは、コンテンツボール３０のメッセージステータスのデータを参照することで、当該連携システム固有の処理を行なう。具体的には、例えばメッセージステータスがＳＴＡＲＴの場合には、前処理として、当該連携システムのデータベースの初期化処理を行ったり、ＩＮＦＯの場合は、後処理として、コンテンツボール３０の「一時ファイル名」から一時ファイルを取得して単語を検索する処理を行う、等である。このような連携システム固有の処理は、後述するステップＳ７１３２（前処理）又はステップＳ７１３６（後処理）で行われることになる。
【０２９３】
また、コンテンツボール３０には、サイト情報が含まれる。このサイト情報は、記憶装置２０のサイト情報記憶部２１のサイト情報と同じ情報である。連携システムは、コンテンツボール３０のサイト情報を参照することで、例えば当該連携システムの前処理や後処理の設定によっては、記憶装置２０のサイト情報記憶部２１にアクセスして、特定サイトの内容を参照することも可能である。
【０２９４】
また、コンテンツボール３０には、ページ情報が含まれる。ページ情報とは、図１２に示すように、そのデータのネットワーク上の場所（具体的にはＵＲＬなど），階層，種別，サイズ，更新日時，収集状態を指しており、これらは記憶装置２０のデータ情報記憶部２３の各々と同じ内容である。連携システムは、コンテンツボール３０のページ情報を参照することで、同様に、例えば当該連携システムの前処理や後処理の設定に基づいて、記憶装置２０のデータ情報記憶部２３にアクセスして、特定ページの内容を参照することも可能である。
【０２９５】
また、コンテンツボール３０には、ページ状態が含まれる。ページ状態とは、データ処理部１４が今回収集したページの状態について、前回収集時のデータ（すなわちデータ情報記憶部２３の前データ情報の記憶領域の登録データ）と比較した結果について示すものであり、図１２に示すように、今回収集したページが新規に追加されたものである場合にはＮＥＷが、前回の収集から更新があった場合にはＵＰＤＡＴＥが、前回の収集時と同じ場合にはＮＯＮＥが設定される。
【０２９６】
具体的には、特定サイトへの最初の収集のときには、全てのコンテンツボール３０のページ状態はＮＥＷとなり、２度目以降の収集のときには、前回のデータと比較した結果のＮＥＷ（新規ページ），ＵＰＤＡＴＥ（更新あり），ＮＯＮＥ（更新なし）のいずれかが設定されることになる。
【０２９７】
また、コンテンツボール３０には、一時ファイル名が含まれる。一時ファイル名とは、図４で説明した、データ処理部１４のデータ取得部１４１がネットワーク１００越しにサーバから取得したコンテンツを保存した一時ファイルの名称及び場所（ディレクトリ名等）を示すものである。したがって、連携システムは、コンテンツボール３０の一時ファイル名を参照することで、一時ファイルにアクセスして、コンテンツの内容を参照することが可能となる。
【０２９８】
コンテンツボール３０のページ情報とページ状態は、メッセージステータスがＩＮＦＯ（ページ処理済）またはＥＲＲ（ページ処理異常）の場合に設定され、ＳＴＡＲＴ（サイト情報収集開始）或いはＥＮＤ（サイト情報収集終了）の場合には設定されず、空の状態となる。また、コンテンツボール３０の一時ファイル名は、ページ状態がＮＥＷ（新規ページ）やＵＰＤＡＴＥ（更新あり）の場合など、データ処理装置１０がコンテンツ自体を取得し、一時ファイルを作成した場合に設定される。
【０２９９】
次に、図１１のフローチャートを参照して、図６のステップＳ７１のサブルーチン、すなわちデータ処理装置１０が特定サイトの情報収集の実行を開始する場合に、データ処理装置１０から連携システムに（ａ）サイト情報収集開始（ＳＴＡＲＴ）を通知する処理の詳細について説明する。
【０３００】
なお、図１１は（ａ）サイト情報収集開始（ＳＴＡＲＴ）の際についてのフローチャートであるが、上述した（ｂ）サイト情報収集完了（ＥＮＤ）の際，（ｃ）ページ処理済（ＩＮＦＯ）の際，（ｄ）ページ処理異常（ＥＲＲ）の際についても同様のフローで行われる。したがって、（ｂ）サイト情報収集完了（ＥＮＤ）の場合には図１１のフローチャートが図６のステップＳ８１のサブルーチンとなり、（ｃ）ページ処理済（ＩＮＦＯ）と（ｄ）ページ処理異常（ＥＲＲ）の場合には図１１のフローチャートが図７のステップＳ７５１１のサブルーチンとなる。
【０３０１】
また、図１１のステップＳ７１１乃至ステップＳ７１３の処理は、（ａ）サイト情報収集開始（ＳＴＡＲＴ）と（ｂ）サイト情報収集完了（ＥＮＤ）の場合には、参照情報処理部１３が主体となり、（ｃ）ページ処理済（ＩＮＦＯ）と（ｄ）ページ処理異常（ＥＲＲ）の場合には、データ処理部１４が主体となる。
【０３０２】
一方、図１１の右側に示すフローチャート（ステップＳ７１３１乃至ステップＳ７１３６）は、ステップＳ７１３でコンテンツボール３０を受信した連携システムが行う処理を示すものである。
【０３０３】
データ処理装置１０において、前述の起動命令に基づいて起動した参照情報処理部１３は、図６のステップＳ７１で特定サイトのサイト情報（ＵＲＬ等）を取得するが、その際に、取得したＵＲＬ等に基づいて、コンテンツボール３０を作成する（ステップＳ７１１）。
【０３０４】
この場合のコンテンツボール３０は、図１２のうち、メッセージステータス（＝ＳＴＡＲＴ）とサイト情報だけが含まれ、ページ情報等は含まれないものとなる。
【０３０５】
また、参照情報処理部１３は、（ｂ）サイト情報収集完了（ＥＮＤ）の場合には、図６のステップＳ８１内の処理として、コンテンツボール３０を作成する。
【０３０６】
この場合のコンテンツボール３０は、図１２のうち、メッセージステータス（＝ＥＮＤ）と、サイト情報だけが含まれ、ページ情報等は含まれないものとなる。
【０３０７】
一方、データ処理部１４は、（ｃ）ページ処理済（ＩＮＦＯ）と（ｄ）ページ処理異常（ＥＲＲ）の場合に、データ処理を終えた図７のステップＳ７５１１において、収集データからコンテンツボール３０を作成する。
【０３０８】
この場合のコンテンツボール３０は、図１２の全ての情報が含まれたものとなる。
【０３０９】
これらの場合、参照情報処理部１３或いはデータ処理部１４は、コンテンツを取得し一時ファイルを作成した場合には、その一時ファイル名もコンテンツボール３０に格納する。
【０３１０】
なお、本実施形態では、情報収集・提供システム１は、メッセージステータスがＳＴＡＲＴ、ＥＮＤ、ＥＲＲの場合と、ＩＮＦＯでかつページ状態がＮＯＮＥの場合には、一時ファイルを作成しないようになっている。換言すれば、本実施形態では、メッセージステータスがＩＮＦＯでかつページ状態がＮＥＷまたはＵＰＤＡＴＥの場合、すなわち、そのページのコンテンツ（更新されていた場合を含む）を一度も本システム１が収集していない場合にのみ、一時ファイルが作成されることになる。
【０３１１】
次のステップＳ７１２で、参照情報処理部１３（上記（ｃ），（ｄ）の場合にはデータ処理部１４、以下同様である。）は、記憶装置２０のサイト情報記憶部２１に登録された連携システム名（図９参照）を取得して、ステップＳ７１３に移行する。
【０３１２】
ステップＳ７１３で、参照情報処理部１３は、Ｓ７１１で作成したコンテンツボール３０を、連携システム（ここでは図１０の連携システムＡとする）に対して送信する。
【０３１３】
また、ステップＳ７１３では、参照情報処理部１３は、コンテンツボール３０の送信先となる連携システム（ここでは連携システムＡ）が、後述するステップＳ７１３１乃至７１３６の処理を完了するまで待機する。
【０３１４】
以下、コンテンツボール３０の送信先となる連携システム（連携システムＡ）が行う処理について、図１１右側のフローチャートを参照して説明する。
【０３１５】
連携システムＡは、ステップＳ７１３１で参照情報処理部１３（又はデータ処理部１４）からのコンテンツボール３０を受信すると、次のステップＳ７１３２で、受信したコンテンツボール３０の上述したメッセージステータスを参照して、当該連携システムＡに固有な第１の処理（前処理）を、必要に応じて行った後に、ステップＳ７１３３に移行する。
【０３１６】
ステップＳ７１３３で、連携システムＡは、記憶装置２０のサイト情報記憶部２１にアクセスして、サイト情報記憶部２１から、当該特定サイトの「連携システム名」欄に登録された、自機（システムＡ）に関連する連携システム名（この事例では連携システムＡ及び（連携システムＡに関連している）連携システムＢの連携システム名）を取得して、ステップＳ７１３４に移行する。
【０３１７】
ステップＳ７１３４で、連携システムＡは、取得した連携システム名に、連携システムＡの不図示のＣＰＵ内に記憶されているその他の連携システムが含まれているか否か（すなわち、自システム名に他の連携システム名が関連付けられているか否か）を判定して、Ｙｅｓすなわち今回の連携では連携システムＡ以外の他の連携システムが存在すると判定した場合にはステップＳ７１３５に移行し、Ｎｏすなわち今回の連携では連携システムＡ以外に他の連携システムが存在しないと判定した場合には、ステップＳ７１３６に移行する。
【０３１８】
ステップＳ７１３５で、連携システムＡは、他の連携システム（この事例では図１０の連携システムＢ）に対してコンテンツボール３０を転送（すなわち複写して送信）した後に、ステップＳ７１３６に移行する。
【０３１９】
なお、連携システムＡからコンテンツボール３０を受信した他の連携システム（この事例では連携システムＢ）は、連携システムＡと同様に、ステップＳ７１３１乃至ステップＳ７１３６の処理を行うことになる。この場合、連携システムＡは、連携システムＢのかかる処理が終了するまでステップＳ７１３５で待機し、連携システムＢの処理が終了すると、ステップＳ７１３６に移行する。
【０３２０】
ステップＳ７１３６で、連携システムＡは、コンテンツボール３０の上述したメッセージステータスを参照して、当該連携システムＡに固有な第２の処理（後処理）を、必要に応じて行った後に、一連の処理を終了する。
【０３２１】
かくして、連携システムＡによるステップＳ７１３１乃至ステップＳ７１３６の処理が終了すると、参照情報処理部１３（上記（ｃ），（ｄ）の場合にはデータ処理部１４）は、ステップＳ７１３の処理を抜けて、図６のステップＳ７２に移行することになる。
【０３２２】
図１１で説明したこれら一連の処理は、特定サイトの情報収集を開始・終了するとき（すなわち上記（ａ），（ｂ）の場合）に１度づつと、特定サイト内のデータ（ページ等）の各々を処理する（すなわち上記（ｃ），（ｄ）の）度に行なわれる。具体的には、例えばあるＷｅｂサイト（特定サイト）に１００個のページが存在する場合には、１＋１００＋１＝１０２回のコンテンツボール３０の作成・送信と、連携システムでの処理が行なわれることになる。
【０３２３】
そのため、連携システム側では、上述したステップＳ７１３２又はステップＳ７１３６の各固有処理における自機のＣＰＵの使用量と処理時間を予め推測して、ＣＰＵ使用量が多く時間もかかる固有処理については（ａ）サイト情報収集開始の際及び／又は（ｂ）サイト情報収集完了の際に行い、ＣＰＵ使用量が少なく時間も少なくて済む固有処理については（ｃ）ページ処理済の際及び／又は（ｄ）ページ処理異常の際に行なうことが望ましい。その理由は、情報収集・提供システム１のデータ処理装置１０では、通常、複数のデータ処理部１４（１４ａ〜ｎ）が動作するために、上記（ｃ），（ｄ）でＣＰＵとネットワーク１００を最も多く使用している状態にあることによる。
【０３２４】
このように、実施の形態の情報収集・提供システム１では、特定サイトに対するデータ収集の開始の際（（ａ）サイト情報収集開始（ＳＴＡＲＴ））、特定サイト内の一のページについてのデータ処理終了の際（（ｃ）ページ処理済又は（ｄ）ページ処理異常））、及び、特定サイトに対するデータ収集の終了の際（（ｂ）サイト情報収集完了（ＥＮＤ））、の各々の時点で、それぞれ同一のデータフォーマットによる提供データ（コンテンツボール３０）を連携システムに送信するので、クライアント側となる連携システム側では、特定サイト内に存在する複数の異なる形態の情報に対して、形態の違いを特に意識することなく参照でき、管理や情報参照の手間の軽減を図ることが可能となる。また、連携システム側では、コンテンツボール３０の受信及び解析処理を統一することが可能となり、連携部分のプログラムの作成工数を短縮でき、本システムを利用したネットワークアプリケーションの開発効率を高めることが可能となる。
【０３２５】
そして、連携システムは、受信した各コンテンツボール３０を解析して、例えば「ページ状態」等の情報により特定サイトの新規ページや更新されたページが分かり、また、「メッセージステータス」が「ＥＲＲ」の場合には所謂リンク切れであることが分かり、また、「メッセージステータス」が「ＩＮＦＯ」であるコンテンツボール３０の数をカウントすることで、特定サイトの総ページ数が分かり、さらには、「ページ情報」の「サイズ」を積算することで、特定サイトの全容量が分かることになる。さらにまた、連携システムは、記憶装置２０の各記憶部２１，２２，２３にアクセスして必要な情報を参照することにより、特定サイトに関する種々の情報を得ることが可能となる。
【０３２６】
従って、本実施の形態の情報収集・提供システム１によれば、各連携システムに対して、種々の情報提供のサービスを行うことが可能で、検索以外の様々なサービスのニーズにも対応することが可能となる。
【０３２７】
なお、上述した実施の形態では、データ処理装置１０がネットワーク１００上で共有されているデータをネットワーク１００を介して取得する例について説明したが、これに限定されず、例えば情報収集・提供システム１と不図示の他のコンピュータとでＬＡＮを組み、データ処理装置１０で当該他のコンピュータ内のデータを取得するようにしても良いことは勿論、データを取得するコンピュータ内に情報収集・提供システム１を直接組み込んで、該コンピュータ内のデータを取得するようにしても良い。
【０３２８】
さらに、上述した実施の形態では、主に、情報収集及び提供する対象がＷＷＷデータである場合について説明したが、これに限定されず、例えばディレクトリツリー型のファイルシステムを情報収集の対象とし、ディレクトリツリーを階層（リンク）として扱い、かつ、各ディレクトリに格納されているファイルを収集データとして、あるディレクトリ内の全てのファイルを取得し、サブディレクトリを検知し、更にサブディレクトリ内のファイルを取得しつづけることで、ディレクトリツリー内の全てのデータを取得することも可能である。
【０３２９】
また、ドメイン参加型ネットワークを情報収集の対象とし、ドメインを階層（リンク）、ドメインに属しているネットワーク機器を収集データとして、あるドメイン内の全てのネットワーク機器の状態を取得し、サブドメインを検知し、更にサブドメイン内のネットワーク機器の状態を取得し続けることで、ドメイン内の全てのネットワーク機器の状態（例えば「起動している」，「問題なく動作している」，「ハングアップしている」，等の各種状態）についての情報を取得することも可能である。
【０３３０】
このように、情報収集・提供システム１によれば、高速で質の高い情報収集が行われ、指定されたサイト内における情報を効率良く高速に収集することが可能となる。
【０３３１】
また、情報収集・提供システム１によれば、情報収集に際してＣＰＵやネットワーク等のハードウェア資源の負荷を軽減することが可能となる。
【０３３２】
さらに、情報収集・提供システム１によれば、指定されたサイト内における情報収集後の情報提供に関するサービスの向上が実現する。
【０３３３】
そして、情報収集・提供システム１によれば、ローカルコンピュータ内のデータもしくはネットワーク上で共有されているデータで、階層（リンク）関係を持つが形式が異なる情報群（例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）は勿論、Ｍａｃｒｏｍｅｄｉａ社のＦｌａｓｈやディレクトリツリー型ファイルシステム、ドメイン参加型ネットワーク機器群など、）に対し、あたかも同一の形式の情報群として１つのファイルシステムとして認識し、階層関係に基づいてデータを収集し、記憶装置に登録し、さらには連携システムに提供することが可能となる。これにより、本システム及び他の連携システムの使用者は、特定サイト内に存在する複数の異なる形態の情報に対して特に意識することなく参照できるようになり、管理や情報参照の手間が大幅に軽減する。
【０３３４】
【発明の効果】
以上詳細に説明したように、本発明によれば、高速で質の高い情報収集のシステムを構築し、指定されたサイト内における情報を効率良く高速に収集する、情報収集システム、情報収集方法、及び情報収集プログラムを提供することが可能となる。
【０３３５】
また、本発明によれば、情報収集に際してＣＰＵやネットワーク等のハードウェア資源の負荷を軽減することが可能な、情報収集システム、情報収集方法、及び情報収集プログラムを提供することが可能となる。
【０３３６】
さらに、本発明によれば、指定されたサイト内における情報収集後の情報提供に関するサービスの向上を実現した、情報収集システム、情報収集方法、及び情報収集プログラムを提供することが可能となる。
【図面の簡単な説明】
【図１】本発明を適用した情報収集・提供システムの概略構成を示す機能ブロック図である。
【図２】複数のサイトと複数のコンテンツに対して同時並行的に情報収集を行なう場合の、前記情報収集・提供システム内の特に参照情報処理部とデータ処理部における複数構成について説明するための図である。
【図３】複数のコンテンツに対して同時並行的に情報収集を行う場合の前記情報収集・提供システムの動作を、従来のロボット型エンジンのシステムの動作と比較して示す図であり、データ処理装置において２つのデータ処理部が相互に異なるコンテンツに対して情報収集を行う場合の、データ処理管理部と各データ処理部とにおける通知（呼び覚まし）機能の動作について説明するための図である。
【図４】複数の異なるプロトコルやデータ形式に対して透過的に処理を行なう場合の、前記情報収集・提供システム内の特にデータ処理部における構成について説明するための図である。
【図５】情報収集・提供システムにおいて特定サイトに対する情報取得を行う場合の、システム全体の動作概要を示すフローチャートであり、主に情報取得を開始するまでの動作を説明するためのものである。
【図６】一の特定サイトに対する情報取得の実行動作の概要を説明するためのフローチャートであり、図５のステップＳ７から派生したルーチンの処理を示す。
【図７】一の特定サイト内の一のページに対する情報取得の実行動作の詳細を説明するためのフローチャートであり、図６のステップＳ７５から派生したルーチンの動作を示す。
【図８】情報収集・提供システムにおける参照情報処理部と各データ処理部の動作、及び記憶装置の各記憶部に記憶されるデータ等について説明する図である。
【図９】情報収集・提供システムにおける記憶装置の各記憶部に記憶される情報についてのデータテーブルを表した図である。
【図１０】情報収集・提供システムのデータ処理装置と他システムとの連携を概略的に示す図である。
【図１１】情報収集・提供システムがコンテンツボールを作成して連携システムに送信する処理、及び、コンテンツボールを受信した連携システムが行う処理について示すフローチャートであり、図６のステップＳ７１，ステップＳ８１，及び図７のステップＳ７５１１のサブルーチンを説明するための図である。
【図１２】情報収集・提供システムが連携システムに送信するコンテンツボールのデータ構造について説明する図である。
【符号の説明】
１情報収集・提供システム
１０データ処理装置
１１システム管理部
１２サイト管理部（管理手段）
１３（１３ａ，１３ｂ，１３ｃ，・・・）参照情報処理部（管理手段、提供データ生成手段）
１４（１４ａ，１４ｂ，１４ｃ，・・・）データ処理部（データ処理手段、提供データ生成手段）
１４１（１４１ａ，１４１ｂ，・・・）データ取得部（ページアクセス手段、ヘッダ情報取得手段、判定手段、コンテンツ取得手段）
１４２（１４２ａ，１４２ｂ，・・・）データ解析部（コンテンツ取得手段）
１４３データ登録部（情報登録手段）
１５参照情報処理管理部（管理手段）
１６（１６ａ，１６ｂ，１６ｃ，・・・）データ処理管理部（管理手段）
２０記憶装置
２１サイト情報記憶部
２２参照情報記憶部
２３データ情報記憶部
３０コンテンツボール（提供データ）
１００ネットワーク[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information collection system, an information collection method, and an information collection program. More specifically, the present invention mainly provides a system for collecting information published on a network, and provides the collected information to various computer devices on the client side. The present invention relates to a system and the like, and particularly to an information collection system and the like that realizes high-speed acquisition of a large number of documents and data in a designated site.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, a robot-type search engine has been known as a system for collecting information published on a network. In this robot type search engine, a program called a robot circulates around the Internet and automatically collects data of WWW (World Wide Web) pages.
[0003]
Conventional robot type search engines mainly collect WWW page data for the entire Internet, and do not collect WWW page data only for specific sites. The main focus is on collecting a large amount of WWW page data, but not on the speed of collection.
[0004]
For this reason, the conventional robot-type search engine system has a configuration in which the robot constantly circulates the entire Internet because the data collection target is the entire Internet, and the CPU that controls the entire system, There is a problem that a high load is applied to a network used by the system.
[0005]
On the other hand, there is a method of efficiently operating a plurality of robots and reducing the simultaneous operation time of each robot to reduce the load on a CPU and a network (for example, see Patent Document 1).
[0006]
However, in the system of Patent Document 1, a plurality of robots are operated in a distributed manner for each page (content), and a large number of robots operate at a time in order to reduce the load on the CPU and the network. Since each robot starts up sequentially with a time interval so as not to collect a large number of pages, there is a problem that the information collection time itself is extended.
[0007]
Further, conventional robots are often limited to patrol / acquisition of HTML (Hyper Text Markup Language), which is easy to decipher because specifications are disclosed, and if a program for collecting information in a plurality of formats is created, the structure is reduced. Because of the complexity of multimedia, it is often impossible to support multimedia contents other than HTML represented by Flash of Macromedia (registered trademark) (http://www.macromedia.com/), which is widely used in recent years. .
[0008]
In addition, conventional robots are often part of a search engine and are closely related to a dedicated program for full-text search of the WWW called an indexer, so that application to functions other than the WWW search is difficult. There was a problem.
[0009]
In addition, despite the need for various services other than search with the spread of WWW and the like, conventionally provided software is either an HTML-only acquisition tool or library, or the application program described above. Often it was a set.
[0010]
[Patent Document 1]
JP 2000-76264 A
[0011]
[Problems to be solved by the invention]
SUMMARY OF THE INVENTION The present invention has been proposed to solve the above-mentioned problems, and an information collection system for constructing a high-speed and high-quality information collection system and efficiently and quickly collecting information in a designated site. It is a first object to provide an information collection method and an information collection program.
[0012]
A second object of the present invention is to provide an information collection system, an information collection method, and an information collection program that can reduce the load on hardware resources such as a CPU and a network when collecting information. .
[0013]
Further, a third object of the present invention is to provide an information collection system, an information collection method, and an information collection program, which realize an improvement in service related to information provision after information collection in a designated site.
[0014]
[Means for Solving the Problems]
The first configuration of the information collection system according to the present invention includes accessing a single page in a single site, collecting various data on a plurality of contents configuring the site, and processing a plurality of data for processing. A data processing device including a processing unit and a management unit that manages the data processing unit; specific site information including information indicating a site to which the data processing device first accesses; And a storage device for storing at least information about the content, wherein the data processing means of the data processing device has a page based on the specific site information stored in advance in the storage device, and a link to the page. Page access means for accessing the accessed link page, header information acquisition means for acquiring header information of the accessed page, and Acquisition means for acquiring the content of the accessed page, link information acquisition means for acquiring link information indicating the location of the linked page linked to the accessed page, and storage device for storing predetermined information based on each acquired information The information processing means for registering the data processing means with n (where n is 1 or more) based on the information stored in the storage device and the link information acquired by the link information acquiring means. ) An activation management means for individually starting and a termination management means for terminating data collection for the site based on the link information acquired by the link information acquisition means.
[0015]
In the information collection system having the first configuration, when the data processing unit is activated by the activation management unit based on the specific site information stored in the storage device, the page access unit accesses one page. An access process is performed, and subsequently, a process of acquiring header information of the page by the header information acquiring unit, a process of acquiring content of the page by the content acquiring unit, and a process of acquiring link information by the link information acquiring unit are sequentially performed. The predetermined information based on the acquired information is registered in the storage device by the information registration unit, and the activation management unit activates one or a plurality of data processing units based on the acquired link information. The above-described processes are performed simultaneously and in parallel and repeatedly, and the termination management means Data collection is terminated.
[0016]
Therefore, according to the first configuration, it is possible to rapidly collect various data in the site based on the link (or hierarchy) relation of each page.
[0017]
According to a second configuration of the information collection system of the present invention, in the first configuration, collection start information indicating a collection start date and time is stored in the storage device as specific site information, and the activation management unit of the data processing device is And activating the data processing means based on the collection start information.
[0018]
In the second configuration, access (patrol) to a page based on the specific site information stored in the storage device, and further to each page in the site, is periodically performed, and thereby, a plurality of pages constituting the site are configured. Since information about the contents of the page is periodically collected, it is possible to regularly obtain information such as updates and additions to various contents while reducing the load on hardware resources (CPU, network, etc.). It becomes.
[0019]
A third configuration of the information collection system according to the present invention is the information collection system according to the first or second configuration, wherein the storage device includes, as specific site information, a maximum of a data processing unit that processes each page in the one site. Coexistence number upper limit information indicating the coexistence number is stored, and the activation management means of the data processing device activates the data processing means within the range of the maximum number of coexistences of the coexistence number upper limit information.
[0020]
In the third configuration, the maximum concurrent activation number of the page access unit is limited by the coexistence number upper limit information, and each unit such as a header information acquisition unit, a content acquisition unit, a link information acquisition unit, and an information registration unit is included. Is also limited, so that a temporary high load is not applied to hardware resources (CPU, network, etc.), and hardware resources can be used efficiently.
[0021]
In a fourth configuration of the information collection system according to the present invention, in any one of the first to third configurations, the data processing unit may notify that the processing has been completed when the registration in the storage device by the information registration unit is completed. Is notified to the management means, and the management means performs a process of activating the data processing means by the activation management means or ending the data collection by the termination management means based on the notification.
[0022]
In the fourth configuration, the management unit only has to wait for a notification of a process notification from any of the data processing units after the data processing unit is activated, and there is no need to periodically detect the state of the data processing unit. Therefore, the load on hardware resources (CPU and the like) is reduced.
[0023]
According to a fifth configuration of the information collection system of the present invention, in any one of the first to fourth configurations, the data processing device determines whether or not the content has been changed based on the header information acquired by the header information acquisition unit. The content acquisition unit does not acquire the content of the accessed page when the determination unit determines that there is no change.
[0024]
In the fifth configuration, since the content of the accessed site is acquired only when it is determined that the content has been changed, collection of various data of a plurality of pages configuring the site and prompt termination of the processing are performed. And the load on hardware resources (CPU, network, etc.) is reduced.
[0025]
In a sixth configuration of the information collection system according to the present invention, in the fifth configuration, the storage device includes a storage area for storing header information acquired by the header information acquisition unit, and the determination unit includes the header information acquired this time. Is compared with the previously obtained header information stored in the storage device, and when they do not match, it is determined that the content has been changed.
[0026]
In the sixth configuration, when the header information matches each other, it is determined that there is no change in the content, and the processing by the content acquisition unit and the link information acquisition unit can be omitted. Network).
[0027]
According to a seventh configuration of the information collection system of the present invention, in the configuration of any one of the first to sixth configurations, the data processing device determines that the page is the site based on the header information acquired by the header information acquisition unit. Has a site determination unit that determines whether or not the page is within, and the content acquisition unit does not acquire the content of the accessed page when the site determination unit determines that the page is not within the site It is characterized by the following.
[0028]
In the seventh configuration, the situation where the processing is not terminated due to the infinite chain of information to be collected and processed by the data processing means is surely prevented, and the collection and processing of various data in a plurality of pages constituting the site can be rapidly performed. Completion is ensured, and the load on hardware resources (CPU, network, etc.) is reduced.
[0029]
An eighth configuration of the information collection system according to the present invention is the information collection system according to any one of the first to seventh configurations, wherein the content acquisition unit includes a plurality of types of analysis programs for analyzing the content of the page, The acquisition unit acquires link information including content type information indicating the type of content of the page, and the activation management unit outputs a data processing request including the link information upon activation of the data processing unit. Regarding the page accessed by the page access unit in the activated data processing unit, the content acquisition unit analyzes the content using an analysis program corresponding to the content type information included in the data processing request.
[0030]
In the eighth configuration, it is possible to acquire various types of contents.
[0031]
A ninth configuration of the information collection system according to the present invention is the information collection system according to any one of the first to eighth configurations, wherein the page access means includes a plurality of types of programs regarding a communication protocol for accessing the page. The information acquisition unit acquires link information including information on an access method to the link page from the content acquired by the content acquisition unit, and the activation management unit uses the link information when activating the data processing unit. The data processing request including the data processing request is output, and the page access means of the activated data processing means accesses the link page using a program corresponding to the access method included in the data processing request.
[0032]
In the ninth configuration, since the page to be accessed can be accessed using various access methods, it is possible to access various sites.
[0033]
A tenth configuration of the information collection system according to the present invention is the information collection system according to any one of the first to ninth configurations, wherein the storage device includes, as specific site information, another system that wants information about the one site. The cooperation system name is stored, and the data processing device, for each piece of information collected and processed by the data processing unit, provided data generation unit for generating provided data for providing to another system, and the generated provided data Provided data transmission means for transmitting to another system based on the cooperation system name.
[0034]
In the tenth configuration, since the provided data generated by the provided data generating means is transmitted to another system based on the cooperation system name, it is possible to provide various information providing services to the other system. It becomes possible.
[0035]
An eleventh configuration of the information collection system according to the present invention is the information collection system according to the tenth configuration, wherein the provision data generation unit performs processing by the data processing unit on one page in the site when data collection for the site is started. At the end of the data collection and at the end of the data collection for the site, the provided data in the same data format is generated.
[0036]
In the eleventh configuration, since the provided data generated by the provided data generating means is transmitted to the other system in the same format at each point in time, the receiving and analyzing process of the provided data in the other system is performed. Can be unified, and the development efficiency of the other system can be improved.
[0037]
According to a twelfth configuration of the information collection system of the present invention, in the tenth or eleventh configuration, the provided data transmitting unit transmits the content acquired by the content acquiring unit to another system based on the cooperation system name. It is characterized by doing.
[0038]
In the twelfth configuration, the process of storing the content acquired by the content acquisition means in the storage device can be omitted, and the storage capacity of the storage device and the load can be reduced.
[0039]
The main structure of the information collecting method of the present invention is to provide a plurality of data processing means for accessing one page in one site, collecting various data on a plurality of contents constituting the site, and processing the data. A data processing device comprising: a data processing device; a management device for managing the data processing device; specific site information including information indicating a site to be accessed first by the data processing device; and each content in the site. And a storage device for at least storing the information of the information processing device, wherein the data processing means of the data processing device includes a page based on the specific site information stored in the storage device in advance, and Page access processing for accessing a linked page linked to a page, and header information acquisition processing for acquiring header information of the accessed page And content acquisition processing for acquiring the content of the accessed page, link information acquisition processing for acquiring link information indicating the location of the linked page linked to the accessed page, and predetermined information based on the acquired information And an information registration process of registering the data processing unit in the storage device, and the management unit of the data processing device sets the data processing unit to n (based on the information stored in the storage device and the link information acquired in the link information acquisition process. (n is 1 or more). A startup process for starting the device and a termination process for terminating data collection for the site based on the link information acquired in the link information acquisition process are executed.
[0040]
In the information collection method having the above configuration, based on the specific site information stored in the storage device, when the data processing unit is activated by the activation process, one page is accessed by the page access process. The acquisition of the header information of the page by the header information acquisition process, the acquisition of the content of the page by the content acquisition process, the acquisition of the link information by the link information acquisition process are sequentially performed, and predetermined information based on the acquired information is obtained. Is registered in the storage device by the information registration process, and based on the acquired link information, one or a plurality of data processing units are activated by the activation process. The data collection for the site is terminated by the termination process.
[0041]
Therefore, according to the information collection method of the present invention, it is possible to collect various data in the site set in advance at high speed based on the link (or hierarchy) relation of each page.
[0042]
The information collection program of the present invention provides a computer with a plurality of data processing means for accessing a single page in one site to collect and process various data on a plurality of contents constituting the site. A data processing device comprising: a data processing device; a management device for managing the data processing device; specific site information including information indicating a site to be accessed first by the data processing device; and each content in the site. And a storage device for at least storing the information, and an information collection program for causing the data processing device to function as a data processing device, wherein the data processing unit of the data processing device is linked to a page based on the specific site information, and linked to the page. Access means for accessing the linked page, and header information for acquiring header information of the accessed page. Acquisition means, content acquisition means for acquiring the content of the accessed page, link information acquisition means for acquiring link information indicating the location of a linked page linked to the accessed page, and predetermined information based on each acquired information. In addition to functioning as information registration means for registering in the storage device, the management means of the data processing device sets the data processing means to n (n) based on the information stored in the storage device and the link information acquired by the link information acquisition means. The information collection program is intended to function as a boot management unit that starts up one or more devices and an end management unit that ends data collection for the site based on the link information acquired by the link information acquisition unit.
[0043]
According to the information collection program of the present invention, when activation of the data processing unit is performed by the activation management unit based on the specific site information stored in the storage device, access processing to one page is performed by the page access unit. Subsequently, the header information acquisition unit acquires the header information of the page, the content acquisition unit acquires the content of the page, and the link information acquisition unit acquires the link information. The predetermined information based on the information is registered in the storage device by the information registration unit, and one or a plurality of data processing units are activated by the activation management unit based on the acquired link information. Processing is performed concurrently and repeatedly, and data collection for the site is As can be allowed to Ryo, computer functions.
[0044]
Therefore, according to the information collection program of the present invention, the computer functions so as to rapidly collect various data in the site set in advance based on the link (or hierarchy) relation of each page.
[0045]
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiments of the present invention will be described in detail with reference to the drawings.
[0046]
1 and 2 show schematic diagrams of an embodiment of an information collection and provision system to which the present invention is applied. In the present embodiment, an example will be described in which a software program is installed on one computer (for example, a personal computer) so that the computer functions as an information collecting / providing system.
[0047]
(Overview of the entire information collection and provision system)
The information collecting / providing system 1 according to the embodiment provides information (sites) having a hierarchical (link) relationship, which can be referred to via various communication networks (hereinafter simply referred to as networks) such as the Internet, to the hierarchical (link). (Link) A system having a function of accessing based on a relationship and collecting various information including the content itself, and a function of providing the collected various information to another computer system, as shown in FIG. A data processing device 10 for collecting and providing various data that can be referred to via the network 100, and a storage device 20 for storing information for managing the system itself and information collected by the data processing device 10.
[0048]
Here, the information (data) to be collected by the data processing device 10 is all data that can be made public on the network 100, and includes various concepts such as pages, contents, files, etc. on the web. As will be described later, data in the local computer can be collected.
[0049]
The types of information (data) to be collected by the data processing device 10 include data dynamically created from document data, moving image data, a database, and the like (for example, data related to product sales and online reservation). Various data such as individual files stored in the directory tree type file system and network devices belonging to the domain participation network are included.
[0050]
The “document” of the document data refers to all documents that can be published on the network 100, including documents described in HTML, WWW published documents, Macromedia (registered trademark) Flash, and the like. Similarly, the “moving image” of the moving image data also refers to all data distributed in a stream, such as MMS (Microsoft (registered trademark) Media Server) or RTSP (Real Time Streaming Protocol), and includes audio data. It goes without saying that this is also included.
[0051]
Similarly, “dynamically created” data also refers to all dynamic data that can be published on the network 100, including data that creates a page each time it is accessed, using a computer language such as CGI or Perl. .
[0052]
Similarly, “individual files of a directory tree file system” also refers to all files recorded on a hard disk device and the like, including Word (registered trademark) Corporation Word documents and Excel documents.
[0053]
Similarly, “network devices belonging to a domain-joined network” also refers to all network devices that can participate in a domain, including personal computers and server computers.
[0054]
The “information group (site)” to be accessed by the data processing device 10 includes, for example, all of the websites of one company, all of the intra sites of one department, all of the files stored in one file server, And various forms.
[0055]
In addition, “to be referred” for an information group (site) means that all public files can be copied (downloaded), and even if all copies cannot be copied, file information (header information such as size and date and time described later) can be copied. I just want to get it.
[0056]
Further, the "hierarchy (link) relationship" of the information group includes not only the hierarchy relationship within the site but also the link relationship to other sites, and one data such as a directory hierarchy or an HTML URL is another data. Refers to those that are related to
[0057]
The "directory tree type file system" includes FAT used in Microsoft (registered trademark) OS such as MS-DOS (registered trademark) and Windows (registered trademark), and NTFS, UNIX (registered trademark). It refers to all file systems having a so-called directory (or folder) such as UFS or SSFS used and having a hierarchical structure.
[0058]
In addition, as the “domain participation type network”, a plurality of devices connected to the network 100 such as an Internet domain, a work group of Microsoft (registered trademark), an eDirectory of Active Directory, and an eDirectory of Novell (registered trademark) are defined as a unit. And all the networks whose management groups have subgroups.
[0059]
The “header information” mainly indicates various data transmitted prior to transmission of the main body in HTTP, but the same applies to protocols other than HTTP, and names and types other than the main body (data content itself) are used. , Size, update date and time.
[0060]
The term “content” mainly refers to the content of a document such as a text or an image published on a Web server. However, the content of a document published on a network 100 other than the Web server or a document in a local computer is referred to. When the target is a network device itself or the like, it indicates information (operation information or the like) of the network device itself.
[0061]
The information collecting / providing system 1 is accessed by another system connected locally or via the network 100, and the data stored in the reference information storage unit 22 and the data information storage unit 23 of the storage device 20. Is supplied to the other system in the form of a content ball 30 described later.
[0062]
Here, other systems that access the information collection / providing system 1 include, for example, various search engine systems that provide search information to users of the network 100, Web pages and files stored in shared folders, and the like. Computer system for the purpose of acquiring newly arrived information, computer system for the purpose of acquiring broken link information of Web pages, and the like (hereinafter, these other systems are referred to as "cooperation systems").
[0063]
In general, when the cooperative system is a search engine system, the cooperative system refers to the downloaded content (temporary file described later) and the data information storage unit 23 or the content ball 30 described below to determine what. Information about which URL contains the page containing the word will be obtained.
[0064]
When the cooperative system is a computer system for acquiring new arrival information, the cooperative system refers to the data information storage unit 23 or a content ball 30 described below to store data updated after a certain date. The location (for example, URL) is obtained.
[0065]
Further, when the cooperative system is a computer system for the purpose of acquiring broken link information of a Web page, the cooperative system refers to the reference information storage unit 22 or the content ball 30 to determine which Web page is Linked to the Web page, information such as what Web page does not exist is obtained.
[0066]
The details of the data provision process performed by the information collection / provision system 1 for these other cooperative systems will be described later with reference to FIGS.
[0067]
As shown in FIG. 1, the data processing device 10 includes six systems: a system management unit 11, a site management unit 12, a reference information processing unit 13, a data processing unit 14, a reference information processing management unit 15, and a data processing management unit 16. In the present embodiment, one CPU for each of these units 11 to 16 performs a so-called multitasking process using a common hardware resource.
[0068]
Although not shown, the information collecting / providing system 1 includes a keyboard and a mouse as means for performing data input and various settings, such as a modem as a transmitting and receiving means for transmitting and receiving data by connecting to the network 100, and It has a display device (CRT or LCD, etc.) as display means.
[0069]
In the present embodiment, a case will be described in which one computer functions as the information collection / providing system 1. However, the present invention is not particularly limited with regard to the arrangement of hardware resources and the like. It is also possible to configure the information collection / providing system 1 using a plurality of computers, such as using a separate computer from the storage device 20.
[0070]
(Outline of the data processing device 10)
The data processing device 10 accesses a server that provides information on the set site, collects documents published on the network 100, and stores the documents in the data information storage unit 23.
[0071]
In the information collection / providing system 1, a plurality of sites (specific sites) to be accessed by the data processing device 10 can be set by a user such as a system administrator performing an input operation using a keyboard or the like. As shown in FIG. 2, information collection of a plurality of specific sites is performed by performing parallel processing of a plurality of reference information processing units 13, a plurality of data processing management units 16, and a plurality of data processing units 14.
[0072]
Further, even when there are a plurality of servers providing information of one specific site for each domain, the data processing apparatus 10 transparently accesses the plurality of servers, and stores the same data in the storage device 20 as the same site information. Information is stored in the information storage unit 23, and this processing will be described later.
[0073]
Note that “transparently accessing” means that, as viewed from a user, each of a plurality of servers accesses as if they are providing information by the same protocol.
[0074]
As shown in FIGS. 1 and 2, the data processing device 10 includes a system management unit 11 that controls the entire information collection and provision system 1, a site management unit 12 that manages information collection processing for a plurality of specific sites, and A reference information processing unit 13 and a data processing management unit 16 that perform processing for collecting information of a specific site, a data processing unit 14 that performs processing for collecting information of one page in one specific site, and each reference information A reference information processing management unit 15 that manages the processing unit 13.
[0075]
In the data processing device 10, as shown in FIG. 2, a plurality of reference information processing units 13, a plurality of data processing management units 16, and a plurality of data processing units 14 can coexist. Specifically, the reference information processing unit 13 and the data processing management unit 16 each have the same number of specific sites that are simultaneously accessed, and the data processing unit 14 has the same number of contents in the specific site that are simultaneously accessed. Will do. In the present embodiment, the reference information processing unit 13 and the data processing management unit 16 each have a maximum of 254 (that is, 254 specific sites can be simultaneously accessed), and the data processing unit 14 has a maximum of 254 × 10. The processing is performed in parallel with each other.
[0076]
As a result, the data processing apparatus 10 can simultaneously access a maximum of 254 specific sites, and can simultaneously access a maximum of 10 pages within each specific site.
[0077]
Note that the maximum number of simultaneous accesses in the reference information processing unit 13, the data processing management unit 16, and the data processing unit 14 is not particularly limited, and depends on the processing speed of the CPU, the capacity of the main memory, and the speed of the network 100. It can be set as appropriate, but generally up to about 10 each. For example, the CPU is Intel Pentium (registered trademark) 3-1.4 GHz, the main memory capacity is 256 MB, and the line speed is 1 Mbps. In the case of, the maximum simultaneous access upper limit number of the reference information processing unit 13 (that is, the upper limit number of specific sites to be accessed simultaneously) is about 5, and the maximum simultaneous access upper limit number of the data processing unit 14 (that is, one specific site It is preferable to set the upper limit number of pages to be accessed at the same time to about 10.
[0078]
Each of the units 11 to 16 in the data processing device 10 can be realized by one CPU, but are independent software programs (that is, a system management program, a site management program, a reference information processing program, a data processing program, a reference information (A processing management program, a data processing management program).
[0079]
Here, as described above, since the reference information processing unit 13 and the data processing management unit 16 exist as many as the number of the specific sites simultaneously accessed, respectively, the reference information processing program and the data processing management program perform (In this embodiment, a maximum of 254).
[0080]
Further, as described above, since the data processing units 14 exist in the number of pages to be accessed at the same time, the data processing programs also exist in the number of pages to be accessed at the same time (254 × 10 at the maximum in this embodiment). .
[0081]
In addition, each of these programs in each of the units 11 to 16 has a function of notifying (calling) to a program that cooperates with each other. That is, a program that does not require processing or a program that is waiting for processing of another cooperating program enters a standby state without using the CPU, and the standby state is released by a notification from another cooperating program. ing.
[0082]
(Outline of Storage Device 20)
On the other hand, the storage device 20 includes a site information storage unit 21 that stores management information for managing the entire information collection / providing system 1, a reference information storage unit 22 that stores link relation information in a specific site, a specific site And a data information storage unit 23 for storing information in the information. Each of the storage units 21 to 23 of the storage device 20 can be realized by, for example, one hard disk device having a sufficient data storage capacity.
[0083]
Here, as shown in FIG. 1, the site information storage unit 21 mainly stores information to be referred to by the site management unit 12, and the stored information mainly relates to a specific site to be accessed. Specific site information, which is various information, is included. Here, as the specific site information, for example, as shown in a data table of FIG. 9, site information indicating the location (URL or the like) of the specific site, collection start information indicating the date and time when collection starts, The name of the cooperating system for the cooperating system that provides information (that is, wants the information of the specific site), the maximum number of simultaneous access to the specific site, the information on the access method used when accessing the specific site, etc. No.
[0084]
As shown in FIG. 9, in the site information storage unit 21, a serial number is assigned to each specific site to be accessed, and each serial number, a reference information storage unit 22, and a data information storage unit described later. The configuration is such that 23 “site” columns are associated with each other. Although FIG. 9 shows an example in which specific site information about two specific sites is registered in the site information storage unit 21, registration for many specific sites is actually possible.
[0085]
Although not shown, the site information storage unit 21 stores management information for managing the entire system in addition to the specific site information described above. The management information includes, for example, information about the maximum simultaneous activation upper limit number (the maximum number of specific sites to be accessed simultaneously) of the reference information processing unit 13 described later.
[0086]
For this information, the system management program of the data processing device 10 is started based on the operation of the operation input unit by the user, and the data table of FIG. 9 is displayed on a display unit (not shown) to register the site. By performing various operations such as input, setting, and the like, the information is recorded in the site information storage unit 21 by the system management unit 11.
[0087]
Here, the “site information” to be input is information indicating a site (location) to be accessed first by the data processing apparatus 10. If the specific site is on the Web, the URL (usually the top page) When a specific site and the system 1 are placed in the same computer to acquire data, a directory name (for example, c: \ data \ 001.doc or /data/001.txt) is used. When the specific site is a directory tree type shared file system, the IP address and the directory name (for example, 192.168.0.1/shared folder / 001.txt) are used, and the specific site participates in the domain. In the case of a type network, a description method indicating the domain is used. Hereinafter, for convenience of explanation, a case where the specific site is on the Web and the top page URL is used as the site information will be described.
[0088]
The “serial number” is automatically added at the time of individual site registration.
[0089]
One or more “cooperation system names” can be registered for one specific site. Here, when a plurality of linked system names are registered for one specific site, the information collected and obtained by the data processing device 10 of the information collection / providing system 1 is transmitted to each registered linked system. The details will be described later with reference to FIG.
[0090]
As the “maximum simultaneous access upper limit number”, a numerical value defining the maximum simultaneous coexistence number of the data processing unit 14 in one specific site is input.
[0091]
The “collection start date and time” is, for example, “every day at 00:00 on the first day of the month”, “every Sunday at 23:30”, depending on circumstances such as the frequency of update of the specific site or the time zone during which the cooperative system can be operated. It can be set arbitrarily, such as “every day at 6:00”.
[0092]
As information on the “access method”, proxy information, authentication information such as simple authentication (Basic authentication), form authentication (CGI authentication), and the like are input.
[0093]
Here, the “proxy” is installed to secure security and realize high-speed access when connecting from an internal network (in this case, the information collection / providing system 1) to an external network (in this case, a specific site). Refers to server or software.
[0094]
“Simple authentication (Basic authentication)” is a method in which access to specific data is restricted by a user name (ID) and a password, and is mainly used to restrict access to directories and files on the Web. Although used, here, it refers to all the systems in which access is restricted by a user name (ID) and a password.
[0095]
Further, “form authentication (CGI authentication)” refers to a system for redirecting an unauthenticated request to an HTML form mainly using an HTTP client-side redirect. All of the methods that allow access to data with access restrictions by accessing the location (URL or the like) with a parameter.
[0096]
The various types of information registered in the site information storage unit 21 are appropriately referred to by each unit of the data processing device 10 when the data processing device 10 collects information on a specific site. Will be described later.
[0097]
As shown in FIG. 1, the reference information storage section 22 mainly stores information to be referred to by the reference information processing section 13, and the stored reference information is, for example, a data table shown in FIG. As described above, the information of “site”, the “link information”, the “parameter” information, the “processing state” information, and the like are listed, and these information are recorded / updated by the data processing unit 14 during the information collection processing. Is done.
[0098]
Here, the “site” information (numerical value) is for associating with the above-described specific site information of the site information storage unit 21 and the “site” information of the data information storage unit 23 which will be described later. As information of the reference information storage unit 22, only information associated with the serial number 1 of the site information storage unit 21 (that is, “http://abcd.co.jp/001.html” of the specific site) is extracted. Is shown.
[0099]
The “link information” is information indicating the content (data) to be collected next, where the content (data) is described, how it is described, and the like. In the present embodiment, As shown in FIG. 9, the link information includes information on “link source”, “link destination”, “number of lines”, and “tag name” of the specific site.
[0100]
Here, the link source is a location on the network 100 of the page in which the link information is described, and the numerical value of the “serial number” in the data information storage unit 23 of the corresponding site is stored.
[0101]
The link destination refers to the content (data) linked from the link source page or the location of the page on the network 100 based on the link information, and the “serial number” of the data information storage unit 23 of the corresponding site. Is stored.
[0102]
The number of lines indicates which line of the link source content (data) is described in the link information, and the line number is stored as a numerical value.
[0103]
The tag name indicates how the link information is linked. In the case of HTML, the HTML tag name (contents such as HREF of A tag and SRC of IMG tag) is used. It is memorized.
[0104]
Further, the “parameter” information is a parameter used when the data processing device 10 accesses a page (content (data) of the page indicated by the link destination of the link information) in the specific site. Stores a name anchor (URL fragment identifier) or, in the case of a dynamic program such as CGI, stores an argument of the program.
[0105]
The “processing state” information indicates whether or not the data processing apparatus 10 has completed all the information collection processing for one specific site. If all of the information collection processing has been completed, for example, a flag of “processed” If all of them have not been completed yet, this is indicated by, for example, a flag of “unprocessed”.
[0106]
As shown in FIG. 9, the reference information storage unit 22 has a storage area divided into a storage area for new reference information and an area for previous reference information. This is an area for writing the processing result during the data acquisition processing by the processing device 10, and the storage area for the previous reference information is an area for storing the processing result of the previous data acquisition.
[0107]
The data information storage unit 23 mainly stores information to be written and compared by the data processing unit 14 with respect to a result of acquiring information of a specific site. The information to be stored is, for example, as shown in FIG. As shown in the data table, information on "site", "serial number", "location on network", "number of layers", "type", "size", "update date", "collection status" And the like.
[0108]
Here, the information of “site” is for associating with the specific site information of the site information storage unit 21 and the information of the site of the reference information storage unit 22. Similarly, in FIG. As the information, only the information associated with the serial number 1 of the site information storage unit 21 (that is, “http://abcd.co.jp/001.html” of the specific site) is extracted and shown.
[0109]
The information (numerical value) of “serial number” in the data information storage unit 23 indicates a page (usually a top page) existing on a specific site and a serial number based on the number of link pages linked to the page. In this example, as shown in FIG. 9, the information is associated with the link source and link destination information in the reference information storage unit 22. Then, in the data information storage unit 23, the information about the page of the specific site registered in the site information storage unit 21 is “serial number” 1, and the linked pages linked from the specific page are sequentially “ The serial numbers "2, 3, 4..." Are added.
[0110]
In the “location on the network”, information equivalent to the “site information” of the above-described site information storage unit 21 is stored. In this embodiment, a specific site and a link page directly or indirectly linked to the specific site The information indicating the location of is stored. In the example illustrated in FIG. 9, the top page existing on the specific site and the location (URL or the like) on the network 100 of the page linked to the top page are shown.
[0111]
Here, “directly or indirectly link to a specific site” means only a link page directly linked to (directly linked to) a specific site (top page or the like) registered in the site information storage unit 21. It is intended to include various link pages which are not directly linked to the specific site but can be finally accessed based on the access to the specific site. However, in the information collecting and providing system 1, link pages to be accessed are restricted within a predetermined range in order to prevent a situation where the information collecting process is not terminated due to an infinite chain of sites to be accessed. The configuration will be described later with reference to FIG.
[0112]
In the “number of layers”, “type”, “size”, and “update date and time”, information indicating the number of layers, the type, and the update date and time of the data (each circulated page) collected by the data processing apparatus 10 are respectively described. It is memorized.
[0113]
Here, the information on the number of hierarchies indicates the order of the number of the page counted from the collection start data (the top page “http://abcd.co.jp/001.html” in FIG. 9). It is shown.
[0114]
The information of “collection state” indicates whether or not the data processing apparatus 10 has completed all the information collection processing for the pages existing in the one specific site. If all of the flags have not been completed yet, the status is indicated by, for example, a flag of “not yet completed”. In this embodiment, when the collection status of each page regarding one site is all “OK”, the processing status information regarding the site in the reference information storage unit 22 changes from “unprocessed” to “processed”. Will be.
[0115]
These pieces of information in the data information storage unit 23 are recorded and updated by the data processing unit 14 during the information collection process.
[0116]
As shown in FIG. 9, the data information storage unit 23 has a storage area divided into a storage area for new data information and a storage area for previous data information. An area for writing the processing result during the data acquisition processing by the device 10, and a storage area for the previous data information is an area for storing the processing result of the previous data acquisition.
[0117]
Further, in the present embodiment, the text of the data on each page, that is, the content itself is not stored in the storage device 20 but is provided as a temporary file to another cooperation system. It will be described later.
[0118]
(Overview of Functions of Each Unit in Data Processing Device 10)
Next, the functions of the units 11 to 16 of the data processing device 10 will be described.
[0119]
The system management unit 11 performs the following functions by operating based on a system management program for managing the entire information collection / providing system 1. That is, the system management unit 11 collects each information (site information (URL, etc., of a specific site to be accessed) of the specific site to be accessed, Start date and time, cooperative system name, maximum simultaneous access upper limit number, access method to specific site, maximum simultaneous start upper limit number of reference information processing unit 13, etc.), start / end processing based on user input operation, It is responsible for managing the start and stop of information collection to specific sites.
[0120]
Note that the system management unit 11 operates mainly based on a user's input operation. Once activated, the system management unit 11 is in a standby state until the user performs an input operation. When the user performs an input operation, the system management unit 11 performs a process of displaying input items and the like on a display screen of a display unit (not shown), and stores data in the site information storage unit 21 based on the input information. , Add, change, delete, etc.
[0121]
The site management unit 12 performs a process of referring to site information (such as a URL) of a specific site from the site information storage unit 21 based on a site management program for managing a plurality of sites to be collected. The date and time are compared with the collection start date and time of the site information storage unit 21. When the collection start date and time comes, the reference information processing unit 13 is started via the reference information processing management unit 15 and the specific site of the specific site to start information collection is started. A process of notifying site information (URL and the like) to each reference information processing unit 13 is performed.
[0122]
When a plurality of the specific sites registered in the site information storage unit 21 are set to the same collection start date and time, the site management unit 12 sends a plurality of reference information processing units when the collection start date and time comes. 13 is started.
[0123]
The reference information processing unit 13 performs the following functions by operating based on a reference information processing program for managing all data on one specific site to be collected. In other words, the reference information processing unit 13 performs the initialization function for initializing data in the reference information storage unit 22 and the data in the data information storage unit 23 relating to one specific site, the matching function for matching each data, and the reference information storage unit 22. Next, it has a function of extracting link information about data to be collected and notifying it to the data processing management unit 16 as a data processing request, and a function of determining whether or not all data in one specific site has been collected.
[0124]
The data processing unit 14 operates based on a data processing program for acquiring / analyzing various data (header information, contents, etc.) in one page to be collected, and thereby operates the one page. And a function of registering various information on the data in the one page in the reference information storage unit 22 and the data information storage unit 23 of the storage device 20.
[0125]
In the present embodiment, as for the function of obtaining data in a specific site in the data processing unit 14, since the access target is a Web page, one data processing unit 14 obtains only one Web page. For example, to acquire data of ten Web pages, ten data processing units 14 operate. On the other hand, when a file is to be accessed, one data processing unit 14 similarly obtains only one file.
[0126]
A plurality (n) of the data processing units 14 coexist in one specific site as described above, but the individual data processing units 14 (14a, 14b,. A plurality of types of protocols (communication procedures) and a plurality of data format analysis programs are respectively mounted, thereby enabling acquisition and analysis of various data.
[0127]
That is, each function of the data processing unit 14 is roughly divided into a data acquisition unit 141, a data analysis unit 142, and a data registration unit 143, as shown in FIG. I do.
[0128]
The reference information processing management unit 15 performs the following functions by operating based on a reference information processing management program for managing the plurality of reference information processing units 13. That is, the reference information processing management unit 15 has a function of receiving a site processing request from the site management unit 12, a function of activating one or a plurality of reference information processing units 13 based on the received site processing request, After the activation of the service 13, it has a function of determining whether or not information is being collected for each specific site (see FIG. 2).
[0129]
The data processing management unit 16 performs the following functions by operating based on a data processing management program for managing the plurality of data processing units 14. That is, the data processing management unit 16 has a function of receiving a data processing request from the reference information processing unit 13 and a maximum simultaneous access registered in the site information storage unit 21 of the storage device 20 based on the received data processing request. It has a function of activating the data processing unit 14 within the upper limit number (see FIG. 9).
[0130]
Next, with reference mainly to FIG. 3, an operation of the data processing device 10 relating to a notification (calling) function in the data processing management unit 16 and the plurality of data processing units 14 (14a, 14b) will be described. Here, FIG. 3 shows the relationship between the data processing management unit 16 and each data processing unit 14 in comparison with a conventional system.
[0131]
In the following, when accessing a specific site where the maximum simultaneous access upper limit number of the site information storage unit 21 is set to “2”, that is, when the maximum value of the data processing unit 14 that can access a certain specific site is “2” It is assumed that this is set to "."
[0132]
First, the data processing management unit 16 outputs a start command to start the two data processing units (the data processing unit 14a and the data processing unit 14b shown in FIG. 3) according to the set maximum value 2, and also starts up. A data processing request is transmitted to the respective data processing units 14a and 14b so as to access data (contents, files, etc.) of different pages that have not been collected.
[0133]
Specifically, in the information collection / providing system 1, first, the top page URL of the site information in the site information storage unit 21 (for example, “http://abcd.co.jp/001.html” in FIG. 9) is accessed. The data processing unit 14a is activated by issuing an activation command and a data processing request to execute the operation, and the activated data processing unit 14a transmits information on the linked page set as the top page (for example, “http://abcd.co. .Jp / 002.html ”), a start command and a data processing request are issued to access the URL of the linked page, and the data processing unit 14b is started. Here, the data processing unit 14b also obtains information of a further linked page set in the linked page (eg, “http://abcd.co.jp/003.html”). Since the maximum value of the data processing unit 14 that can be accessed is set to “2”, the next data processing unit 14 c is activated only after the data acquisition processing by the data processing unit 14 a or 14 b ends. After the data processing unit is closed.
[0134]
In the data processing apparatus 10, before the activated data processing units 14a and 14b access the data of each page, the reference information processing unit 13 determines the data handled by the data processing units 14a and 14b. The processing status information in the reference information storage unit 22 is updated so that the processing status of the data processing unit 14 is set to "processed" in advance, and the data processing management unit 16 accesses the data processing units 14a and 14b. The location of the page on the network (such as a URL), the access method, and the like are individually notified.
[0135]
Here, the processing status information in the reference information storage unit 22 is set to the “processed” state in advance when the data processing unit 14 is in the process of processing, and the reference information processing unit 13 performs data processing on the same page again. This is to prevent a processing request from being made. In other words, the processing state information in the reference information storage unit 22 functions as a flag indicating whether the reference information processing unit 13 has made a data processing request to the data processing unit 14 (whether the request has been made). Will be.
[0136]
Thus, in the data processing apparatus 10, the activated data processing units 14a and 14b each access data of different pages in the same site (see FIG. 2), and immediately start the process of collecting information. On the other hand, when the data processing management unit 16 confirms the start of the information collection processing of each of the data processing units 14a and 14b, it enters a standby state. At this time, in the data collection device 10, the hardware resources of the CPU can be effectively used for the waiting time of the data processing management unit 16, so that the processing speed of each of the data processing units 14a and 14b can be increased as much as possible. Thus, it is possible to collect information published on the network 100 at high speed.
[0137]
Subsequently, each of the data processing units 14a and 14b stores various information other than the content itself (for example, the size of the content, the update date and time of the content, etc.) in the data information storage unit 23 of the storage device 20 for the accessed page. Then, the reference information (link information indicating which page is linked to which page or content, parameters for accessing the page, etc.) is stored in the reference information storage unit 22 of the storage device 20. The data processing units 14a and 14b provide the content itself to another cooperation system (for example, a search engine system) in the form of a temporary file, and do not store the content in the storage device 20. In addition, if another cooperative system ignores the temporary file, the provided temporary file is automatically deleted. By performing such processing, enlargement of the data stored in the storage device 20 is prevented, and the storage capacity of the storage device 20 is saved.
[0138]
Then, when various processes described below for the accessed page are completed, each of the data processing units 14a and 14b transmits a processing completion notification to the data processing management unit 16.
[0139]
In the example of FIG. 3, the content of the page accessed by the data processing unit 14a has a smaller amount of information, so the data processing unit 14a finishes the information collection process first, and transmits a process completion notification to the data processing management unit 16. It shows the case where it is done.
[0140]
Here, when the data processing management unit 16 receives the processing completion notification, as shown in FIG. 3, the data processing management unit 16 releases the standby state and further collects information on unprocessed pages in the same site. An activation command and a data processing request are output to the data processing unit 14 so as to activate 14c.
[0141]
As a result, as shown in FIG. 3, in the data processing unit 14, the data processing unit 3 is activated, and the data processing unit 2 being processed and the newly activated data processing unit 3 mutually use different pages of the same site. When the data processing management unit 16 confirms that the parallel processing of the data processing units 14b and 14c is performed, the data processing management unit 16 enters the standby state again.
[0142]
Then, in this information collection / providing system 1, when any of the data processing units completes the processing in the same manner, a processing completion notification is transmitted from the data processing unit to the data processing management unit 16, and the data processing management unit A new data processing unit is activated based on the command from 16 and a process of collecting information on other unprocessed pages is performed. By repeating such an operation, only one data processing unit exists. The state can be shortened as much as possible, and the processing capacity of the entire system can be greatly improved.
[0143]
That is, in the conventional robot type search engine system, as shown in the operation of the conventional system on the left side of FIG. 3, each robot 1, 2, 3,... (Each data processing unit 14a, 14b, 14c,...) Is periodically detected and determined by the robot management unit (corresponding to the data processing management unit 16), so that the robot processing ends immediately after the detection. In this case, useless time is generated before the next detection, and it takes time to start a new robot because it is necessary to perform determination processing after the detection. On the other hand, if the detection period (cycle) is shortened, the entire system becomes There was a problem that processing became heavy.
[0144]
On the other hand, in the information collecting / providing system 1 of the embodiment, instead of the process in which the data processing management unit 16 detects and determines the operation state of each data processing unit 14 (14a, 14b, 14c,...). The data processing unit 14 that has completed the processing sends a signal to notify the data processing management unit 16 that the processing is completed, so to speak, and the data processing management unit 16 that has received this signal immediately sends the next new data. Since the data processing program and the data processing management program are configured to issue an instruction for activating the processing unit, each data processing unit 14 (14a, 14b, 14c,...) Is processing (information is being collected). During the period (), the only processing to be performed by the data processing management unit 16 is to wait for the reception of the processing completion notification, and it is possible to transfer all the capabilities of the CPU to other necessary programs. In comparison with the come regularly dormant and activities to system, the processing capacity can be significantly improved, it becomes possible to collect the information that has been published to the network 100 at a higher speed.
[0145]
The information collection / providing system 1 has a hierarchical (link) relationship, but an information group (content of a specific site) in which contents of different formats are mixed, such as HTML, Macromedia (registered trademark) Flash, and a directory tree type file system. For each page, data is collected so as to have the same file system. Hereinafter, the processing for data collection will be described with reference to FIG.
[0146]
As shown in FIG. 4, the data processing unit 14 includes a data acquisition unit 141 (141a, 141b, 141c,..., 141n) for transparently accessing a server using a plurality of different protocols, and a plurality of data acquisition units. A data analyzing unit 142 (142a, 142b, 142c,..., 142n) for transparently analyzing data contents in different formats, that is, analyzing the data contents so as to be viewed as the same format as viewed from a user; A data registration unit 143 that is connected to the data analysis unit 142 and registers data in the storage device 20 in a unified data format.
[0147]
The data acquisition unit 141 includes an HTTP, an MMS (Microsoft (registered trademark) Media Server), an RTSP (Real Time Streaming Protocol), an SMB (Server Message Block), a WebDAV (Web-based Dictionary), and the like. In order to enable access to various sites that require a method (protocol), programs for these various access methods (protocols and communication protocols) are installed. As shown in FIG. 4, for example, the data acquisition unit 141a Is an access program that uses HTTP, the data acquisition unit 141b is an access program that uses MMS, and the data acquisition unit 141c is RTSP. Access program to be used, has been assigned as ....
[0148]
Then, when a content acquisition command is sent from the data processing management unit 16 to the data acquisition unit 141, the data acquisition unit 141 selects a matching protocol from the content information, accesses the site using the protocol, and is instructed. The content is acquired, (the majority of) the content ball 30 described later is created, and the acquired data is stored as a temporary file for analysis by the data analysis unit 142.
[0149]
In this embodiment, the temporary file is saved in the hard disk of the cooperative system (not shown) in order to save the storage capacity of the storage device 20 in this embodiment. In such a case, a temporary file storage area may be provided in the data information storage unit 23 or the like, and the processing may be performed in the storage device 20.
[0150]
The data analysis unit 142 includes HTML, Macromedia (registered trademark) Flash, SMIL (Synchronized Multimedia Integration Language), and Adobe (registered trademark) (http://www.adobe.com/mobile.com/download/comm/download.com/download/comm/download.html). (Registered trademark), Microsoft (registered trademark) Word, Excel, and other various programs for analyzing contents in different formats are implemented. As shown in FIG. 4, for example, the data analysis unit 142a includes an HTML analysis program, The data analysis unit 142b is an analysis program of Macromedia (registered trademark) Flash, and the data analysis unit 142c is an analysis program of SMIL. , Data analysis section 142d is PDF of the analysis program, it has been assigned as ....
[0151]
Then, the data analysis unit 142 selects a matching content analysis program (142a, 142b, 142c,..., 142n) from the temporary file acquired by the data acquisition unit 141, activates the analysis program, and activates the content and A process of acquiring link information in a content and converting the acquired content into a unified data format (supplement of a missing portion of the content ball 30 described later and link information corresponding to the reference information storage unit 22). , And transfers the analyzed and converted data (content ball 30 and link information) to the data registration unit 143. These processes are performed in a RAM (not shown) which is a work area in the CPU. On the other hand, the data analysis unit 142 stores the acquired content itself as a temporary file in a hard disk (not shown) of the cooperation system.
[0152]
Upon receiving the analyzed and converted content ball 30 and the link information from the data analysis unit 142, the data registration unit 143 stores the contents of the site information and the page information from the content ball 30 in the data information storage unit 23 of the storage device 20. Then, a process of storing the link information in the reference information storage unit 22 of the storage device 20 is performed.
[0153]
In the data processing unit 14, since the data acquisition unit 141, the data analysis unit 142, and the data registration unit 143 each have a predetermined interface, for example, content data by PDF is referred to in an HTTP site. Even in the case of a combination of different types such as a case where the combination is provided (in this case, by using the data acquisition unit 141a and the data analysis unit 142d), the information collection process can be performed.
[0154]
Further, according to the present embodiment, even when the protocol or data format used for data acquisition is increased, each data acquisition unit 141 and data analysis unit 142 may be added. The system can respond without changing the program.
[0155]
Next, the operation of the entire information collecting / providing system 1 according to the present embodiment will be described in detail with reference to FIGS.
[0156]
FIG. 5 is a flowchart showing the general outline of the operation of the information collecting / providing system 1 when collecting information for a specific site, and is mainly for describing the operation until the information collection is started. . FIG. 6 is a flowchart for explaining an outline of an operation of acquiring information for one specific site in the information collecting / providing system 1, and shows an operation of a routine derived from step S7 in FIG. Further, FIG. 7 is a flowchart for explaining details of the operation of executing information acquisition for one page in one specific site, and shows the operation of a routine derived from step S75 in FIG. FIG. 8 is a diagram for explaining an example of an operation of executing information acquisition for one specific site in relation to information stored in the storage device 20 of the information collection / providing system 1.
[0157]
In the information collection / providing system 1, information acquisition for one or a plurality of specific sites is started by the site management unit 12 and the reference information processing management unit 15. For this processing, see the flowchart of FIG. Will be explained.
[0158]
In executing the information acquisition of the specific site, in the information collection and provision system 1, first, the site management unit 12 of the data processing device 10 acquires the current time based on the internal clock of the CPU (step S1). The URL and the like of each specific site and the information collection start time registered in the “site information” and “collection start date and time” (see FIG. 9) in the storage unit 21 are obtained (step S2).
[0159]
Subsequently, the site management unit 12 determines whether or not there is a match between the current time and the information collection start time (step S3). If No, that is, if there is no match, the site management unit 12 waits for a predetermined time. After performing (Step S5), the process returns to Step S1, and in the case of Yes, that is, when it is determined that there is a matching one of the specific sites, the reference information processing management unit starts collecting information on the site. 15 is notified of a site processing request including the site information (URL and the like) of the specific site (step S4).
[0160]
As a result, in the data processing device 10, the reference information processing management unit 15 is awakened, and the site management unit 12 enters a standby state (step S5).
[0161]
Upon receiving the site processing request, the reference information processing management unit 15 determines whether or not the specific site of the URL acquired by the site management unit 12 (that is, the site for which information collection is to be started) is already in patrol, that is, It is checked whether or not the reference information processing unit 13 that is in charge of the process already exists (step S6). If No, that is, if it is determined that the reference information processing unit 13 does not exist, the process proceeds to step S7. That is, if it is determined that the information already exists, it is determined that there is no need to go to the site again to collect information, the site processing request is discarded, the process waits for a certain period of time (step S5), and then returns to step S1. It waits until a site processing request is received from the site management unit 12.
[0162]
In step S7, the reference information processing management unit 15 outputs an activation command to the reference information processing unit 13, activates one reference information processing unit 13, and notifies the activated reference information processing unit 13 of the URL of the site. As a result, the information collection and the provision of the information to the cooperation system are started, and the fact that the specific site is being processed is stored in the RAM of the CPU.
[0163]
In step S7, after it is determined in step S3 that there are a plurality of specific sites where the current time and the information collection start time match, and in step S6 that none of the specific sites are traveling, a plurality of reference sites are set. An activation command is output from the reference information processing management unit 15 to activate the information processing unit, and a plurality of reference information processing units 13 corresponding to the number of the specific sites are activated.
[0164]
In addition, the process of collecting information and providing information to the cooperation system in step S7 is mainly based on the information in the site information storage unit 21 of the storage device 20 and the reference information processing unit 13, the data processing management unit 16, The processing is performed by the processing unit 14. In each processing, various data are recorded and updated in the reference information storage unit 22 and the data information storage unit 23 of the storage device 20. The outline and details are described in FIG. It will be described later with reference to FIG.
[0165]
When terminating the process of collecting information and providing information to the cooperative system in step S7, the reference information processing unit 13 notifies the site processing reference information processing management unit 15 (see FIG. 6 Step S81). In the present embodiment, when the reference information processing unit 13 is activated and terminated (that is, the start and end of information collection and the like), the notification is transmitted between the reference information processing management unit 15 and the reference information processing unit 13 in this manner. Therefore, the reference information processing management unit 15 always manages the reference information processing unit 13 by storing data indicating which site is currently being processed in the RAM of the CPU. It becomes possible.
[0166]
When the information collection process in step S7 ends, the site management unit 12 and the reference information processing management unit 15 determine whether there is an end instruction of the entire system notified from the system management unit 11 based on the operation input of the user. It is determined whether or not it is not (Step S8). If No, that is, if there is no end instruction, the process waits for a certain period of time (Step S5), and then returns to Step S1. The process ends.
[0167]
Note that the waiting for a certain time in step S5 is for preventing the CPU from constantly operating and imposing a burden on the CPU, and is usually made to wait for one minute.
[0168]
As described above, according to the information collection and provision system 1 of the embodiment, the data processing device 10 is managed and designated based on the information collection start time preset and stored in the site information storage unit 21 of the storage device 20. When the specified time has arrived (Yes in step S3), it is configured to access a site on the specified network to collect information and provide information to the cooperative system. Rather than constantly patrol like a system, patrol to a specific site is performed periodically and collected information is registered in the storage device 20, so that the usage of the CPU and the network 100 is reduced. And the load on the CPU and the network 100 can be reduced.
[0169]
Further, according to the information collection / providing system 1, the site management unit 12 periodically acquires the information collection start time from the storage device 20 (Step S5, Step S2), and compares it with the current time (Step S5). In step S3), when they match, a site processing request including the site information (URL and the like) of the target specific site is notified to the reference information processing management unit 15 (step S4). As compared with the conventional search engine system that operates, the effects of reducing the load on the CPU and the network 100 and reducing the labor for management because information is automatically collected are obtained.
[0170]
(Overview of information collection processing)
Next, a routine derived from step S7 in FIG. 5, that is, an outline of an information collection execution process performed after the activation of the reference information processing unit 13 will be described with reference to a flowchart in FIG.
[0171]
In the information collecting / providing system 1, the reference information processing unit 13, the data processing management unit 16, and the data processing unit 14 of the data processing device 10 transfer a large amount of information about data (pages) in a specific site. The acquisition process is performed, and will be described below with reference to FIG.
[0172]
First, in step S71, the data processing device 10 transmits the URL of the specific site transferred from the reference information processing management unit 15 to the reference information processing unit 13 started based on the activation command from the reference information processing management unit 15 described above. (Ie, site information in the site information storage unit 21). In step S71, the data processing device 10 performs a process of creating a content ball indicating that the processing of the specific site is to be started by the reference information processing unit 13 and transmitting the content ball to the cooperation system, and then proceeds to step S72. This will be described later with reference to FIGS. 11 and 12.
[0173]
In the next step S72, in the data processing device 10, the reference information processing unit 13 transmits a data processing request to the data processing management unit 16, whereby one data processing management unit 16 is activated (see FIG. 2). . This data processing request includes the URL of the specific site (site information in the site information storage unit 21) and an instruction to start processing for the specific site.
[0174]
In the next step S73, based on the site information (URL and the like) acquired in the previous step, the data processing management unit 16 determines from the site information storage unit 21 the maximum simultaneous access upper limit number of the specific site. get. In other words, when the data processing unit 14 is activated, the data processing unit 14 is actually activated by the data processing management unit 16 instead of the reference information processing unit 13, and thus the processing subject of step S 72 is the data processing management unit 16. .
[0175]
Here, if the data processing unit 14 is activated in a large amount, it places a load on the CPU and other servers. Therefore, the data processing management unit 16 performs processing of a specific site in order to avoid such a load. Based on the maximum simultaneous access upper limit number, the number of data processing units 14 to be started is managed as follows.
[0176]
In the next step S74, the data processing management unit 16 checks the number of data processing units 14 currently accessing the specific site, and the data processing unit 14 It is determined whether or not the maximum simultaneous access upper limit number in the site has been reached (see FIG. 9). If No, that is, if it has not been reached, the process proceeds to step S76 via step S75, and Yes, that is, If it is determined that there is, the process proceeds to step S76. Normally, when the data processing unit 14 is activated for the first time to a specific site, a determination of No is made in this step S74.
[0177]
In step S75, the data processing device 10 activates the data processing unit 14 by the data processing management unit 16, starts the process of acquiring the data of the specific site in the data processing unit 14, and proceeds to step S76. Specifically, in step S75, a start command is issued from the data processing management unit 16 to the data processing unit 14 to start the data processing unit, and the data processing The request is transferred from the data processing management unit 16 to the activated one or more data processing units 14.
[0178]
More specifically, in the first step S75 at the start of the processing of the specific site, a start command is output to start one data processing unit, and the data processing request passed to the data processing unit 14 includes a storage request. The site information (the top page URL in this example) of the site information storage unit 21 in the device 20 is included.
[0179]
On the other hand, in the second and subsequent steps S75 after the progress of the processing of the specific site, an activation command is issued to activate one or a plurality of data processing units according to the number of links and the maximum simultaneous access upper limit number in the specific site. The data processing request (step S80 described later) that is output and passed to the data processing unit 14 includes the “location on the network” stored in the new data information storage area of the data information storage unit 23 in the storage device 20. Information (see FIG. 9).
[0180]
By the processing in step S75, each data processing unit 14 that has received the activation and the data processing request accesses any one page in the specific site.
[0181]
Note that details of the processing performed by the data processing unit 14 started in step S75 will be described in a derived routine (steps S7501 to S7511) in FIG. 7, which will be described later.
[0182]
In step S76, the data processing management unit 16 waits until receiving an end notification from the data processing unit 14, which will be described later (see FIG. 3), or until receiving a data processing request from the reference information processing unit 13 next. When any of these is received, the process returns to step S74.
[0183]
That is, for example, in a period during the second processing state of the data processing management unit 16 shown in FIG. 3, a process of receiving a data processing request from the reference information processing unit 13 and an end notification from the data processing unit 14; This means that the processes of S74 and S75 are being performed.
[0184]
In step S77 after the activation of the data processing unit 14, the data processing apparatus 10 performs processing (that is, information collection processing and information provision processing to the cooperative system) among the activated data processing units 14 (14a to 14n). The reference information processing unit 13 determines whether or not there is any one that has been completed, and waits in step S77 until there is one whose processing has been completed. The process moves to S78.
[0185]
This determination in step S77 is made based on the value of the number of pools of data processing requests transmitted to the data processing management unit 16 (that is, the number of waiting for activation). The number of pools is obtained by subtracting the total number of data processing units 14 started in step S75 from the total number of data processing requests transmitted in step S72 and step S80 described later. The transition will be between the maximum simultaneous access upper limit number.
[0186]
In detail, the reference information processing unit 13 adds 1 to the value of the number of pools in step S72 and step S80 to be described later, and stores the number of pools in step S75. When the activation command is issued, n is subtracted from the stored numerical value, and when the subtracted value becomes smaller than the stored value, it is determined that one of the activated data processing units 14 has been terminated. This is because, when the processing of the data processing unit 14 (for example, 14a) ends, the data processing management unit 16 activates the next data processing unit 14 (for example, 14b).
[0187]
When the information collection processing of any of the data processing units 14 is completed, information on the processed data is registered in the reference information storage unit 22 and the data information storage unit 23 of the storage device 20 (see FIG. 7). Step S7507 or step S7510) is in the state, which will be described later.
[0188]
In step S78 after determining “Yes” in step S77, the data processing device 10 causes the reference information processing unit 13 to check the “processing state” column of the new reference information in the reference information storage unit 22 of the storage device 20. By searching for the serial number of the new data information in the data information storage unit 23 based on the information of the link destination for which the processing state flag is “unprocessed”, the unprocessed (uncollected) data related to the specific site is searched. The information of the “location on the network” of the data is acquired from the storage area of the new data information in the data information storage unit 23, and the process proceeds to step S79. Here, the unprocessed (uncollected) data related to the specific site is, for example, information on a page in the specific site that has not been accessed by the data processing unit 14 or a data processing unit. 14 is link information on a page outside the specific site to which a link from the page accessed by the user is provided.
[0189]
In step S79, the data processing device 10 uses the reference information processing unit 13 to determine whether data (pages) of all reference information related to the specific site has been collected from the obtained unprocessed (uncollected) information. If the determination is No, that is, if it is determined that unprocessed (uncollected) data related to the specific site still exists, the process proceeds to step S80, while Yes, that is, all references related to the specific site If the information data has been collected and it is determined that there is no unprocessed (uncollected) information, the process proceeds to step S81. In step S79, the reference information processing unit 13 determines that the processing states of the reference information storage unit 22 in the storage device 20 are all "processed" (that is, the reference information storage unit 22 performs data processing on all link pages and the like in the specific site). And the number of pools of the data processing request described above is zero (that is, there is no waiting for activation of the data processing unit 14). The determination of Yes is made on condition that the unit 14 does not exist.
[0190]
In step S80, the data processing device 10 transmits a data processing request including the information of “location on the network” acquired in step S78 from the reference information processing unit 13 to the data processing management unit 16. At this time, the reference information processing unit 13 sets a flag of “completed” in advance in the processing state column of the reference information storage unit 22 for the unprocessed (uncollected) data.
[0191]
Thus, upon receiving the data processing request in step S80, the data processing management unit 16 exits the standby state in step S76 described above, shifts to step S74, and repeats the processing and standby in steps S74 to S76. On the other hand, the reference information processing unit 13 repeats the processing of steps S77 to S80 until it is determined in step S79 that there is no unprocessed (uncollected) information. As described above, by repeating each process, the data processing apparatus 10 collects data on all pages constituting one specific site and provides the data to the cooperative system.
[0192]
In the present embodiment, for a page or the like outside the specific site linked to a page or the like in the specific site, the data is stored in the storage device 20 as data related to the specific site. In order to prevent this, access is performed by the data processing unit 14, but acquisition and analysis of content and the like are not performed.
[0193]
That is, in the example of the specific site “http://abcd.co.jp/001.html” in FIG. 9, for example, a certain page “http://abcd.co.jp/002.html” in the specific site Is linked to a page “rtsp: ///hijk.co.jp/001.html” of a completely different site, the data of this page is transferred to the specific site “http://abcd.co. jp / 001.html] is stored in the reference information storage unit 22 and the data information storage unit 23 of the storage device 20 as data linked to the storage unit 20. In the data processing unit 14, “rtsp: //hijk.co.jp/001” is stored. .Html ”, but does not acquire the content of the page, and the data processing unit 14 causes all pages in the specific site to be accessed. When the information acquisition is completed for a collecting information on one particular site is completed (Yes in step S79), is determined by reference the information processing unit 13.
[0194]
In the information collecting / providing system 1, the series of operations described above are repeatedly performed, so that the next linked data is obtained from the first obtained data, and further the next linked data is obtained. That is, by collecting data based on the so-called hierarchical relationship of a specific site, all data within the specific site, and a part of data outside the specific site linked to a page of the specific site (ie, header information) , And furthermore, it is possible to provide the acquired various data to the cooperation system.
[0195]
In step S81, the data processing device 10 terminates the derivative routine process in step S7 by causing the reference information processing unit 13 to notify the reference information processing management unit 15 of an end. After the end notification in step S81 is made, the reference information processing unit 13 and the data processing management unit 16 end the processing, and the next collection start time (see FIG. 9) comes, and the reference information processing management unit 15 Until a start command is output (step S7 in FIG. 5) or a data processing request is issued from the reference information processing unit 13 (step S72 in FIG. 6).
[0196]
In step S81, the data processing device 10 performs a process of creating a content ball indicating that the reference information processing unit 13 has completed the process of the URL of the specific site and transmitting the content ball to the cooperation system, and then proceeds to step S8. The process is shifted, but this process will be described later with reference to FIGS.
[0197]
As described above, in the data processing device 10 of the present embodiment, the data processing management unit 16 acquires the maximum simultaneous access upper limit number for the site from the site information storage unit 21 when the data processing management unit 16 is activated ( In step S73, while temporarily holding the data processing request (steps S72 and S80) received from the reference information processing unit 13, the number of data processing units 14 currently being accessed is checked (step S74). If the access limit has been reached, the process waits for the end of the data processing unit 14 or the reception of a data processing request from the reference information processing unit 13 (step S76, see FIG. 3). If the number of the processing units 14 has not reached the maximum simultaneous access upper limit number, the data processing unit 14 is started (Step S). 5) Therefore, the number of data processing units 14 during acquisition of information of one specific site is managed by the data processing management unit 16, and a case where a predetermined number or more of the data processing units 14 are started up may occur. As a result, the load on the CPU and the network 100 can be reduced.
[0198]
In the data processing apparatus 10, a plurality of data processing units 14 collect data for one specific site at the same time, and while the data processing unit 14 is collecting data, the data processing management unit 16 and the reference information processing unit 13 Is in a standby state (FIG. 3, step S76, step S77) without using the CPU, and as described above, whether the data processing unit 14 has completed the collection is determined by the data processing management unit 13, the data processing management unit 16, and the like. Instead of performing a periodic check, a signal indicating the end is individually transmitted from the completed data processing unit 14 to the data processing management unit 16, eliminating unnecessary operation and unnecessary idle time, and also eliminating the CPU and the network 100. It is possible to collect information on a specific site published on the network 100 at high speed while reducing the load on the network.
[0199]
(Details of information collection processing)
Next, with reference to the flowchart of FIG. 7, the derivation routine of step S75 of FIG. 6, that is, the details of the information collection process performed by one data processing unit 14 for data of one page in a specific site will be described. .
[0200]
In the information collecting / providing system 1, the data processing unit 14 obtains individual information existing in a specific site by using a data acquisition unit 141, a data analysis unit 142, and a data registration unit 143 as its components. The processing will be described below with reference to FIG.
[0201]
The data acquisition unit 141 of the data processing unit 14 receives the data processing request from the data processing management unit 16 described above (step S7501), and, based on the URL protocol information included in the data processing request (141a to 141 in FIG. 4). In step S7502, it is determined whether or not it is possible to respond (No in step S7502). If No, in other words, the response is not possible, the data processing management unit 16 is notified of the end (step S7512). From the site information storage unit 21 of the storage device 20, information on the method of accessing the site (see FIG. 9) is obtained (step S7503).
[0202]
As described above, specific examples of the information on the access method acquired in step S7503 include simple authentication (Basic authentication) information, form authentication (CGI = Common Gateway Interface authentication) information, proxy information, and the like.
[0203]
Here, when the simple authentication information is obtained, if the URL to be accessed matches the URL of the simple authentication, the data obtaining unit 141 sets the ID and the password as the simple authentication information to the URL (page) of the specific site. Will be used at the time of access.
[0204]
When the form authentication information is obtained, the data obtaining unit 141 uses the parameters of the form authentication at the time of access if the URL to be accessed matches the URL of the form authentication.
[0205]
Further, when the proxy information information is obtained, the data obtaining unit 141 accesses the URL (page) of the specific site via the proxy.
[0206]
In the next step S7504, if the URL protocol information is HTTP, the data acquisition unit 141 of the data processing unit 14 that has acquired the information on the access method to the site uses HTTP (that is, the data acquisition unit 141a in FIG. Otherwise, the URL is accessed using the protocol to obtain the header information (information other than the text such as the name, type, size, and update date and time) of the URL, and the flow advances to step S6405. . The protocol information of the URL indicates the head of the URL. If the head of the URL is “http:”, HTTP is used. If the URL starts with “rtsp:”, RTSP is used. Become.
[0207]
In step S7505, the data acquisition unit 141 of the data processing unit 14 determines whether or not header information has been acquired. If No, that is, if it is determined that header information has not been acquired, the data processing management unit 16 terminates (in this case, the page The processing is terminated (step S7512), and if Yes, that is, it is determined that the information has been acquired, it is determined whether or not it is necessary to acquire the information of the text (content). The process moves to step S7506 to make a determination.
[0208]
In step S7506, the data acquisition unit 141 of the data processing unit 14 refers to the storage area of the previous data information in the data information storage unit 23 and determines whether the header information acquired this time matches the header information registered last time. If the determination is Yes, that is, it is determined that the header information matches the previously acquired header information, since the name, type, size, update date, etc. are the same, the text (content) is also regarded as unchanged and step S7507 is performed. If the judgment is No, that is, it is determined that they do not match, the text (contents) is also regarded as being changed (“UPDATE”) or newly added (“NEW”), and The process moves to S7508.
[0209]
In the case of the first access to the specific site, since the header information is not registered in any of the storage areas of the previous data information and the new data information of the data information storage unit 23 for the specific site, step S7506 is performed. Then, the determination of No will come out.
[0210]
In step S7507, the data processing unit 14 transfers the header information from the data acquisition unit 141 to the data registration unit 143, and the data registration unit 143 stores the header information in the new data information storage area of the data information storage unit 23. At the same time, the contents of the data registered in the storage area of the previous reference information in the reference information storage unit 22 are copied to the storage area of the new reference information as it is, and the process proceeds to step S7512, where the processing is terminated by the data processing management unit 16 in step S7512 ( In this case, "Page processed": "NONE" indicating that the data has not been updated in "INFO") is notified, and the process ends.
[0211]
On the other hand, in step S7508, the data processing unit 14 determines whether or not the header information acquired in step S7505 is information on an external site. The flow shifts to step S7511 as no need of acquisition, and on the other hand, if No, that is, it is determined that the information is not external site information, the flow shifts to step S7509 as the need for content acquisition.
[0212]
Here, the determination as to whether or not the information is of an external site in step S7508 is performed based on information registered in the “site information” column of the site information storage unit 21. Specifically, the registered information is Is a web site, the reference is based on whether or not the domain name of the specific site (“abcd.co.jp” in the example of serial number 1 in FIG. 9) is the same. If the access target is a directory tree file system, the directory name is registered in the “site information” column of the site information storage unit 21. Therefore, in step S7508, the directory name of the specific site is Whether or not they are the same is a reference.
[0213]
In step S7509, the data processing unit 14 actually obtains the content using the data obtaining unit 141, saves the content in a temporary file (in this example, the hard disk of the cooperative system), and proceeds to step S7510.
[0214]
In step S7510, the data processing unit 14 analyzes the content stored in the temporary file by the data analysis unit 142, and performs a process of extracting necessary information. Specifically, the data analysis unit 142 accesses the temporary file in HTML if the content is HTML (ie, using the data analysis unit 142a in FIG. 4), and otherwise accesses the temporary file using the analysis method. The content is analyzed, and the analyzed content is transferred to the data registration unit 143 using the RAM, and the flow shifts to step S7511.
[0215]
Note that the “necessary information” in step S7510 includes information forming a content ball 30 (see FIG. 12) to be described later and transmitted to the cooperation system, and information forming link information to be stored in the reference information storage unit 22 ( In this embodiment, the number of lines, the tag name).
[0216]
In step S7511, the data registration unit 143 of the data processing unit 14 that has obtained the content analysis content in the RAM uses the analysis content to determine the location on the network, the number of layers, the size constituting the header information, the update date and time, Is registered in the new data information storage area of the data information storage unit 23, and link information, parameters, and the like are registered in the new reference information storage area of the reference information storage unit 22, and the process proceeds to step S7512. In step S7512, the data processing management unit 16 is notified of an end notification (in this case, page processing completed: "NEW" indicating new data of "INFO" or "UPDATE" indicating updated data). And finish the process.
[0219]
On the other hand, in step S7511 after it is determined in step S7508 that the information is external site information, in this case, the data registration unit 143 determines the size and update of the header information since there is no content analysis content in this case. Information such as the date and time is registered in the new data information storage area of the data information storage unit 23, and the process proceeds to step S7512. At step S7512, an end notification is sent to the data processing management unit 16 (in this case, page processing completed: "INFO"). “NEW” indicating that the data is new data or “UPDATE” indicating that the data is updated data), and the process ends.
[0218]
As described above, according to the information collecting / providing system 1 of the embodiment, the data processing unit 14 acquires the header information indicating the outline of the content or the like of the accessed page (step S7504), and acquires the header acquired last time. It is compared with the information (step S7506), and if the header information is different from the previous time, the content is acquired (step S7509) and analyzed (step S7510) on condition that the site is not an external site (No in step S7508). On the other hand, in the case of the same header information as the previous time, the content is not obtained and analyzed, but is stored using the previously stored header information and the link information included in the content. Since the configuration is such that the information is stored in the device 20 (step S7507), the usage of the CPU and the network 100 is reduced. Will be able, CPU load and network 100 is reduced, the effect is obtained that.
[0219]
That is, according to the information collecting / providing system 1 of the embodiment, the data processing apparatus 10 acquires header information, which is unique information of the content, before analyzing the content that requires time and load for processing (step Along with S7504), the information is collected and managed while being compared with the previously acquired header information (step S7506), registered in the storage device 20 (step S7507, steps S7508 to 7511), and registered in the previous storage device 20. The content is collected only for information updated more recently than the content (step S7509), and the content of the external site or the content that has not been updated from the content registered in the previous storage device 20 is acquired (analyzed). (Step S7511, Step S7507) Information can be acquired at high speed and with a low load, and thus the load on the CPU and the network 100 when collecting information on a plurality of specific sites simultaneously and in parallel is significantly reduced. It is possible to end the process quickly.
[0220]
Further, according to the information collecting / providing system 1 of the embodiment, the data processing unit 14 of the data processing device 10 receives site information (URL or the like) of a page to be collected for a specific site (step S7501), Acquisition of access method information (simple authentication (Basic authentication) information, form authentication (CGI = Common Gateway Interface authentication) information, proxy information, etc.) is obtained from the site information storage unit 21 of the storage device 20 (step S7503). The user accesses the URL based on the information and obtains the header information (step S7504), so that the user only needs to activate the system management program and set the information to be registered in the site information storage unit 21 once. , Information about multiple access methods that are different for each specific site It is not necessary to input each time is eliminated, management of labor is reduced.
[0221]
Next, with reference to FIG. 8, in performing the above-described information collection processing, the reference information processing unit 13 and the data processing unit 14 transmit the information to the reference information storage unit 22 and the data information storage unit 23 of the storage device 20. The processing to be performed will be described.
[0222]
As schematically shown in FIG. 8, when performing the information collection process, the reference information processing unit 13 and the data processing unit 14 refer to and update the reference information storage unit 22 and the data information storage unit 23 of the storage device 20. While doing.
[0223]
As described above, each of the reference information storage unit 22 and the data information storage unit 23 has an area in which the storage area of the processing result by the data processing device 10 is duplicated, and one of the areas stores the result of the previous information acquisition processing. The other area is a storage area for new information to be newly created when the information acquisition processing is next performed.
[0224]
Then, in the data processing device 10, the data processing unit 14 accesses the URL specified by the reference information processing unit 13 via the network 100, acquires the header information of the URL, and reads “new data” in the data information storage unit 23. It is registered in the “information recording area” (step S7504 in FIG. 7), and is compared with the header information in the corresponding page of the “previous data information storage area” in the data information storage unit 23 (step S7505 in FIG. 7).
[0225]
In addition, as for the comparison of the header information, an example in which the update date and time are compared is described here for the sake of convenience of description, but other contents stored in the data information storage unit 23 such as the type and size are also described. Similar comparisons will be made.
[0226]
Here, as a result of comparing the header information of each storage area of the data information storage unit 23, if they match (Yes in step S7506), the data processing unit 14 stores the “previous reference information storage” in the reference information storage unit 22. The URL information stored in the “area” is copied as “unprocessed” to the “new reference information storage area” of the reference information storage unit 22 (step S7507). As a result of the comparison, if they do not match, or if they do not exist in the “previous data information storage area” of the data information storage unit 23 (No in step S7506), the data processing unit 14 is not an external site (No in step S7508). ) Is acquired, the content is stored as a temporary file (step S7509), link URL information included in the content is extracted (step S7510), and the “new reference information” in the reference information storage unit 22 is extracted. Is registered as “unprocessed” in the “storage area” (step S7511).
[0227]
By performing such processing, it is possible to save time and transfer amount for acquiring URL contents that have not been changed since the previous acquisition, and to acquire information at high speed while reducing the load on the CPU and the network 100. It becomes.
[0228]
On the other hand, during the data collection processing, the reference information processing unit 13 outputs the information of “location on the network” where the processing state of the storage area of the new reference information of the reference information storage unit 22 is “unprocessed”. 9 (see FIG. 9) from the new data information storage area of the data information storage unit 23 (step S78 in FIG. 6), and if there is “unprocessed” (Yes in step S79), the data processing unit 14 A data processing request is transmitted to the data processing management unit 16 to start, and the processing status information in the reference information storage unit 22 is changed to “processed” (step S80).
[0229]
In FIG. 8, three pages including a specific site (data 1 as a specific site and data 2 and data 3 linked to the data 1) exist in a device such as a server on the network 100, and the previous page (for example, At the time of the first information collection, a link was established from data 1 to data 2 and data 3, but at the time of this (for example, the second) information collection, a new link from data 2 to data 3 was newly established. In this case, the data 2 has been updated, and the update date and time is January 1, 2003.
[0230]
In FIG. 8, the information previously collected by the information collecting / providing system 1 is stored in the storage area of the previous reference information in the reference information storage unit 22 and the storage area of the previous data information in the data information storage unit 23. The state where the information collected this time is recorded in the storage area of the new reference information of the reference information storage unit 22 and the storage area of the new data information of the data information storage unit 23 is schematically shown.
[0231]
In this case, as for data 1 and data 3, there is no change between the previous time and this time, and as a result of this information collection, the new data information storage area of the data information storage unit 23 is stored as shown in FIG. , The same information as the storage area of the previous data information is registered, and the information of data 1 and data 3 is copied from the storage area of the previous reference information of the reference information storage unit 22 to the storage area of the new reference information.
[0232]
More specifically, for data 1 and data 3, at the time of the current information collection, the header information of each of the data 1 and 3 is collected by the data acquisition unit 141 of the data processing unit 14 (step S7504 in FIG. 7), and the data information storage is performed. Each piece of information is registered in the “storage area for new data information” of the section 23. At this time, the information is compared with each piece of information registered in the “storage area of previous data information” of the data information storage section 23 (step S7506), and as a result of the comparison, there is no change between the previous time and the present time (all contents have not changed). Therefore, the data 1 and the data 3 are not analyzed using the data analysis unit 142 (step S7510). The information of the “storage area of the previous reference information” is copied as it is to the “storage area of the new reference information” of the reference information storage unit 22 (step S7507).
[0233]
On the other hand, the data 2 was changed (updated) between the previous time and the current time in that a link to the data 3 was set on January 1, 2003. The update date and time and the information of the update content are acquired. The update date and time are stored in the “new data information storage area” of the data information storage unit 23, and the update content is stored in the “previous reference information storage area” of the reference information storage unit 22. , Respectively.
[0234]
More specifically, at the time of this information collection, the header information of the data 2 is collected by the data acquisition unit 141 of the data processing unit 14 and each information is registered in the “new data information storage area” of the data information storage unit 23. At this time, a comparison is made with each piece of information registered in the “previous data information storage area” of the data information storage unit 23 (step S7506), and as a result of the comparison, the date of the “update date and time” is different. Then, data 2 is obtained (step S7509) and analyzed using the data analysis unit 142 (step S7510), and the data 2 is newly stored in the “storage area for new reference information” of the reference information storage unit 22 from the data 2. The new reference information is registered (step S7511).
[0235]
Then, if all processes to be performed by the data processing unit 14 are completed and all of the “new reference information storage areas” of the reference information storage unit 22 have been processed (in this case, the new reference information shown in FIG. 8). When "unprocessed" in the lower row of the storage area is changed to "processed" and all the processes of the data processing units 14 (14a to 14c) are completed, a large amount of information in the specific site is deleted. It means that all of them have been acquired, and the reference information processing section 13 ends.
[0236]
Next, the relationship among the site information storage unit 21, the reference information storage unit 22, and the data information storage unit 23 in the storage device 20 will be described more specifically with reference to each data table shown in FIG. Each data registered in the reference information storage unit 22 and the data information storage unit 23 shown in FIG. 9 is a specific site “http://abcd.co.jp/0001” registered in the site information storage unit 21. .Html ”is only extracted and shown.
[0237]
As shown in FIG. 9, in the storage device 20, the numerical value of “serial number” (1 in FIG. 9) of the site information storage unit 21 is registered in the “site” column of the reference information storage unit 22, so that the reference information is stored. The data in the storage unit 22 is associated with the data of one specific site “http://abcd.co.jp/0001.html” registered in the site information storage unit 21, and similarly, the data information storage unit By registering the numerical value (“1” in FIG. 9) of the “serial number” of the site information storage unit 21 in the “site” column of 23, the data of the data information storage unit 23 is stored in the site information storage unit 21 of the specific site. It is associated with the data.
[0238]
In the storage device 20, the “serial number” number of the data information storage unit 23 is registered in the “link source” column and the “link destination” column of the “link information” of the reference information storage unit 22, so that the reference The data in the information storage unit 22 is associated with the data in the data information storage unit 23.
[0239]
Further, in the storage device 20, the reference information storage unit 22 and the data information storage unit 23 are each duplicated. Specifically, one is a storage area for the previous collection information, and the other is a storage area for the latest collection information. It is used as a storage area for information.
[0240]
Here, for one specific site “http://abcd.co.jp/0001.html” as an information acquisition target, in the first information collection process, the top page 0001. html and its lower page 0002. html. However, at the time of the second information collection process, 0002. 0003. Assuming that an html page has been newly added, processes such as updating data in the storage units 21 to 23 will be described.
[0241]
First, in the first information collection process, an instruction to process a site indicated by “http://abcd.co.jp/0001.html” is received according to the data processing request in step S7501 of FIG. 7 described above. The data processing unit 14 (referred to as 14a) stores various information related to “http://abcd.co.jp/0001.html” in the site information storage unit 21 (each data in the column (1) in FIG. 9). (In this case, proxy information) (step S7503), and obtains information from http: // abcd. co. jp / 0001. html.
[0242]
And, http: // abcd. co. jp / 0001. html, the data processing unit 14a first accesses http: // abcd. co. jp / 0001. The html header information is acquired by the data acquisition unit 141 (step S7504). In this case, since there is no previously acquired information (No in step S7506), the processing of steps S7509 to S7512 is performed.
[0243]
More specifically, in step S7509, the data processing unit 14a uses the data acquisition unit 141 to access http: // abcd. co. jp / 0001. The content in the html (top page) is acquired, most of the content ball 30 described later is created, and the acquired data is stored as a temporary file for analysis by the data analysis unit 142. In the next step S7510, the data processing unit 14a analyzes the content in the data analysis unit 142 and extracts necessary information to replenish the missing portion of the content ball 30, and in the next step S7511, the data registration unit 143 newly registers the above-described data in the reference information storage unit 22 and the data information storage unit 23, and provides the completed content ball 30 to the cooperation system in step S7512.
[0244]
That is, in step S7511, based on the information extracted in the previous step, the data processing unit 14a determines the number of layers (1 in this example), the type (HTML in this example), and the size (HTML in this example) of the page (0001.html). In the example, data such as 1024 (bytes), the update date and time (in this example, 00:00 on December 1, 2002), and the like are stored in the new data information recording area of the data information storage unit 23 ((5) in FIG. 9). , And if there is information on the link destination linked to the page (0001.html), the information is stored in the storage area of the new reference information in the reference information storage unit 22 (FIG. 9). In the corresponding columns in the column of (2) above).
[0245]
Then, in step S7511, the data registration unit 143 of the data processing unit 14a records each piece of information about the page (0001.html) in the reference information storage unit 22 and the data information storage unit 23, and stores all the data for the page. Assuming that the collection has been completed, a flag indicating that the data collection of the page has been completed is displayed in the “collection status” column of the new data information recording area (step (5) in FIG. 9) of the data information storage unit 23. “OK” in FIG. 9) is recorded, and the end is notified to the data processing management unit 16 (step S7512), and at this time, the content ball 30 is transmitted to the cooperation system.
[0246]
In this case, the data processing unit 14a sends the page 0001. When analyzing the content of html (step S7510), 0001. Since “0002.html” of the link page described in the text (content) of the html is extracted, the data registration unit 143 stores the new reference information in the storage area of the new reference information in the data information storage unit 23 in step S7511. , 0002. In addition to securing (newly established) a record column (step (6) in FIG. 9) for html, the secured (newly established) 0002. In the “collection status” column of html, a flag (for example, “not yet”) indicating that the data of the page has not been collected yet is recorded.
[0247]
In step S7510, 0002. The html extracted data processing unit 14a records the link information existing at the link source site (in this case, “http://abcd.co.jp/0001.html”), so in the next step S7511, , Which is secured in the storage area of the new reference information in the reference information storage unit 22 by the data registration unit 143. html to 0002. The link information (the link source, the link destination, the number of lines, the tag name) is recorded in the record column (step {circle around (2)} in FIG. 9) indicating that the link is made to the html, and the "process state" column is recorded in the "process state" column. In this case, an “unused” flag indicating that all the data of the specific content (in this case, the content of “http://abcd.co.jp/0002.html”) has not been collected yet is recorded. The example shown in the section {circle around (2)} in FIG. 9 corresponds to http: // abcd. co. jp / 0001. html in the 32nd line of the main text. This is a case where there is link destination information for linking to html, and the tag name is HREF of the A tag.
[0248]
Then, in the information collecting / providing system 1, the reference information processing unit 13 monitoring the “processing state” column of the reference information storage unit 22 based on the recording processing of the data processing unit 14a, In order to activate (14b), a data processing request including “http://abcd.co.jp/0002.html” is transmitted to the data processing management unit 16 (step S72 in FIG. 6), and the data processing management unit 16 outputs a start command (step S75).
[0249]
Subsequently, the data processing unit 14b started next accesses the page based on “http://abcd.co.jp/0002.html” included in the start instruction, and similarly, http: // // abcd. co. jp / 0002. The html header information is acquired by the data acquisition unit 141 (step S7504). Since there is no previously acquired information (No in step S7506) and the site is not an external site (No in step S7508), the same steps as described above are performed. The processing from S7509 to S7512 is performed.
[0250]
That is, in step S7509, the data processing unit 14b uses the http: // abcd. co. jp / 0002. The content in the html is obtained, most of the content balls 30 described later are created, and the obtained data is stored as a temporary file for analysis by the data analysis unit 142. In the next step S7510, the data processing unit 14b analyzes the content in the data analysis unit 142 and extracts necessary information to replenish the missing portion of the content ball 30, and in the next step S7511, the data registration unit 143 newly registers each data described above in the reference information storage unit 22 and the data information storage unit 23, and provides the completed content ball 30 to the cooperation system in step S7512.
[0251]
That is, in step S7511, the data processing unit 14b, based on the information extracted in the previous step, determines the number of layers (2 in this example), type (HTML in this example), and size (HTML in this example) of the page (0002.html). In the example, data such as 1024 (bytes), the update date and time (in this example, 00:00 on December 1, 2002), and the like are stored in the new data information recording area of the data information storage unit 23 ((6) in FIG. 9). , And if there is information on the link destination linked to the page (0002.html), the information is recorded in the new reference information storage area of the reference information storage unit 22. Go.
[0252]
In this case, at the time of the first information collection (tour), 0002. Since no link information (0003.html) exists in html, in this case, the data registration unit 143 of the data processing unit 14b stores the new information in the data information storage unit 23 and the reference information storage unit 22 in step S7510. If the information on the page (0002.html) of the specific site is recorded in the new data information recording area (stage {circle around (6)} in FIG. 9) of the specific information site without newly establishing (securing) Assuming that all data collection for the page has been completed, a flag (“OK” in FIG. 9) indicating that data collection for the page has been completed is recorded in this “collection status” column, and a reference information storage unit The "unprocessed" flag in the "processing status" column of No. 22 ((2) in FIG. 9) is changed to "completed" in FIG. As a result of this processing, the processing states of the reference information storage unit 22 are all set to “completed”, and the data collection for all pages of the specific site (that is, each page at http://abcd.co.jp/) is completed. This indicates that the first information collection of this specific site can be completed.
[0253]
That is, when the end of the processing is notified from the data registration unit 143 of the data processing unit 14b to the data processing management unit 16 in the next step S7512, the reference information monitoring the “processing state” column of the reference information storage unit 22 When the processing unit 13 determines that “all the pieces of reference information data have been collected” (Yes in step S79 in FIG. 6), the first information collection of this specific site ends.
[0254]
In the information collection / providing system 1, at the time of the next (second) information collection (patrol) of the specific site, the “new” of the reference information storage unit 22 is executed before the data processing unit 14 is activated. The storage area of the reference information and the storage area of the “new” data information of the data information storage unit 23 are treated as the storage area of the “previous” reference information and the storage area of the “previous” data information, respectively, and the storage area of the new reference information is newly added. Processing is performed by the reference information processing unit 13 so as to secure a storage area and a storage area for new data information.
[0255]
Next, on the specific site “http://abcd.co.jp/0001.html”, a new page 0003. The second information collection process performed by the data processing device 10 after the html has been added will be described.
[0256]
Also in the second information collection process, an instruction to process the site indicated by “http://abcd.co.jp/0001.html” is received according to the data processing request in step S7501 in FIG. 7 described above. Similarly to the above, the data processing unit 14 (similarly, 14a) stores each information (“1” in FIG. 9) regarding “http://abcd.co.jp/0001.html” in the site information storage unit 21. , The information of the “access method” (proxy information in this case) is obtained (step S7503), and http: // abcd. co. jp / 0001. html.
[0257]
And, http: // abcd. co. jp / 0001. html, the data processing unit 14a first accesses http: // abcd. co. jp / 0001. The html header information is acquired by the data acquisition unit 141 (step S7504). In this case, however, the acquired header information matches the previous information (that is, the information in the step (5) in FIG. 9) (step S7506). Yes), the contents (including link information) of the top page (0001.html) are regarded as the same, and the data registration unit 143 performs the process of step S7507 this time. That is, in step S7507, the data registration unit 143 of the data processing unit 14a stores the data in each column of the previous data information recording area (steps (5) and (6) in FIG. 9) of the data information storage unit 23 (ie, The top page 0001.html and its lower page 0002.html) are all copied to the new data information recording area (steps (7) and (8) in FIG. 9). The “uncollected” flag is recorded only in the “collection state” column for html, and the information of the recording area of the previous reference information (step (2) in FIG. 9) of the reference information storage unit 22 is recorded in the recording area of the new reference information. (Step {circle around (3)} in FIG. 9), and a "not-yet" flag is recorded only in the "processing state" column.
[0258]
Note that the content ball 30 is created and updated by the data acquisition unit 141 and the data analysis unit 142, and the data registration unit 143 uses the recording information of the reference information storage unit 22 and the data information storage unit 23 to make a final decision. Update. As a result, the content ball 30 having the same content as the previous time when the page state is NONE (no update) is provided to the cooperation system.
[0259]
Then, in the information collecting / providing system 1, the reference information processing unit 13 monitoring the “processing state” column of the reference information storage unit 22 based on the recording processing of the data processing unit 14a causes the next data processing unit 14 ( 14b), a data processing request including “http://abcd.co.jp/0002.html” is transmitted to the data processing management unit 16 (step S72 in FIG. 6), and the data processing management unit 16 Outputs an activation command (step S75).
[0260]
Subsequently, the data processing unit 14b started next accesses the page based on “http://abcd.co.jp/0002.html” included in the start instruction, and accesses the page http: // abcd. co. jp / 0002. The html header information is acquired by the data acquisition unit 141 (step S7504), and the acquired header information is compared with the header information acquired last time (that is, the information of the stage (6) in FIG. 9) (step S7506).
[0261]
In this case, page 0002. Since the “size” and “update date and time” of the html header information are different from the previous time (No in step S7506), the page 0002. html contents are also considered to be different, and the processes of steps S7509 to S7512 are performed.
[0262]
That is, in step S7509, the data processing unit 14b uses the http: // abcd. co. jp / 0002. The content in the html is obtained, most of the content balls 30 described later are created, and the obtained data is stored as a temporary file for analysis by the data analysis unit 142. In the next step S7510, the data processing unit 14b analyzes the content in the data analysis unit 142 and extracts necessary information to replenish the missing portion of the content ball 30, and in the next step S7511, the data registration unit 143 newly registers each data described above in the reference information storage unit 22 and the data information storage unit 23, and provides the completed content ball 30 to the cooperation system in step S7512.
[0263]
That is, in step S7511, the data processing unit 14b, based on the information extracted in the previous step, determines the number of layers (2 in this example), type (HTML in this example), and size (HTML in this example) of the page (0002.html). In the example, data such as 2048 (bytes), update date and time (in this example, January 1, 2003 at 00:00) are stored in the new data information recording area of the data information storage unit 23 ([8] in FIG. 9). , And if there is information on the link destination linked to the page (0002.html), the information is recorded in the new reference information storage area of the reference information storage unit 22. Go.
[0264]
Then, in step S7511, the data registration unit 143 of the data processing unit 14b records each piece of information about the page (0002.html) in the reference information storage unit 22 and the data information storage unit 23. Assuming that the collection has been completed, a flag indicating that the data collection of the page has been completed is displayed in the “collection state” column of the new data information recording area (step (8) in FIG. 9) of the data information storage unit 23. “OK” in FIG. 9) is recorded, and the data processing management unit 16 is notified of the end (step S7512).
[0265]
In this case, at the time of the second information collection (tour), the page 0002. http: / abcd. co. jp / 0003. html, the data processing unit 14b sends the page 0002. When analyzing the content of html (step S7510), 0002. Since “0003.html” of the link page described in the body (content) of the html is extracted, in step S7511, the data registration unit 143 stores the new reference information in the storage area of the new reference information in the data information storage unit 23. , 0003. In addition to securing (newly established) a record column (step 9 in FIG. 9) for html, the secured (newly established) 0003. In the “collection status” column of html, a flag “not yet” indicating that the data of the page has not been collected yet is recorded.
[0266]
In step S7510, 0003. The data processing unit 14b that has extracted the html records the link information existing in the link source site (in this case, “http://abcd.co.jp/0002.html”). , 0002. secured by the data registration unit 143 in the storage area of the new reference information in the reference information storage unit 22. html to 0003. The link information (link source, link destination, number of lines, tag name) is recorded in the record column (step {circle around (4)} in FIG. 9) indicating that the link is linked to html, and the "process state" column In this case, a flag “not yet recorded” indicating that all data of the specific content (in this case, the content of the lower page “http://abcd.co.jp/0003.html”) has not been collected is recorded. I do. The example shown in (4) in FIG. 9 corresponds to http: // abcd. co. jp / 0002. html. This is a case where there is link destination information for linking to html, and the tag name is HREF of the A tag.
[0267]
Then, in the information collecting / providing system 1, the reference information processing unit 13 monitoring the “processing state” column of the reference information storage unit 22 based on the recording processing of the data processing unit 14b, In order to activate (14c), a data processing request including “http://abcd.co.jp/0003.html” is transmitted to the data processing management unit 16 (step S72 in FIG. 6), and the data processing management unit 16 outputs a start command (step S75).
[0268]
Next, the data processing unit 14c started next accesses the page based on “http://abcd.co.jp/0003.html” included in the start instruction, and similarly, http: // // abcd. co. jp / 0003. The html header information is acquired by the data acquisition unit 141 (step S7504). In this case, this is the first access, there is no information acquired previously (No in step S7506), and it is not an external site (No in step S7508). The processing of steps S7509 to S7512 is performed in the same manner as described above.
[0269]
In this case, the page 0003. Since no link information exists in the html, in this case, the data registration unit 143 of the data processing unit 14c newly establishes (secures) a new column in the data information storage unit 23 and the reference information storage unit 22 in step S7511. If the information about the page (0003.html) of the specific site is recorded in the new data information recording area (step (9) in FIG. 9) of the data information storage unit 23 without performing the above, Assuming that the data collection has been completed, a flag (“OK” in FIG. 9) indicating that the data collection of the page has been completed is recorded in the “collection state” column, and the flag in the reference information storage unit 22 (FIG. The "not-yet" flag in the "processing state" column of (4) is changed to "done" in FIG. As a result of this processing, the processing states of the reference information storage unit 22 are all set to “completed”, indicating that the collection of all data at the specific site (http://abcd.co.jp/0001.html) has been completed. Thus, it is possible to complete the second information collection of this specific site.
[0270]
That is, when the end of the processing is notified from the data registration unit 143 of the data processing unit 14b to the data processing management unit 16 in the next step S7511, the reference information monitoring the “processing state” column of the reference information storage unit 22 When the processing unit 13 determines that “all data of the reference information has been collected” (Yes in step S79 of FIG. 6), the second information collection of this specific site ends.
[0271]
In the information collection / providing system 1, at the time of the next (third) information collection (tour) of the specific site, the reference information processing unit 13 executes the reference information processing prior to the activation of the data processing unit 14. The data in the storage area of the new reference information in the storage unit 22 (steps (3) and (4) in FIG. 9) is overwritten on the storage area of the previous reference information, and the data information storage unit The data in the storage area of the new data information 23 (steps (7), (8), and (9) in FIG. 9) is overwritten on the storage area of the previous data information. By performing such processing, data collection for the third and subsequent collections can be performed by the same processing as described above.
[0272]
In this example, the operation such as data collection at a site having only three pages has been described. However, at a site including more pages, the above processing is repeated to collect data.
[0273]
Also, in this case, in order to avoid complicating the explanation, operations such as data collection at a site where only one page is linked to one page have been described. In many cases, a plurality of other pages are linked, in which case, as described above, the plurality of data processing units 14 are activated within the range of the maximum number of simultaneous accesses set in the site information storage unit 21. 7 (see FIG. 3), each data processing unit 14 performs the processing of FIG. 7 (and FIG. 11 described later).
[0274]
Further, in a case where a page of another specific site is linked to a page of a specific site (for example, a predetermined page 0003.html of the specific site http://abcd.co.jp/0001.html described above). Is linked to page 0001.html of another site http://wxyz.co.jp), by repeating the above-described processing to obtain all the information related to one specific site. Data collection. However, as described above, in this case, http: // wxyz. co. jp / 0001. No content itself is obtained for html, and no access is made to a page or the like (for example, http://wxyz.co.jp/0002.html) linked from the page.
[0275]
In FIG. 9, another specific site (specific site 2) http: // efgh. co. jp / 0001. Although an example in which html site information is also registered is shown, the information collection / providing system 1 performs similar processing when a plurality of specific sites are registered and a plurality of sites are collected. They are performed simultaneously in parallel (see FIG. 2). However, in the example shown in FIG. 9, since the setting of the collection start date and time differs between the specific site 1 and the specific site 2, processes such as information collection are not performed at the same time.
[0276]
Further, FIG. 9 shows the contents stored in the reference information storage unit 22 and the data information storage unit 23 when the specific site is a Web site. However, the specific site is stored in a directory tree type file system or a domain participation type. Even in the case of a group of network devices, the process of collecting information and providing information to the cooperative system can be realized by the same process.
[0277]
(Cooperation between this system and other systems)
Next, cooperation between the information collection / providing system 1 and another system, that is, a process of providing the collected information to the cooperation system and the like will be described in detail with reference to FIGS.
[0278]
FIG. 10 schematically shows the cooperation (connection form) between the data processing device 10 of the information collection / providing system 1 and another system. On the other hand, FIG. 11 is a flowchart for explaining the execution operation of the data processing apparatus 10 and another system, and shows the operation of the subroutine of step S71 in FIG. FIG. 12 shows the content of the content ball as a unified data format transmitted from the data processing device 10 to another system.
[0279]
In the data processing device 10 of the information collecting / providing system 1, a local or network 100 is provided by a reference information processing unit 13 that manages processing of the entire site and a data processing unit 14 that processes data of individual pages included in the site. The collected information is provided to the cooperative system connected to the information collecting / providing system 1 via the.
[0280]
Here, the collected information provided (transmitted) from the information collecting / providing system 1 to the cooperative system includes the content ball 30 (see FIGS. 10 and 12) transferred using the RAM in the CPU and the content itself. And a temporary file (not shown) that stores It should be noted that the temporary file in the collected information may not be provided depending on the information required by the cooperative system.
[0281]
Further, the information collecting / providing system 1 permits the connected cooperative system to access each of the storage units 21, 22, 23 in the storage device 20, and enables the cooperative system to appropriately refer to the registered data. By doing so, it also provides supplementary information.
[0282]
With respect to the timing of providing (transmitting) the collected information (the content ball 30 and the temporary file) from the information collecting / providing system 1 to the cooperation system, (a) immediately before the information collecting / providing system 1 collects information of a specific site. 7) When the site information collection starts (START) (that is, during the processing of step S71 in FIG. 6) and when the information collection / providing system 1 ends the data processing of the page included in the specific site (that is, in step S7511 in FIG. 7). And (b) immediately after the information collection / providing system 1 completes all the information collection of the specific site (END) (that is, during the processing of step S81 in FIG. 6). When the data processing of the pages is completed, the individual pages included in the specific site are normally processed (c). When the page processing has been completed (INFO) and when processing an individual page included in the specific site, the data of the page could not be processed normally due to detection of an abnormality or the like. (D) The page processing abnormality (ERR) There are cases.
[0283]
The information provided at the start of (a) site information collection or (b) completion of site information collection includes the site name of the specific site, and (c) the provided information at the time of page processing has Information indicating the processing result of the page included in the site is included, and (d) information provided when the page processing is abnormal includes information indicating the abnormality detected during processing of the page.
[0284]
The content ball 30 and the creator of the temporary file are created by the reference information processing unit 13 of the data processing device 10 when (a) site information collection is started or (b) site information collection is completed. In the case of c) page processing completed or (d) page processing abnormality, the data processing unit 14 of the data processing apparatus 10 creates the page processing.
[0285]
The information collecting / providing system 1 provides a common interface to other systems that cooperate. Here, as the common interface, the processing shown in steps S7131 to S7136 in FIG. 11, the registration processing to the information collection / providing system 1, the event detection processing from the information collection / providing system 1, and the like are described. Is mentioned.
[0286]
In the process of registering each cooperative system in the information collecting / providing system 1 (that is, registering the cooperative system name in the site information storage unit 21 of the storage device 20 shown in FIG. 9), an input screen of a display unit (not shown) Above, it is possible to make a setting as to which position in FIG. 10 the linked system is to be connected to. When setting the connection, the system name (see FIG. 9) of the cooperative system is used, and the position to which the connection is performed is stored in the CPU of the information collection / providing system 1.
[0287]
For example, in the example illustrated in FIG. 10, the setting is such that the cooperative system A is directly connected to the information collection / providing system 1, and the cooperative system B is a setting that is connected to the system A that has been previously connected. In the present embodiment, the continuity of the processing that the information providing process to the cooperative system B can be performed after the information providing process to the cooperative system A is performed is assured by adopting such a combination mode.
[0288]
On the other hand, in the example shown in FIG. 10, the cooperative system C is set to be directly coupled to the information collection / providing system 1 irrespective of the cooperative systems A and B. In the present embodiment, even when there is a linked system A or the like that is linked earlier, the configuration is such that it can be directly linked to the information collection / providing system 1 itself. It also guarantees independence between cooperating systems, which is not affected.
[0289]
In the information collection / providing system 1 of the present embodiment, the system cooperation (connection mode) according to the processing can be selected for each cooperation system in this manner. Developing network applications using simplifies.
[0290]
(Data structure of content ball)
By receiving the content ball 30, the system coupled to the information collecting / providing system 1 can know all of the processing status of the data processing device 10 and the information of the specific site. As with the common interface described above, the content ball 30 is the same for (a) start of site information collection, (b) completion of site information collection, (c) page processing completed, and (d) page processing abnormality. It has a data structure (data format). Hereinafter, the data structure of the content ball 30 will be described with reference to FIG.
[0291]
As shown in FIG. 12, the content ball 30 includes a message status. As shown in FIG. 12, the types of the message status include "START", "END", "INFO", and "ERR", which are (a) start of site information collection and (b), respectively. This indicates that site information collection is completed, (c) page processing is completed, and (d) page processing is abnormal.
[0292]
The cooperative system refers to the message status data of the content ball 30 to perform a process unique to the cooperative system. More specifically, for example, when the message status is START, initialization processing of the database of the cooperative system is performed as preprocessing, and when the message status is INFO, “temporary file name” of the content ball 30 is processed as postprocessing. And a process of obtaining a temporary file from the server and searching for a word. Such processing unique to the cooperative system is performed in step S7132 (pre-processing) or step S7136 (post-processing) described later.
[0293]
The content ball 30 includes site information. This site information is the same information as the site information in the site information storage unit 21 of the storage device 20. The cooperative system accesses the site information storage unit 21 of the storage device 20 by referring to the site information of the content ball 30 and, for example, depending on the setting of pre-processing and post-processing of the cooperative system, and It is also possible to refer.
[0294]
The content ball 30 contains page information. As shown in FIG. 12, the page information indicates the location (specifically, URL, etc.), hierarchy, type, size, update date, and collection status of the data on the network. The content is the same as that of each of the data information storage units 23. By referring to the page information of the content ball 30, the cooperative system similarly accesses the data information storage unit 23 of the storage device 20 based on, for example, the settings of the pre-processing and post-processing of the cooperative system, and specifies It is also possible to refer to the contents of the page.
[0295]
The content ball 30 includes a page state. The page state indicates a result of comparing the state of the page collected by the data processing unit 14 with the data at the time of the previous collection (that is, the registration data of the storage area of the previous data information in the data information storage unit 23). As shown in FIG. 12, if the page collected this time is a newly added page, NEW is set if the page is updated from the previous collection, and UPDATE is set if the page is the same as the previous collection. NONE is set.
[0296]
Specifically, at the first collection to a specific site, the page state of all the content balls 30 becomes NEW, and at the second and subsequent collections, NEW (new page) and UPDATE as a result of comparison with the previous data. Either (updated) or NONE (no update) is set.
[0297]
The content ball 30 includes a temporary file name. The temporary file name indicates the name and location (directory name, etc.) of the temporary file in which the content acquired by the data acquisition unit 141 of the data processing unit 14 from the server via the network 100 described with reference to FIG. . Therefore, the cooperative system can access the temporary file by referring to the temporary file name of the content ball 30 and refer to the content of the content.
[0298]
The page information and the page state of the content ball 30 are set when the message status is INFO (page processing completed) or ERR (page processing abnormal), and when the message status is START (site information collection start) or END (site information collection end). Is not set and becomes empty. The temporary file name of the content ball 30 is set when the data processing apparatus 10 acquires the content itself and creates a temporary file, such as when the page state is NEW (new page) or UPDATE (updated). .
[0299]
Next, referring to the flowchart of FIG. 11, when the subroutine of step S71 of FIG. 6, that is, when the data processing device 10 starts executing the information collection of the specific site, the data processing device 10 transmits the information to the cooperation system (a). The process of notifying the start of site information collection (START) will be described in detail.
[0300]
FIG. 11 is a flowchart for (a) when the site information collection is started (START). However, FIG. 11 (b) when the site information collection is completed (END) and (c) when the page processing is completed (INFO). , (D) When a page processing error (ERR) occurs, a similar flow is performed. Therefore, in the case of (b) site information collection completion (END), the flowchart of FIG. 11 becomes a subroutine of step S81 of FIG. 6, and (c) page processing completed (INFO) and (d) page processing abnormality (ERR) In this case, the flowchart of FIG. 11 is a subroutine of step S7511 in FIG.
[0301]
In addition, the processing of steps S711 to S713 in FIG. 11 is mainly performed by the reference information processing unit 13 in the case of (a) start of site information collection (START) and (b) completion of site information collection (END). In the case of c) page processing completed (INFO) and (d) page processing abnormality (ERR), the data processing unit 14 is the main component.
[0302]
On the other hand, the flowchart (steps S7131 to S7136) shown on the right side of FIG. 11 shows processing performed by the cooperative system that has received the content ball 30 in step S713.
[0303]
In the data processing device 10, the reference information processing unit 13 activated based on the above-described activation instruction acquires the site information (URL or the like) of the specific site in step S71 of FIG. , The content ball 30 is created (step S711).
[0304]
In this case, the content ball 30 in FIG. 12 includes only the message status (= START) and site information, and does not include page information and the like.
[0305]
In addition, when (b) site information collection is completed (END), the reference information processing unit 13 creates the content ball 30 as a process in step S81 in FIG.
[0306]
In this case, the content ball 30 in FIG. 12 includes only the message status (= END) and the site information, and does not include the page information and the like.
[0307]
On the other hand, when (c) page processing is completed (INFO) and (d) page processing is abnormal (ERR), the data processing unit 14 deletes the content ball 30 from the collected data in step S7511 of FIG. create.
[0308]
In this case, the content ball 30 contains all the information shown in FIG.
[0309]
In these cases, when the reference information processing unit 13 or the data processing unit 14 acquires the content and creates a temporary file, the temporary file name is also stored in the content ball 30.
[0310]
In the present embodiment, the information collecting / providing system 1 does not create a temporary file when the message status is START, END, or ERR, or when the message status is INFO and the page status is NONE. In other words, in the present embodiment, when the message status is INFO and the page status is NEW or UPDATE, that is, the system 1 has never collected the content of the page (including the case where it has been updated). Only then will a temporary file be created.
[0311]
In the next step S712, the reference information processing unit 13 (in the case of (c) and (d) above, the data processing unit 14, the same applies hereinafter) is registered in the site information storage unit 21 of the storage device 20. The system acquires the cooperative system name (see FIG. 9), and proceeds to step S713.
[0312]
In step S713, the reference information processing unit 13 transmits the content ball 30 created in S711 to the cooperation system (here, the cooperation system A in FIG. 10).
[0313]
In step S713, the reference information processing unit 13 waits until the cooperative system (the cooperative system A in this case) to which the content ball 30 is transmitted completes the processing in steps S7131 to 7136 described below.
[0314]
Hereinafter, processing performed by the cooperation system (cooperation system A) that is the transmission destination of the content ball 30 will be described with reference to the flowchart on the right side of FIG.
[0315]
Upon receiving the content ball 30 from the reference information processing unit 13 (or the data processing unit 14) in step S7131, the cooperation system A refers to the above-described message status of the received content ball 30 in the next step S7132, After performing the first process (pre-process) unique to the cooperative system A as necessary, the flow shifts to step S7133.
[0316]
In step S7133, the cooperative system A accesses the site information storage unit 21 of the storage device 20, and from the site information storage unit 21, the cooperative system A (system A) registered in the “cooperation system name” column of the specific site. ) Is obtained (in this case, the cooperation system names of the cooperation system A and the cooperation system B (related to the cooperation system A)), and the flow shifts to step S7134.
[0317]
In step S7134, the cooperative system A determines whether the obtained cooperative system name includes another cooperative system stored in a CPU (not shown) of the cooperative system A (that is, the cooperative system A includes another cooperative system name). It is determined whether or not the cooperation system name is associated), and if Yes, that is, it is determined that there is another cooperation system other than the cooperation system A in the current cooperation, the process proceeds to step S7135, and No, that is, the current cooperation If it is determined that there is no other cooperative system other than the cooperative system A, the flow shifts to step S7136.
[0318]
In step S7135, the cooperative system A transfers (ie, copies and transmits) the content ball 30 to another cooperative system (the cooperative system B in FIG. 10 in this case), and then proceeds to step S7136.
[0319]
Note that another cooperative system (the cooperative system B in this case) that has received the content ball 30 from the cooperative system A performs the processes of steps S7131 to S7136, similarly to the cooperative system A. In this case, the cooperation system A waits in step S7135 until the processing of the cooperation system B ends, and when the processing of the cooperation system B ends, the processing proceeds to step S7136.
[0320]
In step S7136, the cooperative system A refers to the above-described message status of the content ball 30, performs a second process (post-process) unique to the cooperative system A as necessary, and then performs a series of processes. To end.
[0321]
Thus, when the processing of steps S7131 to S7136 by the cooperative system A ends, the reference information processing unit 13 (the data processing unit 14 in the case of (c) and (d) above) exits the processing of step S713, The process will proceed to step S72 in FIG.
[0322]
The series of processes described with reference to FIG. 11 is performed once each time the information collection of the specific site is started / finished (that is, in the above-described cases (a) and (b)), the data (the page or the like) within the specific site. Is performed each time (i.e., (c) and (d) above) are processed. Specifically, for example, when a certain Web site (specific site) has 100 pages, 1 + 100 + 1 = 102 times of creation / transmission of the content ball 30 and processing in the cooperative system are performed. .
[0323]
For this reason, the cooperative system preliminarily estimates the CPU usage and processing time of the own device in each of the above-described unique processes in step S7132 or step S7136, and the unique process that requires a large amount of CPU and takes a long time (a). When the site information collection is started and / or (b) when the site information collection is completed, the specific processing that requires less CPU usage and requires less time is (c) page processing completed and / or (d) page It is desirable to perform the process when there is a processing abnormality. The reason is that, in the data processing device 10 of the information collecting / providing system 1, since the plurality of data processing units 14 (14a to 14n) usually operate, the CPU and the network 100 are connected in the above (c) and (d). Because it is in the state that uses it most.
[0324]
As described above, in the information collection / provision system 1 of the embodiment, when data collection for a specific site is started ((a) site information collection start (START)), data processing for one page in the specific site ends. ((C) page processing completed or (d) page processing abnormal)) and at the end of data collection for a specific site ((b) site information collection completed (END)), respectively. Since the provided data (content ball 30) in the same data format is transmitted to the cooperative system, the cooperative system serving as the client side determines the difference in the form for a plurality of different forms of information existing in the specific site. It is possible to refer without consciousness, and it is possible to reduce the trouble of management and information reference. In addition, on the cooperative system side, it is possible to unify the reception and analysis processing of the content ball 30, to reduce the man-hour for creating the program of the cooperative part, and to improve the development efficiency of the network application using the present system. Become.
[0325]
Then, the cooperative system analyzes each received content ball 30 to find out, for example, a new page or an updated page of the specific site based on information such as “page state” and “message status” of “ERR”. In this case, it is known that the link is broken, and by counting the number of the content balls 30 whose “message status” is “INFO”, the total number of pages of the specific site can be determined. , The total capacity of a specific site can be determined. Furthermore, the cooperative system can obtain various information related to a specific site by accessing the storage units 21, 22, and 23 of the storage device 20 and referring to necessary information.
[0326]
Therefore, according to the information collecting / providing system 1 of the present embodiment, it is possible to provide various information providing services to each cooperative system, and to meet various service needs other than the search. Becomes possible.
[0327]
In the above-described embodiment, an example in which the data processing apparatus 10 acquires data shared on the network 100 via the network 100 has been described. However, the present invention is not limited to this. And a computer not shown may form a LAN, and the data processing device 10 may acquire the data in the other computer. May be directly incorporated to acquire data in the computer.
[0328]
Furthermore, in the above-described embodiment, a case has been mainly described in which the information collection and provision target is WWW data. However, the present invention is not limited to this. The tree is treated as a hierarchy (link), and all the files in a certain directory are acquired using the files stored in each directory as collected data, subdirectories are detected, and the files in the subdirectories are acquired. By continuing, it is also possible to acquire all data in the directory tree.
[0329]
In addition, the domain participation type network is targeted for information collection, the domain is a hierarchy (link), and the network devices belonging to the domain are collected data, the status of all network devices in a certain domain is acquired, and the subdomain is detected. By further acquiring the status of the network devices in the sub-domain, the status of all the network devices in the domain (for example, “starting up”, “operating without problems”, It is also possible to acquire information on various states such as "Yes".
[0330]
As described above, according to the information collection / providing system 1, high-speed and high-quality information collection is performed, and information in a designated site can be efficiently and quickly collected.
[0331]
Further, according to the information collection and provision system 1, it is possible to reduce the load on hardware resources such as a CPU and a network when collecting information.
[0332]
Further, according to the information collecting / providing system 1, an improvement in service related to information provision after information collection in a designated site is realized.
[0333]
According to the information collecting / providing system 1, information in a local computer or data shared on a network and having a hierarchical (link) relationship but different formats (eg, HTML (Hyper Text Markup Language)). Of course, Flash of Macromedia, directory tree type file system, domain participation type network device group, etc.) are recognized as one file system as an information group of the same format, and data is collected based on the hierarchical relationship. Then, it can be registered in a storage device and further provided to a cooperative system. As a result, users of this system and other cooperative systems can refer to a plurality of different forms of information existing in a specific site without being particularly conscious, and the time and effort for management and information reference are greatly reduced. To reduce.
[0334]
【The invention's effect】
As described above in detail, according to the present invention, a high-speed and high-quality information collection system is constructed, an information collection system, an information collection method, which efficiently and quickly collects information in a designated site, And an information collection program can be provided.
[0335]
Further, according to the present invention, it is possible to provide an information collection system, an information collection method, and an information collection program capable of reducing the load on hardware resources such as a CPU and a network when collecting information.
[0336]
Further, according to the present invention, it is possible to provide an information collection system, an information collection method, and an information collection program, which realize improved services related to information provision after information collection within a designated site.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a schematic configuration of an information collecting / providing system to which the present invention is applied.
FIG. 2 is a view for explaining a plurality of configurations in a reference information processing section and a data processing section in the information collection / providing system in a case where information collection is performed simultaneously and in parallel on a plurality of sites and a plurality of contents. FIG.
FIG. 3 is a diagram showing an operation of the information collection / providing system when information is collected simultaneously and in parallel for a plurality of contents in comparison with an operation of a system of a conventional robot type engine; FIG. 11 is a diagram for describing an operation of a notification (calling) function between the data processing management unit and each data processing unit when two data processing units collect information on mutually different contents in the device.
FIG. 4 is a diagram for describing a configuration in a data processing unit in the information collection / providing system when a process is performed transparently for a plurality of different protocols and data formats.
FIG. 5 is a flowchart showing an outline of the operation of the entire system when information is acquired for a specific site in the information collection / providing system, and is mainly for describing operations up to the start of information acquisition.
FIG. 6 is a flowchart for explaining an outline of an information acquisition execution operation for one specific site, and shows processing of a routine derived from step S7 in FIG. 5;
FIG. 7 is a flowchart for explaining details of an information acquisition execution operation for one page in one specific site, and shows an operation of a routine derived from step S75 in FIG. 6;
FIG. 8 is a diagram illustrating operations of a reference information processing unit and each data processing unit in the information collection / providing system, data stored in each storage unit of the storage device, and the like.
FIG. 9 is a diagram illustrating a data table of information stored in each storage unit of the storage device in the information collection / providing system.
FIG. 10 is a diagram schematically showing cooperation between a data processing device of an information collection / providing system and another system.
11 is a flowchart showing a process in which the information collection / providing system creates a content ball and transmits the content ball to the cooperative system, and a process performed by the cooperative system which has received the content ball; FIG. FIG. 8 is a diagram for explaining a subroutine of step S7511 in FIG. 7.
FIG. 12 is a diagram illustrating a data structure of a content ball transmitted from the information collecting / providing system to the cooperation system.
[Explanation of symbols]
1 Information collection and provision system
10 Data processing device
11 System Management Department
12 site management department (management means)
13 (13a, 13b, 13c,...) Reference information processing unit (management means, provided data generation means)
14 (14a, 14b, 14c, ...) Data processing unit (data processing means, provided data generation means)
141 (141a, 141b,...) Data acquisition unit (page access means, header information acquisition means, determination means, content acquisition means)
142 (142a, 142b, ...) Data analysis unit (content acquisition means)
143 Data registration part (information registration means)
15 Reference information processing management unit (management means)
16 (16a, 16b, 16c, ...) Data processing management unit (management means)
20 Storage device
21 Site information storage
22 Reference information storage unit
23 Data information storage
30 Content Ball (Provided Data)
100 networks

Claims

A plurality of data processing means for accessing one page in one site, collecting and processing various data on a plurality of contents constituting the site, and a management means for managing the data processing means And a data processing device comprising:
A storage device for storing at least the specific site information including information indicating a site for the data processing device to access first, and information on each content in the site,
The data processing unit of the data processing device includes a page based on the specific site information stored in advance in the storage device, and a page access unit for accessing a link page linked to the page, and the accessed page Header information obtaining means for obtaining the header information of the content, content obtaining means for obtaining the content of the accessed page, and link information obtaining means for obtaining the link information indicating the location of the link page linked to the accessed page Having information registration means for registering predetermined information based on each acquired information in the storage device,
The management unit of the data processing device is configured to start up n (n is 1 or more) data processing units based on information stored in the storage device and link information acquired by the link information acquisition unit. Means, based on the link information obtained by the link information obtaining means, end management means for ending data collection for the site,
An information collection system comprising:

In the storage device, collection start information indicating a collection start date and time is stored as the specific site information,
2. The information collection system according to claim 1, wherein the activation management unit of the data processing device activates the data processing unit based on the collection start information.

In the storage device, as the specific site information, coexistence number upper limit information indicating the maximum simultaneous coexistence number of the data processing unit that processes each page in the one site is stored,
The information collection system according to claim 1, wherein the activation management unit of the data processing device activates the data processing unit within a range of the maximum number of simultaneous coexistences of the coexistence number upper limit information.

The data processing means, when the registration in the storage device by the information registration means is completed, notifies the management means that the processing has been completed,
4. The method according to claim 1, wherein the management unit performs, based on the notification, activation of the data processing unit by the activation management unit or termination of data collection by the termination management unit. 5. Information collection system described in section.

The data processing device has a determination unit that determines whether there is a change in the content based on the header information acquired by the header information acquisition unit,
The information collection system according to any one of claims 1 to 4, wherein the content acquisition unit does not acquire the content of the accessed page when the determination unit determines that there is no change.

The storage device includes a storage area for storing header information acquired by the header information acquisition unit,
6. The method according to claim 5, wherein the determination unit compares the currently acquired header information with the previously acquired header information stored in the storage device, and determines that there is a content change if they do not match. Information collection system described.

The data processing device has a site determination unit that determines whether the page is a page in the site based on the header information acquired by the header information acquisition unit,
7. The content acquisition device according to claim 1, wherein the content acquisition unit does not acquire the content of the accessed page when the site determination unit determines that the page is not a page in the site. Information collection system described.

The content acquisition unit includes a plurality of types of analysis programs for analyzing the content of the page,
The link information obtaining means obtains link information including content type information indicating a type of content of the page,
Upon activation of the data processing unit, the activation management unit outputs a data processing request including the link information,
The content acquisition unit analyzes the content of the page accessed by the page access unit in the activated data processing unit, using an analysis program corresponding to the content type information included in the data processing request. The information collection system according to any one of claims 1 to 7.

The page access unit includes a plurality of types of programs regarding a communication protocol for accessing the page,
The link information obtaining unit obtains link information including information on a method of accessing the link page from the content obtained by the content obtaining unit,
Upon activation of the data processing unit, the activation management unit outputs a data processing request including the link information,
9. The apparatus according to claim 1, wherein the page access unit of the activated data processing unit accesses the link page using a program corresponding to the access method included in the data processing request. Information collection system described.

In the storage device, as the specific site information, a cooperative system name of another system that wants information about the one site is stored,
The data processing apparatus is configured to provide, for each piece of information collected and processed by the data processing unit, provided data for providing to the other system, and a generated data based on the cooperative system name. The information collection system according to any one of claims 1 to 9, further comprising: provided data transmission means for transmitting the data to another system.

The provided data generation unit is configured to start the data collection for the site, to terminate the processing of one page in the site by the data processing unit, and to terminate the data collection for the site. 11. The information collection system according to claim 10, wherein the provided data is generated in the same data format.

12. The information collection system according to claim 10, wherein the provided data transmitting unit transmits the content acquired by the content acquiring unit to another system based on the cooperative system name.

A plurality of data processing means for accessing one page in one site, collecting and processing various data on a plurality of contents constituting the site, and a management means for managing the data processing means And a data processing device comprising:
Information collection using a storage device for storing at least specific site information including information indicating a site to be accessed first by the data processing device and information on each content in the site. The method,
The data processing unit of the data processing device includes: a page based on the specific site information stored in advance in the storage device; a page access process for accessing a link page linked to the page; a header of the accessed page Header information acquisition processing for acquiring information, content acquisition processing for acquiring the content of the accessed page, link information acquisition processing for acquiring link information indicating the location of a linked page linked to the accessed page, and acquisition Information registration processing for registering predetermined information based on each piece of information in the storage device,
The management unit of the data processing device is configured to start n (n is 1 or more) data processing units based on the information stored in the storage device and the link information acquired in the link information acquisition process. And a termination process for terminating data collection for the site based on the link information acquired in the link information acquisition process;
An information collection method characterized by performing the following.

In the storage device, collection start information indicating a collection start date and time is stored as the specific site information,
14. The information collection method according to claim 13, wherein, in the activation process, the data processing unit is activated based on the collection start information.

In the storage device, as the specific site information, coexistence number upper limit information indicating the maximum simultaneous coexistence number of the data processing unit that processes each page in the one site is stored,
15. The information collection method according to claim 13, wherein, in the activation process, the data processing unit is activated within a range of a maximum number of simultaneous coexistences of the coexistence number upper limit information.

The data processing unit, when the information registration process is completed, executes a notification process of notifying the management unit that the process has been completed,
16. The information collection method according to claim 13, wherein the management unit executes the activation processing or the termination processing based on the notification.

Prior to the content acquisition process, the data processing device performs a determination process to determine whether to acquire the content,
17. The information collection method according to claim 13, wherein when it is determined that the content should not be acquired, the content acquisition process is not performed.

The storage device includes a storage area for storing header information acquired in the header information acquisition process,
In the determination processing, the header information acquired this time is compared with the header information acquired last time stored in the storage device, and if they match, it is determined that the content should not be acquired. The information collection method according to claim 17.

The data processing device performs a site determination process to determine whether the page is a page in the site, based on the header information acquired in the header information acquisition process,
19. The information collection method according to claim 13, wherein the content acquisition processing is not performed when it is determined that the page is not a page in the site.

The data processing means of the data processing device includes a plurality of types of analysis programs for analyzing the content of the page,
In the link information obtaining process, link information including content type information indicating the type of content of the page is obtained,
In the activation processing, upon activation of the data processing unit, a data processing request including the link information is output,
As for the activated data processing means, in the content acquisition processing on the page accessed by the page access processing, the content is analyzed using an analysis program corresponding to the content type information included in the data processing request. The information collection method according to any one of claims 13 to 19.

The data processing unit of the data processing device includes a plurality of types of programs regarding a communication protocol for accessing a site,
In the link information obtaining process, from the content obtained in the content obtaining process, obtain link information including information on an access method to the link page,
In the activation processing, upon activation of the data processing unit, a data processing request including the link information is output,
21. The information according to claim 13, wherein the activated data processing means executes the page access processing using a program corresponding to an access method included in the data processing request. Collection method.

In the storage device, as the specific site information, a cooperative system name of another system that wants information about the one site is stored,
The data processing device is configured to generate provided data for providing to the other system based on each information obtained by the data processing unit, and to provide the generated provided data based on the cooperation system name. 22. The information collection method according to claim 13, further comprising performing a provided data transmission process for transmitting the data to another system.

In the provided data generation processing, at the start of data collection for a site, at the end of processing by a data processing unit for one page in the site, and at the end of data collection for the site, 23. The information collection method according to claim 22, wherein the provided data is generated in the same data format.

24. The information collection method according to claim 22, wherein, in the provided data transmission processing, the content acquired in the content acquisition processing is transmitted to another system based on the cooperation system name.

A plurality of data processing means for accessing a single page in one site to collect and process various data on a plurality of contents constituting the site, and managing the data processing means; A data processing device comprising:
A storage device for storing at least specific site information including information indicating a site to be accessed first by the data processing device and information on each content in the site; Information collection program,
The data processing unit of the data processing device, a page based on specific site information, a page access unit for accessing a link page linked to the page, a header information acquisition unit for acquiring header information of the accessed page Content acquisition means for acquiring the content of the accessed page, link information acquisition means for acquiring link information indicating the location of the linked page linked to the accessed page, and storing the predetermined information based on the acquired information. Function as information registration means for registering in the device,
Activation management for activating the management unit of the data processing device based on information stored in the storage device and link information acquired by the link information acquisition unit, n (n is 1 or more) of the data processing units Means, an information collection program for functioning as end management means for ending data collection for the site based on the link information acquired by the link information acquisition means.

In the storage device, collection start information indicating a collection start date and time is stored as the specific site information,
26. The information collection program according to claim 25, wherein the start management unit functions as a unit that starts the data processing unit based on the collection start information.

In the storage device, as the specific site information, coexistence number upper limit information indicating the maximum simultaneous coexistence number of the data processing unit that processes each page in the one site is stored,
27. The information collection program according to claim 25, wherein the start management unit functions as a unit that starts the data processing unit within a range of the maximum number of simultaneous coexistences of the coexistence number upper limit information.

The data processing means, when the registration in the storage device by the information registration means is completed, function as a means for notifying the management means that processing has been completed,
28. The control device according to claim 25, wherein the management unit is configured to function as a unit that performs a process of activating the data processing unit by the activation management unit or ending the data collection by the termination management unit based on the notification. An information collection program according to item 1.

The data processing device functions as a determination unit that determines whether there is a change in content based on the header information acquired by the header information acquisition unit,
The information collection program according to any one of claims 25 to 28, wherein when the determination unit determines that there is no change, the content acquisition unit functions as a unit that does not acquire the content of the accessed page. .

The storage device, function to include a storage area for storing the header information acquired by the header information acquisition means,
The function of the determining means as a means for comparing the header information obtained this time with the header information obtained last time stored in the storage device, and determining that there is a change in the content if the header information does not match. 29. An information collecting program according to 29.

The data processing device, based on the header information acquired by the header information acquisition means, to function as a site determination means to determine whether the page is a page in the site,
31. The method according to claim 25, wherein the content acquisition unit functions as a unit that does not acquire the content of the accessed page when the site determination unit determines that the page is not a page in the site. Information collection program described.

The content acquisition means, as a means having a plurality of types of analysis programs for analyzing the content of the page,
Causing the link information obtaining means to function as means for obtaining link information including content type information indicating the type of content of the page;
Upon activation of the data processing unit, the activation management unit functions as a unit that outputs a data processing request including the link information,
A request for causing the content acquisition means to function as means for analyzing the content using the analysis program corresponding to the content type information included in the data processing request, with respect to the page accessed by the page access means in the activated data processing means. Item 32. The information collection program according to any one of Items 25 to 31.

Causing the page access unit to function as a unit having a plurality of types of programs regarding a communication protocol for accessing the page,
The link information acquisition unit, from the content acquired by the content acquisition unit, to function as a unit that acquires link information including information about an access method to the link page,
Upon activation of the data processing unit, the activation management unit functions as a unit that outputs a data processing request including the link information,
33. A function according to claim 25, wherein the activated page access means of the data processing means functions as means for accessing the link page using an access program corresponding to the access method included in the data processing request. Information collection program described in section.

The storage device, as the specific site information, to function as a means for storing a cooperation system name of another system that wants information about the one site,
For each piece of information collected and processed by the data processing device, the data processing device generates provided data for providing to the other system, and provides the generated provided data based on the cooperative system name. The information collection program according to any one of claims 25 to 33, which functions as a provision data transmission unit that transmits the information to another system.

The provided data generating means may be used at the start of data collection for the site, at the end of processing by the data processing means for one page in the site, and at the end of data collection for the site. 35. The information collection program according to claim 34, wherein the information collection program is configured to function as means for generating provided data in the same data format.

36. The information collection program according to claim 34, wherein the provision data transmission unit functions as a unit that transmits the content acquired by the content acquisition unit to another system based on the cooperative system name.