JP2002245061A

JP2002245061A - Keyword extraction

Info

Publication number: JP2002245061A
Application number: JP2001036577A
Authority: JP
Inventors: Takashige Tanaka; 敬重田中
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-02-14
Filing date: 2001-02-14
Publication date: 2002-08-30

Abstract

(57)【要約】【課題】ウェッブページのような大量のテキストデー
タに対する検索を容易かつ精度良く行なうデータベース
を構築することは困難であった。【解決手段】ひとまとまりのテキストデータであるウ
ェッブページのデータを、巡回エンジンで収集し、これ
を形態素解析して、単語を抽出する。これらの単語に対
して、偏った出現頻度であるＴＦＩＤＦを計算し、所定
上の単語のみをキーワードとして取り出す。これらの単
語を用いて、そのテキストデータを表わすベクトルを演
算し、データベースを構築する。検索時には、検索用の
文章を入力し、これからキーワードを切り出し、そのキ
ーワードが表わすベクトルと、データベースとを比較し
て、類似のサイトを出力する。単純な単語の比較ではな
く、文書を特長付ける単語により表現されたベクトルで
の類似を判定でき、検索の精度を高くすることができ
る。 (57) [Summary] [Problem] It has been difficult to construct a database for easily and accurately searching a large amount of text data such as web pages. SOLUTION: Data of a web page, which is a set of text data, is collected by a traveling engine, and the collected data is subjected to morphological analysis to extract words. For these words, TFIDF, which is a biased frequency of appearance, is calculated, and only predetermined words are extracted as keywords. Using these words, a vector representing the text data is calculated to construct a database. At the time of a search, a search sentence is input, a keyword is cut out from the sentence, a vector represented by the keyword is compared with a database, and similar sites are output. Rather than a simple word comparison, similarity in a vector represented by a word characterizing a document can be determined, and search accuracy can be improved.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、一定のまとまりを
持ったテキストデータに対して、検索や分類を行なう技
術に関し、詳しくは効率の良くキーワードを抽出し、分
類を付与してデータベースを構築し、要約文を生成し、
あるいは検索を行なう技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for searching and classifying text data having a certain unity. More specifically, the present invention relates to a technique for efficiently extracting keywords and assigning classifications to construct a database. , Generate a summary sentence,
Alternatively, it relates to a technique for performing a search.

【０００２】[0002]

【従来の技術】従来、インターネット上でアクセス可能
なウェブページのような、大量のテキストデータを中心
とするデータを扱うために、種々の手法が提案されてい
る。例えば、こうしたウェッブページを検索する目的で
インターネットなどのネットワーク上には多数存在する
検索エンジンでは、クライアントが、この検索エンジン
に検索用のキーワードを投入することで、該当するキー
ワードを含むテキストが存在するページを参照可能にし
ている。2. Description of the Related Art Conventionally, various methods have been proposed to handle a large amount of text data, such as a web page accessible on the Internet. For example, in a search engine that exists on a network such as the Internet for the purpose of searching for such web pages, when a client inputs a search keyword to the search engine, a text including the corresponding keyword exists. The page can be referenced.

【０００３】こうした検索は、クライアントによる検索
の実行前に、各サイトを巡回して、そこに存在するテキ
ストデータをすべて収集してデータベースを構築してお
いたり、トップページのテキストからデータベースを構
築しておくとった手法により行なわれている。この場
合、テキストデータからの単語の抽出は、あらかじめシ
ソーラスなどを用意し、このシソーラスに存在する単語
のみ抽出したり、あるいは単純に漢字やカタカナの連続
を単語として抽出するといったことが行なわれていた。[0003] In such a search, before executing a search by a client, the user goes around each site and collects all text data existing in the site to construct a database, or constructs a database from the text of the top page. This is done by a reserved method. In this case, the extraction of words from the text data is performed by preparing a thesaurus or the like in advance and extracting only words existing in the thesaurus, or simply extracting a series of kanji and katakana as words. .

【０００４】検索により、該当するテキストデータを特
定するためには、キーワードが存在するか否かのみを判
定するものもあるが、テキストデータから取り出した多
数の単語のベクトルを演算し、キーワードから演算され
るベクトルとの類似度を判定するものも提案されてい
る。これは、シソーラスに存在する単語数が１万あれ
ば、この１万の単語からなる空間を想定し、特定のテキ
ストデータに含まれる単語がこの空間内でどのようなベ
クトルを構成するかを演算しておく。この場合、ベクト
ルの各成文は、単語の出現頻度に応じて可変される。例
えば、図１８（Ａ）に示すように、あるテキストデータ
Ａに、キーワードとして、「山」という単語が３回、
「川」という単語が５回出現していたとすれば、このテ
キストデータのベクトルＡは、図１８（Ｂ）に示したよ
うに、「山」「川」をそれぞれ成分としてももつベクト
ルとして表現される。同様に、テキストデータＢには、
「山」が２回のみ現れ、「川」は出現しないとすれば、
そのベクトルＢは、「山」軸に重なるベクトルとなり、
テキストデータＣには、「川」が３回出現するだけであ
るとすれば、そのベクトルＣは、図示するように、
「川」軸に重なったベクトルとなる。これに対して、
「山、川」というキーワードか与えられた場合のベクト
ルＤは、図示するように、「山」「川」をそれぞれ備え
た単位ベクトルとなり、ベクトルＡ，Ｂ，Ｃとの比較か
ら、テキストデータＡが、もっとも類似度が高いと判定
されることになる。In order to identify relevant text data by search, there is a method of determining only whether or not a keyword exists. However, a vector of a large number of words extracted from the text data is calculated and calculated from the keyword. A method for determining the degree of similarity with a vector to be performed has also been proposed. This means that if there are 10,000 words in the thesaurus, a space consisting of these 10,000 words is assumed, and what kind of vectors words contained in specific text data form in this space is calculated. Keep it. In this case, each sentence of the vector is changed according to the frequency of appearance of the word. For example, as shown in FIG. 18A, in a certain text data A, the word “mountain” is three times as a keyword,
Assuming that the word "river" appears five times, the vector A of this text data is expressed as a vector having "mountain" and "river" as components as shown in FIG. You. Similarly, text data B includes
If "mountain" appears only twice and "river" does not appear,
The vector B becomes a vector overlapping the “mountain” axis,
Assuming that “river” appears only three times in the text data C, the vector C becomes
This is a vector that overlaps the “river” axis. On the contrary,
When the keyword “mountain, river” is given, the vector D becomes a unit vector having “mountain” and “river” as shown in FIG. Is determined to have the highest similarity.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、かかる
検索などの技術では、大量のテキストデータを効率よく
扱うことができない、という課題があった。即ち、単純
なキーワード検索では、ノイズが多すぎて、検索された
テキストデータが膨大なものになってしまう。インター
ネットのサイトを例にとると、インターネットに接続さ
れた世界中のサイトのテキストデータを、巡回型のエン
ジンで取得して、これらに含まれる単語をキーワードと
して登録しておき、例えば、「パソコン」といった単語
で検索をかけると、何十万というサイトがヒットしてし
まう。これは、テキストデータの一部に、「パソコンか
らもアンケートにアクセスできます」と記載されていて
も、該当してしまうからである。However, there is a problem that a large amount of text data cannot be efficiently handled by such a search technique. That is, in a simple keyword search, there is too much noise, and the searched text data becomes enormous. Taking an Internet site as an example, text data of sites all over the world connected to the Internet is acquired by a cyclic engine, and words included in these are registered as keywords, for example, "PC" And search for hundreds of thousands of sites. This is because even if a part of the text data states that "the questionnaire can be accessed also from a personal computer", this is true.

【０００６】他方、テキストデータに含まれる単語を用
いて、そのデータ全体のベクトルを求め、このベクトル
を利用して類似度を判定して検索結果に反映させる手法
では、一つのサイトのテキストデータに含まれる単語の
数が大きいため、演算に多大の時間と手間を要するとい
う問題があった。[0006] On the other hand, in a method in which a vector of the entire data is obtained by using words included in the text data, a similarity is determined using this vector, and the result is reflected in a search result. Since the number of words included is large, there is a problem that a great deal of time and labor are required for the calculation.

【０００７】かかる問題は、単に検索にとどまらず、検
索用のデータベースの構築、要約文の作成など、自然言
語（テキスト）を対象とするテキストデータの取り扱い
技術において、課題となっていた。[0007] Such a problem has been a problem not only in the search but also in a technique for handling text data for a natural language (text), such as construction of a search database and creation of a summary sentence.

【０００８】本発明の装置は、こうした問題を解決し、
計算量を提言して、かつ精度の高いテキストデータの取
り扱い技術を実現することを目的とする。[0008] The device of the present invention solves these problems,
An object of the present invention is to propose a calculation amount and realize a technology for handling text data with high accuracy.

【０００９】[0009]

【課題を解決するための手段およびその作用・効果】上
記課題の少なくとも一部を解決する本発明のキーワード
抽出方法は、一定のまとまりを有するテキストデータか
ら、該テキストデータに所定の処理を行なうためのキー
ワードを抽出する方法であって、前記一定のまとまりを
有するテキストデータを、形態素解析して単語を抽出
し、該抽出した単語が、前記テキストデータの中で偏っ
て頻出する程度を評価し、該評価値が所定以上の単語
を、前記テキストデータにおけるキーワードとして抽出
することを要旨としている。Means for Solving the Problems and Action / Effect Thereof A keyword extracting method according to the present invention for solving at least a part of the problems described above is for performing a predetermined process on text data having a certain unity from the text data. A method of extracting a keyword, the text data having a certain unity, morphological analysis to extract words, the extracted words are evaluated in the text data to evaluate the degree of frequent biased, The gist of the present invention is to extract a word whose evaluation value is equal to or more than a predetermined value as a keyword in the text data.

【００１０】また、同様の技術を用いてなされた本発明
の要約文生成方法の発明は、一定のまとまりを有するテ
キストデータから、要約文を生成する方法であって、前
記一定のまとまりを有するテキストデータを、形態素解
析して単語を抽出し、該抽出した単語が、前記テキスト
データの中で偏って頻出する程度を評価し、該評価値が
所定以上の単語を、前記テキストデータにおけるキーワ
ードとして抽出し、該抽出したキーワードを結合して、
要約文を生成することを要旨としている。An abstract sentence generating method according to the present invention, which is performed by using the same technique, is a method for generating an abstract sentence from text data having a fixed unit. The data is subjected to morphological analysis to extract words, the degree to which the extracted words frequently appear in the text data in an uneven manner is evaluated, and words having an evaluation value equal to or greater than a predetermined value are extracted as keywords in the text data. And combining the extracted keywords,
The main point is to generate a summary sentence.

【００１１】かかるキーワード抽出の技術は、テキスト
データから形態素解析を用いて単語を抽出するので、あ
らかじめ抽出用のシソーラスなどを用意する必要がな
い。しかも、抽出した単語が、テキストデータの中で偏
って頻出する程度を評価し、この評価値が所定以上の単
語をキーワードとするので、抽出するキーワードの精度
を低下させることなくその数を低減することができる。
自然言語を用いたテキストにおいては、出現の頻度の高
い単語がキーワードになりやすいことは知られている
が、単に頻度が高いだけでなく、これが偏って出現する
程度を用いているので、「こと」や「時」などの汎用的
な単語を除いてキーワードを抽出することができる。更
に、こうして得られたキーワードを結合して要約文を生
成すれば、きわめて簡易に、精度の高い、要約文を生成
することができる。キーワードの結合による要約文の生
成は、例えば、「このテキストは、」＋「（抽出したキ
ーワード群）」＋「に関する。」という定型文を、要約
文として生成するといった簡易な構成から、形態素解析
で得られた名詞や動詞、およびそれらの結びつきの高さ
や、テキストデータ内での位置の情報（例えば、同一の
センテンス内に存在したか否かなど）から、これらを適
宜結合して要約文を出力する構成など、種々の形態を考
えることができる。In the keyword extraction technique, words are extracted from text data by using morphological analysis, so that it is not necessary to prepare an extraction thesaurus or the like in advance. In addition, the degree to which the extracted words frequently appear in the text data in a biased manner is evaluated, and the words whose evaluation values are equal to or more than a predetermined value are used as keywords. be able to.
In texts using natural language, it is known that words with high frequency of occurrence are likely to be keywords, but not only high frequency but also the degree to which they appear unevenly is used. Keywords can be extracted except for general-purpose words such as "" and "time". Furthermore, if a summary sentence is generated by combining the keywords thus obtained, a highly accurate summary sentence can be generated very easily. The generation of the summary sentence by combining the keywords is performed, for example, by a morphological analysis from a simple configuration such as generating a fixed form sentence “this text is” + “(extracted keyword group)” + “related” as a summary sentence. From the nouns and verbs obtained in the above, the heights of their associations, and information on the position in the text data (for example, whether or not they existed in the same sentence, etc.), these are appropriately combined to create a summary sentence Various forms such as an output configuration can be considered.

【００１２】ここで、一定のまとまりを有するテキスト
データとしては、ネットワークを介して接続可能なサイ
ト内に存在するデータ、いわゆるウェプページを想定す
ることができる。ネットワーク、例えばインターネット
に接続されたサイトの数およびそこに存在する一定のま
とまりを有するテキストデータは、膨大な数に上るの
で、キーワード抽出に関する本発明の効果は大きい。[0012] Here, as text data having a certain unity, data existing in a site connectable via a network, that is, a so-called web page can be assumed. The number of sites connected to a network, for example, the Internet, and the text data having a certain coherence existing there are enormous, and the effect of the present invention on keyword extraction is great.

【００１３】また、記形態素解析による単語の抽出に際
して、抽出する単語を、名詞およびサ変名詞を含む一部
の単語に制限して抽出を行なうことも、検討対象とする
単語の数を減らす上で好ましい。日本語の場合、名詞と
サ変名詞が、意味の大きな部分を担っていることが知ら
れているからである。もとより、形態素解析を用いてい
るので、動詞を原型の形で抽出することも容易である。
動詞の中から、基礎語と呼ばれる基本的な単語、例えば
「走る」「飲む」「食べる」などを更に選択して、キー
ワードすることも可能である。In extracting words by morphological analysis, extracting words may be limited to a part of words including nouns and sa-variable nouns. In order to reduce the number of words to be examined, preferable. This is because, in the case of Japanese, it is known that nouns and sa-variable nouns play a significant part in meaning. Of course, since the morphological analysis is used, it is easy to extract the verb in the original form.
From among the verbs, basic words called basic words, for example, "run", "drink", "eat" and the like can be further selected and used as keywords.

【００１４】単語が偏って頻出する程度は、その単語
が、テキストデータ内で出現する回数を、該テキストデ
ータの量により正規化した値により評価することができ
る。これは、例えばＴＦＩＤＦとして知られている。Ｔ
ＦＩＤＦは、次の式で定義される。なお、以下の式で、
ｄｂは、対称となっているひとまとまりのテキストデー
タ（通常は、これがデータベースの対象となるデータに
相当する）であり、ｄは、テキストデータを構成してい
る各テキスト、ｔはこのテキストに含まれる単語、とす
る。[0014] The degree to which a word frequently appears in a biased manner can be evaluated by a value obtained by normalizing the number of times the word appears in text data by the amount of the text data. This is known, for example, as TFIDF. T
FIDF is defined by the following equation. In the following equation,
db is a set of symmetrical text data (usually, this corresponds to data to be stored in the database), d is each text constituting the text data, and t is included in the text. Word.

【００１５】ＴＦＩＤＦ＝ＴＦ（ｄ，ｔ）×Ｉｄｆ（ｔ） …（１）但し：ＴＦは、テキストデータｄ内において単語ｔが出
現する回数、Ｉｄｆは、次式（２）による。Ｉｄｆ＝ＬＯＧ｛ＤＢ（ｄｂ）／ｆ（ｔ，ｄｂ）｝ここで、ＤＢ（ｄｂ）は、ひとまとまりのテキストデー
タ内のテキストの数、ｆ（ｔ，ｄｂ）は、ひとまとまり
のテキストデータ内において、単語ｔが出現するテキス
トの数、である。TFIDF = TF (d, t) × Idf (t) (1) where: TF is the number of times the word t appears in the text data d, and Idf is given by the following equation (2). Idf = LOG {DB (db) / f (t, db)} Here, DB (db) is the number of texts in a set of text data, and f (t, db) is a set of text data in the set. Is the number of texts in which the word t appears.

【００１６】他方、上記キーワードの抽出技術を用い
て、データベースを構築することができる。このデータ
ベースの公知に関する発明は、一定のまとまりを有する
複数のテキストデータを、キーワードを用いて分類し、
データベースを構築する方法であって、前記複数のテキ
ストデータに対して、順次、前記一定のまとまりを有す
る複数のテキストデータを、形態素解析して単語を抽出
し、該抽出した単語が、前記テキストデータの中で偏っ
て頻出する程度を評価し、該評価値が所定以上の単語
を、前記テキストデータにおけるキーワードとして抽出
し、前記複数のテキストデータを、前記抽出したキーワ
ードにより表現されるベクトルによって分類する処理を
行ない、前記複数のテキストデータを、少なくとも前記
ベクトルによって分類したデータベースを構築すること
を要旨としている。On the other hand, a database can be constructed using the above-described keyword extraction technique. The invention related to the public knowledge of this database classifies a plurality of text data having a certain unit using a keyword,
A method of constructing a database, wherein, for the plurality of text data, a plurality of text data having a predetermined unit are sequentially subjected to morphological analysis to extract words, and the extracted words are included in the text data. Is evaluated, a word having an evaluation value equal to or greater than a predetermined value is extracted as a keyword in the text data, and the plurality of text data is classified by a vector expressed by the extracted keyword. The gist is that a process is performed to construct a database in which the plurality of text data are classified at least by the vector.

【００１７】かかるデータベース構築方法に拠れば、テ
キストデータから形態素解析を用いて単語を抽出するの
で、あらかじめ抽出用のシソーラスなどを用意する必要
がない。しかも、抽出した単語が、テキストデータの中
で偏って頻出する程度を評価し、この評価値が所定以上
の単語をキーワードとするので、抽出するキーワードの
精度を落とすことなくその数を低減することができる。
自然言語を用いたテキストにおいては、出現の頻度の高
い単語がキーワードになりやすいことは知られている
が、単に頻度が高いだけでなく、これが偏って出現する
程度を用いているので、「こと」や「時」などの汎用的
な単語を除いてキーワードを抽出することができる。According to this database construction method, words are extracted from text data using morphological analysis, so that it is not necessary to prepare an extraction thesaurus or the like in advance. In addition, the degree to which the extracted words frequently appear in the text data in a biased manner is evaluated, and words having an evaluation value equal to or greater than a predetermined value are used as keywords, so that the number of extracted words can be reduced without lowering the accuracy. Can be.
In texts using natural language, it is known that words with high frequency of occurrence are likely to be keywords, but not only high frequency but also the degree to which they appear unevenly is used. Keywords can be extracted except for general-purpose words such as "" and "time".

【００１８】その上で、抽出されたキーワードにより表
現されるベクトルによって、対象となったひとまとまり
のテキストデータを分類し、少なくともこのベクトルに
よって分類したデータベースを構築することができる。In addition, a set of target text data is classified by a vector represented by the extracted keyword, and a database classified by at least the vector can be constructed.

【００１９】こうしたデータベースの構築方法におい
て、更に、前記複数のテキストデータについて、一定の
まとまり毎に、カテゴリを指定し、前記データベースの
構築の際に、ベクトルによる前記分類を、前記カテゴリ
別に行なうものとしても良い。同様な単語が、異なるカ
テゴリに出現することがあり得るので、予め用意したカ
テゴリを用いて分類することが、データベースの精度を
高める上で有効である。例えば、同じ「パソコン」とい
う単語が偏って頻出したとしても、通信販売のサイトの
技術用語の解説を目的としたサイトでは、検索しようと
する人にとっては、意味づけが全く異なる。そこで、こ
れらを予めカテゴリにより分けておくことも、その後の
検索の点から有効である。[0019] In this database construction method, a category is specified for each of the plurality of text data for each fixed unit, and when constructing the database, the classification using vectors is performed for each category. Is also good. Since similar words may appear in different categories, it is effective to perform classification using a prepared category in order to increase the accuracy of the database. For example, even if the same word "PC" appears frequently, the meaning is completely different for those who search for a site that aims to explain the technical terms of a mail order site. Therefore, it is also effective to classify these by category in advance in terms of subsequent search.

【００２０】こうしたデータベースの構築方法により構
築されたデータベースと対になった検索方法の発明を考
えることができる。即ち、一定のまとまりを有する複数
のテキストデータを、キーワードを用いて検索する方法
であって、前記複数のテキストデータに対して、順次、
前記一定のまとまりを有する複数のテキストデータを、
形態素解析して単語を抽出し、該抽出した単語が、前記
テキストデータの中で偏って頻出する程度を評価し、該
評価値が所定以上の単語を、前記テキストデータにおけ
るキーワードとして抽出し、前記複数のテキストデータ
を、前記抽出したキーワードにより表現されるベクトル
によって分類する処理を行ない、前記複数のテキストデ
ータを、少なくとも前記ベクトルによって分類したデー
タベースを構築しておき、検索しようとするキーワード
を入力したとき、該検索用キーワードからなるベクトル
を求め、該ベクトルとの類似度によって、前記データベ
ースから適合するテキストデータを検索することを要旨
としている。An invention of a search method paired with a database constructed by such a database construction method can be considered. That is, a method of searching for a plurality of text data having a certain unity by using a keyword, wherein, for the plurality of text data,
A plurality of text data having a certain unity,
Morphological analysis to extract words, evaluate the degree to which the extracted words appear frequently and unevenly in the text data, and extract words having an evaluation value equal to or greater than a predetermined value as keywords in the text data; A process of classifying a plurality of text data by a vector represented by the extracted keyword was performed, and a database in which the plurality of text data was classified by at least the vector was constructed, and a keyword to be searched was input. At this point, the gist is that a vector composed of the search keyword is obtained, and matching text data is searched from the database based on the similarity with the vector.

【００２１】かかる手法によれば、キーワード同士の比
較ではなく、ベクトルの比較となることから、キーワー
ド全体が指し示している領域、いわば意味的なまとまり
を考慮した検索を実現することができることになる。According to such a method, since the comparison is not a comparison between keywords but a vector, it is possible to realize a search in consideration of a region indicated by the entire keyword, that is, a meaningful unity.

【００２２】ここで、一定のまとまりを有するテキスト
データとしては、ネットワークを介して接続可能なサイ
ト内に存在するデータ、いわゆるウェプページを想定す
ることができる。ネットワーク、例えばインターネット
に接続されたサイトの数およびそこに存在する一定のま
とまりを有するテキストデータは、膨大な数に上るの
で、データベース構築に関する本発明の効果は大きい。Here, as text data having a certain unit, data existing in a site connectable via a network, that is, a so-called web page can be assumed. Since the number of sites connected to a network, for example, the Internet and the text data having a certain unit existing there are enormous, the effect of the present invention on database construction is great.

【００２３】また、記形態素解析による単語の抽出に際
して、抽出する単語を、名詞およびサ変名詞を含む一部
の単語に制限して抽出を行なうことも、検討対象とする
単語の数を減らす上で好ましい。日本語の場合、名詞と
サ変名詞が、意味の大きな部分を担っていることが知ら
れているからである。もとより、形態素解析を用いてい
るので、動詞を原型の形で抽出することも容易である。
動詞の中から、基礎語と呼ばれる基本的な単語、例えば
「走る」「飲む」「食べる」などを更に選択して、キー
ワードすることも可能である。When extracting words by morphological analysis, the extraction may be performed by limiting the extracted words to some words including nouns and sa-variable nouns. In order to reduce the number of words to be examined, preferable. This is because, in the case of Japanese, it is known that nouns and sa-variable nouns play a significant part in meaning. Of course, since the morphological analysis is used, it is easy to extract the verb in the original form.
From among the verbs, basic words called basic words, for example, "run", "drink", "eat" and the like can be further selected and used as keywords.

【００２４】単語が偏って頻出する程度は、その単語
が、テキストデータ内で出現する回数を、該テキストデ
ータの量により正規化した値により評価することができ
る。これは、例えばＴＦＩＤＦ（上述）として知られて
いる。The degree to which a word frequently appears in a biased manner can be evaluated by a value obtained by normalizing the number of times the word appears in text data by the amount of the text data. This is known, for example, as TFIDF (described above).

【００２５】かかるキーワードの抽出方法や要約文の生
成方法、あるいはデータベースの構築方法や検索方法に
対応した発明として、これらの方法を実現する装置やプ
ログラムおよびそのプログラムを記録した記録媒体など
が、あり得ることはもちろんである。As an invention corresponding to such a keyword extraction method, a summary generation method, a database construction method, and a search method, there are an apparatus and a program for implementing these methods, and a recording medium on which the program is recorded. Of course you get.

【００２６】[0026]

【発明の他の態様】本願発明のキーワード抽出に関する
技術は、例えば翻訳などもに用いることができる。翻訳
では、翻訳例をデータベース化することが有効であり、
こうしたデータベースの検索に応用できるからである。Another embodiment of the present invention The technique relating to keyword extraction of the present invention can be used for, for example, translation. For translation, it is effective to create a database of translation examples.
This is because it can be applied to such a database search.

【００２７】[0027]

【発明の実施の形態】以下、本発明の実施の形態を実施
例に基づいて説明する。（１）実施例の構成：はじめに、実施例の構成について
図１を用いて説明する。図１は本実施例のデータベース
構築を行なうシステムを示す概略構成図である。このシ
ステムは、インターネットのような大規模なネットワー
ク１００に接続されたデータベースサーバ２００として
実現されている。ネットワーク１００には膨大な数のサ
ーバ３００，３１０，３２０・・・が接続されており、
これらのサーバ３００内の記憶装置には、多数のウェブ
ページＷＰが格納されている。異なるアドレス（ＵＲ
Ｌ）が与えられたウェプページを、ここではひとまとま
りのテキストデータと呼ぶ。これらのテキストデータに
は、メタタグなどを含んでも差し支えない。また、これ
らのウェプページは、通常、アドレスにより直接指し示
された最上位のページ（以下、「表紙」という）ＦＰ
と、この表示ＦＰから呼出可能な下位のページ（以下、
便宜的な「本文」と呼ぶ）ＢＤとから構成されている
（図５参照、詳しくは後述）。もとより、単一のページ
からなるウェプページや複雑なリンクを構築したページ
などもあり得るが、説明の便宜上、フロントページＦＰ
と本文ＢＤからなるウェプページの構成を標準として、
以下の説明を行なう。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below based on examples. (1) Configuration of Embodiment: First, the configuration of the embodiment will be described with reference to FIG. FIG. 1 is a schematic configuration diagram showing a system for constructing a database according to the present embodiment. This system is realized as a database server 200 connected to a large-scale network 100 such as the Internet. An enormous number of servers 300, 310, 320,... Are connected to the network 100.
A large number of web pages WP are stored in storage devices in these servers 300. Different addresses (UR
The web page given L) is herein referred to as a group of text data. These text data may include meta tags and the like. Also, these web pages are usually the top page (hereinafter referred to as “cover”) FP directly pointed to by the address.
And a lower page that can be called from this display FP (hereinafter referred to as
BD (referred to as “body” for convenience) (see FIG. 5 and described later in detail). Of course, there may be a web page composed of a single page or a page constructed with complicated links, but for convenience of explanation, the front page FP
With the configuration of the web page consisting of
The following is described.

【００２８】データベースサーバ２００は、ネットワー
ク１００とのデータのやり取りを制御するネットワーク
インタフェース（ＮＴ−Ｉ／Ｆ）２１０、処理を行なう
ＣＰＵ２２０、処理プログラムや固定的なデータを記憶
するＲＯＭ２３０、ワークエリアとしてのＲＡＭ２４
０、時間を管理するタイマ２５０、後述する各種のデー
タを蓄積するデータベース（ＤＢ）２６０、日本語辞書
などを記憶しているハードディスク２７０等を備える。
なお、データベース２６０は、実際には、ハードディス
クなどの記憶装置に格納されているが、ここでは、説明
の都合上、独立の装置として扱うものとする。The database server 200 includes a network interface (NT-I / F) 210 for controlling data exchange with the network 100, a CPU 220 for processing, a ROM 230 for storing processing programs and fixed data, and a work area. RAM 24
0, a timer 250 for managing time, a database (DB) 260 for storing various data described later, a hard disk 270 for storing a Japanese dictionary, and the like.
Although the database 260 is actually stored in a storage device such as a hard disk, it is assumed here that the database 260 is handled as an independent device for convenience of explanation.

【００２９】このシステムでは、ネットワーク１００を
介して公開された多数のサーバ３００，・・・に備えら
れたサイト内のテキストデータを分類し、検索可能に公
開する。そのために、図３に示した手順で、いくつかの
処理を行なう。この手順は次のように構成されている。
まず、データの収集を行なう（ステップＳ１０）。その
後、収集したテキストデータをキーワードを抽出して分
類する処理を行ない（ステップＳ２０）、得られたキー
ワードを用いてデータベースを構築する処理を行なう
（ステップＳ３０）、こうして得られたデータベース
は、その後、公開され、ユーザが自由にアクセス可能と
なる（ステップＳ４０）。こうして、このデータベース
２６０は、誰でも、あるいは登録した会員に限って、利
用することができるようになる。In this system, text data in a site provided in a large number of servers 300,... Made public via the network 100 are classified and made public in a searchable manner. For this purpose, some processes are performed according to the procedure shown in FIG. This procedure is configured as follows.
First, data is collected (step S10). Thereafter, a process of extracting and classifying the collected text data by keywords is performed (step S20), and a process of constructing a database using the obtained keywords is performed (step S30). It is made public and the user can freely access it (step S40). Thus, this database 260 can be used by anyone or only registered members.

【００３０】（２）ベータベースの構築処理：ステップ
Ｓ１０ないしＳ３０として説明したデータベースの構築
処理について、次に詳しく説明する。本実施例のデータ
ベースの構築処理は、キーワードの抽出処理と、抽出し
たキーワードを用いてテキストデータを分類し、これに
よりデータベースを構築する処理からなる。図４に示し
たように、データベースの構築処理を開始すると、ま
ず、巡回型ロボットによりデータを収集する処理が行な
われる（ステップＳ１１）。この処理は、ネットワーク
１００を介して、サーバ３００，・・・内のサイトを指
定するアドレスを出力して、それらのサイトからテキス
トデータを収集する処理である。ネットワーク１００が
インターネットの場合には、ＩＰアドレスと呼ばれるア
ドレスにより、巡回するサイトを順次指定する。ＩＰア
ドレスの場合には、グローバルアドレスの割り当てが、
ある程度地域的に決まっているので、巡回エンジンの対
象を、例えば日本国内に限ったり、日本と米国といった
ように限定することも可能である。また、ＩＰアドレス
で指定したサイトからＵＲＬと呼ばれるアドレス情報を
取得するとき、アドレスが「ｈｔｔｐ：／／」で始まる
アドレスは、ハイパーテキストであり、いわゆるウェッ
ブページを構成していることから、こうしたアドレスを
有するページに限って、テキストデータを取得するもの
としても良い。(2) Beta Base Construction Process: The database construction process described as steps S10 to S30 will be described in detail below. The database construction processing of the present embodiment includes keyword extraction processing and processing of classifying text data using the extracted keywords and thereby constructing a database. As shown in FIG. 4, when the database construction process is started, first, a process of collecting data by the traveling robot is performed (step S11). This process is a process of outputting addresses specifying sites in the servers 300,... Via the network 100, and collecting text data from those sites. When the network 100 is the Internet, the circulating sites are sequentially specified by an address called an IP address. In the case of an IP address, the assignment of the global address is
Since it is determined locally to some extent, it is possible to limit the target of the patrol engine to, for example, Japan or to Japan and the United States. When acquiring address information called a URL from a site designated by an IP address, an address whose address starts with "http: //" is a hypertext, and constitutes a so-called web page. It is also possible to acquire text data only for pages having.

【００３１】また、あるＩＰアドレスを指定して最初に
取得するハイパーテキストのＵＲＬは、そのテキストデ
ータのフロントページＦＰとして扱うことができる。例
えば、図５に示したように、あるＩＰアドレスを指定し
て得られたＵＲＬが、「http://www.AAA.xx.jp/INDEX.H
TML」であれば、このＵＲＬを、フロントページＦＰと
するのである。このフロントページＦＰからデータを読
み出すと、そのページ内には、このページＦＰからリン
クを張られた他のページのアドレスが含まれている。図
５に示した例では、「http://www.AAA.xx.Jp/BBB/INDE
X.HTML」や「http://www.AAA.xx.jp/CCC/INDEX.HTML」
などがこれに相当する。巡回エンジンは、こうしたリン
ク先のテキストデータもすべて取得してくる。但し、た
のウェップページへのリンクを辿ることはしない。即
ち、同一のＩＰアドレスの中のテキストデータを、その
ＵＲＬと共に収集するのである。The URL of the hypertext obtained first by specifying a certain IP address can be handled as the front page FP of the text data. For example, as shown in FIG. 5, the URL obtained by specifying a certain IP address is “http://www.AAA.xx.jp/INDEX.H
In the case of “TML”, this URL is used as the front page FP. When data is read from this front page FP, the page contains the address of another page linked from this page FP. In the example shown in FIG. 5, "http://www.AAA.xx.Jp/BBB/INDE
X.HTML "or" http://www.AAA.xx.jp/CCC/INDEX.HTML "
Etc. correspond to this. The patrol engine also obtains all the text data of such links. However, it does not follow links to other web pages. That is, text data in the same IP address is collected together with the URL.

【００３２】こうして得られたテキストデータに対し
て、次に形態素解析を行ない、単語を抽出する処理を行
なう（ステップＳ２１）。形態素解析は、日本語解析の
技術として周知のものなので、詳しい説明は省略する
が、図６に示すように、ハードディスク２７０などの記
憶装置に予め用意された日本語辞書ＪＤ、特にいわゆる
逆引き辞書Ｉｊｄを用い、得られたテキストデータを解
析し、個々の文書を構成する単語を形態素解析により定
めるのである。例えば、図５に示した例で「ＤＤという
車は、品質を重視したセダンである。」という文章に対
して、逆引きの日本語辞書Ｉｊｄを参照すると、「Ｄ
Ｄ」「と」「いう」「という」「い」「う」「車」
「は」「品質」「を」「重視」「した」「し」「た」
「セダン」「で」「ある」「である」「あ」といった語
を切り出すことができる。ここで、「い」や「う」
「あ」「し」「た」などの仮名一音も、語として切り出
しているのは、「いう（言う）」の語幹「い」や「うる
（売る）」の語幹「う」などが、文中に現れる可能性が
あるからである。Next, the text data thus obtained is subjected to morphological analysis to extract words (step S21). Since the morphological analysis is well known as a technique of Japanese analysis, detailed description is omitted, but as shown in FIG. 6, a Japanese dictionary JD prepared in advance in a storage device such as a hard disk 270, particularly a so-called reverse lookup dictionary Using Ijd, the obtained text data is analyzed, and words constituting each document are determined by morphological analysis. For example, in the example shown in FIG. 5, for a sentence "A car named DD is a sedan with an emphasis on quality."
D, “to,” “to,” “to,” “to,” “to,” “car.
“Ha” “quality” “weight” “weight” “do” “do” “ta”
Words such as "sedan", "de", "a", "is", and "a" can be extracted. Here, "I" and "U"
The kana single sounds such as "a", "shi" and "ta" are also extracted as words, such as the stem "i" of "say" or "u" of "ru" (sell). This is because they may appear in the sentence.

【００３３】辞書Ｉｊｄには、これらの語がその文法情
報と共に記憶されている。そこで、切り出した語を次に
文法情報に従って並べて、破綻しない配列を見い出す処
理を行なう。かかる解析は、例えば複数文節最長一致法
や最小コスト法といった手法が知られており、所定の語
の組合わせのうちどれが最も日本語としてもっともらし
いかを検定するのである。例えば、「品質を」を例にと
ると、自立語＋付属語（助詞）の結びつきの方が、自立
語＋自立語＋付属語（助詞）よりも望ましいというルー
ルの下、「品」（自立語・名詞）＋「質」（自立語・名
詞）＋「を」（付属語・助詞）よりも、「品質」（自立
語・名詞）＋「を」（付属語・助詞）の方が、日本語と
して確からしいと判断するのである。The dictionary Ijd stores these words together with their grammatical information. Then, the extracted words are arranged next according to the grammatical information, and a process for finding an array that does not break down is performed. For such an analysis, for example, a method such as a multiple phrase longest matching method or a minimum cost method is known, and a test is performed to determine which combination of predetermined words is most likely to be Japanese. For example, taking "quality" as an example, under the rule that the association between an independent word and an adjunct (particle) is more preferable than an independent word + an independent word + adjunct (particle), the "product" (independent "Quality" (independent word / noun) + "wo" (adjunct / particle) is better than "quality" (independent word / noun) + "wo" (adjunct / particle) They judge that it is likely to be Japanese.

【００３４】こうして形態素解析を行なった後、得られ
た単語の品詞情報に基づき、名詞とサ変名詞に相当する
単語のみを抽出する。もとより、動詞の原形や副詞、形
容詞などを抽出しても良い。どのように単語を抽出する
かは、分類しようとするテキストデータの種類などにも
より、例えば、通常のテキストデータでは、名詞を中心
に抽出を行ない、文学作品や芸術品の鑑賞に関するテキ
ストデータは、形容詞などを中心に抽出する、といった
ことも好適である。スポーツに関するテキストデータに
ついては、動詞も抽出するといったことも考えられる。After performing the morphological analysis in this way, based on the part of speech information of the obtained words, only words corresponding to the noun and the noun are extracted. Of course, the original verb, adverb, adjective, etc. may be extracted. How to extract words depends on the type of text data to be classified.For example, in ordinary text data, extraction is performed mainly on nouns, and text data on appreciating literary works and artworks is not It is also preferable to extract mainly adjectives and the like. For text data on sports, it is also possible to extract verbs.

【００３５】なお、この例では、文法的な語と語の結び
つきに関する情報を利用して形態素解析を行なったが、
抽出する単語を、漢字の熟語とカタカナ語に限れば、テ
キストデータから、連続する漢字文字列やカタカナ文字
列を単語として取り出し、これらの単語が名詞辞書に掲
載されているか否かという簡易な判断により、単語を抽
出することも可能である。In this example, morphological analysis is performed using information on grammatical terms and their associations.
If the words to be extracted are limited to kanji idioms and katakana, simple kanji character strings and katakana character strings are extracted from the text data as words, and a simple judgment is made as to whether these words are listed in the noun dictionary , It is also possible to extract words.

【００３６】こうして形態素解析により抽出された単語
は、仮登録データベースに登録される。そこで、次のこ
の仮登録データベースの構成について説明する。この仮
登録データベースは、最終的に得られるデータベースと
は異なり、キーワードの抽出の処理のために、巡回エン
ジンが収集してきたテキストデータのアドレスと、この
テキストデータから抽出された単語とを、仮に登録して
おくデータベースである。仮登録データベースは、以下
に説明する各種テーブルＴＢからなっている。The words extracted by the morphological analysis are registered in the temporary registration database. Therefore, the following describes the configuration of the temporary registration database. This temporary registration database is different from the database obtained finally, in which the addresses of text data collected by the traveling engine and the words extracted from the text data are temporarily registered for keyword extraction processing. It is a database to keep. The temporary registration database includes various tables TB described below.

【００３７】仮登録データベースは、図７に示す構造を
備える。つまり、このデータベースは、「Ｈｏｓｔ」
「Ｐａｇｅ」「キーワード」「単語」という４つのテー
ブルからなり、テーブル「Ｈｏｓｔ」と「Ｐａｇｅ」と
はＩＤ番号により、テーブル「Ｐａｇｅ」と「キーワー
ド」とはＰａｇｅＩＤにより、テーブル「キーワード」
と「単語」とはＷｏｒｄＩＤにより、それぞれ関係付け
られている。The temporary registration database has the structure shown in FIG. In other words, this database is "Host"
The table is composed of four tables “Page”, “keyword”, and “word”. The tables “Host” and “Page” are based on ID numbers, and the tables “Page” and “keyword” are based on PageID, and the tables “Keyword”
And “word” are related by WordID.

【００３８】図８ないし図１１は、これらの各種テーブ
ルＴＢの詳細を示す。「Ｈｏｓｔ」テーブルＨＴＢは、
ＩＤと、「ＨｏｓｔＮａｍｅ」とからなるテーブルであ
り、異なるウェッブページ毎に、異なるＩＤが対応付け
られているものである。従って、このテーブルＨＴＢに
ＩＤを持っているサイト（通常は一つのＩＰアドレスに
対応したドメインネームを有するサイト）を単位とし
て、テキストデータの分類が行なわれることになる。FIGS. 8 to 11 show the details of these various tables TB. The “Host” table HTB is
This is a table including IDs and “HostName”, and different IDs are associated with different web pages. Therefore, text data is classified in units of sites having IDs in the table HTB (usually sites having a domain name corresponding to one IP address).

【００３９】「Ｐａｇｅ」テーブルＰＴＢは、図９に示
すように、「ＨｏｓｔＩＤ」と「ＰａｇｅＩＤ」と「ア
ドレス」とが対応付けられたテーブルである。このうち
「くＨｏｓｔＩＤ」は「Ｈｏｓｔ」テーブルＨＴＢにお
けるＩＤと同一のものである。同一のサイト内に含まれ
るアドレスについては、同一の「ＨｏｓｔＩＤ」が付け
られており、その下位のページに、「ＰａｇｅＩＤ」が
付与されている。「ＰａｇｅＩＤ」は重複を許しておら
ず、各アドレス毎に異なる。従って、図９に示した例で
は、「www.AAA.xx.jp」で代表されるウェッブページ
（「ＨｏｓｔＩＤ」＝２２）の中には、「www.AAA.xx.j
p/CCC/INDEX.HTML」や、「www.AAA.xx.jp/DDD/power.HT
ML」、「www.AAA.xx.jp/EEE/Keep.HTML」といったアド
レスのページが含まれていることが分かる。「Ｐａｇｅ
ＩＤ」も重複を許しておらず、全ページ対してユニーク
な番号が付与されている。なお、これらの説明における
「ページ」は、印刷単位としてのページではなく、一つ
のＵＲＬを付与されたテキストデータのまとまりを意味
している。従って、単一のＵＲＬが与えられていれば、
極めて少ないテキストデータから構成されたページであ
れ、印刷すれば何十頁にも及ぶようなテキストデータか
ら構成されたページであれ、一つのページである。As shown in FIG. 9, the “Page” table PTB is a table in which “Host ID”, “Page ID” and “address” are associated with each other. Of these, “K HostID” is the same as the ID in the “Host” table HTB. Addresses included in the same site are given the same “HostID”, and “PageID” is given to pages below them. “PageID” does not allow duplication, and differs for each address. Therefore, in the example shown in FIG. 9, the web page (“HostID” = 22) represented by “www.AAA.xx.jp” includes “www.AAA.xx.j”.
p / CCC / INDEX.HTML '' or `` www.AAA.xx.jp/DDD/power.HT
ML "and" www.AAA.xx.jp/EEE/Keep.HTML ". "Page
The “ID” also does not allow duplication, and a unique number is assigned to all pages. It should be noted that “page” in these descriptions does not mean a page as a print unit but means a group of text data to which one URL is assigned. Thus, given a single URL,
This is one page, whether it is a page made up of very few text data or a page made up of tens of pages of text data if printed.

【００４０】なお、この「Ｐａｇｅ」テーブルＰＴＢ
は、本実施例では、複数の「ＨｏｓｔＩＤ」に対するも
のを全て含めて構成したが、一つの「ＨｏｓｔＩＤ」毎
に設けても良い。同様に以下に説明する「キーワード」
テーブルＫＴＢや「単語」テーブルＷＴＢも、各「Ｐａ
ｇｅ」毎、各「キーワード」毎に設けても良い。The "Page" table PTB
In the present embodiment, is configured to include all of a plurality of “HostIDs”, but may be provided for each “HostID”. Similarly, "keyword" described below
The table KTB and the “word” table WTB are also
"ge" or "keyword".

【００４１】「キーワード」テーブルＫＴＢは、図１０
に示すように、「ＷｏｒｄＩＤ」と「ＰａｇｅＩＤ」と
「Ｃｏｓｔ」とが対応付けられたテーブルである。この
うち「ＷｏｒｄＩＤ」は、先に形態素解析により抽出さ
れた単語に付与されたＩＤであり、単語毎に対してユニ
ークな値が付与されている。そして、各単語が、一つの
「ＰａｇｅＩＤ」を有するページ内に何回出現したかを
カウントし、これを「Ｃｏｓｔ」に格納している。The "keyword" table KTB is shown in FIG.
Is a table in which “WordID”, “PageID”, and “Cost” are associated with each other. Among them, “WordID” is an ID assigned to a word previously extracted by morphological analysis, and a unique value is assigned to each word. Then, how many times each word appears in a page having one “PageID” is counted, and this is stored in “Cost”.

【００４２】「単語」テーブルＷＴＢは、「ＷｏｒｄＩ
Ｄ」と「単語」と「Ｆ値」とを対応付けて記憶している
テーブルである。即ち、「キーワード」テーブルＫＴＢ
により、各単語毎に、ページ内の出現回数「Ｃｏｓｔ」
が求められているので、これが０でないページ、即ち、
その単語が出現したページ数を、単語毎（「ＷｏｒｄＩ
Ｄ」毎）に累積する。その上で、単語ｔとその累積値Ｆ
（ｔ，ｄｂ）とを、各サイト毎に求め、これを記憶して
いるのである。図１１に示した例では、このサイトにお
いて単語「車」が出現したページ数３０６、「特長」と
いった一般的な用語の出現したページ数は多く、１２４
０、などとなっている。The “word” table WTB stores “WordI”
It is a table that stores "D", "word", and "F value" in association with each other. That is, the “keyword” table KTB
Gives the number of appearances "Cost" in the page for each word
Is required, so this is a non-zero page, ie,
The number of pages on which the word appears is determined for each word ("WordI
D ”). Then, the word t and its cumulative value F
(T, db) is obtained for each site and is stored. In the example shown in FIG. 11, the number of pages 306 where the word "car" appears in this site, and the number of pages where general terms such as "features" appear are as large as 124
0, and so on.

【００４３】以上で、図４に示したステップＳ２２まで
の処理を終了し、次に、頻度評価の処理を行なう（ステ
ップＳ２３）。この処理は、具体的には、既述したＴＦ
ＩＤＦの値を求める処理を行なう。ＴＦＩＤＦの値は、
式（１）（２）から求められるが、式（１）（２）を、
この各サイトが持っているウェップページの形に適用す
ると、ＴＦＩＤＦ＝ＴＦ（ｄ，ｔ）×Ｉｄｆ（ｔ） …（１）但し：ＴＦは、ひとつのＵＲＬで指定されたページ（本
文ＢＤ）ｄ内において単語ｔが出現する回数、であると
なる。従って、図６ないし図１１で示したケースで、
「車」を例に採ると、「ＷｏｒｄＩＤ」＝３であり、
「ＰａｇｅＩＤ」＝４であるアドレス「www.AAA.xx.jp/
CCC/INDEX.HTML」内には、「車」という単語は２回出現
したことになり、全単語数８０７００１により正規化し
た出現頻度ＴＦ（ｄ，「車」）は、となる。Thus, the processing up to step S22 shown in FIG. 4 is completed, and then the frequency evaluation processing is performed (step S23). This processing is, specifically, the TF described above.
A process for determining the value of IDF is performed. The value of TFIDF is
Equations (1) and (2) are obtained from equations (1) and (2).
When applied to the form of a web page possessed by each site, TFIDF = TF (d, t) × Idf (t) (1) where: TF is a page specified by one URL (text BD) d , The number of times the word t appears. Therefore, in the cases shown in FIGS. 6 to 11,
Taking "car" as an example, "WordID" = 3,
An address with “PageID” = 4 “www.AAA.xx.jp/
In the "CCC / INDEX.HTML", the word "car" appears twice, and the appearance frequency TF (d, "car") normalized by the total number of words 807001 is Becomes

【００４４】一方、そのＩｄｆは、図１１および次式
（２）により計算する。Ｉｄｆ＝ＬＯＧ｛ＤＢ（ｄｂ）／Ｆ（ｔ，ｄｂ）｝ …（２）ここで、ＤＢ（ｄｂ）は、特定のサイト内に存在する全
ページの数、従って、図９に示した例では、同一の「Ｈ
ｏｓｔＩＤ」を有するページの数であり、この例では、
３６４５６であった。他方、Ｆ（ｔ，ｄｂ）は、図１１
に示したＦ値である。「車」について、式（２）を計算
すると、Ｉｄｆ＝ＬＯＧ（３６４５６／３０６）＝ＬＯＧ（１２５７）＝４．７８０２７６０従って、ＴＦＩＤＦ＝ＴＦ×Ｉｄｆ＝０．０００００２４×４．７８０２７６０＝０．００００１１５となる。On the other hand, the Idf is calculated according to FIG. 11 and the following equation (2). Idf = LOG {DB (db) / F (t, db)} (2) where DB (db) is the number of all pages existing in a specific site, and therefore, in the example shown in FIG. , The same "H
ostID ", in this example,
36456. On the other hand, F (t, db) corresponds to FIG.
Is the F value shown in FIG. When the formula (2) is calculated for “car”, Idf = LOG (36456/306) = LOG (1257) = 4.780760 Therefore, TFIDF = TF × Idf = 0.0000024 × 4.77802760 = 0.00000115 Become.

【００４５】上記の計算を、「ＰａｇｅＩＤ」＝４の単
語「車」「特長」「次世代」「エネルギ」について行な
った結果を、図１２に示す。この結果、単に出現回数
（ＴＦの値）だけであれば、「特長」＞「車」＝「次世
代」＞「エネルギ」となっているのが、ＴＦＩＤＦの値
では、「次世代」＞「特長」＞「車」＞「エネルギ」と
いう順になることが分かる。FIG. 12 shows the result of performing the above calculation for the words “car”, “feature”, “next generation”, and “energy” with “PageID” = 4. As a result, if only the number of appearances (value of TF) is satisfied, “feature”> “vehicle” = “next generation”> “energy”, but in the TFIDF value, “next generation”> “energy” It can be seen that the order is "feature">"car">"energy".

【００４６】次に、平均ベクトルの計算を行なう（図
４、ステップＳ２４）。上述したＴＦＩＤＦの演算は、
一つの「ＰａｇｅＩＤ」について行なっている。即ち、
一つのサイトは、通常複数のページから構成されている
ので、上記の演算を各ページ（一つのＩＰアドレスの下
のユニークなＵＲＬを有するテキストデータ）について
行なうと、ページ毎に、ＴＦＩＤＦ値を求めることがで
きる。そこで、これらのＴＦＩＤＦ値を平均すること
で、平均ベクトルを求めるのである。即ち、一つのサイ
トに存在するページ数をＮ、各ページのＴＦＩＤＦ値を
ＴＦＩＤＦｉ（ｉ＝１，２，・・・Ｎ）とすると、平均
ベクトルＴＦＩＤＦａｖは、ＴＦＩＤＦａｖ＝（ＴＦＩＤＦ１＋ＴＦＩＤＦ２＋・・
・ＴＦＩＤＦＮ）／Ｎとして求めることができる。こうしては、一つの単語に
ついてのそのサイトにおけるＴＦＩＤＦ値が求められ
た。Next, an average vector is calculated (FIG. 4, step S24). The above-mentioned calculation of TFIDF is
This is performed for one “PageID”. That is,
Since one site is usually composed of a plurality of pages, if the above operation is performed for each page (text data having a unique URL under one IP address), a TFIDF value is obtained for each page be able to. Therefore, an average vector is obtained by averaging these TFIDF values. That is, assuming that the number of pages existing in one site is N and the TFIDF value of each page is TFIDFi (i = 1, 2,... N), the average vector TFIDFav is: TFIDFav = (TFIDF1 + TFIDF2 +.
TFIDFN) / N. Thus, the TFIDF value at that site for one word was determined.

【００４７】こうして各キーワードについて、平均ＴＦ
ＩＤＦ値を求めた段階で、ＴＦＩＤＦ値が所定値、例え
ば値０．００００１以上の単語だけをキーワードとして
抽出する。次に、このサイト（www.AAA.xx.jp ）につい
てのベクトルＢａを演算し、これをこのサイトのキーワ
ードとして、データベースに登録する処理を行なう（ス
テップＳ３１）。即ち、Ｂａ＝（ｂ１、ｂ２・・・・ｂｍ）ｂ１，ｂ２・・・ｂｍは、平均ＴＦＩＤＦが値０．００
００１以上の単語とその平均ＴＦＩＤＦ値である。こう
して一つの単語についてのベクトルＢａを求めた後、以
上の処理を全サイトの全ページに出現する全単語につい
て繰り返す。この結果、巡回エンジンが集めてきた膨大
なサイトについての情報が、ベクトルＢａ・・・の集合
として、蓄積されることになる。これがデータベース２
６０に相当する。Thus, for each keyword, the average TF
At the stage when the IDF value is obtained, only words having a TFIDF value of a predetermined value, for example, a value of 0.00001 or more, are extracted as keywords. Next, a process is performed to calculate a vector Ba for this site (www.AAA.xx.jp) and to register this in the database as a keyword for this site (step S31). Ba = (b1, b2... Bm) b1, b2.
001 or more words and their average TFIDF values. After obtaining the vector Ba for one word in this way, the above processing is repeated for all words appearing on all pages of all sites. As a result, information on a huge number of sites collected by the traveling engine is accumulated as a set of vectors Ba. This is Database 2
Equivalent to 60.

【００４８】なお、上記のベクトルの演算と登録の処理
において、ベクトルＢａは、平均ＴＦＩＤＦが、所定値
以上の単語のみから構成しても良いし、辞書に用意した
全単語を要素として構成しても良い。この場合、ＴＦＩ
ＤＦ値が所定値以下の単語についてのＴＦＩＤＦ値は、
値０に近似する。いずれにせよ、ベクトルの要素数が減
るか、値０の要素が増えるので、演算を容易に行なうこ
とができる。In the above-described vector calculation and registration processing, the vector Ba may be composed of only words whose average TFIDF is equal to or greater than a predetermined value, or composed of all words prepared in the dictionary as elements. Is also good. In this case, TFI
The TFIDF value for a word whose DF value is equal to or less than a predetermined value is:
Approximate value 0. In any case, the number of elements in the vector decreases or the number of elements having the value 0 increases, so that the calculation can be easily performed.

【００４９】以上の処理によりデータベース２６０が完
成すると、次にこのデータベースが外部に公開され、自
由な使用、または登録した会員の使用に供される。この
とき、データベースに直接アクセスするような構成も可
能であるが、ネットワーク１００を介して不特定多数の
クライアントからアクセス可能とするには、例えば、デ
ータベース２６０をアクセスするためのＣＧＩを備えた
サイトを、サーバ２００内に用意し、クライアントは、
ネットワーク１００を経由して、いわゆるブラウザか
ら、このデータベース２６０にアクセスできるようにす
るのが通常である。そこで、次にデータベースを用い
て、ウェップページの検索を行なう手法について、説明
する。図１３は、検索時の処理を示すフローチャートで
ある。まず、検索を開始するクライアントは、検索用に
用意されたサイトにアクセスする（ステップＳ４０
０）。この結果、図１４に示すような、検索画面が表示
される。When the database 260 is completed by the above processing, the database is released to the outside and used for free use or for registered members. At this time, a configuration in which the database is directly accessed is also possible. However, in order to enable access from an unspecified number of clients via the network 100, for example, a site having a CGI for accessing the database 260 may be configured as follows: Prepared in the server 200, the client:
Usually, the database 260 can be accessed from a so-called browser via the network 100. Therefore, a method of searching for a web page using a database will be described next. FIG. 13 is a flowchart showing a process at the time of search. First, the client that starts the search accesses a site prepared for the search (step S40).
0). As a result, a search screen as shown in FIG. 14 is displayed.

【００５０】そこで、クライアントは、この画面に用意
されたキーワード記入ボックスＫＢに、検索内容を、日
本語による文章として入力する（ステップＳ４１０）。
例えば、図１４（Ａ）に示したように、文字列を入力す
るボックスＴＢに、「次世代」といった文字列を入力す
る。このとき、同図に示すように、検索分野などを併せ
て指定するようにしても良い。このとき、絞り込み検索
をする必要があるときには、再度図１４（Ａ）を表示し
て、順次絞り込んでいくようなインタフェースにしても
良いし、「次世代，車」といったように、コンマ（，）
で複数の単語を入力するようにしても良い。あるいは、
図１４（Ｂ）に例示するように、「次世代の車につい
て」などと自然文で入力するものとしても良い。このと
き、検索文の入力に並行して、「検索」ボタンＢＢが押
されたかを監視し（ステップＳ４２０）、検索ボタンが
押された時には、入力された単語や文章を読み取り、図
１４（Ａ）に示した入力の場合には、単語と分野を抽出
し、図１４（Ｂ）に示した入力の場合にはこの文章を形
態素解析して、いずれにせよ単語を抽出する処理を行な
う（ステップＳ４３０）。形態素解析により単語を抽出
する場合には、単語としては、名詞やサ変名詞に限定し
て抽出しても良いし、他の品詞まで含んで抽出しても良
い。図１４（Ｂ）には、検索用の文章から、単語が抽出
される様子も模式的に示した。Therefore, the client inputs the search content as a sentence in Japanese into the keyword entry box KB prepared on this screen (step S410).
For example, as shown in FIG. 14A, a character string such as "next generation" is input into a box TB for inputting a character string. At this time, as shown in the figure, a search field or the like may be specified together. At this time, if it is necessary to perform a refined search, the interface shown in FIG. 14A may be displayed again to narrow down sequentially, or a comma (,) such as “next generation, car” may be used.
May be used to input a plurality of words. Or,
As illustrated in FIG. 14 (B), a natural sentence such as “about the next generation car” may be input. At this time, in parallel with the input of the search sentence, it is monitored whether the "search" button BB is pressed (step S420). When the search button is pressed, the input word or sentence is read, and FIG. 14), the word and the field are extracted, and in the case of the input shown in FIG. 14 (B), the sentence is morphologically analyzed, and the word is extracted anyway (step). S430). When a word is extracted by morphological analysis, the word may be extracted by limiting it to nouns or sa-variant nouns, or may be extracted including other parts of speech. FIG. 14B also schematically shows how words are extracted from a search sentence.

【００５１】単語、あるいは単語と分野を抽出した後、
得られたｓ個の単語Ｄ１，Ｄ２・・・Ｄｓについて、そ
のベクトルＢｓを求める処理を行ない（ステップＳ４４
０）、このベクトルＢｓに最も近いベクトルを有するサ
イトをデータベース２６０から検索する処理を行なう
（ステップＳ４５０）。即ち、図１５に模式的に示した
ように、各サイトが、多数の単語を要素とし、そのＴＦ
ＩＤＦ値により重み付けられた単語の集合からなるベク
トルとして、データベース２６０に記憶されているの
で、与えられた文章から得られた検索用のキーワードが
構成するベクトルと、データベース２６０に登録された
ベクトルとの類似度を判定し、最も類似するベクトルを
有するサイトから順に、検索結果を出力するのである
（ステップＳ４７０）。出力された検索結果は、ネット
ワーク１００を介してクライアントに送られ、クライア
ント側のマシンの画面に表示される。After extracting words, or words and fields,
With respect to the obtained s words D1, D2,..., Ds, processing for obtaining the vector Bs is performed (step S44).
0), a process of searching the database 260 for a site having a vector closest to the vector Bs (step S450). That is, as schematically shown in FIG. 15, each site has a number of words as elements,
Since the vector 260 is stored in the database 260 as a vector composed of a set of words weighted by the IDF value, the vector composed of the search keyword obtained from the given sentence and the vector registered in the database 260 The similarity is determined, and the search results are output in order from the site having the most similar vector (step S470). The output search results are sent to the client via the network 100 and displayed on the screen of the client machine.

【００５２】かかる手法によれば、サイトを構成してい
るページ内で、単語がどの程度偏って出現するかという
情報（ＴＦＩＤＦ値）を用いて、サイトを分類してお
き、これをデータベース２６０に、ＴＦＩＤＦ値が所定
以上の単語の集合として登録しておき、このデータと検
索用のキーワードとして与えられた言葉のベクトルとの
類似を見ているから、単にキーワードの一致を見るので
はなく、サイトの持っている固有の特長を捉えた検索が
可能となる。According to such a method, sites are classified using information (TFFID value) indicating how skewed a word appears in a page constituting the site, and this is stored in the database 260. , Registered as a set of words having a TFIDF value equal to or greater than a predetermined value, and looking at the similarity between this data and the vector of words given as a keyword for search, it is not necessary to simply look for a match between keywords. Search that captures the unique features of the system.

【００５３】次に本発明の第２の実施例について説明す
る。第２実施例では、第１実施例とほぼ同様の処理を行
なうが、データベースを構成する際、まず予備的な処理
として、いくつかの代表的なサイトについて、マニュア
ル処理による分類を付与する処理を併せて行なう。即
ち、巡回エンジンにより、例えば数千程度の数のサイト
の情報を収集し、このサイトに存在するテキストデータ
から単語を抽出してＴＦＩＤＦ値を計算し、ベクトルを
求める際、そのサイトのフロントページＦＰを登録者が
参照し、そのフロントページにふさわしい分類を付与す
る処理を行なうのである。即ち、図４に示したステップ
Ｓ３１において、ＴＦＩＤＦ値が所定以上の単語からな
るベクトルを登録する際、分類項目を付加するのであ
る。分類項目としては、「通信販売」「趣味」「政治」
「経済」といった種々の分類を適用可能である。もとよ
り、産業分類などを用いても良い。Next, a second embodiment of the present invention will be described. In the second embodiment, substantially the same processing as that of the first embodiment is performed, but when constructing a database, first, as a preliminary processing, a processing of assigning a classification by manual processing to some representative sites is performed. Perform it together. That is, the patrol engine collects information on, for example, thousands of sites, extracts words from text data present on the sites, calculates TFIDF values, and obtains vectors. A registrant refers to the front page and performs a process of assigning a classification appropriate to the front page. That is, in step S31 shown in FIG. 4, when registering a vector composed of words having a TFIDF value of a predetermined value or more, a classification item is added. Classification items are "mail order", "hobby", "politics"
Various classifications such as "economy" can be applied. Of course, an industrial classification or the like may be used.

【００５４】この場合、図１６に例示するように、マニ
ュアルで与えた分類に含まれる多数のサイトのベクトル
は、ある広がりをもって存在することになる。そこで、
この広がりの中心を、かかる分類を代表するベクトルＢ
Ｃ１、ＢＣ２・・・として定義する。また、処理したサ
イトのベクトルの広がりから、中心に対するばらつき
（分散）の程度も定めることができる。予め、こうした
処理を行なうことで、次にインターネット上の全サイト
のテキストデータを巡回エンジンにより収集してきたと
き、得られたベクトルから、そのサイトの分類を容易に
定めることができる。データベース２６０は、第１実施
例のように、特定の分類を持たずに、各サイトの情報を
登録しても良いが、分類を付与してやれば、例えば目次
のような形で情報を提示することも可能になる。In this case, as illustrated in FIG. 16, the vectors of many sites included in the classification given manually exist with a certain extent. Therefore,
The center of this spread is defined as a vector B representing such a classification.
Are defined as C1, BC2,. Also, the degree of variation (dispersion) with respect to the center can be determined from the spread of the vector of the processed site. By performing such processing in advance, when text data of all sites on the Internet is next collected by the traveling engine, the classification of the sites can be easily determined from the obtained vectors. The database 260 may register information of each site without having a specific classification as in the first embodiment, but if a classification is given, the information may be presented in a form such as a table of contents. Also becomes possible.

【００５５】かかる実施例によれば、分類の中心と広が
りをベクトル的に定義することができるので、新しいサ
イトのテキストデータを解析した結果、そのサイトをど
の分類に分類するかを容易に定めることができる。な
お、いずれにも分類できないサイトが存在した場合に
は、その旨、サーバ２００の運用者に警告し、新たな分
類を付与するといった処理を行なうものとしても良い。According to this embodiment, since the center and the spread of the classification can be defined in a vector manner, the text data of the new site is analyzed, so that the classification of the site can be easily determined. Can be. If there is a site that cannot be classified into any of them, a warning may be given to the operator of the server 200 to that effect, and a process of adding a new classification may be performed.

【００５６】かかる分類付きのデータベース２６０を用
意した場合には、クライアントが検索を行なう場合に
は、まずこの分類を指定することで、検索範囲を絞る込
むといった使い方をすることができる。インターネット
上のサイトなどは、多数にのぼるので、分類を与えて検
索を行なうことは、検索の効率を上げる上で有効であ
る。When the database 260 with such classification is prepared, when the client performs a search, it is possible to narrow down the search range by first specifying this classification. Since there are many sites and the like on the Internet, it is effective to perform a search by giving a classification to improve the efficiency of the search.

【００５７】次に、本発明の第３実施例について説明す
る。第３実施例は、与えられたテキストデータから要約
文を生成する要約文生成装置である。この要約文生成装
置は、第１実施例のサーバ２００に設けられており、第
１実施例で説明したデータベースの生成処理を利用して
要約文を生成する。即ち、図１７に示すように、データ
ベース２６０への登録が完了した後（図４、ステップＳ
３１）、一つのサイトについて登録したキーワードを読
みだし（ステップＳ５００）、そのキーワードの中から
最もＴＦＩＤＦ値が高かった単語Ｌを５個取り出す処理
を行なう（ステップＳ５１０）。その上で、これらの単
語Ｌ１，Ｌ２・・・Ｌ５を並べて、「このサイトは、Ｌ
１，Ｌ２，Ｌ３，Ｌ４およびＬ５に関する。」という文
を生成する処理を行なう（ステップＳ５２０）。この文
は、このサイトの内容を最も短く表現した文とみなせる
ので、これをデータベース２６０に登録する（ステップ
Ｓ５３０）。その後、クライアントからの検索が行なわ
れ、検索用のキーワード群から指定された内容に類似す
るサイトを出力する際、そのＵＲＬと共に、この文章を
要約文として出力する。Next, a third embodiment of the present invention will be described. The third embodiment is a summary sentence generation device that generates a summary sentence from given text data. This summary sentence generation device is provided in the server 200 of the first embodiment, and generates a summary sentence using the database generation processing described in the first embodiment. That is, as shown in FIG. 17, after the registration in the database 260 is completed (FIG. 4, step S
31) Then, a keyword registered for one site is read (step S500), and a process of extracting five words L having the highest TFIDF values from the keyword is performed (step S510). Then, arranging these words L1, L2, ... L5, "This site is L
1, L2, L3, L4 and L5. Is performed (step S520). Since this sentence can be regarded as the shortest sentence of the contents of this site, it is registered in the database 260 (step S530). Thereafter, when a search is performed from the client and a site similar to the content specified from the search keyword group is output, this sentence is output as a summary sentence together with the URL.

【００５８】かかる実施例によれば、サイトの内容を最
も簡潔に表現した要約文を簡単に生成することができ、
検索されたサイトの内容を知る上で、極めて有効な情報
として活用することができる。なお、この例では、キー
ワードとして名詞やサ変名詞だけが登録されているもの
としたが、キーワードとして動詞や形容詞などが登録さ
れており、かつそれらの単語同士の関係、例えば同一の
ページに出現したか否か、などが記憶されている場合に
は、形態素解析利用して一定の文を生成するものとして
も良い。この場合、例えば、名詞Ｌ１を中心にして形容
詞ａ１と動詞Ｖ１とが一つのページに現われていたとす
れは、「このサイトは、ａ１Ｌ１が、Ｖ１ことに関す
る。」というように、文を生成することができる。もと
より、名詞Ｌ１と、動詞Ｖ１との間には、「主語＋述
語」になりうるもの、「目的語＋述語」になる得るもの
などのが有り、これらの情報は、予め辞書などに用意す
ることができるから、名詞Ｌ１と動詞Ｖ１とを検定し
て、「このサイトは、ａ１Ｌ１を、Ｖ１ことに関す
る。」という文を生成すると言ったことも可能である。
文末も、Ｖ１が、サ変名詞なら「Ｖ１すること」のよう
に自然な日本語として生成すればよい。According to this embodiment, it is possible to easily generate a summary that expresses the contents of the site most simply,
It can be used as extremely effective information to know the contents of the searched site. In this example, it is assumed that only nouns and sa-variant nouns are registered as keywords, but verbs and adjectives are registered as keywords, and relationships between the words, for example, appear on the same page. If the information such as whether or not is stored, a certain sentence may be generated using morphological analysis. In this case, for example, if the adjective a1 and the verb V1 appear on one page centering on the noun L1, a sentence such as "This site relates to the fact that a1L1 is V1" is generated. Can be. Of course, between the noun L1 and the verb V1, there are those that can be “subject + predicate” and those that can be “object + predicate”, and such information is prepared in a dictionary or the like in advance. Therefore, it can be said that the noun L1 and the verb V1 are tested to generate a sentence "This site relates to a1L1 about V1."
At the end of the sentence, if V1 is a sa noun, it may be generated as a natural Japanese language such as "to do V1".

【００５９】以上、本発明の実施の形態について説明し
たが、本発明はこうした実施の形態に何等限定されるも
のではなく、本発明の要旨を逸脱しない範囲内におい
て、更に種々なる形態で実施し得ることは勿論である。
例えば、データベースの構築のみを行なう装置やその方
法として実現しても良いし、キーワードを抽出するだけ
の装置やその方法として実現しても良い。また、翻訳装
置に応用することも可能である。翻訳は、単に文法情報
を用いて言語間の変換を行なおうとしても上手く行かず
（必要な規則が無限に大きくなる）、むしろ豊富な用例
を用意し、翻訳にマッチした用例を見い出して、これを
適用するような形で訳した方が、意味的に正確な翻訳に
できることが知られている。そこで、与えられたテキス
トデータに、本発明を適用してキーワードを抽出し、こ
れを利用して用例を特定するといった使い方が可能であ
る。Although the embodiments of the present invention have been described above, the present invention is not limited to these embodiments at all, and may be implemented in various other forms without departing from the gist of the present invention. Obviously you can get it.
For example, the present invention may be realized as an apparatus or a method that only constructs a database, or may be realized as an apparatus or a method that only extracts keywords. Further, the present invention can be applied to a translation device. For translation, simply trying to convert between languages using grammar information does not work (the necessary rules become infinitely large), but rather prepares abundant examples, finds examples that match the translation, It is known that translating in such a way as to apply this can result in a semantically accurate translation. Therefore, it is possible to use the present invention by applying the present invention to given text data, extracting a keyword, and using this to specify an example.

[Brief description of the drawings]

【図１】本発明の各実施例における全体構成を示す概略
構成図である。FIG. 1 is a schematic configuration diagram showing an overall configuration in each embodiment of the present invention.

【図２】データベースサーバ２００の構成を示すブロッ
ク図である。FIG. 2 is a block diagram showing a configuration of a database server 200.

【図３】実施例における工程の概略を示す説明図であ
る。FIG. 3 is an explanatory view schematically showing a process in an example.

【図４】データベースサーバ２００が行なうデータベー
ス構築の処理を示すフローチャートである。FIG. 4 is a flowchart showing a database construction process performed by a database server 200;

【図５】ウェップページでのデータのリンクの様子を説
明する説明図である。FIG. 5 is an explanatory diagram illustrating a state of a data link on a web page.

【図６】テキストデータに対する形態素解析について例
示する説明図である。FIG. 6 is an explanatory diagram illustrating morphological analysis on text data.

【図７】仮登録データベースの構成を示す説明図であ
る。FIG. 7 is an explanatory diagram showing a configuration of a temporary registration database.

【図８】「Ｈｏｓｔ」テーブルＨＴＢの一例を示す説明
図である。FIG. 8 is an explanatory diagram illustrating an example of a “Host” table HTB.

【図９】「Ｐａｇｅ」テーブルＰＴＢの一例を示す説明
図である。FIG. 9 is an explanatory diagram showing an example of a “Page” table PTB.

【図１０】「キーワード」テーブルＫＴＢの一例を示す
説明図である。FIG. 10 is an explanatory diagram illustrating an example of a “keyword” table KTB.

【図１１】「単語」テーブルＷＴＢの一例を示す説明図
である。FIG. 11 is an explanatory diagram illustrating an example of a “word” table WTB.

【図１２】ＴＦＩＤＦ値の計算例を示す説明図である。FIG. 12 is an explanatory diagram showing a calculation example of a TFIDF value.

【図１３】実施例における検索時の処理を示すフローチ
ャートである。FIG. 13 is a flowchart illustrating a search process according to the embodiment.

【図１４】検索画面の一例を示す説明図である。FIG. 14 is an explanatory diagram illustrating an example of a search screen.

【図１５】検索における類似判定の様子を模式的に示す
説明図である。FIG. 15 is an explanatory diagram schematically showing a state of similarity determination in a search.

【図１６】分類とベクトルとの関係を模式的に示す説明
図である。FIG. 16 is an explanatory diagram schematically showing a relationship between a classification and a vector.

【図１７】要約文生成処理を示すフローチャートであ
る。FIG. 17 is a flowchart illustrating a summary sentence generation process.

【図１８】キーワードからベクトルを求めてデータの類
似を判断する従来の手法を示す説明図である。FIG. 18 is an explanatory diagram showing a conventional method for determining a similarity of data by obtaining a vector from a keyword.

[Explanation of symbols]

１００…ネットワーク２００…データベースサーバ２２０…ＣＰＵ２３０…ＲＯＭ２４０…ＲＡＭ２５０…タイマ２６０…データベース２７０…ハードディスク３００，３１０，３２０…サーバ 100 network 200 database server 220 CPU 230 ROM 240 RAM 250 timer 260 database 270 hard disk 300, 310, 320 server

Claims

[Claims]

1. A method for extracting a keyword for performing a predetermined process on text data having a certain unit from text data having a certain unit, wherein the text data having the certain unit is subjected to morphological analysis to determine a word. A keyword extracting method for extracting the extracted words, evaluating a degree of the occurrence of the extracted words in the text data in a biased manner, and extracting words having the evaluation value equal to or more than a predetermined value as keywords in the text data.

2. The extraction method according to claim 1, wherein the text data having the predetermined unit is data existing in a site connectable via a network.

3. The extraction method according to claim 1, wherein in extracting the words by the morphological analysis, the extracted words are extracted by limiting the extracted words to a part of words including nouns and sa-variable nouns.

4. The method according to claim 1, wherein the degree to which the word frequently appears in a biased manner is evaluated by a value obtained by normalizing the number of times the word appears in the text data by the amount of the text data. The extraction method according to any of the above.

5. A method for classifying a plurality of text data having a certain unit using a keyword and constructing a database, wherein the plurality of text data having the certain unit are sequentially arranged for the plurality of text data. Text data of
Morphological analysis to extract words, evaluate the degree to which the extracted words appear frequently and unevenly in the text data, and extract words having an evaluation value equal to or greater than a predetermined value as keywords in the text data; A database construction method for performing a process of classifying a plurality of text data by a vector represented by the extracted keyword, and constructing a database in which the plurality of text data is classified by at least the vector.

6. The database construction method according to claim 5, wherein the text data having a predetermined unit is data existing in a site connectable via a network.

7. The database construction method according to claim 5, wherein in extracting the words by the morphological analysis, the extracted words are extracted by limiting the extracted words to a part of words including nouns and sa-variable nouns.

8. The method according to claim 5, wherein the degree to which the word appears in a biased manner is evaluated by a value obtained by normalizing the number of times the word appears in the text data by the amount of the text data. The database construction method according to any of the above.

9. The method according to claim 5, wherein a category is specified for each of the plurality of pieces of text data in a certain unit, and the plurality of text data is represented by a vector when the database is constructed. A method for constructing a database for performing the classification by the category.

10. A method for generating a summary sentence from text data having a certain unity, wherein the text data having the certain unity is morphologically analyzed to extract words, and the extracted words are Evaluating the degree of frequent occurrence in the text data, extracting words whose evaluation value is equal to or greater than a predetermined value as keywords in the text data, and combining the extracted keywords to generate a summary sentence Method.

11. A method for searching for a plurality of text data having a certain unit using a keyword, wherein the plurality of text data having a certain unit are sequentially searched for the plurality of text data.
Morphological analysis to extract words, evaluate the degree to which the extracted words appear frequently and unevenly in the text data, and extract words having an evaluation value equal to or greater than a predetermined value as keywords in the text data; A process of classifying a plurality of text data by a vector represented by the extracted keyword was performed, and a database in which the plurality of text data was classified by at least the vector was constructed, and a keyword to be searched was input. A search method for obtaining a vector including the search keyword, and searching the database for matching text data based on the degree of similarity with the vector.

12. An apparatus for extracting a keyword for performing a predetermined process on text data having a certain unit from text data having a certain unit, wherein the text data having the certain unit is subjected to morphological analysis to convert a word. A morphological analysis unit to be extracted; a frequency evaluation unit that evaluates the degree to which the extracted words appear in the text data in a biased manner; and a word whose evaluation value is equal to or greater than a predetermined value is extracted as a keyword in the text data. A keyword extraction device comprising keyword extraction means.

13. An apparatus for classifying a plurality of text data having a certain unit using a keyword and constructing a database, wherein the plurality of text data having the certain unit is
Morphological analysis means for extracting words by morphological analysis; frequency evaluation means for evaluating the degree to which the extracted words appear in the text data in a biased manner; A keyword extracting unit that extracts a plurality of pieces of text data as a keyword, and a classifying unit that classifies the plurality of pieces of text data by a vector represented by the extracted keyword. By performing the processing according to, the plurality of text data,
A database construction device for constructing a database classified by at least the vector.

14. An apparatus for generating a summary sentence from text data having a certain unity, comprising: a morphological analysis unit for morphologically analyzing the text data having a certain unity to extract words; Frequency evaluation means for evaluating the degree to which words frequently appear unbalanced in the text data; keyword extraction means for extracting words whose evaluation value is equal to or greater than a predetermined value as keywords in the text data; A summary sentence generation apparatus comprising: a sentence generation unit that generates a summary sentence by combining them.

15. An apparatus for searching a plurality of text data having a certain unit using a keyword, wherein the plurality of text data having the certain unit are sequentially searched for the plurality of text data.
Morphological analysis to extract words, evaluate the degree to which the extracted words appear frequently and unevenly in the text data, and extract words having an evaluation value equal to or greater than a predetermined value as keywords in the text data; A process for classifying a plurality of text data by a vector represented by the extracted keyword; a database storage unit for storing a database obtained by classifying the plurality of text data by at least the vector; and a keyword to be searched. A search device comprising: vector calculation means for obtaining a vector consisting of the search keyword when input; and search means for searching for matching text data from the database based on the similarity with the vector.

16. A program for causing a computer to perform a process of extracting a keyword for performing a predetermined process on text data having a certain unit from text data having a certain unit, comprising: A function of extracting words by performing morphological analysis; a function of evaluating the degree to which the extracted words appear frequently in the text data in a biased manner; A program to realize the function of extracting as.

17. A program for causing a computer to perform a process of classifying a plurality of text data having a certain unit using a keyword and constructing a database, wherein the plurality of text data are sequentially processed with respect to the plurality of text data. Multiple text data with a certain unity
Morphological analysis to extract words, evaluate the degree to which the extracted words appear frequently and unevenly in the text data, and extract words having an evaluation value equal to or greater than a predetermined value as keywords in the text data; A program for realizing a function of performing a process of classifying a plurality of text data by a vector represented by the extracted keyword, and a function of constructing a database in which the plurality of text data is classified at least by the vector.

18. A program for causing a computer to perform a process of generating a summary sentence from text data having a certain unity, a function of morphologically analyzing the text data having a certain unity and extracting words. A function of evaluating the degree to which the extracted words appear in the text data in a biased manner, a function of extracting words having an evaluation value equal to or greater than a predetermined value as keywords in the text data, A program for realizing the function of combining and generating a summary sentence.