JPS63217418A

JPS63217418A - System for extracting japanese key word text

Info

Publication number: JPS63217418A
Application number: JP62051317A
Authority: JP
Inventors: Toshihiko Kobayashi; 敏彦小林
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-03-05
Filing date: 1987-03-05
Publication date: 1988-09-09

Abstract

PURPOSE:To extract a key word in a consistent extraction method with no intervention of man power by deleting such character strings that cannot give features of a text with reference to a non-key word file and extracting the character strings having high emerging frequency table. CONSTITUTION:When a Japanese word text data is converted into codes, a character extracting means 15 extracts a character string containing at least KANJI (Chinese character) or KATAKANA (square form of Japanese syllabary) out of a text based on a range of KANJI codes or KATAKANA codes. Then a key word extracting means 18 decides from the information on a non-key word file 16 whether said extracted character string is generally improper or not to give features to a text. Then the emerging frequency of a character string which is not decided improper is obtained from a character string emerging frequency table 17. Then the character string having a high emerging frequency is decided as a key word of the text. Thus it is possible to extract a key word of the Japanese word text data with no intervention of man power.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は日本語テキストキーワード抽出方式に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a Japanese text keyword extraction method.

[Conventional technology]

従来、日本語テキストキーワードの抽出は、人手により
日本語テキストを通読し、人手によりそのテキストを特
徴付けるキーワードを抽出して行なわれていた。Conventionally, Japanese text keywords have been extracted by manually reading through the Japanese text and manually extracting keywords that characterize the text.

[Problem that the invention seeks to solve]

上述した従来の人手による日本語テキストキーワード抽
出方式では、人手によりテキストを通読し、人手により
キーワードを抽出していたため、一つのキーワードの抽
出に多くの時間を要し、またキーワードの抽出が個人個
人の判断によりがちで、複数人がたずされる場合に、均
質なキーワードが設定しにくいといった欠点がある。In the conventional manual Japanese text keyword extraction method described above, the text was read through manually and the keywords were extracted manually, so it took a lot of time to extract one keyword, and the extraction of keywords was difficult for each individual. The disadvantage is that it is difficult to set homogeneous keywords when multiple people are asked.

[Means for solving problems]

本発明の日本語テキストキーワード抽出方式は、日本語
の文字をコードに変換する文字・コード変換手段と、前記文字・コード変換手段により生成された日本語テキ
ストのコード化データから、漢字、カタカナの少なくと
も一方を含む文字列を抽出する文字列抽出手段と、前記文字列抽出手段で抽出された文字列からそのテキス
トを特徴付は得ない文字列を除外するための情報が蓄え
られている非キーワードファイルと、対像としている日本語テキストの中で各文字列とその出
現回数を表している文字列出現日数表と、前記文字列抽
出手段で抽出された文字列から、前記非キーワードファ
イル中の情報を参照しながらそのテキストを特徴付は得
ない文字列を除外し、除外されなかった文字列に対し、
前記文字抽出手段を各日本語テキストごとに作成し、作
成された一表の中から出現回数の多いもののみをキーワ
ードとして抽出するキーワード抽出手段とを有する。The Japanese text keyword extraction method of the present invention includes a character/code conversion means for converting Japanese characters into codes, and a coded data of the Japanese text generated by the character/code conversion means. a character string extracting means for extracting a character string containing at least one of the characters; and a non-keyword storing information for excluding character strings that do not characterize the text from the character strings extracted by the character string extracting means. From the file, a character string appearance number table representing each character string and its number of occurrences in the Japanese text as an image, and the character strings extracted by the character string extraction means, the character strings in the non-keyword file are While referring to the information, exclude strings that do not characterize the text, and for strings that are not excluded,
The character extraction means is created for each Japanese text, and keyword extraction means is provided for extracting only characters that appear frequently from the created table as keywords.

〔作用〕日本語デキストデータをコード化するとき、文字抽出手
段が、漢字コードの範囲、カタカナコード範囲をもとに
テキスト中から漢字、カタカナの少なくとも一方を含む
文字列を抽出し、キーワード抽出手段が、その文字列が
一般にテキストを特徴付けるのに不適切であるかどうか
を非キーワードファイルの情報をもとに判断し、不適切
と判断されなかった文字列の出現回数を文字列出現日数
表をもとに求め、出現回数の多いものをそのテキストの
キーワードとする。[Operation] When encoding Japanese dex data, the character extraction means extracts character strings containing at least one of kanji and katakana from the text based on the kanji code range and the katakana code range, and the keyword extraction means However, based on the information in the non-keyword file, it is determined whether the string is inappropriate for characterizing the text in general, and the number of occurrences of strings that are not judged to be inappropriate is calculated using a string appearance number table. The keywords that appear most often are used as keywords for the text.

したがって、人手を介することなく日本語テキストデー
タのキーワードを抽出できる。Therefore, keywords from Japanese text data can be extracted without manual intervention.

〔Example〕

次に、本発明の実施例について図面を参照して説明する
。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の日本語テキストキーワードの抽出方式
の一実施例を示す構成図、第２図は第１図中の非キーワ
ードファイル１６の情報形式の一例を示す図、第３図、
第４図、第５図はそれぞれ第１図における文字列抽出手
段１５、キーワード抽出手段１８およびデータベース更
新手段１９の処理を表すフローチャートである。FIG. 1 is a block diagram showing an embodiment of the Japanese text keyword extraction method of the present invention, FIG. 2 is a diagram showing an example of the information format of the non-keyword file 16 in FIG. 1, and FIG.
4 and 5 are flowcharts showing the processing of the character string extracting means 15, keyword extracting means 18, and database updating means 19 in FIG. 1, respectively.

本実施例は、第１図に示すように、入力手段１１と表示
装置１２と日本語テキスト作成手段１３と日本語テキス
トファイル１４と文字列抽出手段１５と非キーワードフ
ァイル１６と文字列出現日数表１７とキーワード抽出手
段１８とデータベース更新手段１９とキーワードベース
２０から構成される。As shown in FIG. 1, this embodiment includes an input means 11, a display device 12, a Japanese text creation means 13, a Japanese text file 14, a character string extraction means 15, a non-keyword file 16, and a character string appearance number table. 17, keyword extracting means 18, database updating means 19, and keyword base 20.

入力手段１１は、日本語テキスト作成手段１３とデータ
ベース更新手段１９への入力手段である。The input means 11 is an input means for the Japanese text creation means 13 and the database updating means 19.

表示装置１２は、入力手段１１とキーワード抽出手段１
８からの情報を表示する装置である。日本語テキスト作
成手段１３は、入力手段１１から入力されたテキストの
文字列情報からコード化データを生成する手段である。The display device 12 includes an input means 11 and a keyword extraction means 1.
This is a device that displays information from 8. The Japanese text creation means 13 is means for generating encoded data from the character string information of the text input from the input means 11.

日本語テキストファイル１４は、日本語テキスト作成手
段１３の生成したコード化テキスト情報を登録しておく
ファイルである。文字列抽出手段１５は、自らが保持し
ている漢字コードの範囲、カタカナコードの範囲により
、テキストデータ゛から漢字、カタカナを含む文字列を
抽出する手段である。非キーワードファイル１６は、漢
字、カタカナを含む文字列であっても、「該当、本日、
思」等キーワードとして一般にテキストを特徴付けない
ものを判断するために、テキストを特徴付は得ない漢字
・カタカナの文字列を列挙しであるファイルであり、第
２図に示すようにこれらの情報が格納されている。文字
列出現日数表１７は、表１に示すように、対象としてい
る日本語テキスト中での漢字、カタカナを含む文字列と
、その出現回数とを表現している表である。The Japanese text file 14 is a file in which encoded text information generated by the Japanese text creation means 13 is registered. The character string extraction means 15 is a means for extracting character strings including kanji and katakana from text data, based on the kanji code range and katakana code range that it owns. Even if the non-keyword file 16 contains character strings including kanji and katakana,
This is a file that lists kanji and katakana character strings that do not characterize the text, in order to determine keywords that generally do not characterize the text, such as ``thought'', and this information is shown in Figure 2. is stored. As shown in Table 1, the character string appearance number table 17 is a table that expresses character strings including kanji and katakana in the target Japanese text and the number of times they appear.

キーワード抽出手段１８は、文字列抽出手段１５から抽
出されたそのテキスト中の全ての漢字・カタカナからな
る文字列の内、一般にテキストを特徴付は得ないものを
非キーワードファイル１６をもとにして除外し、除外す
ることにより１５られた同一のキーワードの出現回数を
文字列出現回数衣１７を用いて求め、出現回数の多いも
のを表示装置１２に表示する。データベース更新手段１
９は、キーワード抽出手段１８が表示装Ｔｉ１２に表示
したキーワードの中から、利用者が一般にテキストを特
徴付けるものではないと判断し、入力装置１１から人力
したキーワードを非キーワードファイル１６に非キーワ
ードとして追加し、キーワードデータベース２０には上
記で非キーワードとされなかったキーワードを登録する
。キーワードデータベース２０には、表２に示すように
、日本語テキストファイル１４の各テキストデータのテ
キスト番号とともに、キーワードが登録される。The keyword extracting means 18 extracts, based on the non-keyword file 16, out of all the character strings consisting of kanji and katakana in the text extracted by the character string extracting means 15, those that do not generally characterize the text. The number of occurrences of the same keyword, which has been reduced to 15 by excluding, is determined using a character string appearance number display 17, and the keywords with the highest number of occurrences are displayed on the display device 12. Database update means 1
9, from among the keywords displayed on the display Ti 12 by the keyword extracting means 18, the user determines that the keywords do not generally characterize the text, and adds the manually entered keywords from the input device 11 to the non-keyword file 16 as non-keywords. However, the keywords that are not set as non-keywords above are registered in the keyword database 20. As shown in Table 2, keywords are registered in the keyword database 20 along with the text number of each text data in the Japanese text file 14.

表　　　　　　　２次に、文字列抽出手段１５とキーワード抽出手段１８お
よびデータベース更新手段１９の処理について第３図、
第４図および第５図を参照して説明する。Table 2 Next, FIG.
This will be explained with reference to FIGS. 4 and 5.

（１）文字列抽出手段１５（第３図）まず、日本語テキスト作成手段１３によりコード化され
たテキストの一文字分の入力を行い〈ステップ３１）、
テキストデータが終了かどうかを判断する（ステップ３
２）。テキストデータが終了でないとき、現在調べてい
る一文字が漢字またはカタカナであるかどうかを判断す
る（ステップ３３）。漢字、カタカナのいずれがであっ
たとき、もしキーワード候補の文字列を作成中であれば
、作成中の文字列の末尾に現在調べている一文字を追加
し、もし現在キーワード候補の文字列を作成中でなけれ
ば新しく一文字をキーワード候補として作成中の文字列
とし、ステップ３１へ飛ぶ（ステップ３４）。漢字・カ
タカナのいずれでもない場合、現在作成中の漢字、カタ
カナを含む文字列があるかどうかを判断する（ステップ
３５）。現在作成中の漢字、カタカナを含む文字列があ
る場合は作成した文字列をキーワード抽出手段１８の入
力となる文字列の一つとし、作成中の文字列を空とし、
ステップ３１へ飛ぶ（ステップ３６）。(1) Character string extraction means 15 (Figure 3) First, one character of coded text is inputted by the Japanese text creation means 13 (step 31),
Determine whether the text data is finished (Step 3)
2). If the text data is not finished, it is determined whether the character currently being examined is a kanji or katakana (step 33). If either kanji or katakana is in progress, if you are creating a string of keyword candidates, add the character you are currently looking for to the end of the string you are creating, and if you are currently creating a string of keyword candidates. If not, one new character is set as a keyword candidate in the character string being created, and the process jumps to step 31 (step 34). If it is neither kanji nor katakana, it is determined whether there is a character string that is currently being created that includes kanji or katakana (step 35). If there is a character string that is currently being created that includes kanji or katakana, use the created character string as one of the character strings that will be input to the keyword extraction means 18, leave the character string that is currently being created empty,
Jump to step 31 (step 36).

ステップ３２でテキストデータの終了と判断したとき、
処理を終了する（ステップ３７）。When it is determined in step 32 that the text data has ended,
The process ends (step 37).

（２）キーワード抽出手段１８（第４図）まず、文字列
抽出手段１５によりテキスト中の漢字、カタカナを含む
文字列を全て抽出したあと、非キーワードファイル１６
を参照し、このテキストから抽出されたすべての漢字、
カタカナの文字列について非キーワードファイル１６に
ないもののみを抽出することをしくステップ４１）、抽
出された文字列に対して、文字列出現回数衣１７を作成
し、出現回数値を更新しくステップ４２）、出現回数が
あらかじめ設定された規定の回数より多い文字列をキー
ワー゛ドとして抽出しくステップ４３）、抽出されたキ
ーワードを表示装置１２に表示しくステップ４４）、処
理を終了する。(2) Keyword extraction means 18 (Figure 4) First, after all character strings including kanji and katakana in the text are extracted by the character string extraction means 15, the non-keyword file 16
and all kanji extracted from this text,
Step 41) To extract only those katakana character strings that are not in the non-keyword file 16, create a string appearance count 17 for the extracted character strings, and update the appearance count value Step 42 ), a character string whose number of appearances is greater than a preset specified number of times is extracted as a keyword (step 43), the extracted keyword is displayed on the display device 12 (step 44), and the process ends.

（３）データベース更新手段１９（第５図）。(3) Database update means 19 (FIG. 5).

まず、キーワード抽出手段１８が非キーワードファイル
１６と文字出現回数衣１７を参照し、キーワードの抽出
を終了したあと、キーワード抽出手段１８が抽出したキ
ーワードのうち、利用者から入力手段１１を通して、一
般にテキストを特徴付は得ないキーワードとして指示さ
れたキーワードを削除しくステップ５１）、削除されな
かったキーワードをキーワー・ドデータベース２ｏに登
録しくステップ５２）、ステップ５１で入力手段１１か
ら指定された新しくテキストを特徴付は得ない文字列を
非キーワードファイル１６に追加し、（ステップ５３）
、処理を終了する。First, the keyword extracting means 18 refers to the non-keyword file 16 and the number of character appearances 17, and after completing the extraction of keywords, the keywords extracted by the keyword extracting means 18 are inputted from the user through the input means 11 into text. In step 51), the specified keyword is deleted as a keyword that does not have a characteristic, and in step 52), the keyword that has not been deleted is registered in the keyword database 2o.In step 51, the new text specified from the input means 11 is Add character strings that cannot be characterized to the non-keyword file 16 (step 53)
, ends the process.

次に、本実施例の全体的な動作を具体的に説明する。Next, the overall operation of this embodiment will be specifically explained.

たとえば、入力手段１１、表示装置１２、日本語テキス
ト作成手段１３、日本語テキストファイル１４、文字列
抽出手段１５、非キーワードファイル１６、文字列出現
回数衣１７、キーワード抽出手段１８、データベース更
新手段１９およびキーワードデータベース２０をそれぞ
れキーワード、ＣＲＴモニタ、コンピュータのプログラ
ムである日本語テキスト作成システム、コンピュータの
プログラムである文字列抽出システム、磁気ディスクを
記録媒体とする非キーワードファイル、コンピュータの
主記憶の中での文字列出現回数衣、コンピュータのプロ
グラムであるキーワード抽出システム、コンピュータの
プログラムであるデータベース更新システムおよび磁気
ディスクを記憶媒体とするキーワードデータベースと−
する。For example, input means 11, display device 12, Japanese text creation means 13, Japanese text file 14, character string extraction means 15, non-keyword file 16, character string appearance count 17, keyword extraction means 18, database updating means 19 and keyword database 20, respectively, in a CRT monitor, a Japanese text creation system that is a computer program, a character string extraction system that is a computer program, a non-keyword file using a magnetic disk as a recording medium, and a main memory of a computer. A keyword extraction system that is a computer program, a database update system that is a computer program, and a keyword database that uses a magnetic disk as a storage medium.
do.

キーボードにより日本語テキストデータをテキストデー
タ番号とともに日本語テキストファイルへ登録するとき
には、日本語テキスト作成システムが入力された文字を
コードに変換する。このとき、文字列抽出システムは、
変換されたコードを人力し、Ｆ述した第３図の処理で漢
字、カタカナを含む文字列を抽出する。抽出された漢字
・カタカナを含む文字列から、キーワード抽出システム
は上述した第４図の処理により、一般にテキストを特徴
付は得ないものを除外し、同一の文字列の出現回数を、
文字列出現回数衣を作成することにより数え、出現回数
の多いものをＣＲＴモニタに表示する。データベース更
新システムは、キーボードからの入力を基に、上述した
第５図の処理に従って、ＣＲＴモニタに表示された文字
列のうちテキストを特徴付は得ないと判断されたものを
削除し、キーワードデータベースに不適切と判断されな
かった文字列を、日本語テキストのテキスト番号ととも
に登録し、一般にテキストを特徴付は得ないと判断され
た文字列を非キーワードファイルに登録することをする
。When registering Japanese text data together with a text data number into a Japanese text file using a keyboard, the Japanese text creation system converts the input characters into codes. At this time, the string extraction system
The converted code is manually input and character strings containing kanji and katakana are extracted using the process shown in FIG. 3 described in F. From the extracted character strings containing kanji and katakana, the keyword extraction system uses the process shown in Figure 4 above to exclude those that do not generally characterize the text, and calculates the number of occurrences of the same character string.
The number of occurrences of character strings is counted by creating a record, and those with the highest number of occurrences are displayed on a CRT monitor. Based on the input from the keyboard, the database update system deletes the character strings displayed on the CRT monitor that are determined not to be characterized by the text according to the process shown in FIG. 5 described above, and updates the keyword database. Character strings that are not judged to be inappropriate are registered together with the text number of the Japanese text, and character strings that are judged not to generally characterize the text are registered in a non-keyword file.

〔Effect of the invention〕

以上説明しように本発明は、漢字コードの範囲、カタカ
ナコードの範囲により日本語テキスト中から漢字、カタ
カナを含む文字列を抽出し、抽出された文字列からテキ
ストを特徴付は得ない文字列を非キーワードファイルを
参照することにより削除し、出現回数の多いものを、文
字列出現回数衣を利用し、キーワードとして抽出するこ
とにより、日本語テキストデータのキーワードの作成に
人手を介することなく、−ｉＬ、た抽出手法でキーワー
ドが抽出できるという効果がある。As explained above, the present invention extracts character strings containing kanji and katakana from Japanese text according to the range of kanji codes and the range of katakana codes, and extracts character strings that do not characterize the text from the extracted character strings. By referring to the non-keyword file and deleting it, and extracting the frequently occurring keywords as keywords using the character string appearance count, it is possible to create keywords for Japanese text data without the need for human intervention. This has the effect that keywords can be extracted using iL and other extraction methods.

[Brief explanation of the drawing]

第１図は本発明の日本語テキストキーワード抽出方式の
一実施例を示す図、第２図は第１図中の非キーワードフ
ァイル１６の情報の表現形式の一例を示す図、第３図、
第４図および第５図はそれぞれ第１図における文字列抽
出手段１５、キーワード抽出手段１８およびデータベー
ス更新手段１９の処理を表すフローチャートである。１１・・・入力手段、　　１２・・・表示装置、１３・
・・日本語テキスト作成手段、１４・・・日本語テキストファイル、１５・・・文字列抽出手段、１６・・・非キーワードファイル、１７・・・文字列出現回数衣、１８・・・キーワード抽出手段、１９・・・データベース更新手段、２０・・・キーワードデータベース。第４図第５図FIG. 1 is a diagram showing an embodiment of the Japanese text keyword extraction method of the present invention, FIG. 2 is a diagram showing an example of the expression format of information in the non-keyword file 16 in FIG. 1, and FIG.
4 and 5 are flowcharts showing the processing of the character string extraction means 15, the keyword extraction means 18, and the database updating means 19 in FIG. 1, respectively. 11... Input means, 12... Display device, 13.
...Japanese text creation means, 14...Japanese text file, 15...Character string extraction means, 16...Non-keyword file, 17...Character string appearance count, 18...Keyword extraction Means, 19...Database update means, 20...Keyword database. Figure 4 Figure 5

Claims

[Claims] Character/code converting means for converting Japanese characters into codes; and characters containing at least one of kanji and katakana from coded data of Japanese text generated by the character/code converting means. A character string extraction means for extracting a string, and a non-keyword file storing information for excluding character strings that do not characterize the text from the character strings extracted by the character string extraction means. The text is extracted from the character string appearance count table representing each character string and its number of occurrences in the Japanese text, and the character strings extracted by the character string extraction means, while referring to the information in the non-keyword file. Exclude strings that cannot be characterized, and for strings that are not excluded,
A Japanese text keyword extraction method, comprising: a keyword extraction means for creating the character string appearance table for each Japanese text, and extracting only keywords that appear frequently from the created table as keywords.