JPH0486948A

JPH0486948A - Method for preparing kana-added data base utilizing dictionary by fields

Info

Publication number: JPH0486948A
Application number: JP2202973A
Authority: JP
Inventors: Masa Saito; 斎藤　雅; Hiroshi Teranishi; 浩寺西; Takahiro Nakajima; 孝浩中島
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1990-07-31
Filing date: 1990-07-31
Publication date: 1992-03-19

Abstract

PURPOSE:To prepare the data base of biographies and technical terms, etc., by automatically executing separate descriptions and the addition of Japanese syllabary (KANA) by using a natural language processing, which is one kind of AI, for preparing the KANA-added data base. CONSTITUTION:First of all, a preprocessing is executed to a stored data base. First, data are extracted and code conversion is executed. The serial number (ID) of a dictionary 106 by fields is set to the code-converted data, afterwards, a natural language processing input file is prepared, and the above-mentioned operation is repeated to all the data. The dictionary 106 by fields is a table paring Chinese characters (KANJI) and the readings. For each field to be processed, the field is registered on a computer in advance and managed by the ID starting from '1'. When using the dictionary 106, the ID of the dictionary by fields is set to the data attribute of a natural language processing input data record, and for preparing the natural language processing input file, the natural language processing input file record is prepared for each extracted data.

Description

【発明の詳細な説明】発明の目的：（産業上の利用分野）この発明は、分野別辞書を利用したカナ振りデータベー
スの作成に自然言語処理システムを利用したＣＤ−ＲＯ
Ｍ等のデータベースの作成方法に関する。[Detailed Description of the Invention] Purpose of the Invention: (Industrial Application Field) This invention provides a CD-RO that uses a natural language processing system to create a kana-furi database using field-specific dictionaries.
This relates to a method for creating a database such as M.

（従来の技術）最近、印刷物用に蓄積した文書データを２次利用してＣ
Ｄ−ＲＯＭやデータベースを作成することが多くなって
いる。そして、データベース検索用のキーワードを抽出
する作業やカナ振りは、従来より専門家による手作業に
よっていた。特に人名。(Prior art) Recently, document data accumulated for printed matter has been used as a secondary
D-ROMs and databases are increasingly being created. The work of extracting keywords for database searches and writing kana characters has traditionally been done manually by experts. Especially people's names.

住所、医療といった特殊な分野では読み方自身か非常に
難しく、専門家てないと殆ど作業が不可能であった。In special fields such as addresses and medicine, it was extremely difficult to read, and it was almost impossible for non-specialists to do the work.

（発明が解決しようとする課題）データベース検索用のキーワードを抽出する作業が、従
来は専門家が文書の中から重要語を選択し、更に読み方
を付けるようになっている。このため、データベースの
キーワード抽出作業に多大な労力を要し、作業そのもの
が非効率的であった。特に人名等の特殊な分野ではキー
ワードの作成が非常に困難であった。(Problem to be Solved by the Invention) Conventionally, the task of extracting keywords for database searches has been to have experts select important words from documents and then add readings to them. For this reason, the task of extracting keywords from the database requires a great deal of effort, and the task itself is inefficient. It has been extremely difficult to create keywords, especially in special fields such as people's names.

この発明は上述のような事情より成されたものであり、
この発明の目的は、ＡＩ（人工知能）の−分野の自然言
語処理技術を利用すると共に、分野別辞書を利用したカ
ナ振りデータベースを自動的に作成するための方法を提
供することにある。This invention was made due to the above-mentioned circumstances,
An object of the present invention is to provide a method for automatically creating a kana-furi database using field-specific dictionaries while utilizing natural language processing technology in the field of AI (artificial intelligence).

発明の構成：（課題を解決するための手段）この発明は分野別辞書を利用したカナ振りデータベース
の作成方法に関するもので、この発明の上記目的は、デ
ータベースを前処理し、分野別辞書及び基本辞書を参翌
して自然言語処理による自然言語処理出力ファイルを作
成し、後処理によってカナ振りデータベースを作成する
ことによって達成される。Structure of the Invention: (Means for Solving the Problems) The present invention relates to a method for creating a kana-furi database using a field-specific dictionary. This is achieved by referring to a dictionary, creating a natural language processing output file using natural language processing, and creating a kana-furi database through post-processing.

（作用）この発明では、カナ振りデータベースの作成にＡＩの一
種である自然言語処理を用いており、分野別辞書及び基
本辞書を参照して人力原文データに対して分かち書き（
品詞分解）及びカナ振りを自動的に行なっている。(Operation) In this invention, natural language processing, which is a type of AI, is used to create a kana-furi database.
It automatically performs part-of-speech (part-of-speech decomposition) and kana translation.

コンピュータに内蔵した辞書とＡＩの手法により名詞、
助詞、動詞等の要素に分解し、分割された文書の漢字へ
の読みがなの付加とキーワードの抽出を行なう。従来は
人手によって行なわれた作業を機械が処理するので、後
は従来と同じチエツクだけで済む０作成されたカナ振り
データベースは、ＣＤ−ＲＯＭやオンラインデータベー
スのインデクスとして加工されて利用され、またカナ振
り機能を利用して総ルビの木として組版することもでき
る。Nouns, using the computer's built-in dictionary and AI techniques.
It decomposes the document into elements such as particles and verbs, adds readings to the kanji in the divided documents, and extracts keywords. Machines now handle work that was traditionally done by hand, so all that is left is the same checks as before.The created kana-furi database can be processed and used as an index for CD-ROMs and online databases, and can also be used as an index for CD-ROMs and online databases. You can also format it as a full ruby tree using the scroll function.

（実施例）先ず、この発明で用いる自然言語処理システムについて
説明する。(Example) First, a natural language processing system used in the present invention will be explained.

第７図は自然言語処理システムのハードウェア構成例を
示しており、ホストマシンｌＯにはＣＰＵＩＩ及び実装
メモリ１２が内蔵されると共に、パスライン１３を介し
て磁気ディスク装置１４．カセット磁気テープ装置１５
が接続されている。ホストマシンｌＯには、更に磁気テ
ープ装置２０．レーザープリンタ２１及びコンソール端
末２３が接続されると共に、Ｒ５−＝２３２Ｃのインタ
ーフェイス１６を介して確認／修正用端末２２が接続さ
れている。FIG. 7 shows an example of the hardware configuration of a natural language processing system, in which a host machine 10 has a built-in CPU II and a built-in memory 12, and a magnetic disk device 14. Cassette magnetic tape device 15
is connected. The host machine IO further includes a magnetic tape device 20. A laser printer 21 and a console terminal 23 are connected, and a confirmation/correction terminal 22 is also connected via an interface 16 of R5-=232C.

第８図は自然言語処理システムのソフトウェア構成を示
しており、磁気テープからの入力データは入力処理１０
１されて取込まれ、ホストマシンｌＯで処理された情報
は出力処理１２０されて磁気テープの出力データとなる
。すなわち、人力処理１０１は自然言語処理システム人
力データ磁気テープをディスクファイル上に人力データ
１０２としてコピーし、漢字コート等のチエツクを行な
い、その後に日本語処理用レコードに変換する。また、
出力処理１２０はディスク上の処理結果ファイルを処理
結果データ１２１　として自然言語処理出力磁気テープ
ヘコビーする。トライバ１０３は人力データ１０２の分
類／解析を行ない、日本語処理システム１１０を制御し
、分かち書き、カナ振り、キーワード抽出結果を取得し
、自然言語処理システム出力データ形式で、処理結果を
編集／圧力する。Figure 8 shows the software configuration of the natural language processing system, in which the input data from the magnetic tape is input to the input processing 10.
The information that has been taken in as 1 and processed by the host machine 1O is output processed 120 and becomes output data on the magnetic tape. That is, the human processing 101 copies the natural language processing system human data magnetic tape onto a disk file as the human data 102, checks the kanji code, etc., and then converts it into a record for Japanese processing. Also,
The output processing 120 outputs the processing result file on the disk to a natural language processing output magnetic tape as processing result data 121 . The driver 103 classifies/analyzes the human data 102, controls the Japanese language processing system 110, obtains the results of separation, kana writing, and keyword extraction, and edits/presses the processing results in the natural language processing system output data format. .

日本語処理システム１１０は基本辞書アクセスルーチン
１１２を介して形態素解析を行ない、言語処理で認定す
る全ての単語についてその読みを抽出し、カナ振り出力
文として圧力する。名詞列抽出は言語処理による単語認
定結果で、その品詞が次の（ａ）　、　（ｂ）に該当す
るときに名詞として抽出する。The Japanese language processing system 110 performs morphological analysis via the basic dictionary access routine 112, extracts the pronunciations of all words recognized by language processing, and outputs them as kana-jiri output sentences. Noun string extraction is the result of word recognition through language processing, and when the part of speech corresponds to the following (a) or (b), it is extracted as a noun.

（ａ）一般名詞、す変型名詞、形動型名詞、転成名詞１
時詞、数詞、固有名詞１代名詞、形式名詞（ｂ）接辞についてはそれぞれ前後の品詞が以下に該当
するとき、該当単語を名詞として抽出する。(a) Common nouns, deformed nouns, verbal nouns, transposed nouns 1
Regarding temporal words, number words, proper nouns, one pronoun, and formal noun (b) affixes, when the preceding and following parts of speech correspond to the following, the corresponding word is extracted as a noun.

■接頭辞の場合後方品詞　一般名詞、す変型名詞、形動型名詞、転成名
詞１時間、数詞、固有名詞１代名詞、形式名詞 ■接尾辞の場合前方品詞・一般名詞、す変型名詞、形動型名詞、転成名
詞１時間、数詞、固有名詞１代名詞、形式名詞また、日本語文章と上記より求められたキーワード分析
テーブルを入力すると共に、統計的解析。■For prefixes, backward part of speech: common noun, s-inflected noun, morphological noun, transposed noun, 1 hour, numeral, proper noun, 1 pronoun, formal noun. ■For suffix, forward part of speech/common noun, s-inflected noun, morphological noun. Type nouns, transposed nouns 1 hour, numerals, proper nouns 1 pronoun, formal nouns.In addition to inputting the Japanese sentences and the keyword analysis table obtained from the above, statistical analysis was performed.

構文解析、知識処理等の手法を用いてアクセスファイル
ルーチン１１１　と協働して入力日本語文章の解析を行
ない、キーワード抽出、絞り込み１重要度評価を行なう
。The input Japanese text is analyzed in cooperation with the access file routine 111 using methods such as syntax analysis and knowledge processing, and keyword extraction and narrowing 1 importance evaluation are performed.

端末通信処理１２３は確記／修正用端末２２との間て通
信を行ない、端末出力用のデータ変換を行なう。そして
、端末からの修正データを出力ファイルの形式に変換し
て書込む。また、リスト圧力処理１２２は、端末から出
力依頼のあった処理結果データをプリンタ出力用データ
に編集すると共に、プリンタ出力用データをレーザープ
リンタ２１に出力する。The terminal communication processing 123 communicates with the confirmation/correction terminal 22 and converts data for terminal output. Then, the modified data from the terminal is converted into an output file format and written. Further, the list pressure processing 122 edits the processing result data requested to be output from the terminal into printer output data, and outputs the printer output data to the laser printer 21 .

ところで、ホストマシンｌＯが扱い得る自然言語処理機
能は、Ａ、ＩＡ理種１・分かち書きＢＪＩＬ理種２　カナ振りＩ　（分かち書き単位のカナ
振り）ｃ、ＩＡ埋種３：カナ振り＋１　（漢字単位のカナ振り
、総ルビ振り）Ｏ３処理種４：キーワード抽出及びキーワードへのカナ
振りの４種であり、人力ファイルのレコード単位に上記各機
能を切替えて処理することができる。By the way, the natural language processing functions that the host machine IO can handle are: A. IA Rise 1/Wakigaki BJIL Rise 2 Kana Furi I (Kana Furi for each dividing line) c. (Kana translation, total ruby translation) O3 processing type 4: There are four types: keyword extraction and kana translation to the keyword, and each of the above functions can be switched and processed for each record of the manual file.

次に、各機能（処理種１〜４）について説明する。Next, each function (processing types 1 to 4) will be explained.

Ａ１分かち書き（処理種１）日本語文章（漢字かな交じり文）を人力して分かち書き
を行ない、名詞、動詞、形容詞につし１て品詞情報を付
加する。出力される情報は、スラ・ンシュ“／”による
分かち書きと品詞情報（名詞。A1 Separation (Processing Type 1) A Japanese sentence (a sentence containing Kanji and Kana) is manually separated and part-of-speech information is added to each noun, verb, and adjective. The output information includes separation using sura nshu “/” and part-of-speech information (nouns.

動詞、形容詞、未知語）である。処理種１の出力形式は
第９図のようになる。(verbs, adjectives, unknown words). The output format of processing type 1 is as shown in FIG.

Ｂ、カナ振りＩ　（処理種２：分かち書き単位のカナ振
り）：日本語文章（漢字かな交じり分）を人力して分かち書き
を行ない、分かち書きされた単語単位にカナ振りを行な
う。読みはカタカナで振られ、名詞、動詞、形容詞につ
いては品詞情報を付加する。そして、出力される情報は
、スラッシュによる分かち書き９品詞情報（名詞、動詞
、形容詞。B. Kana-Furi I (Processing type 2: Kana-Furi for each separated word): A Japanese sentence (including Kanji and Kana) is manually separated, and the separated words are then transformed into Kana-Furi. Readings are written in katakana, and part-of-speech information is added to nouns, verbs, and adjectives. The output information is separated by slashes and contains nine parts of speech information (nouns, verbs, adjectives).

未知語）９分かち書き単語要素へのカナ振り結果である
。処理ｆ１２の出力形式は第１Ｏ図のようになる。(Unknown word) This is the result of kana translation to the 9-minute written word element. The output format of the process f12 is as shown in FIG. 1O.

Ｃ，カナ振りＩＩ　（処理種３）：この処理ｆ！！３は、分野別辞書１０６を使用したカナ
振り及び総ルビ振り（漢字（列）単位のカナ振り）の機
能を有している。分野別辞書１０６を使用したカナ振り
は人名、地名、各種専門用語等の項目データに対して、
品目専用の辞書を利用してカナ振りを行なうものである
。かな振りの方法は項目データをＫＥＹにして分野別辞
ｉ　１　Ｑ　５をサーチし、マツチングした場合に分野
別辞書１０６に登録されているカナを振る。これてカナ
が得られなかった場合、日本語処理システムを呼出して
基本辞書１１５によってカナを振る。C, Kana Furi II (processing type 3): This process f! ! 3 has a function of kana-furi and total ruby-furi (kana-furi for each kanji (column)) using the field-specific dictionary 106. Kana-furi using the field-specific dictionary 106 can be used for item data such as people's names, place names, and various technical terms.
Kana-furi is performed using a dictionary dedicated to the item. The kana-furi method searches for field-specific dictionaries i 1 Q 5 using the item data as KEY, and when a match is found, moves the kana registered in the field-specific dictionary 106 . If kana is not obtained, the Japanese language processing system is called and the basic dictionary 115 is used to determine the kana.

データの人力形式は、単項口データの場合は゛°項目デ
ータ”であり、複数項目データをルコードで処理する場
合は、“項目データ１”／“項目データ２”／・・・・
・・・・・／“項目データＮ”のように各項目データを
スラッシュで区切るようにしている。そして、８力され
る情報は、入力項目データに対する読み（カタカナ）と
カナデータの典拠辞書識別（どの辞書に基づいてカナが
振られたかの識別）である、、処理種３の出力形式はｉ
ｌ１図のようになっており、■分野別辞書１０６で読み
が取得された場合、■基本辞書Ｉｔｓて読みが取得され
た場合、■分野別辞書１０６及び基本辞書１１５の両方
共に読みが登録されていない場合、に分けて識別コート
（例えはＡＡ、ＡＢ、Ａ（：）を与えている。The manual format of data is "item data" for single item data, and "item data 1"/"item data 2"/... when processing multiple item data with a code.
.../"Item data N", each item data is separated by a slash. The input information is the reading (katakana) for the input item data and the authority dictionary identification of the kana data (identification of which dictionary the kana was assigned based on).The output format of processing type 3 is i
As shown in Figure 11, ■ If the reading is acquired in the field-specific dictionary 106, ■ If the reading is acquired in the basic dictionary Its, ■ The reading is registered in both the field-specific dictionary 106 and the basic dictionary 115. If not, it is divided into identification codes (for example, AA, AB, A (:)).

分野別辞書１０６を使用したカナ振りで処理対象となる
データは、人名、地名、各種専門用語等の項目データ（
主に固有名詞）であり、総ルビ振りで処理対象となるデ
ータは日本語の漢字かな交じり文である。総ルビ振り（
漢字（列）単位のカナ振り）の機能は、日本語文意（漢
字かな交じり文）を入力して全ての漢字に対してカナ振
りを行なうものである。カナ振り方法は、人力原文中の
漢字（列）　　（ＪＩＳ非漢字以外）に対してカナ（ル
ビ）を振り、ルビは「群扱いルビ」の形式で振られる。The data to be processed in kana-furi using the field-specific dictionary 106 includes item data such as person names, place names, and various technical terms (
The data to be processed with full ruby processing is Japanese kanji and kana mixed sentences. Total ruby swing (
The function (Kana-Furi for each Kanji (column)) is to input the meaning of a Japanese sentence (a combination of Kanji and Kana) and perform Kana-Furi for all kanji. In the kana-furi method, kana (ruby) is cast for kanji (rows) (other than JIS non-kanji) in the human original text, and ruby is cast in the form of ``group ruby''.

その出力形式は第１２図のようになっている。The output format is as shown in Figure 12.

Ｄ、キーワード抽出及びキーワードへのカナ振り＜ｍ理
種４）：人力した日本語文意から日本語処理システムの言語処理
機能によりフリーキーワードの抽出を行ない、抽出した
キーワードに読みを付加する。D. Extracting keywords and adding kana to keywords (4): Free keywords are extracted from the human-generated Japanese meaning using the language processing function of the Japanese language processing system, and pronunciations are added to the extracted keywords.

出力される情報は、抽出されたキーワードキーワードの
統み（カタカナ）及びキーワードの解析結果であり、出
力形式は第１３図のようになっている。なお、解析情報
は、日本語処理システムによるキーワード認定の過程で
得られた解析情報かセットされるエリアである。The output information is the extracted keyword structure (Katakana) and the keyword analysis results, and the output format is as shown in FIG. Note that the analysis information is an area where analysis information obtained in the process of keyword recognition by the Japanese language processing system is set.

確認／修正用端末２２の機能は、処理結果ファイルの中
の人力原文データと処理結果データ１２１をホストマシ
ン１０より端末通信処理１２３を介して受は取り、端末
装置のデイスプレィに表示し、ポストマシン１０のレー
ザープリンタ２１に出力することにより処理結果の確認
及び修正作業を容易に行なうことを目的とする。端末２
２からのキーボード操作により、確Ｕ／修正を行なう処
理結果ファイルのジョブ名指定を行ない、ルーコード毎
に人力原文データと処理結果データ１２１を端末装置の
デイスプレィ上に表示し、確認／修正作業を行なう。The function of the confirmation/correction terminal 22 is to receive the human input original text data and processing result data 121 in the processing result file from the host machine 10 via the terminal communication processing 123, display it on the display of the terminal device, and send it to the post machine. The purpose is to facilitate checking and correction of processing results by outputting to 10 laser printers 21. Terminal 2
By using the keyboard from step 2, specify the job name of the processing result file to be confirmed/corrected, display the human original data and processing result data 121 for each code on the display of the terminal device, and perform the confirmation/correction work. Let's do it.

デイスプレィの表示形式は、処理種により以下（Ａ）〜
（Ｄ）のようになっている。The display format varies from (A) to the following depending on the processing type.
It looks like (D).

（Ａ）処理種１（分かち書き）の場合は、入力原文と処
理された人力原文の分かち書き結果を画面比力する。(A) In the case of processing type 1 (separation), the input original text and the processed human original text are compared on the screen.

（Ｂ）　ＩＡ理種２（分かち書き単位のカナ振り）の場
合は、人力原文と処理された入力原文の分かち書き単位
のカナ振り結果を画面出力する。(B) In the case of IA Rise 2 (Kana translation in units of dividing lines), the human original text and the result of Kana translation in units of separating lines of the input original text that has been processed are output on the screen.

（Ｃ）処理種３（総ルビ振り）の場合は、入力原文中の
全ての漢字に対してのカナ振り結果を表示色を変えて画
面出力する。(C) In the case of processing type 3 (total ruby writing), the kana writing results for all kanji in the input original text are output on the screen with different display colors.

（Ｄ）処理種４（キーワード抽出）の場合は、入力原文
と入力原文中から抽出されたキーワード及びそのカナ振
り結果を画面出力する。(D) In the case of processing type 4 (keyword extraction), the input original text, the keyword extracted from the input original text, and the kana translation result are output on the screen.

次に、キーボード操作により処理結果データの修正を行
なうが、基本的な修正機能を以下に挙げて説明する。Next, the processing result data is corrected by keyboard operations, and the basic correction functions will be listed and explained below.

処理種３及び処理種４の場合のみ修正が可能である。処
理種３（総ルビ振り）の場合はカナ振り結果の修正が可
能であり、処理種４（キーワード抽出）の場合はカナ振
り結果の修正及びキーワードの挿入、削除、順位の入れ
替えか可能である。Correction is possible only in the case of processing type 3 and processing type 4. In the case of processing type 3 (total ruby swing), it is possible to modify the kana swing results, and in the case of processing type 4 (keyword extraction), it is possible to modify the kana swing results, insert or delete keywords, and change the ranking. .

端末２２て処理結果データ１２１の修正かあった場合、
キーホード操作によって修正後データをホストマシンｌ
Ｏに送信する。ホストマシン１ｏでは、修正後データを
基に処理結果ファイルのレコード更新を行なう。If the terminal 22 modifies the processing result data 121,
The modified data can be transferred to the host machine by keystroke operation.
Send to O. The host machine 1o updates the record of the processing result file based on the corrected data.

一方、端末２２からのキーホード操作により、ホストマ
シンｌＯのレーザープリンタ２１に指定された処理結果
ファイルあるいはレコードのプリンタ出力を行なう６オ
ペレータによるＰキー（プリントキー）の押下による処
理結果ファイルあるいは処理結果レコード単位のプリン
ト出力要求があった場合、処理極毎のフォーマットに合
せてホストマシンｌＯから取り出したレコードのプリン
タ出力を行なう。On the other hand, a process result file or a process result record is output by pressing the P key (print key) by an operator 6 who outputs the specified process result file or record to the laser printer 21 of the host machine IO by keystroke operation from the terminal 22. When a unit printout request is made, the record taken out from the host machine IO is outputted to the printer in accordance with the format of each processing pole.

以上が自然言語処理システムの概要であるが、この発明
は上記自然言語処理システムを用いて人名等のカナ振り
データへ−スを自動作成するものである。この実施例で
は分野別辞書１０５を人名として、人名かな振りデータ
ベースを作成する場合を説明する。The above is an overview of the natural language processing system, and the present invention uses the above natural language processing system to automatically create a base for kana-speech data such as a person's name. In this embodiment, a case will be described in which a kana-furi database of a person's name is created using the field dictionary 105 as a person's name.

％ｘ図はこの発明の処理フローを示しており、磁気記憶
媒体等に格納されたデータベースに対して先ず前処理を
行なう（ステップ５１０）、前処理の詳細は第２図に示
すようになっており、最初にデータの抽出を行ない（ス
テップ５ｌｌ）、抽出したデータのコード変換を行なう
（ステップ５１２）。そして、コート変換されたデータ
に対して分野別辞書１０６のＩＤをセットしくステップ
５１３）、その後に自然言語処理入力ファイルを作成し
くステップ５１４）、全データに対して上２８動作を縁
り返す。Figure %x shows the processing flow of the present invention, in which preprocessing is first performed on a database stored in a magnetic storage medium, etc. (step 510), and the details of the preprocessing are shown in Figure 2. First, data is extracted (step 5ll), and code conversion of the extracted data is performed (step 512). Then, the ID of the field-specific dictionary 106 is set for the code-converted data (step 513), and then a natural language processing input file is created (step 514), and the above 28 operations are repeated for all the data.

データの抽出はデータベースより当処理でかな振りを行
なう姓名の油圧を行なうもので、コート変換データはＪ
ＩＳコード及びＣＴＳ（Ｃｏｍｐｕｔｅｒ　ＴｙｐｅＳ
ｅ’ｔｔｉｎｇ）コードで作成されている場合が多い。The data is extracted from the database by using this process to extract the name and name, and the coat conversion data is J
IS code and CTS (Computer TypeS
e'tting) code.

自然言語処理システムのコード体系は一散的にシステム
固有コードであるため、データのコード変換を行なう必
要がある。分野別辞書ＩＤのセットにおいて、分野別辞
書１０６は、漢字とその読みが対になっているテーブル
である。処理を行なう分野毎に予めコンピュータへの登
録を行ない１からの通し番号（ＩＤ）で管理している。Since the code system of a natural language processing system is a system-specific code, it is necessary to perform code conversion of data. In the field-specific dictionary ID set, the field-specific dictionary 106 is a table in which kanji and their readings are paired. Each field to be processed is registered in the computer in advance and managed using a serial number (ID) starting from 1.

分野別辞書１０６を使用する場合には、自然言語処理人
力ファイルデータレコードのデータ属性に分野別辞書１
０のセットを行なう。また、自然言語処理入力ファイル
作成は、抽出したデータ毎に自然言語処理人力ファイル
レコードの作成を行なうものである。When using the field-specific dictionary 106, the field-specific dictionary 1 is added to the data attribute of the natural language processing human file data record.
Set to 0. Furthermore, natural language processing input file creation involves creating a natural language processing human file record for each extracted data.

上述のように前処理されたデータは次のステップＳ１で
自然言語処理されるが、これに関しては後に詳述する６
分野別辞書１０６を使用したカナ振りの場合、第３図に
示すように先ず分野別辞書１０８を参照してパターンマ
ツチングを行ない（ステップ５１＾）、マツチングのと
れた場合にはその読みを出力し、それ以外は基本辞書１
１５を参照する通常の分かち書き／カナ振りを行なう（
ステップ５ＩＢ）。圧力形式は第１１図に示すようにな
っている０通常の自然言語処理では自然百語第埋入カフ
アイルを作成し、自然言語処理で基本辞書１１５（シス
テム辞書１３１＋ユーザ辞書１３２）を参照して、第４
図に示すような入力原文データに対して第５図に示すよ
うに分かち書き（品詞分解）及びカナ振りを行なう。分
かち書きされたデータの直前にはその単語の品詞識別１
０か付加されており、単語の品詞を判別てきるようにな
っている。次に、自然言語処理された自然言語処理出力
ファイルに対して後処理を行なう（ステップ５２０）、
後処理の詳細は第６図に示すようになっており、先ずコ
ード変換を行なう（ステップ５２１）、自然言語処理シ
ステムの処理結果はシステム固有コードで出力されるの
で、カナ振り処理結果データのＣＴＳコードへのコート
変換を行ない（ステップ５２１）、次にデータベースの
作成を行なう（ステップ５２２）、つまり、コード変換
したデータをデータベース形式のファイルレコードに出
力し、データベースへの登録を行なう。次に、人名カナ
振りファイルの内容をリスト出力しくステップＳ２）、
赤字等を入れた後に姓名カナ振りデータの校正を行なう
０校正を終了したキーワードデータを人名カナ振りデー
タベースとする。カナ振りが正しく行なわれなかったデ
ータについて、分野別辞書１０６の修正を行ない、次回
の自然言語処理の精度の向上を図る。処理結果の典拠辞
書識別に従って処理するが、分野別辞書１０６て読みか
取得されたものについては分野別辞書中の当データの修
正を行ない、他のものについては、正しい読みが振うれ
ているかどうかのチエツクと修正を行なった後に分野別
辞書１０６への登録を行ない、次回からの自然言語処理
の精度の向上を図る。なお、分野別辞書としては他に医
学用語辞書、経済用語辞書化学技術用語辞書等の登録が
考えられる。The data preprocessed as described above is subjected to natural language processing in the next step S1, which will be detailed later in 6.
In the case of kana furi using the field-specific dictionary 106, as shown in FIG. 3, pattern matching is first performed with reference to the field-specific dictionary 108 (step 51^), and if matching is achieved, the pronunciation is output. Other than that, basic dictionary 1
Do the normal parting/kana-furi referring to 15 (
Step 5IB). The pressure format is as shown in FIG. , 4th
As shown in FIG. 5, the input original text data as shown in the figure is subjected to separation (part-of-speech decomposition) and kana translation. Immediately before the separated data is the word's part of speech identification 1.
A 0 is added to the code so that the part of speech of a word can be determined. Next, post-processing is performed on the natural language processing output file that has undergone natural language processing (step 520).
The details of the post-processing are shown in FIG. 6. First, code conversion is performed (step 521). Since the processing results of the natural language processing system are output as system-specific codes, the CTS of the kana-jiri processing result data is Coat conversion to code is performed (step 521), and then a database is created (step 522), that is, the code-converted data is output to a database format file record and registered in the database. Next, step S2) outputs the contents of the kana-furi file as a list.
The keyword data that has undergone zero proofreading, in which the surname and name in kana characters are corrected after adding red characters, etc., is made into a database of personal names in kana characters. The field-specific dictionary 106 is corrected for the data for which kana-furi was not correctly performed, and the accuracy of the next natural language processing is improved. Processing is performed according to the authority dictionary identification of the processing result, but if the reading has been obtained from the field-specific dictionary 106, the data in the field-specific dictionary is corrected, and for other data, it is checked to see if the correct reading has been assigned. After checking and correcting the information, it is registered in the field-specific dictionary 106 to improve the accuracy of natural language processing from the next time onwards. In addition, as field-specific dictionaries, it is possible to register medical terminology dictionaries, economic terminology dictionaries, chemical technical terminology dictionaries, etc.

姓名の分野別辞書を用いることによって、次の表１に示
すようなカナ振りを行なうことかできる表１基本辞書１１５は自然Ｍ語処理（分かち書き／カナ振り
）を行なう上で一番基本となる辞書で、システム辞ｇ　
１３１とユーザ辞書１３２　とから構成されている。ユ
ーザ辞書１３２の修正を行なう事により、自然言語処理
の精度を向上する事か出来る。By using field-specific dictionaries for first and last names, it is possible to perform kana-furi as shown in Table 1 below.Table 1 The basic dictionary 115 is the most basic for natural M-word processing (partition/kana-furi). In the dictionary, system dictionary g
131 and a user dictionary 132. By modifying the user dictionary 132, the accuracy of natural language processing can be improved.

この発明ではＣＴＳの自然言語処理の汎用人出力ファイ
ルとして汎用ファイル（以下、Ｎ１．ファイルとする）
を用いているが、ＮＬファイルでは第１４図に示すよう
にＮＬゼインァイル、　ＩＩＩＬアウトファイル及びＮ
Ｌ情報ファイルの３種類で構成され、フォーマットは同
一である。全体のフォーマットはへダーレコード及びデ
ータレコードで成っており、ヘダーレコードにはレコー
ド識別、シーケンス番号、ファイル識別、ジョブ名、原
稿名、　ＣＴＳシステム名等がある。また、データレコ
ードとしてはレコード識別、シーケンス番号、データ番
号、ＩＡ理種、データ等が含まれている。In this invention, a general-purpose file (hereinafter referred to as N1. file) is used as a general-purpose human output file for CTS natural language processing.
However, in the NL file, as shown in Figure 14, the NL zein file, IIIL out file and N
It consists of three types of L information files, and the format is the same. The entire format consists of a header record and a data record, and the header record includes record identification, sequence number, file identification, job name, manuscript name, CTS system name, etc. Further, the data record includes record identification, sequence number, data number, IA type, data, etc.

入力ルーチン５１００は第１５図に示すように、ＮＬゼ
インァイルをパラメータと共に読込んで自然言語処理入
力ファイル及びＮＬ情報ファイルを作成するようになっ
ており、その詳細は第１６図に示すようになっている。As shown in FIG. 15, the input routine 5100 reads the NL zein file along with parameters to create a natural language processing input file and NL information file, the details of which are shown in FIG. 16. .

ＮＬゼインァイルを読込んで、パラメータの指定による
ファンクションの削除及びコート変換（外部−システム
固有コード）を行ない、自然言語処理入力ファイルを作
成する。削除したファンクションの位置情報及びコート
変換情報は、情報ファイルに格納し、処理終了後にジョ
ブ名等をリスト出力する。パラメータチエツク（ステッ
プ５１０１）では、ファンクション削除実行の有無及び
コード変換情報の指示の解析を行なう、ヘダーレコード
作成（ステップ５１０２）では、ＮＬゼインァイルのへ
ダーレコートの内容より、自然言語処理入力ファイル及
びＮＬ情報ファイルのへダーレコーＫを作成する。同デ
ータＮＯのデータの読込２１（ステップ５２０３）の処
理は、同データＮｏを持つレコードの全有効データを処
理単位とする。A natural language processing input file is created by reading the NL zein file, deleting functions by specifying parameters, and performing code conversion (external-system specific code). The position information and code conversion information of the deleted function are stored in an information file, and the job name etc. are output as a list after the processing is completed. In the parameter check (step 5101), the instruction of the execution of function deletion and code conversion information is analyzed. In the header record creation (step 5102), the natural language processing input file and the NL Create a header record K of the information file. The process of reading data with the same data number 21 (step 5203) uses all valid data of records having the same data number as a processing unit.

従って、ＮＬゼインファイルデータレコード中同データ
ＮＯを持つデータレコードから有効データを抽出する。Therefore, valid data is extracted from data records having the same data NO among the NL zein file data records.

データの加工（ステップ５１０４）では、ＮＬゼインァ
イルから抽出したデータのファンクションの削除及びコ
ート変換を行なう。削除したファンクションの情報及び
コート変換情報はＮＬ情報ファイルへ、処理されたデー
タは自然言語処理人力ファイルに出力する。また、デー
タレコードの作成（ステップ５１０５）ては、同データ
ＮＯの加工後（ファンクションの削除、コード変換）の
データを自然言語処理人力ファイルへ出力し、加工情報
をＮＬ情報ファイルへ出力する。In data processing (step 5104), functions of the data extracted from the NL zein file are deleted and code conversion is performed. Information on the deleted functions and code conversion information are output to the NL information file, and processed data is output to the natural language processing human file. Further, in creating a data record (step 5105), the data after processing (deleting functions, code conversion) of the same data number is output to the natural language processing human file, and the processing information is output to the NL information file.

一方、第１４図の出力ルーチン５２００は第１７図に示
すように、自然言語処理の後処理として自然言語処理出
力ファイルとＮＬ情報ファイルを、パラメータと共に読
込んでＮＬアウトファイルを作成するものであり、その
詳細は？；１８図のようになっている。すなわち、自然
言語処理出力ファイルとＮＬ情報ファイルを統込んで、
パラメータの指定によるファンクションの復帰及びコー
ド変換（システム固有コード−外部）を行ない、ＮＬア
ウトファイルを作成する。処理終了後にジョブ名等をリ
スト出力する。パラメータチエツク（ステップ５２０１
）では、ファンクション復帰実行の有無及びコード変換
情報の指示の解析を行なう。ヘダーレコードの作成（ス
テップ５２０３）では、ＮＬ情報ファイル及び自然言語
処理出力ファイルのへダーレコートの内容よりＮＬアウ
トファイルのへダーレコードを作成する。同データＮｏ
のデータの読込み（ステップ５２０４）は同データＮＯ
を持つレコードの全有効データを処理単位とする。自然
言語処理出力ファイルデータレコード中には、人力原文
データと処理結果データか存在するが、処理結果データ
のみを有効データとする。従って、自然言語処理圧カフ
アイルレコード中の同データＮｏを持つデータレコード
から処理結果データを抽出する。また、データの加工（
ステップ５２ｏ５）では、自然言語処理出力ファイルか
ら抽出したデータにファンクションの復帰及びコート変
換を行なう。加工したデータはＮＬアウトファイルに出
力する。On the other hand, as shown in FIG. 17, the output routine 5200 in FIG. 14 reads the natural language processing output file and the NL information file together with parameters to create an NL out file as post-processing of the natural language processing. What are the details? ; It is as shown in Figure 18. In other words, by integrating the natural language processing output file and the NL information file,
A NL out file is created by returning the function and converting the code (system specific code - external) by specifying the parameter. After processing is completed, job names, etc. are output as a list. Parameter check (step 5201
), the presence or absence of function return execution and the instruction of code conversion information are analyzed. In the creation of a header record (step 5203), a header record of the NL out file is created from the contents of the header record of the NL information file and the natural language processing output file. Same data No.
The reading of the data (step 5204) is the same data NO.
The processing unit is all valid data of the record with . Although the natural language processing output file data record includes human input original text data and processing result data, only the processing result data is valid data. Therefore, the processing result data is extracted from the data record having the same data number in the natural language processing pressure cuff file record. In addition, data processing (
In step 52o5), function restoration and code conversion are performed on the data extracted from the natural language processing output file. The processed data is output to the NL out file.

この発明はＣＤ−ＲＯＭ等のデータベースの構築支援と
して利用でき、検索用キーワードの抽出、抽出したキー
ワードへの読みの付加を行ない得る。また、印刷業務で
の利用か可能で、カナ振り機能を利用した総ルビの印刷
物作成や名簿の住所１氏名なとの項目の自動カナ振り、
索引作成の支援システムとして利用できる。The present invention can be used to support the construction of databases such as CD-ROMs, and can extract search keywords and add pronunciations to the extracted keywords. In addition, it can be used for printing work, such as creating printed materials with full ruby using the kana-Furi function, automatic kana-Furi of items such as address 1 name of the list, etc.
It can be used as a support system for index creation.

発明の効果。Effect of the invention.

以上のようにこの発明の分野別辞書を利用したかな振り
データベースの作成方法によれば、専門的な知識や技術
を要することなく自動的に人名専門用語等のデータベー
スを作成することかできる。As described above, according to the method of creating a kana-furi database using the field-specific dictionary of the present invention, it is possible to automatically create a database of personal names and terminology without requiring any specialized knowledge or skills.

[Brief explanation of the drawing]

第１図はこの発明の動作例を示すフローチャート、第２
図は前処理の動作例を示すフローチャート、第３図は自
然言語処理の作用を示すフローチャート、第４図は自然
言語処理する原文の例を示す図、第５図は分かちカナの
例を示す図、第６図は後処理の動作例を示すフローチャ
ート、第７図は自然言語処理システムのハードウェア構
成例を示すブロック図、第８図はそのソフトウェア構成
例を示す図、第９図は分がち書きの出力形式を示す図、
第１Ｏ図は分かち書ぎ単位のカナ振りの出力形式を示す
図、ｉｌ１図は分野別辞書を使用したカナ振りの出力形
式を示す図、第１２図は総ルヒ振りの出力形式を示す図
、第１３図はキーワード抽出及びキーワードへのカナ振
りの出力形式を示す図、第１４図はこの発明に用いる汎
用ファイルの構成例を示すフローチャート、第１５図は
入力ルーチンの人出力を示す図、第１６図は人力ルーチ
ンの詳細を示すフローチャート、第１７図は出ルリーチ
ンの人出力を示す図、第１８図は出力ルーチンの詳細を
示すフローチャートである。１０・・・ホストマシン、１１・・・ＣＰＩＩ　、　１
２・・・メモリ、１４・・・磁気ディスク装置、１５・
・・カセット磁気テープ装置、２０・・・磁気テープ装
置、２１・・・レーザープリンタ、２２・・・確認／修
正用端末、２３・・・コンソール端末。図面の浄書（内容に変更なし）土願人代理人　　安　形　雄　三慕３ Ω 著図ｔ９図雛副塾図享図め図都図某図手続補正書（方式）平成２年１１月２０日特許庁長官　植　松　　　敏　殿　　口＝１、事件の表
示　　　　　　　　　　　　１′平成２年特許願第２０
２９７３号２、発明の名称分野別辞書を利用したカナ振りデータベースの作成方法事件との関係　　特許出願人（２８９）犬日本印刷株式会社４、代理人５、補正命令の日付平成２年１０月１５日（全送日　平成２年１０月３０日）FIG. 1 is a flowchart showing an example of the operation of this invention, and FIG.
Figure 3 is a flowchart showing an example of preprocessing operation, Figure 3 is a flowchart showing the operation of natural language processing, Figure 4 is a diagram showing an example of an original text subjected to natural language processing, and Figure 5 is a diagram showing an example of splitting kana. , Fig. 6 is a flowchart showing an example of post-processing operation, Fig. 7 is a block diagram showing an example of the hardware configuration of the natural language processing system, Fig. 8 is a diagram showing an example of its software configuration, and Fig. 9 is a diagram showing an example of the hardware configuration of the natural language processing system. A diagram showing the output format of writing,
Figure 1O is a diagram showing the output format of kana furi in dividing line units, Figure il1 is a diagram showing the output format of kana furi using a field-specific dictionary, Figure 12 is a diagram showing the output format of total ruhi furi, Fig. 13 is a diagram showing the output format of keyword extraction and kana translation to the keyword, Fig. 14 is a flowchart showing an example of the configuration of a general-purpose file used in this invention, Fig. 15 is a diagram showing the human output of the input routine, FIG. 16 is a flowchart showing the details of the human power routine, FIG. 17 is a flowchart showing the human output of the output routine, and FIG. 18 is a flowchart showing the details of the output routine. 10...Host machine, 11...CPII, 1
2...Memory, 14...Magnetic disk device, 15.
...Cassette magnetic tape device, 20...Magnetic tape device, 21...Laser printer, 22...Verification/correction terminal, 23...Console terminal. Engraving of drawings (no changes to the content) Requester's agent: Yu Angata, Sanbo 3 Ω Author: t9 Zu Hina Sojuku Zu Kyo Zume Zu Capital Zu certain map procedural amendment (method) November 20, 1990 Director General of the Patent Office Toshi Uematsu Kuchi = 1, case description 1' 1990 Patent Application No. 20
2973 No. 2, Relation to the case of method for creating a kana-furi database using a field-based dictionary of invention names Patent applicant (289) Inu Nippon Printing Co., Ltd. 4, attorney 5, date of amendment order October 15, 1990 day (all shipping date: October 30, 1990)

Claims

[Claims] 1. A database is pre-processed, a natural language processing output file is created by natural language processing by referring to a field-specific dictionary and a basic dictionary, and a kana-furi database is created by post-processing. A method for creating a kana-furi database using a field-specific dictionary. 2. The method for creating a kana-furi database using a field-specific dictionary according to claim 1, wherein the field-specific dictionary is corrected when proofreading the keyword data. 3. Creation of a kana-furi database using the field-specific dictionary according to claim 1, wherein the preprocessing is a repetition of data extraction, code conversion, ID set of the field-specific dictionary, and creation of a natural language processing input file. Method. 4. Claim 1, wherein the post-processing performs code conversion and creation of a database format file on the natural language processing output file, and repeats the above operations.
How to create a kana-furi database using the field-specific dictionary described in .