JP4389102B2

JP4389102B2 - Technical literature search system

Info

Publication number: JP4389102B2
Application number: JP2002294626A
Authority: JP
Inventors: 戸広信宍
Original assignee: 宍戸広信
Priority date: 2002-10-08
Filing date: 2002-10-08
Publication date: 2009-12-24
Anticipated expiration: 2022-10-08
Also published as: JP2004133510A

Description

【０００１】
【発明の属する技術分野】
本発明は、特許情報データベース等の技術文献を検索するシステムに関するものである。
【０００２】
【従来の技術】
従来の特許情報の検索システムには、例えば、株式会社パトリスにより提供されているＰＡＴＯＬＩＳ（登録商標）−ＷＥＢのような特許情報検索システムが知られている。こうした従来のシステムを用いて先行技術の調査や、中間処理の異議・無効手続を行うための文献調査を実施する場合には、国際特許分類の分類コードや、キーワード等の検索条件を論理式にして、テキスト検索等を行っている。こうした検索システムを用いての調査は一般的に以下のような手順で行われる。
手順１、分類コードや論理式などの適宜な条件で検索を行う。
手順２、検索結果の文献の要約部等に着目して１次スクリーニングを行う。
手順３、１次スクリーニングで抽出した文献を精査し２次抽出を行う。
手順４、２次抽出した文献から、構成要件に該当する部分を抜き出して対比資料を作成する。
【０００３】
また、特許文献１に開示されている特許公報検索システムには、特許情報データベースからバッチ転送された特許公報データを、特別の処理支援プログラムにより再構築して、例えば、企業の開発セクション等で利用しやすいような、特定技術の分野の開発状態、先行技術などが一目で分かるように構成する等により、企業内等のセクション毎に再利用しやすい形式に編集し直して、利用するようにしたものがある。
【０００４】
一方、特許文献２〜特許文献４には特許情報データベースとは直接関係ないが、一般的な文書データ等の各種検索技術が開示されている。
特許文献２にはＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｔｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）などのマーク付け言語を利用して構造化した文書データ、すなわち、文書の表題、章題、本文と言った文書構成要素の名前とその範囲が、適当な記号を用いて文書中にマーク付けされた文書による構造化文書データベースを構築して、文書構成要素により構成したキーワードにより類似文書の検索を行うものである。
【０００５】
また、特許文献３には、文書データのデータベースを構築する文書構成要素として、マーク付け言語等を利用して章、節、段落、枠といった論理構造に分割される文書構成要素を、木構造のデータ構造でデータベースに記憶して、データベース検索技術等を利用して検索するものである。
【０００６】
特許文献４は、検索単位データとして文書中より抽出された単語と、他の特許文献にも見られる「文」「段落」と言った文書構成要素との、両方で構築して検索する例を開示している。
そして検索結果を判定する検索適合文書検索語関連算出部では、「単語」あるいは「文書構成要素」の位置情報を算出し、例えば、検索単語が文中に何回出現したかを調べ、出現位置を文末（あるいは文頭）から見た位置情報等で表し、近接演算を行って適合度を判定している。
【０００７】
【特許文献１】
特開２００１−２２７９４号公報（段落〔００２０〕、図１）
【特許文献２】
特開平７−４４５６７号公報（段落〔００１９〕、図１）
【特許文献３】
特開平８−４４７６６号公報（段落〔００１０〕、図３）
【特許文献４】
特開２００２−１８９７５４号公報（〔要約〕、図１）
【０００８】
【発明が解決しようとする課題】
上記従来の手法では、手順１においては、文献データ全体として検索条件が判定される（検索条件に指定した語句それぞれが文献データ中に存在するか、を検索する）ために、関連性の薄い文献も多く検索されてしまう（一般に「ノイズ」と呼ばれる）という問題があった。
手順２の場合は、文献データの一部分にのみ着目してスクリーニングを行うので、重要な文献を見落としてしまう恐れがあるという問題があった。
手順３、４、の場合は、構成要件チェック表などを作成して、文献ひとつひとつ対応関係を把握しながら作業を行うため、作業量が膨大になるという問題があった。
【０００９】
特許文献１に示される技術を用いれば、上記手順２〜３が効率化されるが、依然、文献データをひとつひとつ目読検査する必要があるものである。
特許文献２に示される技術を用いれば、類似度順の検査を行うことが出来るため、早期に目的の文献を発見できる可能性が高まる。しかし、類似度は文書全体としての評価であるため、上記手順３〜４の作業に対しては、何ら改善の手段が提供されない。
特許文献３に示される技術は、文書の構成要素に着目した検索技術であるが、文書の再利用を目的としたものであるため、本願の課題である「複数の構成要素を全て（可能な限り多く含む）文献を検索する」という目的に利用するには、更なる改良が必要である。
特許文献４に示される技術を用いれば、検索条件が文書全体としての評価でなく、段落等を単位とした合致判定が行われるので、上記手順１における検索ノイズを小さくできるが、本願の目的に利用する場合の効果は、特許文献２の技術と同程度である。
【００１０】
そこで本発明は、高効率な文書データ検索技術、データベース検索技術などを導入して、所望の特許情報・公報を迅速に、正確に無駄無く検索できる技術文献検索システムを提供することを目的としている。
【００１１】
上記目的を達成するため、請求項１に記載の発明は、電子化されて記憶手段に格納された、複数の単位データからなる技術文献データを検索するコンピュータシステムであって、検索対象技術を構成する複数の構成要素毎に、前記各構成要素を表わすデータの入力を受け付ける構成要素入力手段と、前記各構成要素毎に検索条件の入力を受け付ける検索条件入力手段と、前記各構成要素毎に、前記検索条件に基づいて前記技術文献データを検索する検索手段と、検索された前記技術文献データ中から前記検索条件に合致する単位データを抽出する単位データ抽出手段と、検索された前記技術文献毎に、当該技術文献が前記構成要素毎の検索条件を満たすか否かを示す構成要素配列データと前記検索条件に合致した前記単位データおよび当該技術文献の識別データとを対応付けて記録する記録手段と、を備えたことを特徴とする。
【００１２】
また、請求項２に記載の発明は、請求項１記載の技術文献検索システムにおいて、技術文章データの入力を受け付ける入力手段と、前記技術文章データを解析して複数の文章データに分割し、該分割された文章データを前記構成要素入力手段に引き渡す技術構成分解手段を備えたことを特徴としている。
また、請求項３に記載の発明は、請求項１記載の技術文献検索システムにおいて、用語の辞書データを記憶する辞書記憶手段と、前記辞書データを参照して前記構成要素を表わすデータから技術用語を抽出し、抽出された技術用語を前記検索条件入力手段に引き渡す技術用語抽出手段を備えたことを特徴としている。
また、請求項４に記載の発明は、請求項１記載の技術文献検索システムにおいて、類似語の組を複数記憶する類似語記憶手段と、検索条件入力手段により入力を受け付けられた検索条件に含まれるキーワード毎に、該キーワードに対応する類似語を前記類似語記憶手段より取得し、取得した類似語を前記検索条件に拡張して追加する類似語追加手段とを備えたことを特徴としている。
また、請求項５に記載の発明は、請求項１記載の技術文献検索システムにおいて、検索適合度の条件を定義する適合度条件の入力を受け付ける適合度条件入力手段と、検索された技術文献毎に前記検索適合度を算出する検索適合度算出手段とを有し、前記単位データ抽出手段は、前記検索適合度が前記適合条件に合致する技術文献データを対象として単位データの抽出処理を行うことを特徴としている。
また、請求項６に記載の発明は、請求項１記載の技術文献検索システムにおいて、前記検索手段が検索対象とすべき技術文献データの属性の入力を受け付ける対象条件入力手段を有し、前記検索手段は前記属性に合致する技術文献データのみを検索することを特徴としている。
また、請求項７に記載の発明は、前記検索手段は、前記技術文献データ内の所定の範囲毎に前記検索条件に基づく検索を行うことを特徴としている。
また、請求項８に記載の発明は、前記検索手段は、前記技術文献データがチャプター化されている場合に、所定のチャプターに属するデータを検索しないことを特徴としている。
【００１３】
【発明の実施の形態】
以下、本発明の実施の形態について図を参照して説明する。
［第１の実施の形態］
図１は、本発明の、第１の実施の形態に係る技術文献検索システムのブロック図である。
本実施の形態は、パーソナルコンピュータ（以下「ＰＣ」）やワークステーションなどで動作可能なソフトウェアシステムである。市販されているごく一般的なＰＣ等を利用できるので、ハードウェア構成（図示せず）についての説明は次の通り簡単に留める。
【００１４】
図１の検索条件入力部２は、検索条件入力画面を表示するプログラム、ディスプレイ等の表示デバイスおよびマウス、キーボード等の入力デバイスである。文献データ１は、ＨＤＤ等の記憶装置に格納される。検索処理部３は、検索条件入力部２で入力された検索条件による検索処理を実施するプログラムである。ヒットテーブル４および抽出データリスト５は、検索処理部３により検索結果として出力され、主記憶またはＨＤＤ等の二次記憶装置に格納される。
検索結果出力部６は、検索結果を表示するプログラムおよびディスプレイ等の表示デバイスであり、検索結果の表示画面はマウス、キーボード等の入力デバイスにより操作する。
【００１５】
まず、図４を参照し、検索条件の入力について説明する。
図４は、コンピュータのディスプレイに表示される検索条件の入力画面を示す一例である。
図４の「ＩＰＣ」の欄は、検索対象とする文献データをフィルタリングするために用いられ、以下「対象条件」として説明する。対象条件は、ＩＰＣ（ＩｎｔｅｒｎａｔｉｏｎａｌＰａｔｅｎｔＣｌａｓｓｉｆｉｃａｔｉｏｎ）とは別の文献分類コードや、文献の発行日その他の書誌事項、あるいは文献データを全体として検索するためのキーワードの組み合わせを入力するものであってもよい。
【００１６】
「構成要件」の欄には、検索対象技術の各構成要素を表す文章データを入力する。
「検索条件」の欄には、各構成要素に対応する検索式を入力する。
「重要」の欄は、特に重視する構成要素を指定するために用いるものであり、マウスでクリックする等してチェックＯＮ／ＯＦＦの切り替えが可能となっている。
「検索場所」は、文献データが格納されている記憶装置上の位置を指定するものである。ディレクトリを直接キーボード入力してもよいし、また、「参照」ボタンをマウスでクリックすることによって、ディレクトリの一覧（図示せず）よりＧＵＩ操作で選択することも可能である。
「ヒット条件」の欄には、検索された文献データそれぞれについて、検索結果として出力するか否かを定めるためのしきい値を入力する。
「抽出箇所の数」の欄には、各構成要素に対応する単位データをひとつの文献あたりいくつまで検索ないし抽出するかを定めるものである。
「重要チェック要素を必須」の欄は、前記「重要」の欄がチェックＯＮにされた構成要件については必ず検索条件が満たされなければならないことを表すものである。すなわち、構成要件Ｂの検索条件が満たされない文献データは検索結果として出力されない。
操作者がキーボードやマウスを用いて上記の各入力を行い「検索実行」のボタンにより指示を行うと、検索処理が開始される。
【００１７】
次に、図２のフローを参照して検索処理について説明する。
検索処理の開始に当たり、まず、構成要素配列を生成する（Ｓ１００）。
ここで構成要素配列とは、各構成要素（図４に示す入力画面の「構成要件」）に対応する検索条件が満たされるか否かを記憶するための配列である。本実施の形態では、構成要素の数の配列要素を持つ整数配列であり、先頭の配列要素を示す添字が１であるものとする。
続いて、抽出データリストを生成する（Ｓ１０１）。
抽出データリストは、図３に示すように、構成要素毎に検索条件に合致する文章データを格納するための領域である。リストの各要素が各構成要素に対応しており、リストの各要素は更に検索条件に合致する文章データ（文字列データ）のリストを格納可能に構成される。
【００１８】
次に検索対象の文献データが残されているか否かを判断する（Ｓ１０２）。肯定判定であれば、ヒットテーブルを適合度に応じて並べ替えて出力する（Ｓ１０３）。文献データが終わりでなければ、続けて文献データを読込み（Ｓ１０４）、対象条件に合致するかの判定を行う（Ｓ１０５）。例えば、特許文献データの場合は、文献データ中に書誌データが含まれているので、書誌データと入力された対象条件を比較することで判定を行うことができる。
Ｓ１０５が肯定判定であればＳ１０６に進み、構成要素配列、抽出データリストを初期化（ゼロクリア）する。
【００１９】
次に文献データから本文部分以外（例えば書誌データ）を除去し（Ｓ１０７）、テキスト整形を行う（Ｓ１０８）。
特許文献における書誌事項記載部分、一般技術文献における引用・参考文献記載部分などは検索対象とされないのが好ましいため、このような部分を除去する処理をＳ１０７にて行う。特許文献など、章構成や文書構造化のためのタグが定義された文献データの場合は、該定義にしたがって、除去すべき部分を容易に判定できる。構造化文書以外では、一般に引用・参考文献は文献データの最後に記載されるので、文献データの内容を検査し、「引用文献」「参考文献」などの文字列のみからなる行が存在したら、その行以降を除去するようにするとよい。
Ｓ１０８での処理は、マーク付け言語などのフォーマットのタグ等を除去して普通文とし、文献データの所定の範囲ごとに１行のテキストデータとなるように改行を再編成する。
ここで所定の範囲とは、句点（。や．）で区切られた一文を表す文字列データでもよいし、複数文からなる段落を表す文字列データであってもよい。段落の区切りは、ＣＲ（ＣａｒｒｉａｇｅＲｅｔｕｒｎ）やＬＦ（ＬｉｎｅＦｅｅｄ）などの改行コードにより判定することができる。
【００２０】
Ｓ１０９では、検索中の構成要素の番号を示すための変数ｎをゼロに初期化する。Ｓ１１０では、次の構成要素の検索処理に移るために、変数ｎをインクリメントする。
Ｓ１１１では、変数ｎが入力された構成要素の数よりも大きいか、すなわち全ての構成要素について検索処理が行われたか否かを判定する。判定が肯定であればＳ１１２へ進み、否定であればＳ１１６へ進む。
【００２１】
Ｓ１１６〜Ｓ１１９の手順は、Ｓ１０８により整形されたテキストデータに対して、行単位に検索が行われるものである。Ｓ１１６では、検索中の行番号を示すための変数Ｌをゼロに初期化し、Ｓ１１７において変数Ｌがインクリメントされる。Ｓ１１８では、変数Ｌの値が前記テキストデータの行数を超えたか否かが判定され、肯定判定であればＳ１１０に戻り、次の構成要素の検索処理が行われる。否定判定であればＳ１１９において、現在の行Ｌが構成要素ｎに対応する検索条件Ｎに合致するか否かの判定を行う。Ｓ１１９の判定が否定であれば、Ｓ１１７に戻り、次の行の検索処理が行われる。判定が肯定であれば、Ｓ１２０に進む。
【００２２】
Ｓ１２０では、構成要素配列［ｎ］の値をインクリメントする。構成要素配列［ｎ］の値は、構成要素ｎに対応する（検索条件Ｎに合致する）行の数を表している。
次に、Ｓ１２１において、抽出データリスト［ｎ］に行Ｌの内容を追加する。
Ｓ１２２では、構成要素配列［ｎ］の値が図４に示す入力画面の「抽出箇所の数」に入力された値に達したか否かの判定を行う。肯定判定であればＳ１１０に戻り、次の構成要素の検索処理に移る。否定判定であればＳ１１７へ戻り、次の行の検索処理を続行する。
【００２３】
前述の通り、Ｓ１１１において全ての構成要素の検索処理が完了したと判断された場合、Ｓ１１２以降の処理が行われる。Ｓ１１２では、構成要素充足率の算出を行う。
構成要素充足率の算出は、構成要素配列を参照し、
値＞０である配列要素の数÷配列要素の全体の数×１００
であり、すなわち、
検索条件に合致する段落（行）がある構成要素の数÷構成要素の数×１００
と実質同一となる。
次にＳ１１３において、算出された構成要素充足率が所定値以上であるか否かを判定する。ここで所定値とは、例えば図４の「ヒット条件」に入力された値である。否定判定であれば、当該文献データは全体として検索条件に合致しないこととなり、Ｓ１０２へ戻って次の文献データの検索処理を開始する。
【００２４】
Ｓ１１４においては、図４の「重要」のチェックが付された構成要素が全て充足しているか否かを判定する。これは、
構成要素配列［ｍ］の値＞０
となるか否かにより判定することができる（ｍは「重要」のチェックが付された構成要素の先頭からの位置を示す）。否定判定であれば、Ｓ１１３の場合と同様に、Ｓ１０２へ戻って次の文献データの検索処理を開始する。なお、図４において「重要チェックを必須」が選択されていない場合は、Ｓ１１４の判定処理は不要であり、そのままＳ１１５へ進む。
Ｓ１１４が肯定判定であることにより、該文献データが検索条件に合致することとなり、Ｓ１１５において該文献データの情報をヒットテーブルに追加する。
【００２５】
図５は、ヒットテーブルの一例であり、「文献識別番号」は公報番号などであり、「スコア」は検索適合度、充足率などを表す数字である。「スコア」は、構成要素充足率をそのまま用いてもよいし、他の条件により算出してもよい。また、検索条件に合致する段落の数を加味したスコアとしてもよい。Ｓ１１３においてこのような「スコア」を算出し、構成要件充足率に代えて該スコアによりＳ１１３の判定を行うものであってもよい。
また、ここでは抽出した段落データ（または文章データ）を持続的に格納するための新たなメモリ領域を割当て、フロー処理において格納された抽出データリストの内容を複写し、割り当てられたメモリ領域へのポインタをヒットテーブルの「抽出データリストへのポインタ」に記憶する。
以上の手順により、全ての文献の検索処理が終わると、Ｓ１０３において、「スコア」の値が大きい順にヒットテーブルの並べ替えを行う。
【００２６】
図６は、ヒットテーブルに基づき検索結果が出力された表示画面の一例である。
図６の「文献番号」にはヒットテーブルの「文献識別番号」が対応する。
Ａ〜Ｆの列は、ヒットテーブルの「構成要素配列」に対応し
構成要素配列の値＞０
であるときに「＊」を表示して、該構成要素に対応する段落が該文献に存在することを示している。「＊」を表示するのに代えて、構成要素配列の値そのものを表示してもよい。
図６は更に、左方の表において行１の構成要素（この場合Ｂ）の欄がマウスクリック等で選択されたときの様子を示している。すなわち、右方には、構成要素Ｂに対応する文章データ（図４の「構成要件２の文章データ」に相当）、およびそれに対応する検索条件式、更に、行１の文献（特開ＸＸＸＸ−ＸＸＸＸＸＸ）より抽出された文章データのうち、構成要素Ｂに対応する文章データが出力される。このように左方の表において、所定の構成要素の欄を選択すると、図４における入力データおよびヒットテーブルを参照して、対応するデータが右方に表示されるものである。更に、抽出された文章データのうち、検索条件式中に現れる語はハイライト表示される。
図６に示すＡ〜Ｆの列のうち、図４の「重要」にチェックが付された構成要素に対応する列は太字あるいは色分け等により強調表示するとよい。
【００２７】
更に、図６において、文献を選択するためのチェックボックスが表示されており、操作者は選択した文献と構成要素との対応表を作成する指示を行うことができる（図示せず）。対応表の一例は図７に示す通り、選択された文献より抽出された文章データが各構成要素に対応する体裁を有しているものであって、プリンタ等の印字装置に出力されてもよいし、ＨＴＭＬ形式等の文書ファイルとして出力されてもよい。
また、図６の画面において、例えば「特開ＸＸＸＸ−ＸＸＸＸＸＸ」の文字上をマウスでダブルクリックすると、ヒットテーブルの「ファイル名」の値に基づき、特開ＸＸＸＸ−ＸＸＸＸＸＸの原文データが該ファイルから読み込まれてディスプレイに表示される（図示せず）。
このようにして、操作者は、各文献に目的とする記述が存在するか否かを容易に確認することができる。
【００２８】
［変形例］
上記は、あらかじめＨＤＤ上に格納された文献データを対象条件によりフィルタリングした後、対象条件に合致する文献データに対して構成要素ごとの検索処理を行うものであったが、次のように変形することができる。
図４に示す入力画面で入力が完了し、操作者に検索実行の指示がなされたあと、通信回線を介して外部の文献データベースサーバーに接続し、該データベースに対して対象条件による検索実行のクエリーを発行し、検索結果の文献データを指定された「検索場所」にダウンロードし、ダウンロードされた文献データを対象として構成要素ごとの検索処理を行うようにしてもよい。この場合、Ｓ１０５の処理は不要となる。
【００２９】
［応用例］
本実施の形態は更に以下のように応用することが可能である。
［応用例１］
図４において、構成要件の文章データは、構成要件ごとに該当する枠内に入力されるものであったが、所定の方法で入力された一連の文章データを解析して各構成要素に自動的に分解し、各構成要件入力欄にセットするようにしてもよい。
具体的には、一連の文章データである文字列データを、読点（“、”や“，”などの記号）で区切り、区切られた各部分を一構成要素として解釈するものである。
更に、各部分の文字数が所定の数以下であるときに、次の部分と連結してひとつの構成要素とする補正を行ってもよい。
【００３０】
［応用例２］
技術用語と品詞を対応付けた辞書データを用意し、構成要素である部分文章データから、前記辞書に存在する技術用語を抽出して検索条件入力欄にセットするようにすることができる。また、汎用語辞書データを更に用意して、該汎用語辞書に含まれる技術用語を前記抽出された技術用語群から取り除くなどして、技術的特徴に直接結びつかない語句（例えば「手段」「ステップ」「方法」「装置」など）を検索条件入力欄にセットされないようにしてもよい。
検索条件入力欄への引き渡し方法については、抽出した技術用語を単純に羅列して検索条件入力欄にセットし、その結果、各用語が論理積（ＡＮＤ結合）または論理和（ＯＲ結合）で検索されるようにするものであってもよい。
他の引き渡し方法としては、抽出した技術用語を名詞群と非名詞群に分け、各群の内部においてはＯＲ結合、各群相互間はＡＮＤ結合となるようにしてもよい。
【００３１】
［応用例３］
類似語辞書データを更に用意し、検索条件入力欄にセットされた各検索語について、類似語辞書に登録されている用語については、該検索語とそれに対応する類似語とをＯＲ結合した検索条件式に拡張するものであってもよい。
一例としては、検索条件入力欄に、
構成＊要素＊検索
と入力された後、操作者よりの類似語拡張の指示により各語を類似語辞書において検索し、取得された類似語を用いて
（構成＋構造）＊（要素＋エレメント）＊（検索＋検出＋検査）
の様に検索式を拡張するものである。
この例では、類似語辞書には「構成」に対応する類似語として「構造」が記憶されており、同様に「要素」に対して「エレメント」が、「検索」に対して「検出」および「検査」が記憶されているものである。
【００３２】
上記応用例１〜応用例３は、文章データが入力された後、操作者の指示を介さず自動的に処理が実行され、検索処理が開始されるようにしてもよい。
また、文章データの記述スタイルを複数に類型化し、該類型中から操作者が選択した類型に応じた文章データ分解・用語抽出処理が選択されて実行されるものであってもよい。
【００３３】
［第２の実施の形態］
第２の実施の形態は、クライアント／サーバーのコンピュータシステム（図示せず）により運用され、クライアントコンピュータにて検索条件を入力してサーバーに送信し、サーバーコンピュータにおいては、クライアントより受信した検索条件にもとづき検索処理を行い、検索結果をクライアントに送信するものである。
【００３４】
クライアントは第１の実施の形態と同様、一般的なＰＣを用いて図４に示す入力画面を用いた検索条件の入力が行われる。そして入力された内容が通信回線を介してサーバーに送信される。
第１の実施の形態においては、文献データごとに逐一検索処理を行っていたが、第２の実施の形態においては、サーバー上に一般的な構成の文献データベースが構築されており、この検索システムを効果的に利用する点で第１の実施の形態と大きく異なる。
【００３５】
図８は、第２の実施の形態に係るサーバー上の検索システムのブロック図である。
サーバー上の記憶装置には、文献データが格納された原文データベース１０と、該原文データベース１０に対応する文献識別番号と書誌データが対応づけられた書誌インデックス１１、文献識別番号と全文検索用のインデックスを対応づけた全文インデックス１２が記憶されている。
【００３６】
サーバーにおいては、検索条件をクライアントより受信した後、以下の通り処理が行われる。
まず、検索条件に含まれる対象条件を用いて１次検索を行う。対象条件がＩＰＣなどの書誌事項であれば、書誌検索エンジン１３により書誌インデックスを検索し、該当する文献識別番号のリストを得る（識別データリスト１（１５））。対象条件に全文検索のためのキーワードが含まれている場合は、全文検索エンジンに１４より全文インデックスを検索し、該当する文献識別番号のリストを得る（識別データリスト２（１６））。
【００３７】
次いで、演算処理部１７により識別データリスト１（１５）と識別データリスト２（１６）の和集合を演算し、ヒットリスト１８を得る。
以上により、ヒットリスト１８は、対象条件に合致する文献識別番号のリストとなる。このヒットリスト１８に示される文献データを原文データベース１０から読み込み、第１の実施の形態と同様に、図２のフローチャートに従う２次検索（構成要素ごとの検索）処理を行うことで、検索結果を得ることができる。
【００３８】
しかし、この方法では、処理すべき文献データの数が膨大になり、サーバーの処理負荷が高くなるおそれがあるので、以下の様にして最終ヒットリストを作成するようにするとよい。
ヒットリストが得られた後、構成要素ごとの検索条件ｎで全文インデックス１２を検索し、識別番号リスト２１_１〜２１_ｎ（識別データリスト２（１６）に代わる）を出力する。
そして、前記ヒットリスト、および識別番号リスト２１_１〜２１_ｎに含まれる文献識別番号の和集合をとり、最終ヒットリストとする。
こうすれば、２次検索処理の対象文献を必要最小限に絞り込むことができる。
そして、最終ヒットリストに示される文献データを原文データベース１０から読み込み、第１の実施の形態と同様に、図２のフローチャートに従う２次検索（構成要素ごとの検索）処理を行うことで、検索結果を得ることができる。サーバーは検索結果として得られたヒットテーブルと抽出データリスト（図１のヒットテーブル４と抽出データリスト５に対応）をクライアントに送信し、クライアントではこれに基づき、図６に示す表示出力を行うことができる。
【００３９】
前述した最終ヒットリストを演算する段階で、構成要素充足率の計算を行い、指定された条件に合致するもののみを最終ヒットリストに出力してもよい。
具体的には、
１、構成要素の数×指定された構成要素充足率（ヒット条件）÷１００＝必要数Ｋ（小数切捨）とする
２、各ヒットリストを連結し、識別番号順に並べ替えを行い、リストの先頭から検査して同一の識別番号がＫ個以上連続すれば該識別番号を最終ヒットリストに出力する
というものである。こうすれば、図２のフロー中「構成要件充足率の算出」（Ｓ１１２）「充足率が所定値以上？」（Ｓ１１３）などの処理は不要となる。
【００４０】
これと同時に、更に必須要素充足の判定を行い、指定された条件に合致するもののみを最終ヒットリストに出力してもよい。具体的には、
１、必須要素として指定された（図４において「重要」のチェックが付された）構成要素に対応する識別番号リスト２_ｎの和集合を採る・・・（Ｘ）
２、必須要素として指定された以外の構成要素に対応する識別番号リスト２_ｎを連結する・・・（Ｙ）
３、構成要素の数×指定された構成要素充足率÷１００−必須要素の数＝必要数ｋ（小数切捨）とする
４、（Ｙ）の先頭から検査し、
ａ、同一番号がｋ個以上ある
ｂ、該番号が（Ｘ）に存在する
が共に肯定判定であれば、最終ヒットリストに出力する。
こうすれば、図２のフロー中「必須要素すべて充足？」（Ｓ１１４）の処理が不要となる。
【００４１】
［第３の実施の形態］
第３の実施の形態もまた、クライアント／サーバーの形で運用されるものである。
第２の実施の形態との違いは、全文検索エンジンが近接演算（近傍検索）機能を有していることである。
第３の実施の形態においても、まず第２の実施の形態と同様の１次検索処理を行う。
次に、全文検索エンジンを用いて、構成要素ごとの検索条件式を用いた近傍検索を行う。この検索処理を行った結果、図１０に示すようなデータが得られる。
【００４２】
図１０（ａ）は、ひとつの構成要素について検索を行った結果得られるものであり、検索条件式に合致する文献データ中の範囲：位置１〜位置ｍが得られる。
そして、全ての構成要素についての近傍検索が終了すると、図１０（ａ）の検索結果が構成要素の数だけ得られるので、これらを図１０（ｂ）の通り統合する。具体的には、文献識別番号については和集合をとり、図１０（ａ）の位置情報を図１０（ｂ）の構成要素位置情報の対応する領域に複写する。
そして、図１０（ｂ）に含まれる文献識別番号の集合と、前記１次検索の結果であるヒットリストに含まれる文献識別番号の集合との積をとり（ＡＮＤ演算）、最終ヒットリスト（図１０（ｂ））とする。
【００４３】
最終ヒットリストに含まれる各文献に対して、図９のフローに従う処理を行う。
図９のフローの手順は、図２のフロー手順を変形したものである。図２の手順と大きく異なるところは、行ごとに検索条件に合致するか否かの判定が不要であることである。なぜなら、最終ヒットリスト（図１０（ｂ））に、構成要素に該当する単位データの位置が既に取り出されているからである。
図９における、図２のステップ番号と同じステップ番号を付された手順内の処理は、図２に関して前述したものと同様の処理が行われるものであるため、第１の実施の形態に記載した説明を参照されたい。
【００４４】
Ｓ１２３においては、変数ｍを初期化し、Ｓ１２４で該変数ｍをインクリメントする。変数ｍは、図１０（ｂ）のデータにおける位置１〜位置ｍを定めるためのインデックス値として用いるものである。
Ｓ１２５において、構成要素ｎに対応する複数の位置情報のうち、位置ｍの情報を得る。
Ｓ１２６では、Ｓ１２５で取得した位置情報で示される範囲のテキストデータを文献データ中から抽出する。以降の手順は、図２における場合と同様である。
【００４５】
このように本発明によれば、「文」「段落」などの構成要素の集合として文献データを捉えた効率的な特許情報の検索システムを構築でき、特許情報データベースの検索技術に限定されず、技術単語、構成要素などの単位データの構造を効率的にカスタマイズして生成することによって、他の技術文献（例えば、ＩＳＯ、各種研究所、大学の資料データベース等）の技術文献検索システムにも適用可能である。
【００４６】
【発明の効果】
以上説明したように、本発明によれば、電子化された技術文献データを検索する文献検索システムであり、技術の構成要素データと共に構成要素毎に検索条件を入力することで技術文献データを検索し、検索条件に合致する単位データを抽出して、構成要素データと単位データおよび該単位データに対応する技術データの識別データとを対応付けて出力するので、先行技術、異議・無効調査その他の目的で特許情報データベースを検索する場合や、他の技術情報の検索に適用した場合の利用者の処理が簡単化され、迅速で正確に所望の文献識別番号を検索できるという効果がある。
【図面の簡単な説明】
【図１】本発明の、第１の実施の形態に係る技術文献検索システムのブロック図である。
【図２】図１に示すシステムの処理のフローチャートである。
【図３】抽出データリストの構造を示す図である。
【図４】検索入力画面の一例を示す図である。
【図５】ヒットテーブルの一例を示す図である。
【図６】検索結果の出力画面の一例を示す図である。
【図７】検索条件と検索結果とを対応させた出力例を示す図である。
【図８】本発明の、第２の実施の形態に係る技術文献検索システムのブロック図である。
【図９】本発明の、第３の実施の形態に係る技術文献検索システムの処理のフローチャートである。
【図１０】第３の実施の形態で用いられる近傍検索の出力の一例を示す図であり、（ａ）はひとつの構成要素についての検索結果を示し、（ｂ）は構成要素を統合したリストを示している。
【符号の説明】
１文献データ
２検索条件入力部
３検索処理部
４ヒットテーブル
５抽出データリスト
６検索結果出力部
１０原文データベース
１１書誌インデックス
１２全文インデックス
１３書誌検索エンジン
１４全文検索エンジン
１５識別データリスト１
１６識別データリスト２
１７演算処理部
１８ヒットリスト[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a system for searching technical documents such as a patent information database.
[0002]
[Prior art]
As a conventional patent information search system, for example, a patent information search system such as PATOLIS (registered trademark) -WEB provided by Patrice Co., Ltd. is known. When conducting a prior art search using this conventional system, or a literature search for an intermediate processing objection / invalidation procedure, search criteria such as classification codes for international patent classifications and keywords should be expressed in logical formulas. Text search and so on. A survey using such a search system is generally performed in the following procedure.
Search is performed under appropriate conditions such as procedure 1, classification code and logical expression.
The primary screening is performed by paying attention to the procedure 2, the summary part of the document of the search result, and the like.
Procedure 3 The literature extracted in the primary screening is examined and secondary extraction is performed.
Step 4 From the secondary extracted documents, a portion corresponding to the configuration requirement is extracted and a comparison document is created.
[0003]
Further, in the patent publication search system disclosed in Patent Document 1, the patent publication data batch-transferred from the patent information database is reconstructed by a special processing support program and used, for example, in a company development section. In order to make it easier to understand, it is possible to understand at a glance the development status of the field of specific technology, prior art, etc., so that it is re-edited into a format that can be easily reused for each section in the company etc. There is something.
[0004]
On the other hand, Patent Documents 2 to 4 disclose various search techniques such as general document data although they are not directly related to the patent information database.
Patent Document 2 includes document data structured using a markup language such as SGML (Standard Generated Markup Language), that is, the names and ranges of document components such as a document title, a chapter, and a body. A structured document database is constructed from documents marked in a document using appropriate symbols, and similar documents are searched using keywords constructed by document components.
[0005]
Patent Document 3 discloses a document structure element that is divided into logical structures such as chapters, sections, paragraphs, and frames using a markup language as a document structure element for constructing a database of document data. A data structure is stored in a database, and retrieval is performed using a database retrieval technique or the like.
[0006]
Patent document 4 is an example of constructing and searching with both words extracted from a document as search unit data and document components such as “sentence” and “paragraph” found in other patent documents. Disclosure.
Then, the search relevant document search word related calculation unit for determining the search result calculates the position information of “word” or “document component”, for example, checks how many times the search word appears in the sentence, and determines the appearance position. It is expressed by position information seen from the end of the sentence (or the beginning of the sentence), and the proximity calculation is performed to determine the fitness.
[0007]
[Patent Document 1]
JP 2001-22794 A (paragraph [0020], FIG. 1)
[Patent Document 2]
Japanese Unexamined Patent Publication No. 7-44567 (paragraph [0019], FIG. 1)
[Patent Document 3]
Japanese Patent Laid-Open No. 8-44766 (paragraph [0010], FIG. 3)
[Patent Document 4]
JP 2002-189754 ([Summary], FIG. 1)
[0008]
[Problems to be solved by the invention]
In the above conventional technique, in step 1, the search condition is determined for the entire document data (searching whether each word specified in the search condition is present in the document data), and thus the document having low relevance. There is a problem that many searches are performed (generally called “noise”).
In the case of the procedure 2, since screening is performed by paying attention to only a part of the document data, there is a problem that important documents may be overlooked.
In the case of procedures 3 and 4, there is a problem that the amount of work becomes enormous because the work is performed while creating a configuration requirement check table and the like and grasping the correspondence between each document.
[0009]
If the technique shown in Patent Document 1 is used, the above-mentioned procedures 2 to 3 are made efficient, but it is still necessary to perform a reading inspection of the document data one by one.
If the technique disclosed in Patent Document 2 is used, it is possible to perform inspections in order of similarity, so that there is an increased possibility of finding the target document at an early stage. However, since the degree of similarity is an evaluation of the entire document, no improvement means is provided for the operations of the above procedures 3 to 4.
The technique disclosed in Patent Document 3 is a search technique that focuses on the constituent elements of a document, but is intended for document reuse. Further improvement is necessary to use for the purpose of “searching documents” (including as many as possible).
If the technique disclosed in Patent Document 4 is used, the search condition is not an evaluation of the entire document, but a match determination is performed in units of paragraphs, etc., so that the search noise in the above procedure 1 can be reduced. The effect in the case of using is comparable to the technique of Patent Document 2.
[0010]
Therefore, the present invention has an object to provide a technical document search system capable of quickly and accurately searching for desired patent information / gazette without waste by introducing highly efficient document data search technology, database search technology, and the like. .
[0011]
In order to achieve the above object, the invention described in claim 1 is electronically stored in the storage means. , Consisting of multiple unit data A computer system for searching technical literature data, the search target technology For each of a plurality of components constituting Data representing the component Accept input Component input means, and each Search conditions for each component Accept input Search condition input means, and each Component Every, Based on the search criteria Above Search means for searching technical document data, unit data extracting means for extracting unit data that matches the search condition from the searched technical document data, and for each searched technical document, the technical document is Component array data indicating whether or not the search condition for each component is satisfied; and Matched the search conditions Associating the unit data with the identification data of the technical document Recording means for recording And.
[0012]
According to a second aspect of the present invention, there is provided the technical document data search system according to the first aspect. Accept input It is characterized by comprising an input means and a technical composition decomposition means for analyzing and dividing the technical text data into a plurality of text data and delivering the divided text data to the component input means.
Further, the invention according to claim 3 is the technical document search system according to claim 1, wherein dictionary storage means for storing dictionary data of terms and the constituent elements with reference to the dictionary data. Represents Technical term extraction means is provided for extracting technical terms from the data and delivering the extracted technical terms to the search condition input means.
Further, the invention according to claim 4 is the technical document search system according to claim 1, similar word storage means for storing a plurality of sets of similar words, By search condition input means input Accepted For each keyword included in the search condition, a similar word corresponding to the keyword is acquired from the similar word storage unit, and acquired. Shi Similar search terms Expand A similar word adding means is provided.
Further, the invention according to claim 5 is the technical document search system according to claim 1, in which the fitness condition that defines the condition of the search fitness is defined. Accept input Relevance condition input means and each retrieved technical document Above A search suitability calculating means for calculating a search suitability, wherein the unit data extracting means performs a process of extracting unit data for technical literature data whose search suitability matches the suitability condition. It is said.
The invention according to claim 6 is the technical document search system according to claim 1, wherein the attribute of the technical document data to be searched by the search means Accept input It has a target condition input means, and the search means searches only technical document data that matches the attribute.
The invention according to claim 7 is characterized in that the search means performs a search based on the search condition for each predetermined range in the technical document data.
The invention according to claim 8 is characterized in that the search means does not search for data belonging to a predetermined chapter when the technical document data is chaptered.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a block diagram of a technical literature search system according to the first embodiment of the present invention.
The present embodiment is a software system operable on a personal computer (hereinafter “PC”) or a workstation. Since a commercially available PC or the like can be used, the description of the hardware configuration (not shown) will be briefly given as follows.
[0014]
The search condition input unit 2 in FIG. 1 is a program for displaying a search condition input screen, a display device such as a display, and an input device such as a mouse and a keyboard. The document data 1 is stored in a storage device such as an HDD. The search processing unit 3 is a program that performs search processing based on the search condition input by the search condition input unit 2. The hit table 4 and the extracted data list 5 are output as search results by the search processing unit 3 and stored in a secondary storage device such as a main memory or HDD.
The search result output unit 6 is a display device such as a program for displaying the search result and a display. The search result display screen is operated by an input device such as a mouse or a keyboard.
[0015]
First, input of search conditions will be described with reference to FIG.
FIG. 4 shows an example of a search condition input screen displayed on the computer display.
The column “IPC” in FIG. 4 is used for filtering document data to be searched, and will be described below as “target conditions”. The target condition may be a document classification code different from IPC (International Patent Classification), a publication date of the document, other bibliographic items, or a combination of keywords for searching the document data as a whole. .
[0016]
Text data representing each component of the search target technology is entered in the “component requirement” column.
In the “search condition” column, a search expression corresponding to each component is entered.
The “important” column is used for designating components to be particularly emphasized, and can be switched ON / OFF by clicking with a mouse or the like.
“Search location” designates a location on the storage device where the document data is stored. The directory may be input directly from the keyboard, or by clicking a “reference” button with a mouse, a directory list (not shown) can be selected by GUI operation.
In the “hit condition” column, a threshold value for determining whether to output each retrieved document data as a search result is input.
The column “number of extraction locations” defines how many unit data corresponding to each component are searched or extracted per document.
The “essential check element is required” column represents that the search condition must be satisfied for the configuration requirement in which the “important” column is checked. That is, document data that does not satisfy the search condition of the component requirement B is not output as a search result.
When the operator inputs each of the above using the keyboard or mouse and gives an instruction using the “Search” button, the search process is started.
[0017]
Next, the search process will be described with reference to the flow of FIG.
In starting the search process, first, a component array is generated (S100).
Here, the component array is an array for storing whether or not a search condition corresponding to each component (“component” on the input screen shown in FIG. 4) is satisfied. In this embodiment, it is an integer array having the number of array elements, and the subscript indicating the first array element is 1.
Subsequently, an extracted data list is generated (S101).
As shown in FIG. 3, the extracted data list is an area for storing text data that matches the search condition for each component. Each element of the list corresponds to each component, and each element of the list is configured to be able to store a list of text data (character string data) that matches the search condition.
[0018]
Next, it is determined whether or not the document data to be searched remains (S102). If the determination is affirmative, the hit table is rearranged according to the fitness level and output (S103). If the document data is not over, the document data is continuously read (S104), and it is determined whether the target condition is met (S105). For example, in the case of patent document data, since bibliographic data is included in the document data, the determination can be made by comparing the bibliographic data with the input target condition.
If S105 is affirmative, the process proceeds to S106 to initialize (zero clear) the component array and the extracted data list.
[0019]
Next, parts other than the body part (for example, bibliographic data) are removed from the document data (S107), and text shaping is performed (S108).
The bibliographic item description part in the patent document, the citation / reference document description part in the general technical document, etc. are preferably not to be searched, so the process of removing such a part is performed in S107. In the case of document data such as a patent document in which tags for chapter structure and document structuring are defined, a portion to be removed can be easily determined according to the definition. Except for structured documents, citations and references are generally listed at the end of the document data, so the contents of the document data are inspected, and if there is a line consisting only of a character string such as "cited document" or "reference document" It is recommended to remove the lines after that line.
In the processing in S108, tags in a format such as a markup language are removed to form a normal sentence, and line breaks are rearranged so that one line of text data is obtained for each predetermined range of document data.
Here, the predetermined range may be character string data representing a sentence separated by punctuation marks (. Or.), Or character string data representing a paragraph composed of a plurality of sentences. The paragraph break can be determined by a line feed code such as CR (Carriage Return) or LF (Line Feed).
[0020]
In S109, a variable n for indicating the number of the component being searched is initialized to zero. In S110, the variable n is incremented in order to proceed to the search process for the next component.
In S111, it is determined whether or not the variable n is larger than the number of input components, that is, whether or not the search processing has been performed for all the components. If the determination is affirmative, the process proceeds to S112, and if the determination is negative, the process proceeds to S116.
[0021]
In the procedure of S116 to S119, the text data shaped in S108 is searched for in units of lines. In S116, a variable L for indicating the line number being searched is initialized to zero, and the variable L is incremented in S117. In S118, it is determined whether or not the value of the variable L exceeds the number of lines of the text data. If the determination is affirmative, the process returns to S110, and the search process for the next component is performed. If the determination is negative, in S119, it is determined whether or not the current row L matches the search condition N corresponding to the component n. If the determination in S119 is negative, the process returns to S117, and the search process for the next row is performed. If the determination is positive, the process proceeds to S120.
[0022]
In S120, the value of the component array [n] is incremented. The value of the component array [n] represents the number of rows corresponding to the component n (matching the search condition N).
Next, in S121, the contents of the row L are added to the extracted data list [n].
In S122, it is determined whether or not the value of the component array [n] has reached the value input in “Number of extraction locations” on the input screen shown in FIG. If it is affirmation determination, it will return to S110 and will move to the search process of the following component. If the determination is negative, the process returns to S117, and the search process for the next line is continued.
[0023]
As described above, when it is determined in S111 that the search process for all the constituent elements has been completed, the processes after S112 are performed. In S112, the component satisfaction rate is calculated.
To calculate the component satisfaction rate, refer to the component array,
Number of array elements with value> 0 ÷ total number of array elements × 100
That is,
Number of components with paragraphs (rows) that match the search condition ÷ number of components × 100
Is substantially the same.
Next, in S113, it is determined whether or not the calculated component satisfaction rate is equal to or greater than a predetermined value. Here, the predetermined value is, for example, a value input in the “hit condition” in FIG. If a negative determination is made, the document data as a whole does not match the search conditions, and the process returns to S102 to start the next document data search process.
[0024]
In S <b> 114, it is determined whether or not all the components with the “important” check in FIG. 4 are satisfied. this is,
Value of component array [m]> 0
(M indicates the position from the top of the component that is checked for “important”). If the determination is negative, as in S113, the process returns to S102 and the search process for the next document data is started. Note that if “important check is essential” is not selected in FIG. 4, the determination process of S114 is unnecessary, and the process directly proceeds to S115.
If the determination in S114 is affirmative, the document data matches the search condition, and the information of the document data is added to the hit table in S115.
[0025]
FIG. 5 is an example of a hit table, where “document identification number” is a publication number or the like, and “score” is a number that represents a search suitability, a satisfaction rate, and the like. The “score” may use the component satisfaction rate as it is, or may be calculated according to other conditions. Also, a score may be taken in consideration of the number of paragraphs that match the search condition. Such a “score” may be calculated in S113, and the determination in S113 may be performed based on the score instead of the constituent requirement satisfaction rate.
Also, here, a new memory area for continuously storing the extracted paragraph data (or sentence data) is allocated, the contents of the extracted data list stored in the flow process are copied, and the allocated memory area is copied to the allocated memory area. The pointer is stored in “pointer to extracted data list” of the hit table.
When the search process for all documents is completed by the above procedure, the hit table is rearranged in descending order of the “score” value in S103.
[0026]
FIG. 6 is an example of a display screen on which search results are output based on the hit table.
The “reference number” in FIG. 6 corresponds to the “reference number” in the hit table.
The columns A to F correspond to the “component array” in the hit table.
Component array value> 0
“*” Is displayed to indicate that a paragraph corresponding to the component exists in the document. Instead of displaying “*”, the value of the component array itself may be displayed.
FIG. 6 further shows a state in which the column of the component 1 (in this case B) in the row 1 in the left table is selected by mouse click or the like. That is, on the right side, the sentence data corresponding to the component B (corresponding to “text data of constituent requirement 2” in FIG. 4), the search condition expression corresponding thereto, and the document in line 1 (JP XXXX- Of the text data extracted from (XXXXXX), text data corresponding to the component B is output. In this way, when a predetermined component column is selected in the left table, the corresponding data is displayed on the right with reference to the input data and hit table in FIG. Furthermore, among the extracted sentence data, words appearing in the search condition formula are highlighted.
Of the columns A to F shown in FIG. 6, the column corresponding to the component for which “important” in FIG. 4 is checked may be highlighted by bold or color coding.
[0027]
Further, in FIG. 6, a check box for selecting a document is displayed, and the operator can instruct to create a correspondence table between the selected document and the component (not shown). An example of the correspondence table is as shown in FIG. 7, in which the text data extracted from the selected document has a format corresponding to each component, and may be output to a printer such as a printer. Alternatively, it may be output as a document file in HTML format or the like.
In the screen of FIG. 6, for example, when the user double-clicks on the characters “JPXXXX-XXXXXXX”, the original text data of JPXXXX-XXXXXXX is extracted from the file based on the value of the “file name” in the hit table. It is read and displayed on the display (not shown).
In this way, the operator can easily confirm whether or not the target description exists in each document.
[0028]
[Modification]
In the above, after the literature data stored in the HDD in advance is filtered according to the target condition, the search processing for each component is performed on the literature data that matches the target condition. be able to.
After the input is completed on the input screen shown in FIG. 4 and the operator is instructed to execute a search, the query is connected to an external document database server via a communication line, and the database is searched for the search according to the target condition. May be issued, the search result document data may be downloaded to a designated “search location”, and the search process for each component may be performed on the downloaded document data. In this case, the process of S105 is unnecessary.
[0029]
[Application example]
This embodiment can be further applied as follows.
[Application Example 1]
In FIG. 4, the text data of the constituent requirements is input in a frame corresponding to each constituent requirement, but a series of text data input by a predetermined method is analyzed and automatically assigned to each constituent element. And may be set in each component requirement input field.
Specifically, character string data, which is a series of text data, is separated by punctuation marks (symbols such as “,” and “,”), and each divided part is interpreted as one constituent element.
Further, when the number of characters in each part is equal to or less than a predetermined number, correction may be performed by connecting to the next part to form one component.
[0030]
[Application 2]
It is possible to prepare dictionary data in which technical terms are associated with parts of speech, extract technical terms existing in the dictionary from partial sentence data as constituent elements, and set them in the search condition input field. Further, by preparing general term dictionary data and removing technical terms included in the general term dictionary from the extracted technical term group, phrases that are not directly related to technical features (for example, “means” “step” "Method", "apparatus", etc.) may not be set in the search condition input field.
As for the delivery method to the search condition input field, the extracted technical terms are simply enumerated and set in the search condition input field, and as a result, each term is searched by logical product (AND combination) or logical sum (OR connection). It may be made to do.
As another delivery method, the extracted technical terms may be divided into a noun group and a non-noun group, and the inside of each group may be an OR connection, and each group may be an AND connection.
[0031]
[Application Example 3]
For each search word set in the search condition input field, similar word dictionary data is further prepared. For the terms registered in the similar word dictionary, a search condition in which the search word and the corresponding similar word are OR-coupled. It may be extended to an expression.
As an example, in the search condition input field,
Structure * element * search
Are searched for in the similar word dictionary according to instructions from the operator to expand similar words, and the acquired similar words are used.
(Configuration + Structure) * (Element + Element) * (Search + Detection + Inspection)
The search expression is expanded like this.
In this example, “structure” is stored as a similar word corresponding to “configuration” in the similar word dictionary, and similarly, “element” for “element”, “detection” for “search”, and “Inspection” is stored.
[0032]
In the first to third application examples, after the text data is input, the process may be automatically executed without the operator's instruction, and the search process may be started.
Alternatively, the description style of the text data may be classified into a plurality of types, and text data decomposition / term extraction processing corresponding to the type selected by the operator from the types may be selected and executed.
[0033]
[Second Embodiment]
The second embodiment is operated by a client / server computer system (not shown). The client computer inputs search conditions and transmits them to the server. The server computer uses the search conditions received from the client. Based on the search process, the search result is transmitted to the client.
[0034]
As in the first embodiment, the client inputs a search condition using an input screen shown in FIG. 4 using a general PC. Then, the input content is transmitted to the server via the communication line.
In the first embodiment, search processing is performed for each document data one by one. However, in the second embodiment, a document database having a general configuration is constructed on a server. It differs greatly from the first embodiment in that it is used effectively.
[0035]
FIG. 8 is a block diagram of a search system on a server according to the second embodiment.
In the storage device on the server, the original text database 10 storing the reference data, the bibliographic index 11 in which the reference identification number and the bibliographic data corresponding to the original sentence database 10 are associated, the reference identification number and the index for full text search Is stored in the full-text index 12.
[0036]
In the server, after receiving the search condition from the client, the following processing is performed.
First, a primary search is performed using the target conditions included in the search conditions. If the target condition is a bibliographic item such as IPC, the bibliographic search engine 13 searches the bibliographic index to obtain a list of corresponding document identification numbers (identification data list 1 (15)). When a keyword for full-text search is included in the target condition, the full-text search engine searches the full-text index from 14 and obtains a list of corresponding document identification numbers (identification data list 2 (16)).
[0037]
Next, the arithmetic processing unit 17 calculates the union of the identification data list 1 (15) and the identification data list 2 (16) to obtain the hit list 18.
As described above, the hit list 18 becomes a list of document identification numbers that match the target condition. The document data shown in the hit list 18 is read from the original text database 10 and the search result is obtained by performing a secondary search (search for each component) according to the flowchart of FIG. 2 as in the first embodiment. Obtainable.
[0038]
However, in this method, the number of document data to be processed becomes enormous and the processing load on the server may increase. Therefore, it is preferable to create the final hit list as follows.
After the hit list is obtained, the full text index 12 is searched with the search condition n for each component, and the identification number list 21 ₁ ~ 21 _n (Instead of the identification data list 2 (16)) is output.
The hit list and the identification number list 21 ₁ ~ 21 _n The union of the document identification numbers included in is taken as the final hit list.
In this way, the target documents for the secondary search process can be narrowed down to the minimum necessary.
Then, the document data shown in the final hit list is read from the original text database 10 and the search result is obtained by performing the secondary search (search for each component) according to the flowchart of FIG. 2 as in the first embodiment. Can be obtained. The server sends the hit table and the extracted data list (corresponding to hit table 4 and extracted data list 5 in FIG. 1) obtained as a search result to the client, and the client performs display output shown in FIG. 6 based on this. Can do.
[0039]
At the stage of calculating the final hit list described above, the component satisfaction rate may be calculated, and only those that meet the specified conditions may be output to the final hit list.
In particular,
1. Number of components × specified component satisfaction rate (hit condition) / 100 = required number K (rounded down)
2. Concatenate each hit list, rearrange them in the order of identification numbers, and inspect from the top of the list and output the identification number to the final hit list if there are K or more consecutive identification numbers
That's it. In this way, processing such as “calculation of the component requirement satisfaction rate” (S112) “is the satisfaction rate equal to or greater than a predetermined value” (S113) in the flow of FIG. 2 is not necessary.
[0040]
At the same time, it may be determined whether or not the essential elements are satisfied, and only those that meet the specified condition may be output to the final hit list. In particular,
1. Identification number list 2 corresponding to a constituent element designated as an essential element (checked with “important” in FIG. 4) _n Take the union of (X)
2. Identification number list 2 corresponding to constituent elements other than those designated as essential elements _n (Y)
3. Number of components × specified component satisfaction rate ÷ 100−number of essential elements = required number k (rounded down)
4. Check from the beginning of (Y),
a, there are k or more same numbers
b, the number is in (X)
If both are positive, the final hit list is output.
By doing so, the processing of “all necessary elements are satisfied?” (S114) in the flow of FIG. 2 becomes unnecessary.
[0041]
[Third Embodiment]
The third embodiment is also operated in the form of a client / server.
The difference from the second embodiment is that the full-text search engine has a proximity calculation (neighbor search) function.
Also in the third embodiment, first, a primary search process similar to that in the second embodiment is performed.
Next, using the full-text search engine, a neighborhood search using a search condition formula for each component is performed. As a result of this search processing, data as shown in FIG. 10 is obtained.
[0042]
FIG. 10A is obtained as a result of searching for one component, and a range: position 1 to position m in the document data that matches the search condition formula is obtained.
When the neighborhood search for all the components is completed, the search results in FIG. 10A are obtained by the number of components, and these are integrated as shown in FIG. Specifically, the union of the document identification numbers is taken, and the position information in FIG. 10A is copied to the corresponding area of the component element position information in FIG.
Then, the product of the set of document identification numbers included in FIG. 10B and the set of document identification numbers included in the hit list as a result of the primary search is obtained (AND operation), and the final hit list (FIG. 10 (b)).
[0043]
A process according to the flow of FIG. 9 is performed on each document included in the final hit list.
The flow procedure of FIG. 9 is a modification of the flow procedure of FIG. A significant difference from the procedure of FIG. 2 is that it is not necessary to determine whether or not the search condition is matched for each row. This is because the position of the unit data corresponding to the constituent element has already been extracted from the final hit list (FIG. 10B).
In FIG. 9, the processes in the procedure given the same step numbers as those in FIG. 2 are the same as those described above with reference to FIG. 2, and are therefore described in the first embodiment. See description.
[0044]
In S123, the variable m is initialized, and in S124, the variable m is incremented. The variable m is used as an index value for determining the positions 1 to m in the data of FIG.
In S125, information on the position m is obtained from the plurality of pieces of position information corresponding to the component n.
In S126, text data in a range indicated by the position information acquired in S125 is extracted from the literature data. The subsequent procedure is the same as in FIG.
[0045]
Thus, according to the present invention, an efficient patent information search system that captures literature data as a set of components such as “sentence” and “paragraph” can be constructed, and the invention is not limited to patent information database search technology, By efficiently customizing and generating unit data structures such as technical words and components, it can also be applied to technical literature search systems for other technical literature (eg, ISO, various research institutes, university data databases, etc.) Is possible.
[0046]
【The invention's effect】
As described above, according to the present invention, there is a literature retrieval system for retrieving digitized technical literature data, and the technical literature data is retrieved by inputting a search condition for each constituent element together with the technical constituent element data. The unit data that matches the search condition is extracted, and the component data, the unit data, and the identification data of the technical data corresponding to the unit data are output in association with each other. When searching the patent information database for the purpose or when applied to the search of other technical information, the user's processing is simplified, and the desired document identification number can be searched quickly and accurately.
[Brief description of the drawings]
FIG. 1 is a block diagram of a technical literature search system according to a first embodiment of the present invention.
FIG. 2 is a flowchart of processing of the system shown in FIG.
FIG. 3 is a diagram showing a structure of an extracted data list.
FIG. 4 is a diagram illustrating an example of a search input screen.
FIG. 5 is a diagram illustrating an example of a hit table.
FIG. 6 is a diagram illustrating an example of a search result output screen.
FIG. 7 is a diagram illustrating an output example in which search conditions are associated with search results.
FIG. 8 is a block diagram of a technical literature search system according to a second embodiment of the present invention.
FIG. 9 is a flowchart of processing of a technical literature search system according to a third embodiment of the present invention.
FIGS. 10A and 10B are diagrams showing an example of the output of neighborhood search used in the third embodiment, where FIG. 10A shows a search result for one component, and FIG. 10B is a list in which the components are integrated; Is shown.
[Explanation of symbols]
1 Literature data
2 Search condition input part
3 Search processing section
4 Hit table
5 Extracted data list
6 Search result output section
10 Original database
11 Bibliographic index
12 Full-text index
13 Bibliographic search engine
14 Full-text search engine
15 Identification data list 1
16 Identification data list 2
17 Arithmetic processing part
18 hit list

Claims

A computer system for searching technical literature data composed of a plurality of unit data , which is electronically stored in a storage means,
For each of a plurality of components constituting the search techniques, the components input means for accepting an input of data representing the respective components,
The search condition input means for accepting an input of a search condition for each component,
Wherein each component, a searching means for searching the technical literature data on the basis of the search condition,
Unit data extraction means for extracting unit data that matches the search condition from the searched technical literature data;
For each searched technical document, component element array data indicating whether the technical document satisfies a search condition for each component, the unit data that matches the search condition, and identification data of the technical document A technical literature search system comprising: a recording means for recording in association with each other.

The technical literature search system according to claim 1,
Input means for receiving input of technical text data;
A technical literature search system comprising: a technical composition decomposing unit that analyzes and divides the technical text data into a plurality of text data and delivers the divided text data to the component input unit.

The technical literature search system according to claim 1,
Dictionary storage means for storing dictionary data of terms;
A technical literature search system comprising technical term extraction means for extracting technical terms from data representing the constituent elements with reference to the dictionary data and delivering the extracted technical terms to the search condition input means.

The technical literature search system according to claim 1,
Similar word storage means for storing a plurality of sets of similar words;
Each keyword included in the input reception we were by the retrieval condition input unit search condition, a similar word corresponding to the keyword acquired from the similar word storing means, and expanding the obtained similar words to the search criteria A technical literature search system comprising a similar word addition means for adding.

The technical literature search system according to claim 1,
It has a fitness condition input means for receiving input of fitness criteria that define the search fit condition, and a search fitness calculating means for calculating the search adaptability for each retrieved technical literature,
The technical document search system, wherein the unit data extraction means performs a process of extracting unit data on technical document data whose search suitability matches the matching condition.

The technical literature search system according to claim 1,
The search means has target condition input means for receiving input of attributes of technical literature data to be searched,
The technical document search system, wherein the search means searches only technical document data that matches the attribute.

The technical document search system according to claim 1, wherein the search means performs a search based on the search condition for each predetermined range in the technical document data.

2. The technical document search system according to claim 1, wherein the search means does not search data belonging to a predetermined chapter when the technical document data is chaptered.