JP2024128597A

JP2024128597A - Similar document search device, similar document search method and program

Info

Publication number: JP2024128597A
Application number: JP2023037638A
Authority: JP
Inventors: 慧古川; Kei Furukawa; 巧大山; Takumi Oyama
Original assignee: Shimizu Construction Co Ltd; Shimizu Corp
Current assignee: Shimizu Construction Co Ltd; Shimizu Corp
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2024-09-24

Abstract

To provide a similar document retrieval device, a similar document retrieval method, and a program which can increase the search accuracy.SOLUTION: A device 10 is for retrieving a similar document as a document similar to a related document as a document related to a retrieval word to be input from among a plurality of documents registered in advance. The device includes: a related document extraction unit 14 for extracting the related document tied in advance to the retrieval word in a predetermined rule base on the basis of the input retrieval word; and a similar document retrieval unit 16 for calculating similarity of the extracted related document and each document registered in advance, and retrieving the similar document on the basis of the calculated similarity.SELECTED DRAWING: Figure 1

Description

本発明は、文書の類似度に基づいて文書を検索する類似文書検索装置、類似文書検索方法およびプログラムに関するものである。 The present invention relates to a similar document search device, a similar document search method, and a program for searching documents based on their similarity.

従来、コンピュータによる自然言語処理の一つとして、データベースに記憶された文書の中から入力文書に類似する文書を検索する検索処理が知られている（例えば、特許文献１を参照）。また、検索対象の文書中から、出現頻度ベースなどの特定のアルゴリズムによって文書を特徴づけるキーワードを抽出し、ユーザが入力したワードとキーワードの言語空間におけるベクトル同士のなす角度の近さを表現するコサイン類似度等を算出して、関連性の高い文書を出力する技術が知られている。 A conventional method of natural language processing using a computer is a search process that searches documents stored in a database for documents similar to an input document (see, for example, Patent Document 1). There is also known a technique that extracts keywords that characterize documents from the documents to be searched using a specific algorithm, such as based on frequency of occurrence, and calculates cosine similarity, which expresses the closeness of the angle between vectors in the language space of the word and keyword entered by the user, to output documents with high relevance.

特許第６１９０９０４号公報Patent No. 6190904

しかし、ある文書群Ａを検索したい場合において、入力条件（上記のワードと同等）と文書群Ａとの類似性が著しく低いときには、検索精度が落ちるおそれがある。すなわち、上記のワードに類似するキーワードを文書群Ａが多く保有しない場合には、従来の検索方法によるベクトルを使った類似度計算が有効でない可能性がある。 However, when searching a certain document group A, if the similarity between the input conditions (equivalent to the above words) and document group A is extremely low, search accuracy may decrease. In other words, if document group A does not contain many keywords similar to the above words, the similarity calculation using vectors according to the conventional search method may not be effective.

例えば、建設分野の文書の中から、塩害対策の文書群Ａを検索して出力させるために、検索語（ワード）を「構造：Ｓ（鉄骨）造」かつ「沿岸からの距離３００ｍ以内」に設定した場合を考える。この場合、「沿岸からの距離３００ｍ以内」は自然言語処理上、塩害対策をあまり要求されない「沿岸からの距離３０００ｍ以内」等とほぼ同じように扱われる蓋然性が高いことから、塩害対策以外の文書群まで出力される可能性が高い。このように、検索語に含まれる数値の持つ意味合いが考慮されないと、検索精度が低下するおそれがある。 For example, consider the case where the search terms (words) are set to "Structure: Steel (S) construction" and "Within 300m distance from the coast" to search and output document group A on salt damage countermeasures from among documents in the construction sector. In this case, there is a high probability that "Within 300m distance from the coast" will be treated in natural language processing in roughly the same way as "Within 3000m distance from the coast," which does not require much salt damage countermeasures, and therefore there is a high possibility that document groups other than those related to salt damage countermeasures will be output. In this way, if the meaning of the numbers contained in the search terms is not taken into consideration, there is a risk of a decrease in search accuracy.

本発明は、上記に鑑みてなされたものであって、検索精度を向上することができる類似文書検索装置、類似文書検索方法およびプログラムを提供することを目的とする。 The present invention has been made in consideration of the above, and aims to provide a similar document search device, a similar document search method, and a program that can improve search accuracy.

上記した課題を解決し、目的を達成するために、本発明に係る類似文書検索装置は、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する装置であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出する関連文書抽出部と、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索する類似文書検索部とを備えることを特徴とする。 In order to solve the above problems and achieve the object, the similar document search device of the present invention is a device that searches for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among multiple documents that have been registered in advance, and is characterized by having a related document extraction unit that extracts the related documents that are pre-linked to the search term based on a predetermined rule base, based on the input search term, and a similar document search unit that calculates the similarity between the extracted related documents and each document registered in advance, and searches for the similar documents based on the calculated similarity.

また、本発明に係る類似文書検索方法は、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する方法であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出するステップと、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索するステップとを有することを特徴とする。 The similar document search method according to the present invention is a method for searching for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among a plurality of pre-registered documents, and is characterized by having a step of extracting the related documents that are pre-linked to the search term based on a predetermined rule base, based on the input search term, and a step of calculating the similarity between the extracted related documents and each pre-registered document, and searching for the similar documents based on the calculated similarity.

また、本発明に係るプログラムは、上述した類似文書検索方法をコンピュータに実行させることを特徴とする。 The program according to the present invention is characterized in that it causes a computer to execute the similar document search method described above.

本発明に係る類似文書検索装置によれば、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する装置であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出する関連文書抽出部と、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索する類似文書検索部とを備えるので、検索精度を向上することができるという効果を奏する。 The similar document search device according to the present invention is a device that searches for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among multiple documents that have been registered in advance, and includes a related document extraction unit that extracts the related documents that are linked to the input search term in advance using a predetermined rule base, based on the input search term, and a similar document search unit that calculates the similarity between the extracted related documents and each document that has been registered in advance, and searches for the similar documents based on the calculated similarity, thereby achieving the effect of improving search accuracy.

また、本発明に係る類似文書検索方法によれば、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する方法であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出するステップと、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索するステップとを有するので、検索精度を向上することができるという効果を奏する。 The similar document search method according to the present invention is a method for searching for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among multiple documents registered in advance, and includes the steps of extracting the related documents that are linked to the search term in advance on a predetermined rule basis based on the input search term, calculating the similarity between the extracted related documents and each document registered in advance, and searching for the similar documents based on the calculated similarity, thereby achieving the effect of improving search accuracy.

図１は、本発明に係る類似文書検索装置の実施の形態を示す概略構成図である。FIG. 1 is a schematic diagram showing the configuration of an embodiment of a similar document search device according to the present invention. 図２は、本発明に係る類似文書検索方法の実施の形態を示す概略フロー図である。FIG. 2 is a schematic flow diagram showing an embodiment of a similar document search method according to the present invention. 図３は、本実施の形態による検索例を示す図である。FIG. 3 is a diagram showing an example of a search according to this embodiment.

以下に、本発明に係る類似文書検索装置、類似文書検索方法およびプログラムの実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。 Below, embodiments of a similar document search device, a similar document search method, and a program according to the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited to these embodiments.

図１に示すように、本発明の実施の形態に係る類似文書検索装置１０は、入力部１２と、関連文書抽出部１４と、類似文書検索部１６と、出力部１８と、記憶部２０とを備える。 As shown in FIG. 1, a similar document search device 10 according to an embodiment of the present invention includes an input unit 12, a related document extraction unit 14, a similar document search unit 16, an output unit 18, and a memory unit 20.

入力部１２は、類似文書の検索においてキーとなる検索語の入力を受け付けるものであり、例えば、入力インターフェース用のキーボードおよびディスプレイ画面等に設けられる入力欄などにより構成される。検索語は、一定の意味を有する語句や、長さ等の単位を含む数値などの字句を想定している。例えば、建設事業分野の場合には、検索語として、案件情報（建設地・延床・面積・構造など）、性能情報（沿岸部・軟弱地盤・特殊構造など）、仕上情報（壁：石・屋根：防水など）のいずれか一つ以上を用いることができる。検索語は、キーワードとして予め複数の選択肢を設けておき、入力部１２においていずれかを選択して入力可能なようにしてもよい。また、複数の検索語を入力して、ＡＮＤ検索やＯＲ検索が可能なようにしてもよい。例えば、「構造：Ｓ（鉄骨）造」、「沿岸からの距離３００ｍ以内」の二つを検索語として入力してもよい。 The input unit 12 accepts input of search terms that are used as keys in searching for similar documents, and is composed of, for example, a keyboard for an input interface and an input field provided on a display screen. The search terms are assumed to be words with a certain meaning or words such as numbers including units such as length. For example, in the case of the construction business field, the search terms may be one or more of project information (construction site, total floor area, area, structure, etc.), performance information (coastal area, soft ground, special structure, etc.), and finishing information (wall: stone, roof: waterproof, etc.). A number of options may be provided as keywords for the search terms, and one of them may be selected and input in the input unit 12. In addition, multiple search terms may be input to enable AND searches or OR searches. For example, the two search terms "structure: steel (steel) construction" and "within 300 m distance from the coast" may be input.

関連文書抽出部１４は、入力部１２に入力された検索語に基づいて、所定のルールベースで予め紐付けられている関連文書を記憶部２０から抽出する。関連文書およびルールベースは、記憶部２０に記憶されている。ルールベースには、検索語と、この検索語に関連する関連文書の情報（例えば、文書のタイトルなど）が予め紐付けて登録されている。これにより、字句形式の検索語と文書形式の関連文書とが対応付けられる。ルールベースは、過去の多数のデータから、検索語と文書を分析して、検索語と文書の関連性をルール化して設定することができる。この関連性は、例えば、過去に入力された検索語の使用頻度および類似文書の閲覧頻度から算出された重み付け値などに基づいて設定してもよいし、予め人の手によって設定してもよい。検索語が建設案件の断片的な情報であった場合には、頻出した不具合事例が記載された文書を関連文書としてもよい。例えば、上記の例では、検索語が「構造：Ｓ（鉄骨）造」かつ「沿岸からの距離３００ｍ以内」の場合、「塩害対策の文書」が関連文書となるように紐付けてもよい。 The related document extraction unit 14 extracts related documents from the storage unit 20 that are linked in advance with a predetermined rule base based on the search term input to the input unit 12. The related documents and the rule base are stored in the storage unit 20. The search term and information on related documents related to the search term (e.g., document titles, etc.) are registered in advance in the rule base. This allows the search term in lexical form to correspond to the related document in document form. The rule base can analyze the search term and the document from a large amount of past data, and set the relevance between the search term and the document as a rule. This relevance may be set based on, for example, a weighting value calculated from the frequency of use of the search term input in the past and the frequency of browsing of similar documents, or may be set manually in advance. If the search term is fragmentary information on a construction project, documents describing frequently occurring defect cases may be set as related documents. For example, in the above example, if the search terms are "Structure: Steel (S) construction" and "Distance from the coast within 300 m," "Documents on salt damage prevention measures" may be linked as related documents.

類似文書検索部１６は、関連文書抽出部１４により抽出した関連文書を入力文書とし、この入力文書と記憶部２０に記憶されている検索対象の各文書との類似度を自然言語処理により算出し、算出した類似度に基づいて、関連文書と類似度の高い類似文書を検索する。類似度は、例えば、関連文書および検索対象の各文書のそれぞれを形態素に分割し、それぞれに共通して出現する単語の数をカウントすることにより算出する方法や、上記の特許文献１に記載されているベクトル空間法などの公知の類似文書検索技術を用いることができる。 The similar document search unit 16 uses the related documents extracted by the related document extraction unit 14 as input documents, calculates the similarity between this input document and each document to be searched stored in the storage unit 20 by natural language processing, and searches for similar documents that are highly similar to the related document based on the calculated similarity. The similarity can be calculated, for example, by dividing each of the related documents and each document to be searched into morphemes and counting the number of words that appear in common between them, or by using a known similar document search technique such as the vector space method described in the above-mentioned Patent Document 1.

出力部１８は、類似文書検索部１６により検索された類似文書についての情報を類似文書検索結果として出力するものであり、例えば、類似文書の文字列を表示するディスプレイやプリンタなどで構成される。類似文書検索結果として、例えば、類似文書のタイトルなどを出力することができる。 The output unit 18 outputs information about the similar documents searched for by the similar document search unit 16 as the similar document search results, and is composed of, for example, a display or a printer that displays the character strings of the similar documents. For example, the titles of the similar documents can be output as the similar document search results.

記憶部２０は、類似文書検索部１６による検索の対象となる複数の文書と、複数の関連文書と、ルールベースを記憶するものであり、例えば、データベースやメモリなどにより構成される。記憶部２０に記憶される文書および関連文書は、電子書籍、電子ファイル、ウェブページ等のテキスト形式のデータを含む電子媒体の電子文書である。この電子文書は、少なくとも本文とタイトルを有する。電子文書は、建設事業分野などで使用される各種法令、社内標準、施工マニュアル、Ｔｉｐｓ集、べからず集等の電子文書であってもよい。 The storage unit 20 stores multiple documents that are the subject of search by the similar document search unit 16, multiple related documents, and a rule base, and is composed of, for example, a database or memory. The documents and related documents stored in the storage unit 20 are electronic documents on electronic media that contain text-format data such as electronic books, electronic files, and web pages. These electronic documents have at least a main text and a title. The electronic documents may be electronic documents such as various laws and regulations used in the construction business field, company standards, construction manuals, collections of tips, and lists of dos and don'ts.

なお、上記の類似文書検索装置１０のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、通信インターフェース等を備えたコンピュータである。上記の各機能を実現するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによってこのプログラムを実行することによって、類似文書検索を行うことができる。このようなプログラムも本発明の範囲に含まれる。 An example of the hardware of the similar document search device 10 is a computer equipped with a CPU, RAM, ROM, a hard disk, a communication interface, etc. Similar document search can be performed by storing a program that realizes each of the above functions in the RAM or ROM and executing this program by the CPU. Such programs are also included in the scope of the present invention.

次に、本発明の実施の形態に係る類似文書検索方法について説明する。この類似文書検索方法は、例えば、上記の類似文書検索装置１０の各部が、図２に示したステップＳ１～Ｓ４の処理を行うことにより実行される。 Next, a similar document search method according to an embodiment of the present invention will be described. This similar document search method is executed, for example, by each unit of the similar document search device 10 performing the processes of steps S1 to S4 shown in FIG. 2.

まず、ステップＳ１において、入力部１２に検索語を入力する。次のステップＳ２において、関連文書抽出部１４が検索語およびルールベースに基づいて、検索語と予め紐付けられている関連文書を抽出する。次のステップＳ３において、類似文書検索部１６が関連文書と記憶部２０に記憶されている各文書の類似度を算出する。次のステップＳ４において、算出した類似度に基づいて類似文書を検索し、検索結果を出力部１８から出力する。 First, in step S1, a search term is input to the input unit 12. In the next step S2, the related document extraction unit 14 extracts related documents that are pre-linked to the search term based on the search term and the rule base. In the next step S3, the similar document search unit 16 calculates the similarity between the related document and each document stored in the memory unit 20. In the next step S4, similar documents are searched for based on the calculated similarity, and the search results are output from the output unit 18.

図３は、本実施の形態による検索例を示した概念図である。この図に示すように、入力する検索語として、例えば、案件情報（建設地・延床・面積・構造など）、性能情報（沿岸部・軟弱地盤・特殊構造など）、仕上情報（壁：石・屋根：防水など）のいずれか一つ以上の情報を入力すると、設定したルールベースに基づいて、検索語に紐付けられた関連文書（不具合事例を記載した文書）が記憶部２０から抽出される。その後、抽出した関連文書と、記憶部２０に記憶されている各文書との間で自然言語処理（類似文書検索）が行われ、関連文書と類似性の高い類似文書のタイトルが出力される。図の例では、類似文書としてTips集、施工マニュアル、べからず集が出力された場合を示している。 Figure 3 is a conceptual diagram showing a search example according to this embodiment. As shown in this figure, when one or more pieces of information are input as search terms, such as project information (construction site, total floor area, area, structure, etc.), performance information (coastal area, soft ground, special structure, etc.), and finishing information (wall: stone, roof: waterproof, etc.), related documents (documents describing defect cases) linked to the search terms are extracted from the storage unit 20 based on the set rule base. After that, natural language processing (similar document search) is performed between the extracted related documents and each document stored in the storage unit 20, and the titles of similar documents that are highly similar to the related documents are output. In the example shown in the figure, a collection of tips, a construction manual, and a list of dos and don'ts are output as similar documents.

このように、本実施の形態によれば、入力した検索語を媒介用の関連文書に疑似的に変換してから、自然言語処理によって関連文書と各文書との間で類似文書検索を行うことで、関連性の高い類似文書を出力する。このようにすれば、検索語に関連するキーワードを各文書が多く保有しないような場合であっても、高精度に検索を行える。したがって、上記の従来の方法に比べて検索精度を向上することができる。 In this way, according to this embodiment, the input search term is pseudo-converted into an intermediary related document, and then a similar document search is performed between the related documents and each document using natural language processing, thereby outputting highly related similar documents. In this way, even if each document does not have many keywords related to the search term, a highly accurate search can be performed. Therefore, the search accuracy can be improved compared to the conventional method described above.

特に、本実施の形態によれば、自然言語処理では扱いにくい検索語中の数値の持つ意味合いを、関連文書に置き換えることにより自然言語処理しやすくなる。例えば、上記の例では、検索語の「沿岸からの距離３００ｍ以内」は「沿岸からの距離３０００ｍ以内」であったとしても自然言語処理上はほぼ同じように扱われる蓋然性が高い。しかし、「沿岸からの距離３００ｍ以内」を塩害対策の関連文書に置き換えることにより、「３００ｍ以内」が「塩害」と関係があるという意味付けが可能となる。これにより、最終的に検索される類似文書と検索語とを、「塩害」というワードで関連付けることができる。 In particular, according to this embodiment, the meaning of the numerical values in the search term, which is difficult to handle in natural language processing, is replaced with related documents, making natural language processing easier. For example, in the above example, the search term "within 300 m of the coast" is likely to be treated almost the same in natural language processing as "within 3000 m of the coast". However, by replacing "within 300 m of the coast" with related documents on salt damage countermeasures, it is possible to give the meaning that "within 300 m" is related to "salt damage". This makes it possible to associate the similar documents ultimately found with the search term through the word "salt damage".

以上説明したように、本発明に係る類似文書検索装置によれば、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する装置であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出する関連文書抽出部と、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索する類似文書検索部とを備えるので、検索精度を向上することができる。 As described above, the similar document search device of the present invention is a device that searches for similar documents, which are documents similar to related documents that are documents related to an input search term, from among multiple pre-registered documents, and includes a related document extraction unit that extracts the related documents that are pre-linked to the search term based on a predetermined rule base based on the input search term, and a similar document search unit that calculates the similarity between the extracted related documents and each pre-registered document and searches for the similar documents based on the calculated similarity, thereby improving search accuracy.

また、本発明に係る類似文書検索方法によれば、入力する検索語に関連した文書である関連文書に類似した文書である類似文書を、予め登録されている複数の文書の中から検索する方法であって、入力された前記検索語に基づいて、所定のルールベースで前記検索語と予め紐付けられている前記関連文書を抽出するステップと、抽出した前記関連文書と、予め登録されている各文書の類似度を算出し、算出した類似度に基づいて前記類似文書を検索するステップとを有するので、検索精度を向上することができる。 The similar document search method according to the present invention is a method for searching for similar documents, which are documents similar to related documents that are documents related to an input search term, from among multiple documents registered in advance, and includes the steps of extracting the related documents that are linked to the search term in advance on a predetermined rule basis based on the input search term, calculating the similarity between the extracted related documents and each document registered in advance, and searching for the similar documents based on the calculated similarity, thereby improving search accuracy.

以上のように、本発明に係る類似文書検索装置、類似文書検索方法およびプログラムは、文書の類似度に基づいて文書を検索するのに有用であり、特に、検索精度を向上するのに適している。 As described above, the similar document search device, similar document search method, and program of the present invention are useful for searching documents based on document similarity, and are particularly suitable for improving search accuracy.

１０類似文書検索装置
１２入力部
１４関連文書抽出部
１６類似文書検索部
１８出力部
２０記憶部 10 Similar document search device 12 Input unit 14 Related document extraction unit 16 Similar document search unit 18 Output unit 20 Storage unit

Claims

A device for searching for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among a plurality of documents registered in advance, comprising:
a related document extraction unit that extracts the related documents that are linked to the input search term in advance according to a predetermined rule base, based on the input search term;
A similar document search device comprising a similar document search unit that calculates the similarity between the extracted related documents and each document registered in advance, and searches for the similar documents based on the calculated similarity.

A method for searching for similar documents, which are documents similar to related documents, which are documents related to an input search term, from among a plurality of documents registered in advance, comprising the steps of:
extracting the related documents that are linked to the input search term in advance based on a predetermined rule base;
A similar document search method comprising the steps of: calculating a similarity between the extracted related documents and each document registered in advance; and searching for the similar documents based on the calculated similarity.

A program that causes a computer to execute the similar document search method described in claim 2.