JP5249848B2

JP5249848B2 - Information retrieval method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP5249848B2
Application number: JP2009111147A
Authority: JP
Inventors: 幸生植松; 俊介小長井; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2009-04-30
Filing date: 2009-04-30
Publication date: 2013-07-31
Anticipated expiration: 2029-04-30
Also published as: JP2010262379A

Description

本発明は、情報検索方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、蓄積された文書群の文書内に出現する単語の位置情報を利用してフレーズ検索や複合語の検索を高速かつ省メモリで行う検索インデックスを作成するための情報検索方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to an information search method, apparatus and program, and a computer-readable recording medium, and in particular, performs phrase search and compound word search using position information of words appearing in a document of an accumulated document group. The present invention relates to an information search method and apparatus and program for creating a search index that is performed at high speed and in a memory-saving manner, and a computer-readable recording medium.

図１０は、一般的な情報検索装置の構成を示す。 FIG. 10 shows a configuration of a general information search apparatus.

同図に示す情報検索装置は、インデックスを作成し、検索インデックス記憶部３に格納するインデックス作成部１、検索インデックス記憶部３の作成されたインデックスを参照して検索結果集合を特定し、検索結果を返却する検索結果集合特定部２からなる。インデクス作成部１０は、形態素解析部１１、転置リスト作成部１２、転置インデクス追加部１３を有する。 The information search apparatus shown in FIG. 1 creates an index, specifies an index creation unit 1 to be stored in the search index storage unit 3, refers to the created index in the search index storage unit 3, identifies a search result set, and retrieves the search results. The search result set identifying unit 2 that returns The index creating unit 10 includes a morphological analyzer 11, a transposed list creating unit 12, and a transposed index adding unit 13.

ここで、本発明の対象であるインデックス作成部１に着目すると、フレーズや複合語を効率的に検索するために、一般的には転置インデックスに単語の位置情報を保持する。この位置情報のリストを「転置リスト」と呼ぶ。最もシンプルな転置リストの保持方法として、単語の位置情報を図１１のように保存する転置リストがある（例えば、非特許文献１参照）。この方式は、単語word_1が文書ＩＤ１に４回出現し、その位置情報が［１，３，５４，５８］であることを示す。当該方式のインデックスでは単語の位置情報を文書の先頭から数えて何番目に出現するかを保持している。この位置情報は検索語が複合語やフレーズの時に利用される。例えば、"東京都"というフレーズで検索した場合、word_1が"東京"、word_2が"都"だったとすると、ＩＤ１の文書は１番目の単語がword_1で、２番目の単語がword_2なので、"東京都"がフレーズとして出現したことがわかる。一方、ＩＤ１４４の文書は、word_1とword_2が隣り合って出現していないので、"東京"と"都"という単語が文書内で共起しているものの、フレーズでは出現していないことがわかる。このフレーズで出現しているかどうかチェックする処理を「連接処理」と呼ぶ。 Here, focusing on the index creation unit 1 that is the subject of the present invention, in order to efficiently search for phrases and compound words, generally, the positional information of words is held in a transposed index. This list of position information is called a “transposition list”. The simplest transposition list holding method is a transposition list that stores word position information as shown in FIG. 11 (see, for example, Non-Patent Document 1). This method indicates that the word word_1 appears four times in the document ID 1 and its position information is [1, 3, 54, 58]. The index of this method holds the number of occurrences of word position information from the beginning of the document. This position information is used when the search word is a compound word or a phrase. For example, if you search for the phrase “Tokyo” and word_1 is “Tokyo” and word_2 is “Tokyo”, the document with ID1 is word_1 for the first word and word_2 for the second word. It can be seen that "City" appeared as a phrase. On the other hand, in the document with ID 144, since word_1 and word_2 do not appear next to each other, it can be seen that the words “Tokyo” and “city” co-occur in the document but do not appear in the phrase. The process of checking whether or not the phrase appears is called “joining process”.

具体的に、下記のような文書が入力された場合を例に、転置リストの作成方法を説明する。 Specifically, a method for creating a transposed list will be described by taking a case where the following document is input as an example.

例文：
ＩＤ＝１：『東京で東京都知事選挙が行われた。…。東京出身で東京都知事に立候補したのは３人で、…』
上記の例文において、東京の転置リストを作成する。図１２に、転置インデックス作成の詳細なフローを示す。 Example sentences:
ID = 1: “The Tokyo Governor Election was held in Tokyo. …. Three people from Tokyo who were running as governors of Tokyo ... ”
In the above example, create a transpose list in Tokyo. FIG. 12 shows a detailed flow of creating an inverted index.

まず、インデックス作成部１の形態素解析部１１は、文字列を単語に分割する（ステップ１０）。 First, the morphological analysis unit 11 of the index creation unit 1 divides the character string into words (step 10).

分割方法としては既存技術である「mecab(http://mecab.sourceforege.net/)」や「chasen(http://chasen.naist.jp/hiki/ChaSen/)」等の形態素解析を用いる。下記が上記の例文を形態素解析し、単語毎に分割した結果である。 As a division method, morphological analysis such as “mecab (http://mecab.sourceforege.net/)” or “chasen (http://chasen.naist.jp/hiki/ChaSen/)” which is an existing technology is used. The following is the result of dividing the above example sentence into words by morphological analysis.

例文：
ＩＤ＝１：東京／で／東京／都／知事／選挙／が／行わ／れ／た／。／…／。／東京／出身／で／東京／都／知事／に／立候補／した／の／は／３人／で／…
上記のスラッシュで区切られた区間が単語である。 Example sentences:
ID = 1: Tokyo / de / Tokyo / metro / governor / election / ga / done / do / ta /. /.../. / Tokyo / Born / De / Tokyo / Metro / Governor / Ni / Candidate / Done / Has / 3 /
The section delimited by the slash is a word.

次に、転置リスト作成部１２がｉ＝０と空の転置リストを作成する（ステップ１１）。 Next, the transposed list creating unit 12 creates an empty transposed list with i = 0 (step 11).

転置リストとは転置インデックスの各要素である。 An inverted list is each element of an inverted index.

ｉ番目の単語word(i)が行末（文書の末尾）かどうかを判定する（ステップ１２）。末尾でない場合は、word(i)に該当する転置リストにｉを追加する（ステップ１３）。この時のｉの値は位置情報を表す。例えば、上記文書が入力となった場合、word(1)は"東京"であり、行末ではないので、word(i)の転置リストに位置情報ｉを追加する。 It is determined whether the i-th word word (i) is at the end of the line (the end of the document) (step 12). If it is not the end, i is added to the transposition list corresponding to word (i) (step 13). The value of i at this time represents position information. For example, when the above document is input, word (1) is “Tokyo”, not the end of the line, so position information i is added to the transposed list of word (i).

次にｉを１つ進め（ステップ１４）、ステップ１２から繰り返す。 Next, i is incremented by 1 (step 14), and the process is repeated from step 12.

ステップ１２において、行末まで達した場合はステップ１５で全体の転置インデックスに作成した転置リストを検索インデックス記憶部３００に追加する。具体的には、上記の例文が文書番号１で、word_1が"東京"であったとすると、"東京"は、１，３，６５，６９番目の形態素として出現しているので、word_1の先頭の転置リストは図１１のようになっているのがわかる。これを全ての文書で繰り返すことで、検索の転置インデックスが作成される。 In step 12, when the end of the line is reached, the inverted list created in the entire inverted index in step 15 is added to the search index storage unit 300. Specifically, if the above example sentence is document number 1 and word_1 is “Tokyo”, “Tokyo” appears as the first, third, 65th, and 69th morphemes. It can be seen that the transposed list is as shown in FIG. By repeating this for all documents, a transposed index for search is created.

以上示したように、従来技術の転置インデックスは位置情報をその文書に出現する回数だけ保存されるため、インデックスサイズの肥大化を招く。 As described above, the inverted index of the prior art stores the position information as many times as it appears in the document, which causes an increase in index size.

Witten et. al., Managing Gigabytes, Morgan Kaufmann publishers,1999.Witten et.al., Managing Gigabytes, Morgan Kaufmann publishers, 1999.

しかしながら、上記従来技術のインデックス形式の場合、連接処理のために保存される位置情報やメモリの使用量が大きく、また、連接処理を行う場合に文書内に出現する全ての単語について調べる必要があり、速度が遅いという課題がある。 However, in the case of the index format of the above prior art, the position information and the memory used for the concatenation process are large, and it is necessary to examine all words appearing in the document when the concatenation process is performed. There is a problem that the speed is slow.

本発明は、上記の点に鑑みなされたもので、連接処理の誤検出が起こり得るもののインデックスサイズを削減することが可能な情報検索方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and provides an information search method and apparatus, a program, and a computer-readable recording medium capable of reducing the index size although misdetection of concatenation processing can occur. With the goal.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明は、文書データベースに蓄積された文書群からインデックスを作成し、転置インデックスを格納する検索インデックス記憶手段に格納し、作成されたインデックスから検索結果として返却する検索結果集合を特定する情報検索方法であって、
文書データベースから文書を読み込み、該文書の文字列を単語毎に分割し（ステップ１）、各単語が先頭から数えて何番目に位置するかを示す位置情報の数を、予め設定された固定長以下の数値で集約し、第１の記憶手段に格納する（ステップ２）位置情報集約ステップと、
第１の記憶手段から位置情報集約ステップで得られた位置情報の列を取得して、第２の記憶手段の１つの転置リストにマッピングするマッピングステップ（ステップ３）と、
第２の記憶手段からマッピングステップでマッピングされた転置リストを取得して、転置インデックス記憶手段に追加する転置インデックス追加ステップ（ステップ４）と、を行う。 This onset Ming, information retrieval indexed from documents stored in the document database, stored in the search index storage means for storing an inverted index, which identifies the result set to return as a search result from the index created A method,
A document is read from the document database, the character string of the document is divided into words (step 1), and the number of position information indicating how many positions each word is counted from the top is set to a predetermined fixed length. Aggregating with the following numerical values and storing them in the first storage means (Step 2):
A mapping step (step 3) of acquiring a sequence of position information obtained in the position information aggregation step from the first storage means and mapping it to one transposed list of the second storage means;
The transposed list mapped in the mapping step is acquired from the second storage unit, and the transposed index adding step (step 4) for adding to the transposed index storage unit is performed.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項１）は、文書データベース３０に蓄積された文書群からインデックスを作成し、転置インデックスを格納する検索インデックス記憶手段３００に格納するインデックス作成手段００と、作成されたインデックスから検索結果として返却する検索結果集合を特定する検索結果集合特定手段２００と、を有する情報検索装置であって、
インデックス作成手段１００は、
文書データベース３０から文書を読み込み、該文書の文字列を単語毎に分割する単語分割手段１１０と、
分割された各単語が先頭から数えて何番目に位置するかを示す位置情報の数を、予め設定された固定長以下の数値で集約し、第１の記憶手段に格納する位置情報集約手段１３０と、
第１の記憶手段から位置情報集約手段１３０で得られた位置情報の列を取得して、第２の記憶手段の１つの転置リストにマッピングする位置情報マッピング手段１４０と、
第２の記憶手段から位置情報マッピング手段１４０でマッピングされた転置リストを取得して、転置インデックス記憶手段３００に追加する転置インデックス追加手段１５０と、を有する。 The present invention (claim 1 ) creates an index from a document group stored in the document database 30, stores an index creation unit 00 in a search index storage unit 300 that stores a transposed index, and a search result from the created index. A search result set specifying means 200 for specifying a search result set to be returned as an information search device,
The index creation means 100
A word dividing unit 110 that reads a document from the document database 30 and divides the character string of the document into words;
A position information aggregating unit 130 that aggregates the number of pieces of position information indicating how many positions each divided word is counted from the head with a numerical value that is equal to or less than a preset fixed length and stores it in the first storage unit. When,
A position information mapping means 140 for acquiring a position information column obtained by the position information aggregating means 130 from the first storage means and mapping it to one transposed list of the second storage means;
An inverted index adding unit 150 that acquires the inverted list mapped by the positional information mapping unit 140 from the second storage unit and adds the inverted list to the inverted index storage unit 300;

また、本発明（請求項２）は、位置情報を集約する際に、固定長を用いる代わりに、位置情報の数を任意に指定された区切り文字を用いて集約する手段を含む位置情報集約手段１３０を有する。 Further, the present invention (Claim 2 ) includes position information aggregating means including means for aggregating the number of position information using arbitrarily designated delimiters instead of using a fixed length when aggregating position information. 130.

本発明（請求項３）は、請求項３または４記載の情報検索装置を構成する各手段としてコンピュータを機能させるための情報検索プログラムである。 The present invention (Claim 3 ) is an information retrieval program for causing a computer to function as each means constituting the information retrieval apparatus according to Claim 3 or 4.

本発明（請求項４）は、請求項３記載の情報検索プログラムを格納したコンピュータ読み取り可能な記録媒体である。 The present invention (Claim 4 ) is a computer-readable recording medium storing the information search program according to Claim 3 .

上述のように本発明では、各単語に保存される位置情報をセグメントに分割し、固定長を用いてそのセグメントの先頭から数えた数に位置情報を集約することで、連接処理の誤検出が起こり得るもののインデックスサイズを大きく削減できる。 As described above, in the present invention, the position information stored in each word is divided into segments, and the position information is aggregated into a number counted from the beginning of the segment using a fixed length, thereby preventing erroneous detection of the concatenation process. The index size can be greatly reduced.

また、区切り文字を用いることで、連接していない場合でも連接しているかのように動作することを防ぐことができる。 In addition, by using a delimiter character, it is possible to prevent the operation as if they are connected even when they are not connected.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における情報処理検索装置の構成図である。It is a block diagram of the information processing search apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における転置インデックスの作成例である。It is an example of creation of an inverted index in the 1st embodiment of the present invention. 本発明の第１の実施の形態における転置インデックスの構造例である。It is a structural example of the transposed index in the 1st Embodiment of this invention. 本発明の第１の実施の形態における転置インデックス作成のフローチャートである。It is a flowchart of transposition index creation in the 1st embodiment of the present invention. 本発明の第２の実施の形態における情報処理装置の構成図である。It is a block diagram of the information processing apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における区切り文字を利用した転置インデックスの構造例である。It is a structural example of the transposed index using the delimiter in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における転置インデックス作成のフローチャートである。It is a flowchart of transposition index creation in the 2nd embodiment of the present invention. 一般的な情報検索装置の構成図である。It is a block diagram of a general information search device. 従来技術による転置インデックスの構造例である。It is an example of the structure of the inverted index by a prior art. 従来技術による転置インデックス作成のフローチャートである。It is a flowchart of transposition index creation by a prior art.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における情報処理装置の構成を示す。 FIG. 3 shows the configuration of the information processing apparatus according to the embodiment of the present invention.

本発明の情報処理装置は、図１０に示した従来の情報検索装置と同様にインデクス作成部１００と検索結果集合特定部２００、検索インデックス記憶部３００から構成されるが、インデックス作成部１００の構成が異なる。 The information processing apparatus of the present invention includes an index creation unit 100, a search result set identification unit 200, and a search index storage unit 300, as in the conventional information retrieval apparatus shown in FIG. Is different.

本発明のインデックス作成部１００は、外部の文書ＤＢ３０から文書群を読み込んで、文書毎に単語に分割する形態素解析部１１０、各単語毎に転置インデックスと単語の位置情報からなる転置リストを作成し、メモリ（図示せず）に格納する転置リスト作成部１２０、メモリ（図示せず）に格納されている各単語の位置情報の数を集約し、メモリ（図示せず）に格納する位置情報集約部１３０、メモリ（図示せず）から位置情報を取得して配列にマッピングする位置マッピング部１４０、マッピングされた配列を転置インデックス記憶部３００に追加書き込みする転置インデックス追加部１５０を有する。つまり、図１０の構成に、文書に含まれる位置情報の数を集約する位置情報集約部１３０と、その位置情報を配列にマッピングする位置情報マッピング部１４０を付加した構成である。 The index creation unit 100 of the present invention reads a document group from the external document DB 30 and divides it into words for each document, and creates a transposition list composed of a transposition index and word position information for each word. , A transposition list creation unit 120 stored in a memory (not shown), and aggregating the number of pieces of position information of each word stored in the memory (not shown) and collecting the position information stored in the memory (not shown) Section 130, position mapping section 140 that acquires position information from a memory (not shown) and maps it to an array, and transposed index adding section 150 that additionally writes the mapped array to transposed index storage section 300. That is, the configuration shown in FIG. 10 includes a location information aggregating unit 130 that aggregates the number of location information included in a document, and a location information mapping unit 140 that maps the location information to an array.

前述の従来技術の転置インデックスは、位置情報をその文書に出現する回数だけ保存されるため、インデックスサイズが大きくなるが、本発明では、位置情報の絶対値を固定長Ｌに圧縮する。具体的には固定長Ｌを予め設定し、その固定長内にインデックスが収まるようにインデックスを集約させることで実現する。 Since the above-described conventional inverted index stores the position information as many times as it appears in the document, the index size increases. In the present invention, the absolute value of the position information is compressed to a fixed length L. Specifically, this is realized by setting a fixed length L in advance and collecting the indexes so that the indexes are within the fixed length.

図４に、図１１で示した転置インデックスのword_1を固定長Ｌ＝６４で集約した例を示す。まず、従来技術では、文書ＩＤ＝１について、検索インデックス記憶部３に位置情報が「１，３，６５，６９」と保存されているが、本発明では、これを６４以下の数値に収めるために、位置情報集約部１３０において、位置情報の数値を「６４」の剰余に変換し、固定長Ｌ以下の数値で集約する。例えば、「６５」は「６４」で剰余をとると「１」となる。よって、「１，３，１，４」という列が得られる。これを位置情報マッピング部１４０で１つの転置リストにマッピングして本発明の転置インデックスである「１，３，４」という位置情報を最終的に得る。 FIG. 4 shows an example in which word_1 of the inverted index shown in FIG. 11 is aggregated with a fixed length L = 64. First, in the prior art, the position information “1, 3, 65, 69” is stored in the search index storage unit 3 for the document ID = 1, but in the present invention, this is stored in a numerical value of 64 or less. In addition, the position information aggregating unit 130 converts the numerical value of the positional information into a remainder of “64” and aggregates the numerical value with a fixed length L or less. For example, “65” becomes “1” when “64” is taken and the remainder is taken. Therefore, the column “1, 3, 1, 4” is obtained. This is mapped to one transposition list by the positional information mapping unit 140 to finally obtain the positional information “1, 3, 4” which is the transposed index of the present invention.

図５に本発明を利用して全ての数値が「６４」以下になった転置インデックスの例を示す。同図中の全ての位置情報の値が「６４」以下になっている。これを利用して従来技術と同様の連接処理をする場合、実際には連接処理を行っても連接していないケースが起こる可能性があるが、インデックスサイズと計算量を大きく削減することができるというメリットがある。 FIG. 5 shows an example of an inverted index in which all numerical values are “64” or less using the present invention. The values of all position information in the figure are “64” or less. When using this to perform the same connection process as in the prior art, there is a possibility that the connection will not be connected even if the connection process is actually performed, but the index size and calculation amount can be greatly reduced. There is a merit.

図６は、本発明の一実施の形態におけるインデックスの作成のフローチャートである。 FIG. 6 is a flowchart for creating an index according to an embodiment of the present invention.

ステップ２０１）形態素解析部１１０で文書ＤＢ３０から文書を読み込み、当該文書の文字列を単語毎に分割する。形態素解析の方法は従来技術と同様である。 Step 201) The morphological analysis unit 110 reads a document from the document DB 30, and divides the character string of the document into words. The method of morphological analysis is the same as in the prior art.

ステップ２０２）転置リスト作成部１２０は、position=0、ｉ＝０，１をセットし、メモリ（図示せず）上に空のインデックスを作成する。固定長Ｌは任意の値を与えることが可能であるが、値が小さければ小さいほどインデックスサイズが小さくなる代わりに、連接処理の誤検出も増える。 Step 202) The transposed list creation unit 120 sets position = 0, i = 0, 1, and creates an empty index on a memory (not shown). The fixed length L can be given an arbitrary value, but the smaller the value, the smaller the index size, but the greater the misdetection of the concatenation process.

ステップ２０３） word_(i)が行末であるかどうかをチェックし、行末でない場合はステップ２０４に移行し、行末の場合はステップ２０９に移行する。 Step 203) It is checked whether word_ (i) is at the end of the line. If it is not at the end of the line, the process proceeds to Step 204. If it is at the end of the line, the process proceeds to Step 209.

ステップ２０４）位置情報集約部１３０は、ｉが固定長Ｌより大きいかを確認し、大きい場合はステップ２０６に移行し、小さい場合はステップ２０５に移行する。 Step 204) The position information aggregating unit 130 checks whether i is larger than the fixed length L. If i is larger, the process proceeds to Step 206. If smaller, the process proceeds to Step 205.

ステップ２０５）位置情報集約部１３０は、固定長Ｌより小さい場合は、positionを１つ進め、ステップ２０７に移行する。 Step 205) When the position information aggregating unit 130 is smaller than the fixed length L, the position information aggregating unit 130 advances the position by one, and proceeds to Step 207.

ステップ２０６）位置情報集約部１３０は、positionを０に戻し、ステップ２０７に移行する。 Step 206) The position information aggregating unit 130 returns position to 0, and proceeds to Step 207.

ステップ２０７）転置リスト作成部１２０は、メモリ（図示せず）上のword(i)の転置リストにpositionを追加する。 Step 207) The transposed list creation unit 120 adds position to the transposed list of word (i) on the memory (not shown).

ステップ２０８）ｉを１進め、ステップ２０３に戻る。 Step 208) Advance i by 1 and return to Step 203.

ステップ２０９）ステップ２０３において行末である場合は、位置情報マッピング部１３０はメモリ（図示せず）から転置リストを読み込み、位置情報のマッピングを行う。マッピングは同じ位置情報をひとつにまとめることを指す。例えば、図４の例の場合は、文書ＩＤ＝１の位置情報［１，３，６５，６９］（図４（ａ））を固定長Ｌを利用して集約すると、［１，３，１，４］（図４（ｂ））となり、同じ位置情報である［ＩＤ＝１］をマッピングすると［１，３，４］（図４（ｃ））となる。このようにマッピングされた位置情報をメモリ（図示せず）に格納する。 Step 209) If the line is at the end in step 203, the position information mapping unit 130 reads the transposed list from the memory (not shown) and maps the position information. Mapping refers to combining the same location information into one. For example, in the case of the example of FIG. 4, when the position information [1, 3, 65, 69] (FIG. 4A) of the document ID = 1 is aggregated using the fixed length L, [1, 3, 1 , 4] (FIG. 4B), and mapping [ID = 1] that is the same position information results in [1, 3, 4] (FIG. 4C). The mapped location information is stored in a memory (not shown).

ステップ２１０）転置インデックス追加部１５０は、メモリ（図示せず）からマッピングされた転置リストを読み込み、全体の転置インデックスに当該転置リストを追加し、検索インデックス記憶部３００に格納し、インデックス作成部１００側の処理を終了する。 Step 210) The inverted index adding unit 150 reads the mapped inverted list from the memory (not shown), adds the inverted list to the entire inverted index, stores it in the search index storage unit 300, and stores it in the index creating unit 100. End the processing on the side.

これにより、固定長Ｌ以下の位置情報で構成された転置インデックスが作成される。 As a result, a transposed index composed of position information having a fixed length L or less is created.

［第２の実施の形態］
前述した位置情報の集約方法はＬを固定長で限定するため、ある文を区切ってしまう可能性がある。例文１において、固定長「４」で区切った場合は、下記のようになる。 [Second Embodiment]
Since the above-described location information aggregation method limits L to a fixed length, there is a possibility that a sentence is divided. In the example sentence 1, when it is divided by the fixed length “4”, it is as follows.

例文（ＩＤ＝１）：／東京／で／東京／都
／知事／選挙／が／行わ
／れ／た／。／…／
よって、「都知事」と検索された場合、誤検出だけでなく検出漏れが起きることがある。そこで、本実施の形態では、第１の実施の形態で用いた固定長ではなく、区切り文字を利用したインデックス作成方法を示す。 Example sentences (ID = 1): / Tokyo / De / Tokyo / Miyako
/ Governor / election / ga / do
/ Re / ta /. /.../
Therefore, when searching for “Governor of Tokyo”, not only erroneous detection but also detection failure may occur. Therefore, in this embodiment, an index creation method using a delimiter instead of the fixed length used in the first embodiment is shown.

図７は、本発明の第２の実施の形態における情報検索装置の構成を示す。 FIG. 7 shows a configuration of an information search apparatus according to the second embodiment of the present invention.

同図に示す情報検索装置は、図３の位置情報集約部１３０の代わりに、区切り文字を用いて位置情報を集約する区切り文字位置情報集約部２３０を設けた構成である。 The information search apparatus shown in the figure is configured by providing a delimiter character position information aggregating unit 230 that aggregates position information using delimiters instead of the position information aggregating unit 130 of FIG.

区切り文字は任意に指定できるが、"。"を区切り文字にした場合は下記のようになる。 The delimiter can be specified arbitrarily, but when "." Is used as the delimiter, it is as follows.

例文（ＩＤ＝１）：
／東京／で／東京／都／知事／選挙／が／行わ／れ／た／。 Example sentence (ID = 1):
/ Tokyo / De / Tokyo / Tokyo / Governor / Election /

…
／東京／出身／で／東京／都／知事／に／立候補／した／の／は／３人／で／…
これにより誤検出は起きる可能性があるものの、検出漏れを防ぐことができる。 ...
/ Tokyo / Born / De / Tokyo / Metro / Governor / Ni / Candidate / Done / Has / 3 /
As a result, erroneous detection may occur, but detection omission can be prevented.

図８に区切り文字を利用して位置情報を集約した例を示す。図４との違いは６４以下の数値であっても、文の区切れがある場合は位置情報が集約されている点である。例えば、文書ＩＤ＝１７０は、従来技術の転置インデックスにおいては、［２，６，８，１０］であるのに対し、６番目の単語が文の区切りであったため、［２，６，２，６］と集約し、最終的に［２，６］にマッピングされている。このように文頭には同じ文字列が出現する可能性が高いため、こうしたマッピングはインデックスサイズ削減に貢献する可能性がある。 FIG. 8 shows an example in which position information is collected using delimiters. The difference from FIG. 4 is that even if the numerical value is 64 or less, the position information is collected when there is a sentence break. For example, the document ID = 170 is [2, 6, 8, 10] in the conventional inverted index, whereas the sixth word is a sentence break, so [2, 6, 2, 6] and finally mapped to [2, 6]. Since there is a high possibility that the same character string appears at the beginning of the sentence in this way, such mapping may contribute to index size reduction.

図９は、本発明の第２の実施の形態における転置インデックス作成のフローチャートである。 FIG. 9 is a flow chart for creating an inverted index according to the second embodiment of this invention.

以下では図６との違いのみを詳細に説明する。 Only the differences from FIG. 6 will be described in detail below.

ステップ３０２）区切り文字位置情報集約部２３０は、区切り文字をセットする。例えば、"。"や"．"などをセットすることができる。また、区切り文字を正規表現などのパターンで与えることも考えられる。 Step 302) The delimiter character position information aggregating unit 230 sets delimiter characters. For example, "." Or "." Can be set. It is also conceivable to give the delimiter as a pattern such as a regular expression.

ステップ３０４）区切り文字位置情報集約部２３０は、word(i)が前述した区切り文字であるかどうかを判断する。これが区切り文字である場合はステップ３０６に移行し、区切り文字でない場合はステップ３０５に移行する。 Step 304) The delimiter character position information aggregating unit 230 determines whether word (i) is the delimiter described above. If it is a delimiter, the process proceeds to step 306, and if it is not a delimiter, the process proceeds to step 305.

ステップ３０５）区切り文字でない場合はpositionを１つ進め、ステップ３０７に移行する。 Step 305) If it is not a delimiter, the position is advanced by one, and the process proceeds to Step 307.

ステップ３０６）区切り文字である場合はpositionを０としてステップ３０７に移行する。 Step 306) If it is a delimiter, the position is set to 0 and the process proceeds to Step 307.

ステップ３０７）転置リスト作成部１２０は、メモリ（図示せず）上のword(i)の転置リストにpositionを追加する。 Step 307) The transposed list creating unit 120 adds position to the transposed list of word (i) on the memory (not shown).

上記以外の処理は図６の動作と同様である。 Processing other than the above is the same as the operation of FIG.

このような手順により区切り文字を利用した転置リストを作成することができる。 Through such a procedure, a transposed list using a delimiter can be created.

なお、上記の情報処理装置の構成要素の各動作をプログラムとして構築し、情報処理装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In addition, each operation | movement of the component of said information processing apparatus can be constructed | assembled as a program, and it can install and run in the computer utilized as an information processing apparatus, or can distribute | distribute it via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

３０文書データベース（ＤＢ）
１００インデックス作成手段、インデックス作成部
１１０単語分割手段、単語分割部
１２０転置リスト作成部
１３０位置情報集約手段、位置情報集約部
１４０位置情報マッピング手段、位置情報マッピング部
１５０転置インデックス追加手段、転置インデックス追加部
２００検索結果集合特定手段、検索結果集合特定部
２３０区切り文字位置情報集約部
３００転置インデックス記憶手段、転置インデックス記憶部 30 Document database (DB)
100 Index creation means, index creation section 110 Word segmentation means, word segmentation section 120 Transposition list creation section 130 Location information aggregation means, location information aggregation section 140 Location information mapping means, location information mapping section 150 Transposition index addition means, transposition index addition Section 200 search result set specifying means, search result set specifying section 230 delimiter character position information aggregation section 300 transposed index storage means, transposed index storage section

Claims

A search result set for creating an index from a document group stored in a document database and storing it in a search index storage means for storing a transposed index, and a search result set to be returned as a search result from the created index An information retrieval device comprising:
The index creation means includes:
Word dividing means for reading a document from the document database and dividing the character string of the document into words;
Position information aggregating means for aggregating the number of pieces of position information indicating how many positions each divided word is counted from the head with a numerical value not more than a preset fixed length and storing the number in a first storage means; ,
Position information mapping means for acquiring a column of position information obtained in the position information aggregation step from the first storage means and mapping it to one transposed list of the second storage means;
An inverted index adding means for acquiring the inverted list mapped by the positional information mapping means from the second storage means and adding the inverted list to the inverted index storage means;
An information retrieval apparatus comprising:

The position information aggregation means includes
Wherein when aggregating location information, instead of using the fixed length, the information retrieval apparatus according to claim 1 further comprising a means for aggregating using arbitrarily specified delimiter number of said position information.

Information retrieval program for causing a computer to function as each means constituting the information retrieval apparatus according to claim 1 or 2 wherein.

A computer-readable recording medium storing the information search program according to claim 3 .