JP2007334402A

JP2007334402A - Server, system and method for retrieving clustered vector data

Info

Publication number: JP2007334402A
Application number: JP2006162105A
Authority: JP
Inventors: Atsushi Hiroike; 敦廣池
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-06-12
Filing date: 2006-06-12
Publication date: 2007-12-27
Anticipated expiration: 2026-06-12
Also published as: JP5137339B2

Abstract

【課題】高次元ベクトルデータを対象とした類似検索において、クラスタリングによって高速処理を実装した場合、一般には、全件を参照した検索結果と一致する保証はない。高い精度を求めると、多くのクラスタの参照が必要となり、迅速な応答を実現できない。
【解決手段】検索エンジンが一回に参照するクラスタ数を一定とした上で、検索エンジンとアプリケーションとの間で、検索結果、及び、参照されたクラスタに関する情報を送受信する。検索エンジンは、既に参照されたクラスタに関する処理を省略し、さらに一定個数のクラスタに関する処理を実行することによって、一定の応答時間で、より精度の高い検索結果を構成することが可能となる。
【選択図】図４In a similar search for high-dimensional vector data, when high-speed processing is implemented by clustering, there is generally no guarantee that the search results refer to all cases. When high accuracy is required, many clusters need to be referred to, and a quick response cannot be realized.
A search engine and an application transmit and receive search results and information related to the referenced cluster, with a fixed number of clusters referenced by the search engine at a time. The search engine omits the processing related to the already referenced clusters and further executes the processing related to a certain number of clusters, thereby making it possible to construct a more accurate search result with a constant response time.
[Selection] Figure 4

Description

本発明は、計算機上でのベクトルデータの検索に関する。 The present invention relates to retrieval of vector data on a computer.

ベクトルデータの検索として一般的なのは、地図データ等の空間位置情報に基づく検索である。この場合、ベクトルデータの次元数は、高々２〜数次元程度となる。これに対して、画像・映像等を対象とする類似検索では、各データを特徴付けるデータとして、数１０次元〜数１００次元のベクトルが用いられる。類似検索では、検索のキーとなるデータと、ベクトル空間中で距離が小さいデータの検索、すなわち、ベクトル空間中での最近接検索が行われる。例えば、静止画像の類似検索では、画像中の色分布のヒストグラム等が、データを特徴付ける特徴量ベクトルとして用いられる。 A general search for vector data is a search based on spatial position information such as map data. In this case, the number of dimensions of the vector data is about 2 to several dimensions at most. On the other hand, in a similar search for images / videos, vectors of several tens of dimensions to several hundreds of dimensions are used as data characterizing each data. In the similar search, a search for data serving as a search key and data having a short distance in the vector space, that is, a closest search in the vector space is performed. For example, in a similarity search of still images, a histogram of color distribution in the image is used as a feature vector that characterizes data.

最近接検索では、ベクトル間の距離を知る必要がある。距離計算の対象データ数をＮ、特徴量ベクトルの次元数をＭとすれば、検索キーデータと距離計算の対象データとの間でＮ回の距離計算が必要となり、かつ、各距離計算に必要となる時間は、Ｍに比例する。従って、最近接検索を線形探索で実現した場合、１回の検索時間のＮ×Ｍに比例した計算時間が必要となる。 In the nearest neighbor search, it is necessary to know the distance between vectors. If the number of target data for distance calculation is N and the number of dimensions of the feature vector is M, N distance calculations are required between the search key data and the target data for distance calculation, and each distance calculation is required. Is proportional to M. Therefore, when the closest search is realized by a linear search, a calculation time proportional to N × M of one search time is required.

最近接検索の処理を高速化する手法として、検索キーデータに応じて距離計算の対象データを絞り込む方法が複数提案されている。多次元インデキシングと総称される一群の手法は、データベースの分野で利用されているバランス木の概念を多次元空間の処理に拡張したものである。多次元インデキシングでは、空間中の領域が木構造で管理される。そして、検索キーデータが与えられると、その検索キーデータを含む領域が定義されたリーフをLog Ｎのオーダで検索する。一方、パターン認識の分野では、距離が近いもの同士を予め分類しておく、クラスタリングに基づく高速化がしばしば用いられる。具体的な分類の手法としては、k-means法が一般的である。 As a technique for speeding up the nearest neighbor search process, a plurality of methods for narrowing the target data for distance calculation according to the search key data have been proposed. A group of techniques collectively referred to as multidimensional indexing is an extension of the concept of balanced trees used in the field of databases to multidimensional space processing. In multidimensional indexing, regions in space are managed in a tree structure. When the search key data is given, the leaf in which the area including the search key data is defined is searched with the order of Log N. On the other hand, in the field of pattern recognition, speeding up based on clustering is often used in which objects that are close to each other are classified in advance. As a specific classification method, the k-means method is common.

一方、非特許文献１は、高次元で分布する確率ベクトルに関する解析によって、通常の確率分布で成立するある条件の下で、任意の点と他の標本分布中の中との最小距離と最大距離との比が、空間の高次元化に伴い１に収束すること、すなわち、全ての標本点間の距離の差が無くなって行くことを示した。従って、上記の各種高速化手法が、高次元空間の処理で、一般的な意味で良好な効果を生むことは困難であり、通常は、その性能は線形探索に劣るものとなる、と結論付けている。ただし、非特許文献１は、標本の分布が一様構造ではなく、クラスタ構造を持つ場合は、クラスタリング処理が有効性を発揮することも指摘している。
Kevin Beyer, Jonathan Goldstein, Raghu Rmakrishnan, Uri Shaft.: "When Is Nearest Neighbor Meaningful", Proceeding of International Conference on Database Theory, 1999, p.217-235. On the other hand, Non-Patent Document 1 discloses that a minimum distance and a maximum distance between an arbitrary point and another sample distribution under certain conditions established by a normal probability distribution based on analysis of probability vectors distributed in a high dimension. It has been shown that the ratio of and converges to 1 as the dimension of the space increases, that is, the difference in distance between all sample points disappears. Therefore, it is concluded that the above-mentioned various speed-up methods are difficult to produce a good effect in a general sense in high-dimensional space processing, and the performance is usually inferior to linear search. ing. However, Non-Patent Document 1 also points out that the clustering process is effective when the sample distribution is not a uniform structure but has a cluster structure.
Kevin Beyer, Jonathan Goldstein, Raghu Rmakrishnan, Uri Shaft .: "When Is Nearest Neighbor Meaningful", Proceeding of International Conference on Database Theory, 1999, p.217-235.

対象となるデータに対して適切な特徴量ベクトルが抽出されていると仮定すれば、標本分布は、ある程度のクラスタ構造を持っていると想定される。従って、そこから適切にクラスタ構造を抽出することが出来れば、クラスタリングに基づく検索の高速化が期待できる。ただし、実際にどの程度の高速化が可能かは、存在しているクラスタが互いにどの程度分離しているか、及び、検索キーとクラスタとの関係に依存する。 Assuming that an appropriate feature vector is extracted for the target data, the sample distribution is assumed to have a certain degree of cluster structure. Therefore, if a cluster structure can be appropriately extracted therefrom, it is possible to expect a high-speed search based on clustering. However, how much speed can actually be increased depends on how far the existing clusters are separated from each other and the relationship between the search key and the clusters.

例えば、データ全体がＮｃ個のクラスタに分類されており、各クラスタには、それを構成するメンバ（すなわち、各クラスタに含まれるデータ）の平均ベクトルが保存されているものとする。検索時には、まず、検索キーのベクトルと各クラスタの平均ベクトルとの距離が計算される。次に、その距離が小さい順序にクラスタを参照していき、各クラスタのメンバと検索キーとの距離計算を行うことによって、類似検索の結果、すなわち、検索キー近傍のデータを取得する。 For example, it is assumed that the entire data is classified into Nc clusters, and each cluster stores an average vector of members (that is, data included in each cluster). When searching, first, the distance between the search key vector and the average vector of each cluster is calculated. Next, the clusters are referred to in ascending order of the distance, and the distance between the members of each cluster and the search key is calculated to obtain the result of the similarity search, that is, the data near the search key.

真の検索結果は、全数検索（すなわち、全クラスタの全メンバを対象とする検索）を行った場合の結果である。次に、上記の手続きにおいて、幾つのクラスタを参照すれば、真の結果と一致する結果を得られるか、について検討する。 The true search result is a result when an exhaustive search (that is, a search for all members of all clusters) is performed. Next, in the above procedure, it is examined how many clusters can be referred to obtain a result that matches the true result.

クラスタ間の分離が良く、かつ、検索キーが何れかのクラスタの平均ベクトルの近傍にある場合は、少数個のクラスタを参照するだけで十分であろう。仮に、最近接クラスタのメンバ数が、要求される検索結果数より十分大きければ、一つのクラスタのみの参照で済むかも知れない。逆に、検索キーが、クラスタ境界付近に位置する場合は、少なくとも、その境界に接するクラスタを参照する必要が生じる。また、検索キー近傍のクラスタの分離が悪い場合は、多数のクラスタを参照する必要が生じる。 If the separation between clusters is good and the search key is near the average vector of any cluster, it will be sufficient to reference a small number of clusters. If the number of members of the nearest cluster is sufficiently larger than the required number of search results, it may be necessary to refer to only one cluster. Conversely, when the search key is located near the cluster boundary, it is necessary to refer to at least a cluster in contact with the boundary. Also, when the separation of clusters near the search key is poor, it is necessary to refer to a large number of clusters.

実際の検索時には、真の結果は未知である。従って、真の結果になるべく近い結果を得たければ、なるべく多くのクラスタを参照する必要がある。しかし、あまり多数のクラスタを参照すると、線形探索と同等の計算量となってしまうため、検索の高速化は実現しない。このことは、クラスタリングではなく、多次元インデキシングを用いた場合も同様である。この場合、多数のリーフを参照することになる。 During the actual search, the true result is unknown. Therefore, it is necessary to refer to as many clusters as possible in order to obtain a result as close as possible to the true result. However, if too many clusters are referenced, the amount of calculation is equivalent to that of linear search, so that the search speed cannot be increased. The same applies to the case where multidimensional indexing is used instead of clustering. In this case, a large number of leaves are referred to.

さらに、クラスタリング及び多次元インデキシングのいずれが用いられる場合にも、データとの距離計算以外の処理が必要となる。すなわち、クラスタリングでは、クラスタ平均との距離計算、多次元インデキシングでは、木構造を辿る際の領域判定が必要である。また、大量のデータへのアクセスに関しては、一般に、単純な線形探索の方が効率的である。従って、これらの高速化手法は、少なくとも、アルゴリズム評価のレベルで十分な有効性を示す必要がある。 Furthermore, when clustering or multidimensional indexing is used, processing other than the calculation of the distance to the data is required. That is, in clustering, distance calculation with the cluster average and multidimensional indexing require area determination when tracing a tree structure. In general, a simple linear search is more efficient for accessing a large amount of data. Therefore, it is necessary that these speed-up methods have sufficient effectiveness at least at the level of algorithm evaluation.

現在、類似検索を必要とする多くの分野で、扱うべきデータ量が増大しており、線形探索に代わる高速な検索が必要とされている。従って、上述した問題点を克服する高速検索技術がますます必要となっている。 Currently, in many fields that require similar search, the amount of data to be handled is increasing, and high-speed search instead of linear search is required. Accordingly, there is an increasing need for high-speed search technology that overcomes the above-mentioned problems.

実際の類似検索の利用を考えた場合、必ずしも、真の結果を正確に知る必要がなく、それよりも、アプリケーション上での迅速な応答が要求される場合が多い。本発明では、クラスタリングに基づいた検索エンジンにおいて、検索エンジンとそれを使用するアプリケーションが以下のようなデータを送受信することによって、類似検索を行うアプリケーションにとって最適な処理系を実現する。 When considering the use of an actual similarity search, it is not always necessary to know the true result accurately, and a quick response on the application is often required. In the present invention, in a search engine based on clustering, a search engine and an application using the search engine transmit and receive the following data, thereby realizing an optimum processing system for an application that performs a similar search.

検索エンジンは、クライアントのアプリケーションから検索要求を受けると、所定の個数のクラスタを参照して類似検索を実行し、取得した類似検索の結果をアプリケーション側に返す。この際、検索エンジンは、検索のために参照したクラスタに関する情報も合わせてアプリケーション側に渡す。アプリケーション側に返された検索結果は、一般に、上位、すなわち、高類似度の結果の信頼性は高いが、下位の結果の信頼性は低い。アプリケーション側が、より真の結果に近い検索結果を必要とする場合、再度、同一の検索条件での検索要求をエンジン側に発行する。この際、アプリケーションは、以前に取得した検索結果及び以前に参照したクラスタに関する情報も合わせて検索エンジンに送信する。検索エンジンは、以前に参照したクラスタに関する処理を省略し、より低類似度の所定の個数のクラスタを参照し、クラスタメンバのデータに関する類似検索処理を行う。その結果、以前の検索結果よりも高類似度のデータを発見した場合、検索エンジンは、検索結果を更新し、参照したクラスタの情報とともに、検索結果をアプリケーション側に返す。このような処理を繰り返すことによって、アプリケーションは、必要に応じて、より精度の高い類似検索の結果を取得することが可能となる。 When the search engine receives a search request from the client application, the search engine performs a similar search with reference to a predetermined number of clusters, and returns the acquired similar search result to the application side. At this time, the search engine also passes the information about the cluster referred for the search to the application side. The search result returned to the application side is generally high in reliability of the high-order, that is, high-similarity results, but low in reliability of the low-order results. When the application side needs a search result that is closer to the true result, it issues a search request under the same search condition to the engine side again. At this time, the application also transmits the search result acquired previously and information about the previously referenced cluster to the search engine. The search engine omits the process related to the previously referred cluster, refers to a predetermined number of clusters having a lower similarity, and performs a similar search process regarding the data of the cluster members. As a result, when data having a higher similarity than the previous search result is found, the search engine updates the search result and returns the search result to the application side together with the information of the referenced cluster. By repeating such processing, the application can acquire the result of the similarity search with higher accuracy as necessary.

より具体的には、本願で開示する代表的な発明は、データを入出力するインターフェースと、前記インターフェースに接続されるプロセッサと、前記プロセッサに接続される一つ以上の記憶装置と、を備える検索サーバにおいて、前記記憶装置には、各々が複数のクラスタのいずれかに含まれる複数の第１ベクトルデータと、前記複数の第１ベクトルデータの前記クラスタごとの代表値と、が格納され、前記プロセッサは、第２ベクトルデータを含む検索要求を受信すると、受信した前記第２ベクトルデータをキーとして前記代表値を検索し、前記第２ベクトルデータとの距離が近い前記代表値を含む前記クラスタから順に、第１の所定の数の前記クラスタに含まれる複数の第１ベクトルデータを、前記第２ベクトルデータをキーとして検索し、前記検索された第１ベクトルデータのうち、前記第２ベクトルデータとの距離が近い第２の所定の数の前記第１ベクトルデータを、前記インターフェースを介して出力することを特徴とする。 More specifically, a representative invention disclosed in the present application includes an interface for inputting and outputting data, a processor connected to the interface, and one or more storage devices connected to the processor. In the server, the storage device stores a plurality of first vector data each included in any of a plurality of clusters, and representative values for the clusters of the plurality of first vector data, and the processor Receives the search request including the second vector data, searches the representative value using the received second vector data as a key, and sequentially starts from the cluster including the representative value that is close to the second vector data. A plurality of first vector data included in the first predetermined number of clusters are searched using the second vector data as a key. Of the first vector data the search, the first vector data of a predetermined number of distance second close between the second vector data, and outputs through the interface.

検索エンジン側においては、１回の検索要求に対して常に一定個数のクラスタのみが参照されるため、ほぼ一定の応答時間が実現される。アプリケーション側においては、その利用の文脈に応じて、要求する検索精度を制御可能であるため、エンドユーザにとって最適な処理系を構成することが可能となる。 On the search engine side, only a fixed number of clusters are always referred to for a single search request, so that a substantially constant response time is realized. On the application side, the required search accuracy can be controlled in accordance with the usage context, so that it is possible to configure an optimum processing system for the end user.

以下、本発明の実施の一形態として、画像を対象とした類似検索システムについて説明する。 Hereinafter, a similarity search system for an image will be described as an embodiment of the present invention.

図１は、本発明の実施の形態の類似検索システムの構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of the similarity search system according to the embodiment of this invention.

検索エンジンが稼動するサーバ計算機１１０は、通信基盤１２０を経由して、アプリケーションプログラムが稼動するクライアント計算機１３０と接続され、クライアント計算機１３０に検索等のサービスを提供する。 The server computer 110 on which the search engine operates is connected to the client computer 130 on which the application program operates via the communication infrastructure 120, and provides services such as search to the client computer 130.

通信基盤１２０は、サーバ計算機１１０とクライアント計算機１３０とを接続するネットワーク（例えば、ＩＰネットワーク）である。 The communication infrastructure 120 is a network (for example, an IP network) that connects the server computer 110 and the client computer 130.

サーバ計算機１１０は、少なくとも、相互に接続されたインターフェース（Ｉ／Ｆ）１５１、ＣＰＵ１５２、メモリ１５３及びハードディスク１５４を備える。 The server computer 110 includes at least an interface (I / F) 151, a CPU 152, a memory 153, and a hard disk 154 that are connected to each other.

Ｉ／Ｆ１５１は、通信基盤１２０に接続され、サーバ計算機１１０とクライアント計算機１３０の間の通信に使用される。 The I / F 151 is connected to the communication infrastructure 120 and is used for communication between the server computer 110 and the client computer 130.

ＣＰＵ１５２は、メモリ１５３に格納されたプログラムを実行するプロセッサである。 The CPU 152 is a processor that executes a program stored in the memory 153.

メモリ１５３は、ＣＰＵ１５２によって実行されるプログラム及びＣＰＵ１５２によって参照されるデータを格納する記憶装置である。本実施の形態のメモリ１５３は、いわゆる主記憶装置であり、例えば、ランダムアクセス可能な半導体記憶装置である。本実施の形態のメモリ１５３は、少なくとも、検索サーバプロセス１１１を実現するためのサーバ・プログラム及びデータを格納する。 The memory 153 is a storage device that stores a program executed by the CPU 152 and data referred to by the CPU 152. The memory 153 according to the present embodiment is a so-called main memory device, for example, a semiconductor memory device that can be randomly accessed. The memory 153 of this embodiment stores at least a server program and data for realizing the search server process 111.

ハードディスク１５４は、一つ以上のハードディスクドライブ（ＨＤＤ）からなる記憶装置である。本実施の形態のハードディスク１５４は、画像サーバ１４０に格納された画像の特徴量ベクトルに関する情報を特徴量データ１１４及びクラスタ管理情報１１５として格納する。 The hard disk 154 is a storage device including one or more hard disk drives (HDD). The hard disk 154 according to the present embodiment stores information about the feature vector of the image stored in the image server 140 as feature data 114 and cluster management information 115.

特徴量ベクトルとは、画像サーバ１４０に格納されている画像の特徴をベクトルデータとして数値化したものである。特徴量ベクトルは、従来から知られている種々の方法によって算出することができる。 The feature amount vector is obtained by digitizing the feature of an image stored in the image server 140 as vector data. The feature vector can be calculated by various conventionally known methods.

本実施の形態において、各特徴量ベクトルは、複数のクラスタのいずれかに分類される。相互に距離が近い特徴量ベクトルは、同じクラスタに分類されることが望ましい。特徴量ベクトルは、どのような方法でクラスタに分類されてもよいが、本実施の形態では、k-means法によって分類される。 In the present embodiment, each feature vector is classified into one of a plurality of clusters. It is desirable that feature vectors that are close to each other are classified into the same cluster. The feature amount vector may be classified into clusters by any method, but in the present embodiment, it is classified by the k-means method.

一つの特徴量データ１１４は、一つのクラスタに分類された一つ以上の画像を識別するデータＩＤと、そのＩＤによって識別される画像データの特徴量ベクトルと、の組を含む。なお、各クラスタに分類された画像及び特徴ベクトルは、クラスタメンバとも記載される。 One feature amount data 114 includes a set of a data ID for identifying one or more images classified into one cluster and a feature amount vector of image data identified by the ID. Note that the images and feature vectors classified into each cluster are also described as cluster members.

クラスタ管理情報１１５は、各クラスタを識別するクラスタＩＤと、そのクラスタＩＤによって識別されるクラスタの代表値と、の組を含む。 The cluster management information 115 includes a set of a cluster ID for identifying each cluster and a representative value of the cluster identified by the cluster ID.

本実施の形態において、各クラスタの代表値とは、各クラスタに含まれる特徴量ベクトルの平均ベクトル（クラスタ平均）である。しかし、平均ベクトル以外の値がクラスタの代表値として使用されてもよい。例えば、平均ベクトルに近接するクラスタメンバの特徴量ベクトルが使用されてもよいし、他の統計的な代表値、例えば、最頻値、中央値等が使用されてもよい。k-means法による最適化の結果、各クラスタが完全に分離している場合、各特徴量ベクトルは、その特徴量ベクトルと最も距離が近い代表値を含むクラスタに含まれる。 In the present embodiment, the representative value of each cluster is an average vector (cluster average) of feature quantity vectors included in each cluster. However, a value other than the average vector may be used as the representative value of the cluster. For example, the feature vector of cluster members close to the average vector may be used, or other statistical representative values such as the mode value and the median value may be used. As a result of optimization by the k-means method, when each cluster is completely separated, each feature vector is included in a cluster including a representative value closest to the feature vector.

なお、本実施の形態のハードディスク１５４は、光ディスク装置、フラッシュメモリのような半導体記憶装置、又は、その他のいかなる種類の記憶装置によって置き換えられてもよい。 Note that the hard disk 154 of this embodiment may be replaced with an optical disk device, a semiconductor storage device such as a flash memory, or any other type of storage device.

画像サーバ１４０は、画像データを格納する記憶装置（図示省略）を備え、通信基盤１２０に接続される計算機である。 The image server 140 is a computer that includes a storage device (not shown) that stores image data and is connected to the communication infrastructure 120.

サーバ計算機１１０内の検索サーバプロセス１１１は、クラスタリングされた（すなわち、クラスタに分類された）検索対象を管理している。システム稼動時には、クラスタ管理情報１１５は、サーバ計算機のメモリ１５３内にクラスタ管理情報１１２として展開されている。各クラスタ情報１１３として、各クラスタＩＤ、そのＩＤによって識別されるクラスタの代表値である平均ベクトル、及び、クラスタメンバを識別するデータＩＤ列等が格納されている。各クラスタメンバの特徴量ベクトルは、特徴量データ１１４として一括してハードディスク１５４上で管理される。このため、メモリ１５３内のクラスタ情報１１３として、さらに、各クラスタメンバの特徴量ベクトルを格納したハードディスク上の位置が格納されている。 The search server process 111 in the server computer 110 manages search targets that are clustered (that is, classified into clusters). When the system is operating, the cluster management information 115 is expanded as the cluster management information 112 in the memory 153 of the server computer. As each cluster information 113, each cluster ID, an average vector which is a representative value of the cluster identified by the ID, a data ID string for identifying cluster members, and the like are stored. The feature vector of each cluster member is managed on the hard disk 154 collectively as feature data 114. For this reason, as the cluster information 113 in the memory 153, the position on the hard disk that stores the feature vector of each cluster member is further stored.

なお、クラスタ管理情報１１２は、ハードディスク１５４上に記録されたクラスタ管理情報のコピーである。このため、クラスタに対する更新が生じた場合、クラスタ管理情報１１２だけでなく、ハードディスク１５４上のクラスタ管理情報１１５も更新される。しかし、検索処理においてクラスタ管理情報１１５が直接参照されることはない。 The cluster management information 112 is a copy of the cluster management information recorded on the hard disk 154. Therefore, when an update to the cluster occurs, not only the cluster management information 112 but also the cluster management information 115 on the hard disk 154 is updated. However, the cluster management information 115 is not directly referenced in the search process.

クライアント計算機１３０は、通信基盤１２０に接続され、アプリケーションプログラム（図示省略）が稼動する計算機である。図１には二つのクライアント計算機１３０を示すが、本類似検索システムは任意の数のクライアント計算機１３０を備えてもよい。 The client computer 130 is a computer connected to the communication infrastructure 120 and running an application program (not shown). Although two client computers 130 are shown in FIG. 1, the present similarity search system may include any number of client computers 130.

クライアント計算機１３０は、いかなる構成の計算機であってもよい。図１には、典型的なクライアント計算機１３０の構成を示す。すなわち、図１のクライアント計算機１３０は、ＣＰＵ１３１、メモリ１３２、Ｉ／Ｆ１３３、入力装置１３４及び出力装置１３５を備える。 The client computer 130 may be a computer having any configuration. FIG. 1 shows a configuration of a typical client computer 130. That is, the client computer 130 of FIG. 1 includes a CPU 131, a memory 132, an I / F 133, an input device 134, and an output device 135.

ＣＰＵ１３１は、メモリ１３２に格納されたプログラムを実行するプロセッサである。 The CPU 131 is a processor that executes a program stored in the memory 132.

メモリ１３２は、ＣＰＵ１３１によって実行されるプログラム等を格納する記憶装置である。メモリ１３２は、少なくとも、アプリケーションプログラム（図示省略）を格納する。 The memory 132 is a storage device that stores programs executed by the CPU 131. The memory 132 stores at least an application program (not shown).

Ｉ／Ｆ１３３は、通信基盤１２０に接続され、クライアント計算機１３０とサーバ計算機１１０との間の通信に使用されるインターフェースである。 The I / F 133 is an interface that is connected to the communication infrastructure 120 and is used for communication between the client computer 130 and the server computer 110.

入力装置１３４は、クライアント計算機１３０のユーザから入力を受け付ける装置である。入力装置１３４は、例えば、キーボード、ポインティングデバイス又は画像スキャナ等を含んでもよい。 The input device 134 is a device that receives input from the user of the client computer 130. The input device 134 may include, for example, a keyboard, a pointing device, an image scanner, or the like.

出力装置１３５は、クライアント計算機１３０のユーザに情報を表示する装置である。具体的には、例えば、類似検索の結果として取得された画像が出力装置１３５に表示される。出力装置１３５は、例えばＣＲＴ又は液晶ディスプレイのような画像表示装置である。 The output device 135 is a device that displays information to the user of the client computer 130. Specifically, for example, an image acquired as a result of the similarity search is displayed on the output device 135. The output device 135 is an image display device such as a CRT or a liquid crystal display.

次に、本類似検索システムにおけるデータ登録時の処理について説明する。 Next, processing at the time of data registration in the similar search system will be described.

本システムでは、k-means法に基づいたクラスタリングを採用している。ただし、データの登録ごとに、全データのクラスタリングを行ったのでは、実用的な処理時間で登録処理を行うことは不可能である。本システムでは、新規データ登録時に、そのデータと距離が近いクラスタを、近接クラスタとして所定の個数検索し、検索された近接クラスタに関してk-means法の最適化を実行する。また、各クラスタのメンバのハードディスク上の格納領域を、物理的にも連続的に確保するために、クラスタ生成時に所定の量のディスク領域を確保する。それに従い、各クラスタメンバの最大数も制限される。 This system uses clustering based on the k-means method. However, if all data is clustered for each data registration, it is impossible to perform the registration process in a practical processing time. In this system, when new data is registered, a predetermined number of clusters having a distance close to that data are searched as neighboring clusters, and optimization of the k-means method is executed for the retrieved neighboring clusters. In addition, in order to continuously and physically secure storage areas on the hard disks of the members of each cluster, a predetermined amount of disk area is secured at the time of cluster generation. Accordingly, the maximum number of each cluster member is also limited.

図２は、本発明の実施の形態においてデータ登録時に実行される処理を示すフローチャートである。 FIG. 2 is a flowchart showing processing executed at the time of data registration in the embodiment of the present invention.

図２に示す処理は、検索サーバプロセス１１１を実現するサーバ・プログラムの一部として実行される。従って、図２に示す処理は、ＣＰＵ１５２によって実行される。 The process shown in FIG. 2 is executed as a part of a server program that implements the search server process 111. Therefore, the process shown in FIG.

ＣＰＵ１５２は、登録対象の新規データｘを与えられると、まず、近接クラスタを検索し、近接クラスタの集合Ｃ*を取得する（２１０）。具体的には、ＣＰＵ１５２は、各クラスタの平均ベクトルと新規データｘとを比較し、新規データｘと距離が近い平均ベクトルによって代表されるクラスタから順に、所定の数のクラスタを近接クラスタの集合Ｃ*として取得する。 When given the new data x to be registered, the CPU 152 first searches for a nearby cluster and acquires a set C * of neighboring clusters (210). Specifically, the CPU 152 compares the average vector of each cluster with the new data x, and selects a predetermined number of clusters in order from the cluster represented by the average vector whose distance is close to the new data x. Get as *.

次に、ＣＰＵ１５２は、近接クラスタの集合Ｃ*の中の最近接クラスタｃ*（すなわち、新規データｘと最も距離が近い平均ベクトルによって代表されるクラスタ）に、新規データｘを追加する（２２０）。 Next, the CPU 152 adds the new data x to the nearest cluster c * (that is, the cluster represented by the average vector closest to the new data x) in the set C * of adjacent clusters (220). .

次に、ＣＰＵ１５２は、パラメータt及びパラメータiを、それぞれ、「０」及び「１」に初期化する（２２１、２２２）。パラメータtは、k-means法の更新の反復回数を計数するために使用される。パラメータiは、近接クラスタの集合Ｃ*に要素として含まれるクラスタを指示するために使用される。 Next, the CPU 152 initializes the parameter t and the parameter i to “0” and “1”, respectively (221, 222). The parameter t is used to count the number of iterations of the k-means method update. The parameter i is used to indicate a cluster included as an element in the set C * of neighboring clusters.

その後、ステップ２３０以降に示す、k-means法による最適化のループに入る。 Thereafter, an optimization loop based on the k-means method shown in step 230 and after is entered.

具体的には、ＣＰＵ１５２は、近接クラスタの集合Ｃ*の要素である各クラスタについて（２３０）、クラスタのメンバ数が制限M_maxを超えるか否かを判定する（２３１）。 Specifically, the CPU 152 determines whether or not the number of cluster members exceeds the limit M_max for each cluster that is an element of the set C * of adjacent clusters (230) (231).

具体的には、ＣＰＵ１５２は、ステップ２３０において、パラメータiが集合Ｃ*の要素数以下であるか否かを判定する。ステップ２３０において、パラメータiが集合Ｃ*の要素数以下であると判定された場合、ＣＰＵ１５２は、集合Ｃ*のi番目の要素であるクラスタｃを対象として（２３４）、クラスタｃのメンバ数がM_maxを超えるか否かを判定する（２３１）。なお、最適化ループに入った時点では、新規データｘが追加された最近接クラスタｃ*以外のクラスタは、メンバ数制限を超えないことが前提となる。 Specifically, CPU 152 determines in step 230 whether parameter i is equal to or less than the number of elements in set C *. If it is determined in step 230 that the parameter i is less than or equal to the number of elements in the set C *, the CPU 152 targets the cluster c that is the i-th element of the set C * (234), and the number of members of the cluster c is It is determined whether or not M_max is exceeded (231). At the time of entering the optimization loop, it is assumed that the cluster other than the nearest cluster c * to which the new data x is added does not exceed the member number limit.

仮に最近接クラスタｃ*のメンバ数がM_maxを超えた場合、ＣＰＵ１５２は、そのクラスタを２分割し（２３２）、新たに生成されたクラスタｄを近接クラスタの集合Ｃ*の要素に加える（２３３）。クラスタを２分割する方法としては、種々の方法が考えられる。ここでは、そのクラスタ内のベクトル分布に関して主軸を求め、各メンバのベクトルの主軸への射影が、クラスタ平均ベクトルの射影のどちら側に存在するかを判定することによって、メンバを二つのクラスタに群分けする。 If the number of members of the closest cluster c * exceeds M_max, the CPU 152 divides the cluster into two (232), and adds the newly generated cluster d to the elements of the set C * of neighboring clusters (233). . Various methods can be considered as a method of dividing a cluster into two. Here, the principal axis is obtained with respect to the vector distribution in the cluster, and the members are grouped into two clusters by determining on which side the projection of the vector of each member onto the principal axis is on the projection of the cluster average vector. Divide.

ステップ２３３が実行された後、ＣＰＵ１５２の処理は、ステップ２３１に戻る。ステップ２３１では、分割後のクラスタのメンバ数がM_maxを超えているか否かが判定される。分割後のクラスタのメンバ数がM_maxを超えていると判定された場合、そのクラスタをさらに分割するために、処理はステップ２３２に進む。一方、分割後のクラスタのメンバ数がM_maxを超えていないと判定された場合、次のクラスタについてステップ２３１の判定を実行するために、ＣＰＵ１５２は、パラメータiの値に１を加算して（２３５）、ステップ２３０に戻る。 After step 233 is executed, the process of the CPU 152 returns to step 231. In step 231, it is determined whether the number of cluster members after the division exceeds M_max. If it is determined that the number of members of the cluster after the division exceeds M_max, the process proceeds to step 232 to further divide the cluster. On the other hand, when it is determined that the number of members of the cluster after the division does not exceed M_max, the CPU 152 adds 1 to the value of the parameter i in order to execute the determination of step 231 for the next cluster (235 ) And return to Step 230.

ステップ２３０において、パラメータiが集合Ｃ*の要素数を超えたと判定された場合、集合Ｃ*の要素である全てのクラスタのメンバ数がM_max以内であることが確認された。この場合、ＣＰＵ１５２は、k-means法による最適化の反復回数tをチェックする（２４０）。 In step 230, when it is determined that the parameter i exceeds the number of elements of the set C *, it is confirmed that the number of members of all the clusters that are elements of the set C * is within M_max. In this case, the CPU 152 checks the number of optimization iterations t by the k-means method (240).

本システムにおいて、図２に示す最適化は、あくまでクラスタの部分集合を対象としたものであり、クラスタ全体での最適化を意味しない。また、データの追加は、その後も繰り返し行われることを想定しており、その度に最適化が実行される。従って、ある時点での最適化を極端に重視する必要はなく、反復の最大数t_maxは、数回程度で十分である。 In this system, the optimization shown in FIG. 2 is intended only for a subset of clusters, and does not mean optimization for the entire cluster. Further, it is assumed that the addition of data is repeatedly performed thereafter, and optimization is executed each time. Therefore, it is not necessary to place an extreme importance on optimization at a certain point in time, and it is sufficient that the maximum number of iterations t_max is several times.

ステップ２４０において、反復回数を示すパラメータtが反復の最大数t_max以上であると判定された場合、k-means法による最適化が所定の回数実行されたため、図２の処理が終了する。あるいは、ステップ２４０において、集合Ｃ*が変化していないと判定された場合、さらに最適化を実行する必要がないと考えられる。従って、この場合も、図２の処理が終了する。 If it is determined in step 240 that the parameter t indicating the number of iterations is equal to or greater than the maximum number of iterations t_max, optimization by the k-means method has been performed a predetermined number of times, and the processing in FIG. 2 ends. Alternatively, if it is determined in step 240 that the set C * has not changed, it is considered that no further optimization is required. Therefore, also in this case, the process of FIG.

一方、ステップ２４０において、反復回数を示すパラメータtが反復の最大数t_maxより小さく、かつ、集合Ｃ*が変化していると判定された場合、クラスタの最適化を実行する必要があるため、ＣＰＵ１５２は、k-means法によって集合Ｃ*を更新する（２５０）。 On the other hand, if it is determined in step 240 that the parameter t indicating the number of iterations is smaller than the maximum number of iterations t_max and the set C * is changing, it is necessary to perform cluster optimization. Updates the set C * by the k-means method (250).

ステップ２５０の処理は、通常のk-means法と同様である。すなわち、近接クラスタに含まれる全データは、その時点での最も近接したクラスタ平均を持つクラスタに配分される。これによって、各近接クラスタのメンバ、及び、クラスタ平均が更新され、ステップ２３０に戻る。最適化ループに入った時点とは異なり、今回は、大きくクラスタの状態が変化した場合、複数のクラスタがメンバ数の上限を超える可能性がある。また、２分割しただけでは不十分であるため、再度分割が必要となる場合、あるいは、新たに生成されたクラスタが上限を超える場合も生じる可能性がある。このため、ＣＰＵ１５２は、全てのクラスタのメンバ数が上限M_max以下となるように処理（ステップ２３０からステップ２３５）した後、ステップ２４０に移行する。ステップ２４０で、反復数が最大数t_maxに達したか、あるいは、近接クラスタの集合Ｃ*の状態に全く変化がない場合、処理を終了する。 The process of step 250 is the same as the normal k-means method. That is, all data included in the adjacent cluster is distributed to the cluster having the closest cluster average at that time. This updates the members of each neighboring cluster and the cluster average, and returns to step 230. Unlike the point in time when the optimization loop is entered, this time, if the cluster state changes greatly, there is a possibility that a plurality of clusters may exceed the upper limit of the number of members. Moreover, since it is not sufficient to divide into two, there is a possibility that division may be necessary again, or a newly generated cluster may exceed the upper limit. For this reason, the CPU 152 proceeds to step 240 after performing processing (from step 230 to step 235) so that the number of members of all clusters is equal to or less than the upper limit M_max. In step 240, if the number of iterations reaches the maximum number t_max, or if there is no change in the state of the set C * of neighboring clusters, the process is terminated.

図２に示した一連の処理は、メモリ１５３上の作業領域で実行され、最終的な近接クラスタの状態が、ハードディスク１５４上に保存される。このハードディスク１５４上への保存時には、オプションとして、以下の機能が用意されている。 The series of processing shown in FIG. 2 is executed in the work area on the memory 153, and the final proximity cluster state is stored on the hard disk 154. When saving on the hard disk 154, the following functions are provided as options.

一般に、類似性が高いクラスタは、更新時、及び、検索時に同時に参照される可能性が高い。従って、類似性が高いクラスタ同士が、ハードディスク上でなるべく近傍に集まるように配列すれば、ディスク走査の負荷を低減できるはずである。 In general, a cluster having high similarity is highly likely to be referred to at the time of update and search. Therefore, if the clusters having high similarity are arranged as close to each other as possible on the hard disk, the disk scanning load should be reduced.

式（１）は、クラスタ位置の再配列の手続きを定義するためのエネルギー関数である。Ncは、クラスタ数、v_iは、i番目の位置にあるクラスタの平均ベクトルを表す。本エネルギー関数は、相前後する位置にあるクラスタ平均の２乗距離の総和として定義されている。 Equation (1) is an energy function for defining a cluster position rearrangement procedure. Nc represents the number of clusters, and v_i represents an average vector of clusters at the i-th position. This energy function is defined as the sum of the squared distances of the cluster averages at successive positions.

ただし、両端の位置、すなわち、１番目の位置とNc番目の位置での境界条件は、両端で折り返しを行った形式で定義している。具体的には、存在しない０番目の位置のクラスタの平均ベクトルv_0の代わりに、２番目の位置のクラスタの平均ベクトルv_2が用いられる。また、存在しないNc+1番目の位置のクラスタの平均ベクトルv_Nc+1の代わりに、Nc-1番目の位置のクラスタの平均ベクトルv_Nc-1が用いられる。この時、i番目の位置にあるクラスタとj番目の位置にあるクラスタを交換した場合のエネルギー関数の変化量は、式（２）によって算出される。ただし、j>iとする。 However, the boundary conditions at the positions at both ends, that is, the first position and the Nc-th position are defined in a form in which folding is performed at both ends. Specifically, the average vector v_2 of the cluster at the second position is used instead of the average vector v_0 of the cluster at the zeroth position that does not exist. Further, instead of the average vector v_Nc + 1 of the non-existing Nc + 1-th cluster, the average vector v_Nc-1 of the Nc-1-th cluster is used. At this time, the amount of change in the energy function when the cluster at the i-th position and the cluster at the j-th position are exchanged is calculated by Expression (2). However, j> i.

上記のエネルギー変化量に基づき、エネルギー関数が減少するように配列内のクラスタ位置を更新すれば、配列中で隣り合う位置に存在するクラスタ同士の距離が相対的に小さい状態が実現できる。 If the cluster positions in the array are updated so that the energy function is reduced based on the energy change amount, a state in which the distance between clusters existing at adjacent positions in the array is relatively small can be realized.

図３は、本発明の実施の形態において実行されるクラスタ位置の再配列の処理を示すフローチャートである。 FIG. 3 is a flowchart showing cluster position rearrangement processing executed in the embodiment of the present invention.

図３に示す処理は、検索サーバプロセス１１１を実現するサーバ・プログラムの一部として実行される。従って、図３に示す処理は、ＣＰＵ１５２によって実行される。 The process shown in FIG. 3 is executed as part of a server program that implements the search server process 111. Therefore, the process shown in FIG.

まず、ＣＰＵ１５２は、現在更新対象となっている近接クラスタの集合Ｃ*から、式（１）によって算出されるエネルギーの減少量が最大となるクラスタの組を探す（３１０）。 First, the CPU 152 searches a set of clusters that maximizes the amount of energy reduction calculated by the equation (1) from the set C * of adjacent clusters that are currently being updated (310).

次に、ＣＰＵ１５２は、ステップ３１０の条件に該当するクラスタの組を発見したか否かを判定する（３２０）。 Next, the CPU 152 determines whether or not a cluster set corresponding to the condition of step 310 has been found (320).

該当するクラスタの組が発見されない場合、エネルギーを減少させる位置の交換が存在しない（言い換えると、現在のクラスタの配列のエネルギーが最も小さい）。この場合、各クラスタは最適に配置されていると考えられるため、処理を終了する。 If no such cluster set is found, there is no exchange of positions that reduces energy (in other words, the current cluster array has the lowest energy). In this case, since each cluster is considered to be optimally arranged, the processing ends.

一方、該当するクラスタの組が発見された場合、ＣＰＵ１５２は、その位置を交換することによって配列を更新し（３３０）、次のクラスタの組を探すためにステップ３１０に戻る。最終的には、こうして得られた配列上の位置に従って、クラスタメンバの特徴量ベクトルをハードディスク１５４上へ保存する。 On the other hand, if a corresponding cluster set is found, the CPU 152 updates the array by exchanging its position (330) and returns to step 310 to search for the next cluster set. Finally, the cluster member feature vector is stored on the hard disk 154 in accordance with the position on the array thus obtained.

次に、本システムにおける検索処理について説明する。 Next, search processing in this system will be described.

図４は、本発明の実施の形態の類似検索における、クライアント・サーバ間の情報の流れ、及び、各プログラム内での処理の概略を示す説明図である。 FIG. 4 is an explanatory diagram showing an outline of information flow between the client and the server and processing in each program in the similarity search according to the embodiment of this invention.

クライアント・プログラムは、クライアント計算機１３０のメモリ１３２に格納され、ＣＰＵ１３１によって実行されるアプリケーションプログラムである。ＣＰＵ１３１は、必要に応じて入力装置１３４及び出力装置１３５を制御しながら、図４に示すクライアント側の処理を実行する。一方、サーバ・プログラムは、サーバプロセス１１１を実現するプログラムであり、サーバ計算機１１０のメモリ１５３に格納され、ＣＰＵ１５２によって実行される。 The client program is an application program that is stored in the memory 132 of the client computer 130 and executed by the CPU 131. The CPU 131 executes the processing on the client side shown in FIG. 4 while controlling the input device 134 and the output device 135 as necessary. On the other hand, the server program is a program for realizing the server process 111, stored in the memory 153 of the server computer 110, and executed by the CPU 152.

また、以下の説明においてクライアント・プログラムからサーバ・プログラムに送信されるデータ、及び、サーバ・プログラムからクライアント・プログラムに返されるデータは、実際には、Ｉ／Ｆ１３３、通信基盤１２０及びＩ／Ｆ１５１を介して送受信される。 In the following description, data transmitted from the client program to the server program and data returned from the server program to the client program are actually transmitted to the I / F 133, the communication infrastructure 120, and the I / F 151. Are sent and received via

クライアント計算機のユーザは、入力装置１３４を使用して、検索要求を入力することができる。ユーザからの要求（４０１）を受け付けたクライアント・プログラムは、類似検索のキーとなるベクトルデータ（以下、キーデータ）を検索条件として含む検索要求を、サーバ・プログラムに送信する（４０２）。 A user of the client computer can input a search request using the input device 134. The client program that has received the request (401) from the user transmits a search request including vector data (hereinafter referred to as key data) serving as a key for similarity search as a search condition to the server program (402).

サーバ・プログラムは、まず、クラスタを対象とした類似検索を実行する（４０３）。以下、これをクラスタ間検索処理と呼ぶ。クラスタ間検索処理によって、キーデータと全クラスタの平均ベクトルとの間の距離計算が実行され、距離が小さい（すなわち、近い）順序にソートされた所定の個数Rc個のクラスタからなる配列が得られる。 The server program first executes a similarity search for the cluster (403). Hereinafter, this is referred to as an intercluster search process. The inter-cluster search process calculates the distance between the key data and the average vector of all clusters, and obtains an array consisting of a predetermined number of Rc clusters sorted in the order of small (ie, close) distance. .

次に、このソートされたクラスタ配列中のメンバを参照し、各クラスタメンバの特徴量ベクトルとキーデータとの間の距離計算を実行することによって、類似検索の結果を導出する（４０４）。具体的には、ソートされたRc個のクラスタの全メンバの特徴量ベクトルとキーデータとの間の距離を計算し、その距離が小さい順にクラスタメンバをソートする。以下、この処理をクラスタ内検索処理と呼ぶ。 Next, by referring to the members in the sorted cluster array and calculating the distance between the feature vector of each cluster member and the key data, the result of the similarity search is derived (404). Specifically, the distance between the feature vector of all members of the sorted Rc clusters and the key data is calculated, and the cluster members are sorted in ascending order of the distance. Hereinafter, this processing is referred to as intra-cluster search processing.

サーバ・プログラムは、類似検索の結果を、クライアント・プログラムに返す（４０５）。クライアント・プログラムに返される類似検索の結果は、少なくとも、キーデータとの距離が近いものから順に、所定の個数のクラスタメンバの識別子を含む。この際、サーバ・プログラムは、実際に参照されたクラスタ（すなわち、類似検索が終了したクラスタ）に関する情報を類似検索の結果に付加して返す。以下、前回までの類似検索の際に実際に参照されたクラスタに関する情報を「参照クラスタ情報」と記載する。検索結果及び参照クラスタ情報については、後で図５及び図６を参照して説明する。 The server program returns the result of the similarity search to the client program (405). The result of the similarity search returned to the client program includes identifiers of a predetermined number of cluster members in order from the one closest to the key data. At this time, the server program returns information about the cluster actually referred to (that is, the cluster for which the similar search has been completed) added to the result of the similar search. Hereinafter, the information about the cluster actually referred to in the previous similar search is referred to as “reference cluster information”. Search results and reference cluster information will be described later with reference to FIGS.

検索結果を受信したクライアント・プログラムは、検索結果の表示等の処理を実行する（４０６）。例えば、クライアント・プログラムは、受信した検索結果に含まれる識別子によって識別される画像を、出力装置１３５に表示してもよい。検索結果の表示の例については、後で図８を参照して説明する。 The client program that has received the search result executes processing such as display of the search result (406). For example, the client program may display an image identified by an identifier included in the received search result on the output device 135. An example of the display of the search result will be described later with reference to FIG.

その後、再度、ユーザから同一条件の検索要求を受け付けると（４０７）、クライアント・プログラムは、検索条件とともに、以前取得した参照クラスタ情報及び検索結果を、検索要求としてサーバ・プログラムに送信する（４０８）。 After that, when a search request with the same condition is received from the user again (407), the client program transmits the previously acquired reference cluster information and search result together with the search condition to the server program as a search request (408). .

検索要求を受信すると、サーバ・プログラムは、再びクラスタ間検索処理を実行する（４０９）。この際、サーバ・プログラムは、参照クラスタ情報を用いて、参照済みのクラスタに関する距離計算を省略することができる。ステップ４０９の検索の結果、前回検索されたクラスタ数RcにさらにRc個加えたRc×２個のクラスタが、各クラスタの平均ベクトルとキーデータとの間の距離が小さい順にソートされる。その結果、ソートされたRc×２個のクラスタからなる配列が得られる。 When the search request is received, the server program executes the inter-cluster search process again (409). At this time, the server program can omit the distance calculation regarding the referenced cluster by using the reference cluster information. As a result of the search in step 409, Rc × 2 clusters obtained by adding Rc to the number of clusters Rc searched last time are sorted in ascending order of the distance between the average vector of each cluster and the key data. As a result, an array composed of sorted Rc × 2 clusters is obtained.

次のクラスタ内検索処理（４１０）において、サーバ・プログラムは、前回参照された上位Rc個のクラスタのメンバの検索を省略し、Rc+1位からRc×2位までの順位にあるRc個のクラスタについて、キーデータを用いてクラスタメンバとの類似検索を実行し、その検索結果を前回の検索結果とマージする。具体的には、今回得られた検索結果と前回得られた検索結果を合わせて、それらの検索結果であるクラスタメンバをキーデータとの間の距離が近い順にソートする。 In the next intra-cluster search process (410), the server program skips the search for the members of the top Rc clusters that were referenced last time, and the Rc items in the rank from Rc + 1 to Rc × 2 For the cluster, the key data is used to perform a similarity search with cluster members, and the search result is merged with the previous search result. Specifically, the search results obtained this time and the search results obtained last time are combined, and the cluster members that are the search results are sorted in order of increasing distance from the key data.

サーバ・プログラムは、こうして更新された検索結果と、Rc×２個の参照クラスタに関する情報とを、クライアント・プログラムに返す（４１１）。 The server program returns the search result updated in this way and the information related to Rc × 2 reference clusters to the client program (411).

以下、同様の処理を繰り返すことによって、参照クラスタの個数が増えていく。これによって、クライアント・プログラムは、より高い精度の検索結果を、ユーザからの要求に応じて、逐次的に取得することができる。１回の検索要求に応じて新たに検索されるクラスタの数は一定（Rc個）である。そして、各クラスタのメンバ数には上限（M_max）がある。このため、１回の検索要求に応じた検索処理に要する時間は、一定の上限を超えることがない。各クラスタのメンバ数が概ね同じであれば、１回の検索要求に応じた検索処理に要する時間も概ね同じとなる。 Thereafter, the number of reference clusters increases by repeating similar processing. As a result, the client program can sequentially obtain search results with higher accuracy in response to a request from the user. The number of clusters newly searched in response to one search request is constant (Rc). There is an upper limit (M_max) for the number of members in each cluster. For this reason, the time required for the search processing corresponding to one search request does not exceed a certain upper limit. If the number of members of each cluster is substantially the same, the time required for the search processing corresponding to one search request is also approximately the same.

図５は、本発明の実施の形態の参照クラスタ情報の説明図である。 FIG. 5 is an explanatory diagram of the reference cluster information according to the embodiment of this invention.

具体的には、図５は、例えば図４のステップ４０５、４０８及び４１１において、クライアント・プログラムとサーバ・プログラムの間で送受信される参照クラスタ情報の説明図である。 Specifically, FIG. 5 is an explanatory diagram of reference cluster information transmitted / received between the client program and the server program in, for example, steps 405, 408, and 411 of FIG.

各クラスタは、各クラスタを識別するクラスタＩＤによって管理されている。参照クラスタ情報は、タイムスタンプ５１０、既に検索のために参照されたクラスタを識別するクラスタＩＤの列、及び、各クラスタの平均ベクトルとキーデータとの間の距離を示す情報を含む。タイムスタンプ５１０には、その参照クラスタ情報が生成された時刻が記録される。 Each cluster is managed by a cluster ID that identifies each cluster. The reference cluster information includes a time stamp 510, a cluster ID column for identifying a cluster already referenced for search, and information indicating a distance between the average vector of each cluster and key data. The time stamp 510 records the time when the reference cluster information was generated.

なお、図５において、各クラスタＩＤと距離の組は、距離の値が昇順となるように整列される。すなわち、距離の値が最も小さいクラスタＩＤが参照クラスタ情報の先頭となる。 In FIG. 5, the pairs of cluster IDs and distances are aligned so that the distance values are in ascending order. That is, the cluster ID having the smallest distance value is the head of the reference cluster information.

図６は、本発明の実施の形態の検索結果の説明図である。 FIG. 6 is an explanatory diagram of search results according to the embodiment of this invention.

具体的には、図６は、例えば図４のステップ４０５、４０８及び４１１において、クライアント・プログラムとサーバ・プログラムの間で送受信される検索結果の説明図である。 Specifically, FIG. 6 is an explanatory diagram of search results transmitted and received between the client program and the server program in, for example, steps 405, 408, and 411 of FIG.

検索結果は、クラスタ内検索処理の結果として得られたクラスタメンバを識別するデータＩＤの列、及び、各クラスタメンバの特徴量ベクトルとキーデータとの間の距離を示す情報を含む。 The search result includes a data ID column for identifying the cluster member obtained as a result of the intra-cluster search process, and information indicating the distance between the feature quantity vector of each cluster member and the key data.

なお、図５と同様、図６のデータＩＤと距離の組は、距離の値が昇順となるように整列される。 As in FIG. 5, the pairs of data IDs and distances in FIG. 6 are aligned so that the distance values are in ascending order.

図７は、本発明の実施の形態のサーバ・プログラムにおいて実行されるクラスタ内検索処理を示すフローチャートである。 FIG. 7 is a flowchart showing intra-cluster search processing executed in the server program according to the embodiment of this invention.

図７の処理は、図４のステップ４０４及び４１０において実行される。従って、図７の処理は、サーバ・プログラムの一部として、サーバ計算機１１０のＣＰＵ１５２によって実行される。 The process of FIG. 7 is executed in steps 404 and 410 of FIG. Accordingly, the processing of FIG. 7 is executed by the CPU 152 of the server computer 110 as part of the server program.

最初に、処理の概要を説明する。 First, an outline of the process will be described.

ステップ７１０の判定は、クラスタ間検索によって得られたクラスタ配列の先頭から、参照済みのクラスタをスキップするための判定である。クラスタのメンバ及びクラスタの順序に変更がない場合、クラスタ間検索の結果は不変である。この場合、参照済みクラスタを対象としたクラスタ内検索処理の結果は、既に取得した結果と同じになるはずである。従って、この場合、参照済みクラスタを対象としたクラスタ内検索処理を省略することができる。 The determination in step 710 is a determination for skipping the referenced cluster from the beginning of the cluster array obtained by the inter-cluster search. When there is no change in the cluster members and the cluster order, the inter-cluster search results are unchanged. In this case, the result of the intra-cluster search process for the referenced cluster should be the same as the already acquired result. Therefore, in this case, the intra-cluster search process for the referenced cluster can be omitted.

ただし、データの更新処理によって、クラスタ全体の状態が変わった場合、クラスタ間検索の結果は、変わってしまう可能性がある。例えば、クラスタに新たなメンバが追加された場合（図２参照）、さらに、クラスタの順序が変更された場合（図３参照）、それらのクラスタについては、クラスタ内検索処理を実行する必要がある。クラスタ内検索処理を省略できるか否かの判定が、ステップ７２０において実行される。 However, if the status of the entire cluster changes due to data update processing, the inter-cluster search result may change. For example, when a new member is added to the cluster (see FIG. 2), and when the order of the clusters is changed (see FIG. 3), it is necessary to execute a search process within the cluster for those clusters. . A determination is made in step 720 as to whether the intra-cluster search process can be omitted.

ステップ７３０以降のループにおいて、参照済みクラスタを除く所定のRc個のクラスタのメンバを対象とする類似検索が実行される。類似検索のために最後に参照された後に状態が変わってしまったクラスタがある場合、そのクラスタは参照済みクラスタに含まれない。 In a loop after step 730, a similarity search is performed on members of predetermined Rc clusters excluding the referenced cluster. If there is a cluster whose state has changed since it was last referenced for similarity search, that cluster is not included in the referenced cluster.

次に、処理の詳細を説明する。 Next, details of the processing will be described.

以下の説明において、配列Ａは、参照クラスタ情報によって示されるクラスタの配列に相当する。配列Ｂは、クラスタ間検索処理の結果として得られた配列に相当する。例えば、図７の処理が図４のステップ４１０において実行される場合、ステップ４０８においてクライアント・プログラムから送信された参照クラスタ情報（１）が示すクラスタの配列が配列Ａに相当し、ステップ４０９のクラスタ間検索処理の結果が配列Ｂに相当する。また、パラメータi及びjは、配列Ａ又はＢに含まれる要素を示すために使用される。 In the following description, the array A corresponds to the array of clusters indicated by the reference cluster information. The array B corresponds to the array obtained as a result of the intercluster search process. For example, when the process of FIG. 7 is executed in step 410 of FIG. 4, the array of clusters indicated by the reference cluster information (1) transmitted from the client program in step 408 corresponds to the array A, and the cluster of step 409 The result of the inter-search process corresponds to the array B. Parameters i and j are used to indicate the elements included in array A or B.

最初に、サーバ・プログラムは、パラメータiの値を「０」に初期化する（７０１）。 First, the server program initializes the value of the parameter i to “0” (701).

次に、サーバ・プログラムは、パラメータiの値が配列Ａの要素数以下であるか否かを判定する（７１０）。 Next, the server program determines whether or not the value of the parameter i is less than or equal to the number of elements in the array A (710).

ステップ７１０において、パラメータiの値が配列Ａの要素数以下であると判定された場合、次に、サーバ・プログラムは、クラスタＡ[i]とクラスタＢ[i]が同一であるか否かを判定する（７２０）。 If it is determined in step 710 that the value of parameter i is less than or equal to the number of elements in array A, then the server program determines whether cluster A [i] and cluster B [i] are the same. Determine (720).

クラスタはクラスタＩＤによって管理される。従って、Ａ[i]とＢ[i]のクラスタＩＤが異なる場合、これらのクラスタは同一でないと判定される。さらに、Ａ[i]とＢ[i]のクラスタＩＤが同一であっても、状態が異なる場合がある。したがって、ステップ７２０の判定のために、タイムスタンプ５１０が参照される。Ａ[i]とＢ[i]のクラスタＩＤが同一でも、Ａ[i]に関するタイムスタンプ５１０が示す時刻以降にＢ[i]が更新されていた場合、これらのクラスタは異なるクラスタであると判定される。 A cluster is managed by a cluster ID. Therefore, when the cluster IDs of A [i] and B [i] are different, it is determined that these clusters are not the same. Furthermore, even if the cluster IDs of A [i] and B [i] are the same, the states may be different. Therefore, the time stamp 510 is referred to for the determination in step 720. Even if the cluster IDs of A [i] and B [i] are the same, if B [i] has been updated after the time indicated by the time stamp 510 for A [i], these clusters are determined to be different clusters. Is done.

ステップ７２０において、Ａ[i]とＢ[i]が異なるクラスタであると判定された場合、クラスタＢ[i]の状態は、最後に参照された後で更新されている。この場合、クラスタＢ[i]及びその下位のクラスタを対象とするクラスタ内検索処理を実行するため、サーバ・プログラムはステップ７２２に進む。 If it is determined in step 720 that A [i] and B [i] are different clusters, the state of cluster B [i] has been updated since it was last referenced. In this case, the server program proceeds to Step 722 in order to execute the intra-cluster search process for the cluster B [i] and its lower clusters.

一方、ステップ７２０において、Ａ[i]とＢ[i]が同一のクラスタであると判定された場合、参照済みクラスタであるＢ[i]を対象とするクラスタ内検索処理は、省略することができる。この場合、次の要素について判定するため、サーバ・プログラムは、パラメータiの値を１加算して（７２１）、ステップ７１０に戻る。 On the other hand, if it is determined in step 720 that A [i] and B [i] are the same cluster, the intra-cluster search process for B [i] that is the referenced cluster may be omitted. it can. In this case, in order to determine the next element, the server program adds 1 to the value of the parameter i (721), and returns to step 710.

ステップ７１０において、パラメータiの値が配列Ａの要素数を超えると判定された場合、クラスタＢ[i]及びそれより下位のクラスタは、まだクラスタ内検索処理の対象になったことがない。この場合、クラスタＢ[i]及びその下位のクラスタを対象とするクラスタ内検索処理を実行するため、サーバ・プログラムはステップ７２２に進む。 If it is determined in step 710 that the value of the parameter i exceeds the number of elements in the array A, the cluster B [i] and lower clusters have not yet been subjected to intra-cluster search processing. In this case, the server program proceeds to Step 722 in order to execute the intra-cluster search process for the cluster B [i] and its lower clusters.

新たに参照すべきクラスタの開始位置（すなわち、ステップ７２２の時点のＢ[i]）が確定したら、サーバ・プログラムは、そのクラスタから始まるRc個のクラスタに関してクラスタ内検索処理を実行する。 When the start position of a cluster to be newly referred to (that is, B [i] at the time of step 722) is determined, the server program executes an intra-cluster search process for Rc clusters starting from the cluster.

具体的には、サーバ・プログラムは、ステップ７２２において、パラメータjの値を「１」に初期化する。 Specifically, in step 722, the server program initializes the value of the parameter j to “1”.

次に、サーバ・プログラムは、パラメータjの値が、所定の参照クラスタ数Rc以下であるか否かを判定する（７３０）。 Next, the server program determines whether or not the value of the parameter j is equal to or less than a predetermined reference cluster number Rc (730).

ステップ７３０において、パラメータjの値がRc以下であると判定された場合、まだRc個のクラスタを対象とするクラスタ内検索処理が終了していない。この場合、サーバ・プログラムは、クラスタＢ[i+j]を対象とするクラスタ内検索処理を実行する（７４０）。具体的には、サーバ・プログラムは、クラスタＢ[i+j]のメンバとキーデータとの間の距離を計算する。 If it is determined in step 730 that the value of the parameter j is equal to or less than Rc, the intra-cluster search process for Rc clusters has not been completed yet. In this case, the server program executes an intra-cluster search process for the cluster B [i + j] (740). Specifically, the server program calculates the distance between the members of the cluster B [i + j] and the key data.

次に、サーバ・プログラムは、既に取得した検索結果に、クラスタＢ[i+j]から得られた結果をマージすることによって、検索結果を更新する（７５０）。ここで、既に取得した検索結果とは、クライアント・プログラムから送信された検索要求に含まれる検索結果（例えば、図４のステップ４０８に示す検索結果（１））である。具体的には、サーバ・プログラムは、既に取得した検索結果であるクラスタメンバに、ステップ７４０において検索されたクラスタのメンバを追加し、それらのクラスタメンバを、検索キーデータからの距離が小さい順にソートする。 Next, the server program updates the search result by merging the result obtained from the cluster B [i + j] with the already obtained search result (750). Here, the already acquired search result is a search result included in the search request transmitted from the client program (for example, search result (1) shown in step 408 in FIG. 4). Specifically, the server program adds the members of the cluster searched in step 740 to the cluster members that are already acquired search results, and sorts these cluster members in ascending order of the distance from the search key data. To do.

次に、サーバ・プログラムは、次のクラスタについて処理するため、パラメータjの値を１加算して（７５１）、ステップ７３０に戻る。 Next, in order to process the next cluster, the server program adds 1 to the value of the parameter j (751) and returns to step 730.

ステップ７３０において、パラメータjの値がRcを超えると判定された場合、Rc個のクラスタを対象とするクラスタ内検索処理が終了した。この場合、サーバ・プログラムは、処理を終了する。 If it is determined in step 730 that the value of the parameter j exceeds Rc, the intra-cluster search process for Rc clusters is completed. In this case, the server program ends the process.

次に、本システムにおけるユーザに対する検索結果の表示について説明する。 Next, display of search results for the user in this system will be described.

図８は、本発明の実施の形態において表示される検索結果表示画面の状態遷移の例の説明図である。 FIG. 8 is an explanatory diagram of an example of the state transition of the search result display screen displayed in the embodiment of the present invention.

図１から図７を参照して説明したように、本実施の形態のサーバ・プログラムは、クライアント・プログラムから送信されたベクトルデータをキーとして、多数のベクトルデータを対象とする類似検索を実行する。そして、サーバ・プログラムは、検索結果をクライアント・プログラムに返す。クライアント・プログラムは、検索結果であるベクトルデータ自体を表示してもよいが、ベクトルデータに関連付けられたデータを表示してもよい。 As described with reference to FIGS. 1 to 7, the server program according to the present embodiment executes a similarity search for a large number of vector data using the vector data transmitted from the client program as a key. . Then, the server program returns the search result to the client program. The client program may display the vector data itself as a search result, but may display data associated with the vector data.

本実施の形態において、検索対象のベクトルデータは、画像の特徴量ベクトルである。この場合、クライアント・プログラムは、検索によって得られた特徴量ベクトル自体を表示するのではなく、その特徴量ベクトルに対応する画像を検索結果として表示する。特徴量ベクトルに対応する画像は、画像サーバ１４０に格納されている。 In the present embodiment, the vector data to be searched is an image feature vector. In this case, the client program does not display the feature vector obtained by the search but displays an image corresponding to the feature vector as a search result. An image corresponding to the feature vector is stored in the image server 140.

図６に示すように、サーバ・プログラムからクライアント・プログラムに送信される検索結果がデータＩＤを含む場合、クライアント・プログラムは、受信したデータＩＤによって識別される画像を画像サーバ１４０から取得して、検索結果として出力装置１３５に表示する。 As shown in FIG. 6, when the search result transmitted from the server program to the client program includes a data ID, the client program acquires an image identified by the received data ID from the image server 140, and The search result is displayed on the output device 135.

図８には、検索によって得られた特徴量ベクトルに対応する画像を、検索結果として表示する画面の例を示す。なお、図８の画面は、クライアント計算機の出力装置１３５に表示される。 FIG. 8 shows an example of a screen that displays an image corresponding to the feature vector obtained by the search as a search result. The screen in FIG. 8 is displayed on the output device 135 of the client computer.

図８の例では、検索結果の最大数を１００件とし、１画面上に２０件ずつ画像を表示する画面構成となっている。画面上の表示は、２０件だけであるが、クライアント・プログラムは、サーバ・プログラムに対して、特徴量ベクトルとキーデータとの間の類似性が高い上位１００件の検索結果を常に要求するよう設定されている。 In the example of FIG. 8, the maximum number of search results is 100, and the screen configuration displays 20 images on one screen. Although only 20 items are displayed on the screen, the client program always requests the server program for the top 100 search results having high similarity between the feature vector and the key data. Is set.

図８は、類似検索結果表示画面８１０、８２０、８３０及び８４０が順次表示される例を示す。各画面８１０等は、検索結果８１１及びボタン８１２等を含む。検索結果８１１は、検索結果である画像、又は、その画像を縮小したサムネイル画像である。 FIG. 8 shows an example in which similar search result display screens 810, 820, 830, and 840 are sequentially displayed. Each screen 810 and the like includes a search result 811 and a button 812 and the like. The search result 811 is an image as a search result or a thumbnail image obtained by reducing the image.

検索結果表示の初期画面８１０には、検索結果１００件中の１位から２０位までが、特徴量ベクトルとキーデータとの間の類似性が高い順序に配列され、表示されている（８１１）。初期画面８１０の表示は、図４のステップ４０６に相当する。従って、このとき表示される検索結果は、図４のステップ４０５における検索結果（１）に相当する。 On the initial screen 810 of the search result display, the first to twentieth out of 100 search results are arranged and displayed in the order of high similarity between the feature vector and the key data (811). . The display of the initial screen 810 corresponds to step 406 in FIG. Therefore, the search result displayed at this time corresponds to the search result (1) in step 405 of FIG.

初期画面８１０には、さらに、ボタン８１２及び８１３が表示される。ボタン８１２は、検索結果の最初の２０件（すなわち、１位から２０位までの検索結果）を表示する要求を受け付けるボタンである。一方、ボタン８１３は、次の２０件（すなわち、２１位から４０位までの検索結果）を表示する要求を受け付けるボタンである。 On the initial screen 810, buttons 812 and 813 are further displayed. The button 812 is a button for receiving a request to display the first 20 search results (that is, the search results from the first place to the 20th place). On the other hand, the button 813 is a button for receiving a request to display the next 20 items (that is, the search results from the 21st place to the 40th place).

初期画面には、最初の２０件が表示されている。このため、ユーザは、ボタン８１２を操作することができない。この場合、ボタン８１２は、ボタン８１３と異なる態様（例えば、異なる色彩又は形状）で表示される。 The initial 20 items are displayed on the initial screen. For this reason, the user cannot operate the button 812. In this case, the button 812 is displayed in a different mode (for example, different color or shape) from the button 813.

ここで、ユーザがボタン８１３を操作すると、再度、クライアント・プログラムが検索要求をサーバに送信する。入力装置がマウスを含む場合、ボタン８１３の操作は、マウスクリックであってもよい。ユーザによるボタン８１３の操作が図４のステップ４０７に相当し、その結果送信される検索要求がステップ４０８に相当する。サーバ・プログラムは、受信した検索要求に従って、類似検索処理を実行し、その結果をクライアント・プログラムに返す（図４のステップ４０９、４１０及び４１１）。 Here, when the user operates the button 813, the client program transmits a search request to the server again. When the input device includes a mouse, the operation of the button 813 may be a mouse click. The operation of the button 813 by the user corresponds to step 407 in FIG. 4, and the search request transmitted as a result corresponds to step 408. The server program executes a similar search process according to the received search request and returns the result to the client program (steps 409, 410 and 411 in FIG. 4).

その結果、クライアント・プログラムは、更新された１００件の検索結果を取得する。検索結果取得後、画面８１０は画面８２０に遷移する。本システムでは、上位の検索結果は相対的に安定しており、通常の場合、画面８２０には、更新された検索結果中の２１位から４０位までの画像が、類似性が高い順序で表示される。 As a result, the client program acquires 100 updated search results. After obtaining the search results, the screen 810 transitions to the screen 820. In this system, the upper search results are relatively stable, and in the normal case, the images from the 21st to the 40th in the updated search results are displayed on the screen 820 in the order of high similarity. Is done.

ただし、検索結果の更新の結果、上位２０位以内に変動が生じる場合も当然ある。この場合、画面８２０には、更新された検索結果中の上位４０位以内の画像で、かつ、まだ画面８１０に表示されていない画像が、類似性が高い順に２０件表示される。例えば、更新された検索結果中の４位と８位の画像が、以前の検索結果中の２０位以内に含まれていなかった場合、画面８２０に表示される検索結果は、先頭に、４位、次に８位の画像が表示され、その後に、２１位から３８位までの画像が表示される。この場合、３９位と４０位の画像は画面８２０に表示されない。 However, as a result of updating the search result, there are naturally cases where fluctuation occurs within the top 20 positions. In this case, on the screen 820, 20 images in the top 40 positions in the updated search result and not yet displayed on the screen 810 are displayed in descending order of similarity. For example, when the 4th and 8th images in the updated search result are not included in the 20th position in the previous search results, the search result displayed on the screen 820 is 4th in the top. Next, the 8th image is displayed, and then the 21st to 38th images are displayed. In this case, the 39th and 40th images are not displayed on the screen 820.

画面８２０には、ボタン８２１、８２２及び８２３が表示される。ボタン８２１及び８２２の機能は、それぞれ、画面８１０のボタン８１２及び８１３と同等である。ボタン８２３は、画面８２０に表示されている検索結果の次の２０件（すなわち、４１位から６０位までの検索結果）を表示する要求を受け付けるボタンである。従って、ユーザがボタン８２１を操作すると、上位２０位までの検索結果表示に戻る。ユーザがボタン８２３を操作すると、４１位から６０位までの検索結果が表示される。ユーザは、ボタン８２２を操作することができない。 On the screen 820, buttons 821, 822, and 823 are displayed. The functions of the buttons 821 and 822 are the same as the buttons 812 and 813 of the screen 810, respectively. The button 823 is a button for receiving a request to display the next 20 search results displayed on the screen 820 (that is, the search results from the 41st place to the 60th place). Therefore, when the user operates the button 821, the search result display of the top 20 is returned. When the user operates the button 823, search results from the 41st place to the 60th place are displayed. The user cannot operate the button 822.

例えば、ユーザがボタン８２１をクリックすると、画面８２０は画面８３０に遷移する。この際には、検索結果は更新されない。画面８３０に表示されている検索結果の画像は、画像８１０に表示されているものと全く同一である。しかし、画面８３０に表示されているボタンの数は、画面８２０と同じである。 For example, when the user clicks the button 821, the screen 820 transitions to the screen 830. At this time, the search result is not updated. The search result image displayed on the screen 830 is exactly the same as that displayed on the image 810. However, the number of buttons displayed on the screen 830 is the same as that on the screen 820.

具体的には、画面８３０には、ボタン８３１、８３２及び８３３が表示される。これらのボタンの機能は、それぞれ、画面８２０のボタン８２１、８２２及び８２３と同等である。ただし、画面８１０のボタン８１２と同様、ユーザは、ボタン８３１を操作することができない。 Specifically, buttons 831, 832 and 833 are displayed on the screen 830. The functions of these buttons are equivalent to the buttons 821, 822 and 823 on the screen 820, respectively. However, like the button 812 on the screen 810, the user cannot operate the button 831.

その後、ユーザが、まだ表示されていない下位の検索結果の表示を順次要求すると、上記と同様の手順によって検索結果が更新され、ボタンの数が増加する。最終的に８１位から１００位までの検索結果を表示する画面８４０に至るまで、上記の処理が実行される。画面８４０には、８１位から１００位までの検索結果の画像と、五つのボタン８４１から８４５が表示される。ボタン８４１から８４３の機能は、それぞれ、画面８２０のボタン８２１から８２３と同等である。ボタン８４４は、６１位から８０位までの検索結果の表示を要求するためのボタンである。ボタン８４５は、８１位から１００位までの検索結果の表示を要求するためのボタンである。 Thereafter, when the user sequentially requests display of lower-order search results that are not yet displayed, the search results are updated by the same procedure as described above, and the number of buttons increases. The above processing is executed until the screen 840 that finally displays the search results from the 81st place to the 100th place is displayed. On the screen 840, search result images from the 81st place to the 100th place and five buttons 841 to 845 are displayed. The functions of the buttons 841 to 843 are equivalent to the buttons 821 to 823 of the screen 820, respectively. The button 844 is a button for requesting display of search results from the 61st place to the 80th place. The button 845 is a button for requesting display of search results from the 81st place to the 100th place.

仮に、画面８１０において、はじめから、４１位以降の検索結果の表示を要求するボタンが表示される場合、例えば、４１位から６０位までの検索結果の表示が要求される場合がある。この要求がなされた時点で、２１位から４０位までの検索結果もまだ表示されていない。このため、サーバ・プログラムは、２１位から４０位までの検索結果を表示するためのRc個のクラスタを対象とするクラスタ内検索処理と、４１位から６０位までの検索結果を表示するための次のRc個のクラスタを対象とするクラスタ内検索処理とを実行する必要がある。言い換えると、サーバ・プログラムは、１回の検索要求に応じて、Rc×２個のクラスタを対象とするクラスタ内検索処理を実行する必要がある。その結果、Rc個のクラスタを対象とするクラスタ内検索処理が実行される場合と比較して、１回の検索要求に対する応答時間が長くなる。 If, on the screen 810, a button requesting display of search results after the 41st place is displayed from the beginning, for example, display of search results from the 41st place to the 60th place may be requested. When this request is made, the search results from the 21st place to the 40th place are not displayed yet. For this reason, the server program displays the search results in the clusters for the Rc clusters for displaying the search results from the 21st place to the 40th place and the search results from the 41st place to the 60th place. It is necessary to execute the intra-cluster search process for the next Rc clusters. In other words, the server program needs to execute an intra-cluster search process for Rc × 2 clusters in response to a single search request. As a result, the response time for a single search request becomes longer than when the intra-cluster search process for Rc clusters is executed.

一方、図８に示すユーザインタフェースによれば、上位の検索結果が表示された場合のみ、その表示された検索結果の下位に連続する検索結果の表示要求を受け付けることが許可される。例えば、最初に上位２０位までの２０件の検索結果を表示することが要求されたとき、その要求に応じて表示される画面８１０には、１位から２０位までの検索結果が表示される。しかし、この画面８１０には、次の２０件（すなわち２１位から４０位まで）の表示を要求するためのボタン８１３が表示されるが、４１位以降の検索結果の表示を要求するボタンは表示されない。ボタン８１３が操作され、２１位から４０位までの検索結果が一度表示されると、その後、４１位から６０位までの検索結果を表示するボタン８２３、８３３又は８４３が表示される（画面８２０、８３０又は８４０）。このため、サーバ・プログラムは、Rc個を超える数のクラスタを対象とする検索要求を受けることがない。このため、サーバ・プログラムは、１回の検索要求に対して、概ね一定の応答時間内に検索結果を返すことができる。 On the other hand, according to the user interface shown in FIG. 8, only when a higher-order search result is displayed, it is permitted to accept a display request for a search result continuous below the displayed search result. For example, when it is requested to display 20 search results from the top 20 first, the search results from 1st to 20th are displayed on the screen 810 displayed in response to the request. . However, on this screen 810, a button 813 for requesting display of the next 20 items (that is, 21st to 40th) is displayed, but a button for requesting display of search results after the 41st is displayed. Not. When the button 813 is operated and the search results from the 21st place to the 40th place are displayed once, then the buttons 823, 833 or 843 for displaying the search results from the 41st place to the 60th place are displayed (screen 820, 830 or 840). For this reason, the server program does not receive a search request for more than Rc clusters. Therefore, the server program can return the search result within a substantially fixed response time in response to one search request.

上記画面構成の特徴は、検索結果の表示範囲指定に制約を設け、ある範囲の表示が、その一つ上位の範囲が表示された後に、はじめて可能になるようにした点である。この制約は、上位から順番に見ていく、という、通常に行われるユーザ操作を妨げるものではなく、そのような操作を自然に促すものである。特に、類似性に限らず何かしらの基準でソートされたデータを見る場合、上位から順番に内容を確認していくのが、操作の流れとしては最も自然である。従って、操作性の点でユーザに対して不満を与えることはない。また、不用な機能が除かれている分、より分かり易い画面となっている。 A feature of the above screen configuration is that a restriction is imposed on the display range designation of the search result, and a certain range can be displayed only after the upper range is displayed. This restriction does not prevent a user operation that is normally performed to look in order from the top, but naturally encourages such an operation. In particular, when viewing data sorted according to some criteria, not limited to similarity, it is the most natural operation flow to check the contents in order from the top. Therefore, there is no dissatisfaction with the user in terms of operability. In addition, since unnecessary functions are removed, the screen is easier to understand.

図８に示す画面による検索結果表示は、類似検索結果の順位を正確に反映するものではない。また、最終的な検索結果の上位１００件に含まれているにも関わらず、検索結果として表示されない画像も、稀には存在する。まず、順位に関して言えば、類似検索の場合、目的は類似したデータを閲覧することであり、正確な順位は、ユーザにとって特に意味のある情報とは言えない。従って、ユーザにとって特に不都合は生じない。検索結果の内容に関しては、先述したように、本システムでは、上位の検索結果は相対的に安定しており、表示から外れるデータは、検索結果中で類似性が低い方のデータとなる。類似検索では、類似性が高いデータに対してユーザが関心を持っている、と想定されるので、これも、ユーザにとっての不都合とはならない。 The search result display on the screen shown in FIG. 8 does not accurately reflect the rank of similar search results. In addition, there are rarely images that are not displayed as search results even though they are included in the top 100 final search results. First, regarding the ranking, in the case of the similar search, the purpose is to browse similar data, and the accurate ranking is not particularly meaningful information for the user. Therefore, there is no particular inconvenience for the user. Regarding the contents of the search results, as described above, in this system, the higher-order search results are relatively stable, and the data out of the display is the data with the lower similarity in the search results. In the similarity search, since it is assumed that the user is interested in data having high similarity, this is also not inconvenient for the user.

一方、図８のような画面構成と異なり、大量の検索結果、例えば、上位１０００件を表示するようなユーザインタフェースも存在する。このような大量のデータを表示する方法に関しては、特開２０００−２９８８５号公報及び特開２００４−６２３５６号公報に記載されている。大量のデータを表示する場合、画像データの転送に時間を要するため、一度に全画像が表示されることはない。この場合、検索結果の更新は、所定の時間間隔で自動的に実行される。検索結果が更新された時点で、クライアント・プログラムが表示するべき画像も更新される。更新によって検索結果から外れたデータは、画面上からも除去される。この自動更新は、使用者が、別の検索条件を指定するまで繰り返される。この方式を採用した場合、画面上には、常に最新の検索結果が表示されることになる。また、自動更新による表示画像の変動は、更新回数に応じて減少し、アプリケーション側が新たに取得する必要がある画像の数も減少する。従って、不必要に通信負荷等が生じることはない。 On the other hand, unlike the screen configuration shown in FIG. 8, there is a user interface that displays a large amount of search results, for example, the top 1000 items. A method for displaying such a large amount of data is described in Japanese Patent Application Laid-Open Nos. 2000-29885 and 2004-62356. When displaying a large amount of data, since it takes time to transfer the image data, all images are not displayed at once. In this case, the search result is automatically updated at predetermined time intervals. When the search result is updated, the image to be displayed by the client program is also updated. Data deviating from the search result due to the update is also removed from the screen. This automatic update is repeated until the user specifies another search condition. When this method is adopted, the latest search result is always displayed on the screen. In addition, the fluctuation of the display image due to the automatic update decreases according to the number of updates, and the number of images that the application side needs to newly acquire also decreases. Therefore, an unnecessary communication load or the like does not occur.

以上の本発明の実施の形態は、検索対象のベクトルデータが画像の特徴量ベクトルである場合を例として説明したが、本発明は、いかなる種類のベクトルデータの検索に対しても適用することができる。 In the above embodiment of the present invention, the case where the vector data to be searched is the feature vector of the image has been described as an example. However, the present invention can be applied to search for any kind of vector data. it can.

以上、本発明の実施の形態によれば、検索対象のベクトルデータが予めクラスタに分類され、各クラスタの代表値が定められる。任意のベクトルデータをキーとする検索要求が発行された場合、キーデータとの距離が近い代表値によって代表される所定の数のクラスタのみを対象として、類似検索が実行される。このため、高速な類似検索を実現することができる。各クラスタが、相互に類似するベクトルデータによって構成される場合、１回目の検索にある程度の精度を期待することができる。１回目の検索要求に応じた検索によって目的のデータを取得できなかった場合、前回と同一のベクトルデータをキーとする検索要求を繰り返し発行することによって、類似検索の対象クラスタの範囲が順次拡大される。その結果、検索要求を繰り返すごとに、検索精度が向上する。ユーザは、必要な精度の検索結果を得るまで検索要求を繰り返し発行することができる。検索要求を繰り返す場合、既に検索が実行されたクラスタを対象とする類似検索が省略される。このため、類似検索の対象の範囲を拡大しても、１回の検索要求に対応して実際に類似検索が実行されるクラスタの数を一定とすることができる。その結果、検索要求に対する応答時間が一定の値以下に抑えられる。 As described above, according to the embodiment of the present invention, the vector data to be searched is classified into clusters in advance, and the representative value of each cluster is determined. When a search request using arbitrary vector data as a key is issued, a similarity search is executed only for a predetermined number of clusters represented by representative values that are close to the key data. For this reason, a high-speed similarity search can be realized. When each cluster is composed of mutually similar vector data, a certain degree of accuracy can be expected for the first search. If the target data could not be acquired by the search according to the first search request, the range of target clusters for the similar search is sequentially expanded by repeatedly issuing a search request using the same vector data as the previous key. The As a result, the search accuracy improves each time the search request is repeated. The user can issue a search request repeatedly until a search result with a required accuracy is obtained. When the search request is repeated, a similar search for a cluster that has already been searched is omitted. For this reason, even if the range of the target of the similar search is expanded, the number of clusters in which the similar search is actually executed in response to one search request can be made constant. As a result, the response time to the search request is suppressed to a certain value or less.

本発明の実施の形態の類似検索システムの構成を示すブロック図である。It is a block diagram which shows the structure of the similar search system of embodiment of this invention. 本発明の実施の形態においてデータ登録時に実行される処理を示すフローチャートである。It is a flowchart which shows the process performed at the time of data registration in embodiment of this invention. 本発明の実施の形態において実行されるクラスタ位置の再配列の処理を示すフローチャートである。It is a flowchart which shows the process of the rearrangement of the cluster position performed in embodiment of this invention. 本発明の実施の形態の類似検索における、クライアント・サーバ間の情報の流れ、及び、各プログラム内での処理の概略を示す説明図である。It is explanatory drawing which shows the outline of the process in each information flow between the client and server in the similar search of embodiment of this invention, and each program. 本発明の実施の形態の参照クラスタ情報の説明図である。It is explanatory drawing of the reference cluster information of embodiment of this invention. 本発明の実施の形態の検索結果の説明図である。It is explanatory drawing of the search result of embodiment of this invention. 本発明の実施の形態のサーバ・プログラムにおいて実行されるクラスタ内検索処理を示すフローチャートである。It is a flowchart which shows the search process in a cluster performed in the server program of embodiment of this invention. 本発明の実施の形態において表示される検索結果表示画面の状態遷移の例の説明図である。It is explanatory drawing of the example of the state transition of the search result display screen displayed in embodiment of this invention.

Explanation of symbols

１１０サーバ計算機。
１１１検索サーバプロセス。
１１２クラスタ管理情報
１１３クラスタ情報
１１４クラスタメンバの特徴量データ
１１５記録されたクラスタ管理情報
１２０通信基盤
１３０クライアント計算機
１３１、１５２ＣＰＵ
１３２、１５３メモリ
１３３、１５１インターフェース
１３４入力装置
１３５出力装置
１５４ハードディスク
２１０近接クラスタの取得
２２０最近接クラスタへのデータ追加
２３０近接クラスタ要素数に関するループの判定部分
２３１クラスタメンバ数のチェック
２３２クラスタの分割
２３３クラスタの追加
２４０ k-means法の反復回数のチェック
２５０ k-means法の実行
３１０配置交換する組の探索
３２０組が発見できたか否かの判定
３３０クラスタ配列の更新
４０１ユーザ入力
４０２クライアントからサーバへの送信情報（初回）
４０３クラスタ間検索処理
４０４クラスタ内検索処理
４０５サーバからクライアントへの返信情報（初回）
４０６検索結果表示
４０７ユーザ入力
４０８クライアントからサーバへの送信情報（２回目）
４０９クラスタ間検索処理
４１０クラスタ内検索処理
４１１サーバからクライアントへの返信情報（２回目）
５１０タイムスタンプ
７１０参照済みクラスタ数の判定
７２０クラスタ同一性の判定
７３０新規参照するクラスタ数の判定
７４０クラスタ内での類似検索
７５０検索結果の更新
８１０類似画像検索結果表示画面／１位から２０位
８１１検索結果
８１２、８２１、８３１、８４１１位から２０位を表示するためのボタン
８１３、８２２、８３２、８４２２１位から４０位を表示するためのボタン
８２０類似画像検索結果表示画面／２１位から４０位
８２３、８３３、８４３４１位から６０位を表示するためのボタン
８３０類似画像検索結果再表示画面／１位から２０位
８３１表示する順位を切り替えるボタンの列
８２１１位から２０位を表示するためのボタン
８４０類似画像検索結果表示画面／８１位から１００位
８４１表示する順位を切り替えるボタンの列
８４４６１位から８０位を表示するためのボタン
８４５８１位から１００位を表示するためのボタン 110 Server computer.
111 Search server process.
112 Cluster management information 113 Cluster information 114 Cluster member feature data 115 Recorded cluster management information 120 Communication infrastructure 130 Client computer 131, 152 CPU
132, 153 Memory 133, 151 Interface 134 Input device 135 Output device 154 Hard disk 210 Acquisition of proximity cluster 220 Addition of data to nearest cluster 230 Loop determination part 231 regarding number of adjacent cluster elements Checking the number of cluster members 232 Cluster division 233 Addition of cluster 240 Check of number of iterations of k-means method 250 Execution of k-means method 310 Search for pair to be exchanged 320 Judgment whether pair has been found 330 Update cluster array 401 User input 402 From client to server Transmission information (first time)
403 Inter-cluster search process 404 In-cluster search process 405 Return information from server to client (first time)
406 Search result display 407 User input 408 Transmission information from client to server (second time)
409 Inter-cluster search process 410 In-cluster search process 411 Return information from server to client (second time)
510 time stamp 710 determination of number of referenced clusters 720 determination of cluster identity 730 determination of number of newly referred clusters 740 similarity search in cluster 750 search result update 810 similar image search result display screen / first to 20th 811 Search results 812, 821, 831, 841 Buttons 813, 822, 832, 842 for displaying the first place to the 20th place Button 820 for displaying the 21st place to the 40th place 820 Similar image search result display screen / 21st to 40th place Positions 823, 833, 843 Button 830 for displaying positions 41 to 60 830 Similar image search result redisplay screen / first to 20th positions 831 Button row 821 for switching the display order for displaying first to 20th positions Button 840 Similar image search result display screen / Display from 81st to 100th 841 Button for the button 845 81 position for displaying the 80-position from the column 844 61 of the button for switching the position to display the position 100

Claims

In a search server comprising an interface for inputting and outputting data, a processor connected to the interface, and one or more storage devices connected to the processor,
The storage device stores a plurality of first vector data each included in one of a plurality of clusters, and representative values for the clusters of the plurality of first vector data,
The processor is
When the search request including the second vector data is received, the representative value is searched using the received second vector data as a key,
A plurality of first vector data included in a first predetermined number of the clusters are searched in order from the cluster including the representative value that is close to the second vector data using the second vector data as a key. ,
A search server characterized in that a second predetermined number of the first vector data having a short distance to the second vector data among the searched first vector data is output via the interface.

2. The search server according to claim 1, wherein each of the first vector data is included in the cluster including the representative value that is closest to the first vector data.

The processor is
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. The search server according to claim 1, wherein the first vector data is searched using the second vector data as a key.

2. The search server according to claim 1, wherein the processor outputs information indicating the cluster for which the search of the first vector data has been completed using the second vector data as a key.

The processor is
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. The search server according to claim 4, wherein the first vector data is searched using the second vector data as a key.

The one or more storage devices include a first storage device that stores the first vector data, and a second storage device that stores the representative value,
The first storage device is a disk storage device;
The search server according to claim 1, wherein the second storage device is a randomly accessible storage device.

In a search system comprising a terminal computer and a search server connected via a network,
The search server includes an interface connected to the network, a processor connected to the interface, and one or more storage devices connected to the processor,
The storage device stores a plurality of first vector data each included in one of a plurality of clusters, and representative values for the clusters of the plurality of first vector data,
When the terminal computer receives the input of the second vector data, the terminal computer transmits a search request including the received second vector data to the search server via the network,
The processor is
When a search request including the second vector data is received via the interface, the representative value is searched using the received second vector data as a key,
A plurality of first vector data included in a first predetermined number of the clusters are searched in order from the cluster including the representative value that is close to the second vector data using the second vector data as a key. ,
Out of the searched first vector data, a second predetermined number of the first vector data having a short distance to the second vector data is output via the interface;
The terminal device, when receiving the first vector data from the search server, displays the received first vector data or data associated with the received first vector data as a search result. system.

The search system according to claim 7, wherein each of the first vector data is included in the cluster including the representative value that is closest to the first vector data.

The processor is
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. The search system according to claim 7, wherein the first vector data is searched using the second vector data as a key.

8. The search system according to claim 7, wherein the processor outputs, via the interface, information indicating the cluster for which the search of the first vector data has been completed using the second vector data as a key. .

The processor is
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. The search system according to claim 10, wherein the first vector data is searched using the second vector data as a key.

The one or more storage devices include a first storage device that stores the first vector data, and a second storage device that stores the representative value,
The first storage device is a disk storage device;
The search system according to claim 7, wherein the second storage device is a randomly accessible storage device.

When the terminal device receives the first vector data searched using the second vector data as a key and the information indicating the cluster for which the search of the first vector data is completed from the search server, the terminal device The search request including vector data, the received first vector data, and the received information indicating the cluster is transmitted to the search server via the network. Search system.

The terminal device
Upon receiving a display request for the second predetermined number of search results, the search request is transmitted to the search server;
Displaying the second predetermined number of search results in ascending order of distance between the second vector data and the first vector data;
Only when the second predetermined number of search results are displayed, reception of a display request for the second predetermined search result that is consecutively lower than the displayed search results is permitted. Item 14. The search system according to Item 13.

A computer comprising: an interface for inputting / outputting data; a processor connected to the interface; and one or more storage devices connected to the processor;
The storage device stores a plurality of first vector data each included in one of a plurality of clusters, and representative values for the clusters of the plurality of first vector data,
The method
When the search request including the second vector data is received, the representative value is searched using the received second vector data as a key,
A plurality of first vector data included in a first predetermined number of the clusters are searched in order from the cluster including the representative value that is close to the second vector data using the second vector data as a key. ,
A method of outputting, via the interface, a second predetermined number of the first vector data having a short distance from the second vector data among the searched first vector data.

The method according to claim 15, wherein each of the first vector data is included in the cluster including the representative value that is closest to the first vector data.

The method
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. The method according to claim 15, wherein the first vector data is searched using the second vector data as a key.

16. The method according to claim 15, wherein the method outputs information indicating the cluster for which the search of the first vector data has been completed using the second vector data as a key.

The method
When a search request including the second vector data and information indicating the cluster for which the search for the first vector data has been completed using the second vector data as a key is received, other than the cluster indicated by the received information The representative value of the cluster of the second vector data as a key, and
Among the clusters other than the cluster indicated by the received information, a plurality of the clusters included in the first predetermined number of the clusters in order from the cluster including the representative value that is close to the second vector data. 19. The method according to claim 18, wherein the first vector data is searched using the second vector data as a key.

The one or more storage devices include a first storage device that stores the first vector data, and a second storage device that stores the representative value,
The first storage device is a disk storage device;
The method of claim 15, wherein the second storage device is a randomly accessible storage device.